1. 03 Mar, 2017 1 commit
  2. 25 Oct, 2016 1 commit
    • Josh Poimboeuf's avatar
      x86/dumpstack: Remove raw stack dump · 0ee1dd9f
      Josh Poimboeuf authored
      For mostly historical reasons, the x86 oops dump shows the raw stack
      values:
      
        ...
        [registers]
        Stack:
         ffff880079af7350 ffff880079905400 0000000000000000 ffffc900008f3ae0
         ffffffffa0196610 0000000000000001 00010000ffffffff 0000000087654321
         0000000000000002 0000000000000000 0000000000000000 0000000000000000
        Call Trace:
        ...
      
      This seems to be an artifact from long ago, and probably isn't needed
      anymore.  It generally just adds noise to the dump, and it can be
      actively harmful because it leaks kernel addresses.
      
      Linus says:
      
        "The stack dump actually goes back to forever, and it used to be
         useful back in 1992 or so. But it used to be useful mainly because
         stacks were simpler and we didn't have very good call traces anyway. I
         definitely remember having used them - I just do not remember having
         used them in the last ten+ years.
      
         Of course, it's still true that if you can trigger an oops, you've
         likely already lost the security game, but since the stack dump is so
         useless, let's aim to just remove it and make games like the above
         harder."
      
      This also removes the related 'kstack=' cmdline option and the
      'kstack_depth_to_print' sysctl.
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/e83bd50df52d8fe88e94d2566426ae40d813bf8f.1477405374.git.jpoimboe@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      0ee1dd9f
  3. 24 Oct, 2016 1 commit
  4. 02 Aug, 2016 1 commit
    • Borislav Petkov's avatar
      printk: add kernel parameter to control writes to /dev/kmsg · 750afe7b
      Borislav Petkov authored
      Add a "printk.devkmsg" kernel command line parameter which controls how
      userspace writes into /dev/kmsg.  It has three options:
      
       * ratelimit - ratelimit logging from userspace.
       * on  - unlimited logging from userspace
       * off - logging from userspace gets ignored
      
      The default setting is to ratelimit the messages written to it.
      
      This changes the kernel default setting of "on" to "ratelimit" and we do
      that because we want to keep userspace spamming /dev/kmsg to sane
      levels.  This is especially moot when a small kernel log buffer wraps
      around and messages get lost.  So the ratelimiting setting should be a
      sane setting where kernel messages should have a bit higher chance of
      survival from all the spamming.
      
      It additionally does not limit logging to /dev/kmsg while the system is
      booting if we haven't disabled it on the command line.
      
      Furthermore, we can control the logging from a lower priority sysctl
      interface - kernel.printk_devkmsg.
      
      That interface will succeed only if printk.devkmsg *hasn't* been
      supplied on the command line.  If it has, then printk.devkmsg is a
      one-time setting which remains for the duration of the system lifetime.
      This "locking" of the setting is to prevent userspace from changing the
      logging on us through sysctl(2).
      
      This patch is based on previous patches from Linus and Steven.
      
      [bp@suse.de: fixes]
        Link: http://lkml.kernel.org/r/20160719072344.GC25563@nazgul.tnic
      Link: http://lkml.kernel.org/r/20160716061745.15795-3-bp@alien8.deSigned-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Franck Bui <fbui@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      750afe7b
  5. 15 Jun, 2016 1 commit
    • Daniel Bristot de Oliveira's avatar
      rcu: sysctl: Panic on RCU Stall · 088e9d25
      Daniel Bristot de Oliveira authored
      It is not always easy to determine the cause of an RCU stall just by
      analysing the RCU stall messages, mainly when the problem is caused
      by the indirect starvation of rcu threads. For example, when preempt_rcu
      is not awakened due to the starvation of a timer softirq.
      
      We have been hard coding panic() in the RCU stall functions for
      some time while testing the kernel-rt. But this is not possible in
      some scenarios, like when supporting customers.
      
      This patch implements the sysctl kernel.panic_on_rcu_stall. If
      set to 1, the system will panic() when an RCU stall takes place,
      enabling the capture of a vmcore. The vmcore provides a way to analyze
      all kernel/tasks states, helping out to point to the culprit and the
      solution for the stall.
      
      The kernel.panic_on_rcu_stall sysctl is disabled by default.
      
      Changes from v1:
      - Fixed a typo in the git log
      - The if(sysctl_panic_on_rcu_stall) panic() is in a static function
      - Fixed the CONFIG_TINY_RCU compilation issue
      - The var sysctl_panic_on_rcu_stall is now __read_mostly
      
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Acked-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Reviewed-by: default avatarArnaldo Carvalho de Melo <acme@kernel.org>
      Tested-by: default avatar"Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      088e9d25
  6. 17 May, 2016 1 commit
    • Arnaldo Carvalho de Melo's avatar
      perf core: Separate accounting of contexts and real addresses in a stack trace · c85b0334
      Arnaldo Carvalho de Melo authored
      The perf_sample->ip_callchain->nr value includes all the entries in the
      ip_callchain->ip[] array, real addresses and PERF_CONTEXT_{KERNEL,USER,etc},
      while what the user expects is that what is in the kernel.perf_event_max_stack
      sysctl or in the upcoming per event perf_event_attr.sample_max_stack knob be
      honoured in terms of IP addresses in the stack trace.
      
      So allocate a bunch of extra entries for contexts, and do the accounting
      via perf_callchain_entry_ctx struct members.
      
      A new sysctl, kernel.perf_event_max_contexts_per_stack is also
      introduced for investigating possible bugs in the callchain
      implementation by some arch.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Milian Wolff <milian.wolff@kdab.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Wang Nan <wangnan0@huawei.com>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/n/tip-3b4wnqk340c4sg4gwkfdi9yk@git.kernel.orgSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      c85b0334
  7. 10 May, 2016 1 commit
  8. 27 Apr, 2016 1 commit
    • Arnaldo Carvalho de Melo's avatar
      perf core: Allow setting up max frame stack depth via sysctl · c5dfd78e
      Arnaldo Carvalho de Melo authored
      The default remains 127, which is good for most cases, and not even hit
      most of the time, but then for some cases, as reported by Brendan, 1024+
      deep frames are appearing on the radar for things like groovy, ruby.
      
      And in some workloads putting a _lower_ cap on this may make sense. One
      that is per event still needs to be put in place tho.
      
      The new file is:
      
        # cat /proc/sys/kernel/perf_event_max_stack
        127
      
      Chaging it:
      
        # echo 256 > /proc/sys/kernel/perf_event_max_stack
        # cat /proc/sys/kernel/perf_event_max_stack
        256
      
      But as soon as there is some event using callchains we get:
      
        # echo 512 > /proc/sys/kernel/perf_event_max_stack
        -bash: echo: write error: Device or resource busy
        #
      
      Because we only allocate the callchain percpu data structures when there
      is a user, which allows for changing the max easily, its just a matter
      of having no callchain users at that point.
      Reported-and-Tested-by: default avatarBrendan Gregg <brendan.d.gregg@gmail.com>
      Reviewed-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Milian Wolff <milian.wolff@kdab.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Wang Nan <wangnan0@huawei.com>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/r/20160426002928.GB16708@kernel.orgSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      c5dfd78e
  9. 08 Mar, 2016 1 commit
  10. 09 Feb, 2016 1 commit
    • Mel Gorman's avatar
      sched/debug: Make schedstats a runtime tunable that is disabled by default · cb251765
      Mel Gorman authored
      schedstats is very useful during debugging and performance tuning but it
      incurs overhead to calculate the stats. As such, even though it can be
      disabled at build time, it is often enabled as the information is useful.
      
      This patch adds a kernel command-line and sysctl tunable to enable or
      disable schedstats on demand (when it's built in). It is disabled
      by default as someone who knows they need it can also learn to enable
      it when necessary.
      
      The benefits are dependent on how scheduler-intensive the workload is.
      If it is then the patch reduces the number of cycles spent calculating
      the stats with a small benefit from reducing the cache footprint of the
      scheduler.
      
      These measurements were taken from a 48-core 2-socket
      machine with Xeon(R) E5-2670 v3 cpus although they were also tested on a
      single socket machine 8-core machine with Intel i7-3770 processors.
      
      netperf-tcp
                                 4.5.0-rc1             4.5.0-rc1
                                   vanilla          nostats-v3r1
      Hmean    64         560.45 (  0.00%)      575.98 (  2.77%)
      Hmean    128        766.66 (  0.00%)      795.79 (  3.80%)
      Hmean    256        950.51 (  0.00%)      981.50 (  3.26%)
      Hmean    1024      1433.25 (  0.00%)     1466.51 (  2.32%)
      Hmean    2048      2810.54 (  0.00%)     2879.75 (  2.46%)
      Hmean    3312      4618.18 (  0.00%)     4682.09 (  1.38%)
      Hmean    4096      5306.42 (  0.00%)     5346.39 (  0.75%)
      Hmean    8192     10581.44 (  0.00%)    10698.15 (  1.10%)
      Hmean    16384    18857.70 (  0.00%)    18937.61 (  0.42%)
      
      Small gains here, UDP_STREAM showed nothing intresting and neither did
      the TCP_RR tests. The gains on the 8-core machine were very similar.
      
      tbench4
                                       4.5.0-rc1             4.5.0-rc1
                                         vanilla          nostats-v3r1
      Hmean    mb/sec-1         500.85 (  0.00%)      522.43 (  4.31%)
      Hmean    mb/sec-2         984.66 (  0.00%)     1018.19 (  3.41%)
      Hmean    mb/sec-4        1827.91 (  0.00%)     1847.78 (  1.09%)
      Hmean    mb/sec-8        3561.36 (  0.00%)     3611.28 (  1.40%)
      Hmean    mb/sec-16       5824.52 (  0.00%)     5929.03 (  1.79%)
      Hmean    mb/sec-32      10943.10 (  0.00%)    10802.83 ( -1.28%)
      Hmean    mb/sec-64      15950.81 (  0.00%)    16211.31 (  1.63%)
      Hmean    mb/sec-128     15302.17 (  0.00%)    15445.11 (  0.93%)
      Hmean    mb/sec-256     14866.18 (  0.00%)    15088.73 (  1.50%)
      Hmean    mb/sec-512     15223.31 (  0.00%)    15373.69 (  0.99%)
      Hmean    mb/sec-1024    14574.25 (  0.00%)    14598.02 (  0.16%)
      Hmean    mb/sec-2048    13569.02 (  0.00%)    13733.86 (  1.21%)
      Hmean    mb/sec-3072    12865.98 (  0.00%)    13209.23 (  2.67%)
      
      Small gains of 2-4% at low thread counts and otherwise flat.  The
      gains on the 8-core machine were slightly different
      
      tbench4 on 8-core i7-3770 single socket machine
      Hmean    mb/sec-1        442.59 (  0.00%)      448.73 (  1.39%)
      Hmean    mb/sec-2        796.68 (  0.00%)      794.39 ( -0.29%)
      Hmean    mb/sec-4       1322.52 (  0.00%)     1343.66 (  1.60%)
      Hmean    mb/sec-8       2611.65 (  0.00%)     2694.86 (  3.19%)
      Hmean    mb/sec-16      2537.07 (  0.00%)     2609.34 (  2.85%)
      Hmean    mb/sec-32      2506.02 (  0.00%)     2578.18 (  2.88%)
      Hmean    mb/sec-64      2511.06 (  0.00%)     2569.16 (  2.31%)
      Hmean    mb/sec-128     2313.38 (  0.00%)     2395.50 (  3.55%)
      Hmean    mb/sec-256     2110.04 (  0.00%)     2177.45 (  3.19%)
      Hmean    mb/sec-512     2072.51 (  0.00%)     2053.97 ( -0.89%)
      
      In constract, this shows a relatively steady 2-3% gain at higher thread
      counts. Due to the nature of the patch and the type of workload, it's
      not a surprise that the result will depend on the CPU used.
      
      hackbench-pipes
                               4.5.0-rc1             4.5.0-rc1
                                 vanilla          nostats-v3r1
      Amean    1        0.0637 (  0.00%)      0.0660 ( -3.59%)
      Amean    4        0.1229 (  0.00%)      0.1181 (  3.84%)
      Amean    7        0.1921 (  0.00%)      0.1911 (  0.52%)
      Amean    12       0.3117 (  0.00%)      0.2923 (  6.23%)
      Amean    21       0.4050 (  0.00%)      0.3899 (  3.74%)
      Amean    30       0.4586 (  0.00%)      0.4433 (  3.33%)
      Amean    48       0.5910 (  0.00%)      0.5694 (  3.65%)
      Amean    79       0.8663 (  0.00%)      0.8626 (  0.43%)
      Amean    110      1.1543 (  0.00%)      1.1517 (  0.22%)
      Amean    141      1.4457 (  0.00%)      1.4290 (  1.16%)
      Amean    172      1.7090 (  0.00%)      1.6924 (  0.97%)
      Amean    192      1.9126 (  0.00%)      1.9089 (  0.19%)
      
      Some small gains and losses and while the variance data is not included,
      it's close to the noise. The UMA machine did not show anything particularly
      different
      
      pipetest
                                   4.5.0-rc1             4.5.0-rc1
                                     vanilla          nostats-v2r2
      Min         Time        4.13 (  0.00%)        3.99 (  3.39%)
      1st-qrtle   Time        4.38 (  0.00%)        4.27 (  2.51%)
      2nd-qrtle   Time        4.46 (  0.00%)        4.39 (  1.57%)
      3rd-qrtle   Time        4.56 (  0.00%)        4.51 (  1.10%)
      Max-90%     Time        4.67 (  0.00%)        4.60 (  1.50%)
      Max-93%     Time        4.71 (  0.00%)        4.65 (  1.27%)
      Max-95%     Time        4.74 (  0.00%)        4.71 (  0.63%)
      Max-99%     Time        4.88 (  0.00%)        4.79 (  1.84%)
      Max         Time        4.93 (  0.00%)        4.83 (  2.03%)
      Mean        Time        4.48 (  0.00%)        4.39 (  1.91%)
      Best99%Mean Time        4.47 (  0.00%)        4.39 (  1.91%)
      Best95%Mean Time        4.46 (  0.00%)        4.38 (  1.93%)
      Best90%Mean Time        4.45 (  0.00%)        4.36 (  1.98%)
      Best50%Mean Time        4.36 (  0.00%)        4.25 (  2.49%)
      Best10%Mean Time        4.23 (  0.00%)        4.10 (  3.13%)
      Best5%Mean  Time        4.19 (  0.00%)        4.06 (  3.20%)
      Best1%Mean  Time        4.13 (  0.00%)        4.00 (  3.39%)
      
      Small improvement and similar gains were seen on the UMA machine.
      
      The gain is small but it stands to reason that doing less work in the
      scheduler is a good thing. The downside is that the lack of schedstats and
      tracepoints may be surprising to experts doing performance analysis until
      they find the existence of the schedstats= parameter or schedstats sysctl.
      It will be automatically activated for latencytop and sleep profiling to
      alleviate the problem. For tracepoints, there is a simple warning as it's
      not safe to activate schedstats in the context when it's known the tracepoint
      may be wanted but is unavailable.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarMatt Fleming <matt@codeblueprint.co.uk>
      Reviewed-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <mgalbraith@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1454663316-22048-1-git-send-email-mgorman@techsingularity.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      cb251765
  11. 26 Jan, 2016 1 commit
  12. 21 Jan, 2016 1 commit
  13. 19 Dec, 2015 1 commit
    • Hidehiro Kawai's avatar
      Documentation: Document kernel.panic_on_io_nmi sysctl · 9f318e3f
      Hidehiro Kawai authored
      kernel.panic_on_io_nmi sysctl was introduced by commit
      
        5211a242 ("x86: Add sysctl to allow panic on IOCK NMI error")
      
      but its documentation is missing. So, add it.
      Signed-off-by: default avatarHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Requested-by: default avatarBorislav Petkov <bp@alien8.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: kexec@lists.infradead.org
      Cc: linux-doc@vger.kernel.org
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: x86-ml <x86@kernel.org>
      Link: http://lkml.kernel.org/r/20151210014637.25437.71903.stgit@softrsSigned-off-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      9f318e3f
  14. 06 Nov, 2015 1 commit
  15. 26 Jun, 2015 1 commit
    • Nicolas Iooss's avatar
      coredump: use from_kuid/kgid when formatting corename · 5202efe5
      Nicolas Iooss authored
      When adding __printf attribute to cn_printf, gcc reports some issues:
      
        fs/coredump.c:213:5: warning: format '%d' expects argument of type
        'int', but argument 3 has type 'kuid_t' [-Wformat=]
             err = cn_printf(cn, "%d", cred->uid);
             ^
        fs/coredump.c:217:5: warning: format '%d' expects argument of type
        'int', but argument 3 has type 'kgid_t' [-Wformat=]
             err = cn_printf(cn, "%d", cred->gid);
             ^
      
      These warnings come from the fact that the value of uid/gid needs to be
      extracted from the kuid_t/kgid_t structure before being used as an
      integer.  More precisely, cred->uid and cred->gid need to be converted to
      either user-namespace uid/gid or to init_user_ns uid/gid.
      
      Use init_user_ns in order not to break existing ABI, and document this in
      Documentation/sysctl/kernel.txt.
      
      While at it, format uid and gid values with %u instead of %d because
      uid_t/__kernel_uid32_t and gid_t/__kernel_gid32_t are unsigned int.
      Signed-off-by: default avatarNicolas Iooss <nicolas.iooss_linux@m4x.org>
      Acked-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5202efe5
  16. 25 Jun, 2015 1 commit
    • Chris Metcalf's avatar
      watchdog: add watchdog_cpumask sysctl to assist nohz · fe4ba3c3
      Chris Metcalf authored
      Change the default behavior of watchdog so it only runs on the
      housekeeping cores when nohz_full is enabled at build and boot time.
      Allow modifying the set of cores the watchdog is currently running on
      with a new kernel.watchdog_cpumask sysctl.
      
      In the current system, the watchdog subsystem runs a periodic timer that
      schedules the watchdog kthread to run.  However, nohz_full cores are
      designed to allow userspace application code running on those cores to
      have 100% access to the CPU.  So the watchdog system prevents the
      nohz_full application code from being able to run the way it wants to,
      thus the motivation to suppress the watchdog on nohz_full cores, which
      this patchset provides by default.
      
      However, if we disable the watchdog globally, then the housekeeping
      cores can't benefit from the watchdog functionality.  So we allow
      disabling it only on some cores.  See Documentation/lockup-watchdogs.txt
      for more information.
      
      [jhubbard@nvidia.com: fix a watchdog crash in some configurations]
      Signed-off-by: default avatarChris Metcalf <cmetcalf@ezchip.com>
      Acked-by: default avatarDon Zickus <dzickus@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fe4ba3c3
  17. 17 Apr, 2015 1 commit
  18. 14 Apr, 2015 1 commit
    • Ulrich Obergfell's avatar
      watchdog: enable the new user interface of the watchdog mechanism · 195daf66
      Ulrich Obergfell authored
      With the current user interface of the watchdog mechanism it is only
      possible to disable or enable both lockup detectors at the same time.
      This series introduces new kernel parameters and changes the semantics of
      some existing kernel parameters, so that the hard lockup detector and the
      soft lockup detector can be disabled or enabled individually.  With this
      series applied, the user interface is as follows.
      
      - parameters in /proc/sys/kernel
      
        . soft_watchdog
          This is a new parameter to control and examine the run state of
          the soft lockup detector.
      
        . nmi_watchdog
          The semantics of this parameter have changed. It can now be used
          to control and examine the run state of the hard lockup detector.
      
        . watchdog
          This parameter is still available to control the run state of both
          lockup detectors at the same time. If this parameter is examined,
          it shows the logical OR of soft_watchdog and nmi_watchdog.
      
        . watchdog_thresh
          The semantics of this parameter are not affected by the patch.
      
      - kernel command line parameters
      
        . nosoftlockup
          The semantics of this parameter have changed. It can now be used
          to disable the soft lockup detector at boot time.
      
        . nmi_watchdog=0 or nmi_watchdog=1
          Disable or enable the hard lockup detector at boot time. The patch
          introduces '=1' as a new option.
      
        . nowatchdog
          The semantics of this parameter are not affected by the patch. It
          is still available to disable both lockup detectors at boot time.
      
      Also, remove the proc_dowatchdog() function which is no longer needed.
      
      [dzickus@redhat.com: wrote changelog]
      [dzickus@redhat.com: update documentation for kernel params and sysctl]
      Signed-off-by: default avatarUlrich Obergfell <uobergfe@redhat.com>
      Signed-off-by: default avatarDon Zickus <dzickus@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      195daf66
  19. 22 Dec, 2014 1 commit
  20. 13 Dec, 2014 1 commit
    • Manfred Spraul's avatar
      ipc/msg: increase MSGMNI, remove scaling · 0050ee05
      Manfred Spraul authored
      SysV can be abused to allocate locked kernel memory.  For most systems, a
      small limit doesn't make sense, see the discussion with regards to SHMMAX.
      
      Therefore: increase MSGMNI to the maximum supported.
      
      And: If we ignore the risk of locking too much memory, then an automatic
      scaling of MSGMNI doesn't make sense.  Therefore the logic can be removed.
      
      The code preserves auto_msgmni to avoid breaking any user space applications
      that expect that the value exists.
      
      Notes:
      1) If an administrator must limit the memory allocations, then he can set
      MSGMNI as necessary.
      
      Or he can disable sysv entirely (as e.g. done by Android).
      
      2) MSGMAX and MSGMNB are intentionally not increased, as these values are used
      to control latency vs. throughput:
      If MSGMNB is large, then msgsnd() just returns and more messages can be queued
      before a task switch to a task that calls msgrcv() is forced.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0050ee05
  21. 11 Dec, 2014 1 commit
    • Prarit Bhargava's avatar
      kernel: add panic_on_warn · 9e3961a0
      Prarit Bhargava authored
      There have been several times where I have had to rebuild a kernel to
      cause a panic when hitting a WARN() in the code in order to get a crash
      dump from a system.  Sometimes this is easy to do, other times (such as
      in the case of a remote admin) it is not trivial to send new images to
      the user.
      
      A much easier method would be a switch to change the WARN() over to a
      panic.  This makes debugging easier in that I can now test the actual
      image the WARN() was seen on and I do not have to engage in remote
      debugging.
      
      This patch adds a panic_on_warn kernel parameter and
      /proc/sys/kernel/panic_on_warn calls panic() in the
      warn_slowpath_common() path.  The function will still print out the
      location of the warning.
      
      An example of the panic_on_warn output:
      
      The first line below is from the WARN_ON() to output the WARN_ON()'s
      location.  After that the panic() output is displayed.
      
          WARNING: CPU: 30 PID: 11698 at /home/prarit/dummy_module/dummy-module.c:25 init_dummy+0x1f/0x30 [dummy_module]()
          Kernel panic - not syncing: panic_on_warn set ...
      
          CPU: 30 PID: 11698 Comm: insmod Tainted: G        W  OE  3.17.0+ #57
          Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.1311111329 11/11/2013
           0000000000000000 000000008e3f87df ffff88080f093c38 ffffffff81665190
           0000000000000000 ffffffff818aea3d ffff88080f093cb8 ffffffff8165e2ec
           ffffffff00000008 ffff88080f093cc8 ffff88080f093c68 000000008e3f87df
          Call Trace:
           [<ffffffff81665190>] dump_stack+0x46/0x58
           [<ffffffff8165e2ec>] panic+0xd0/0x204
           [<ffffffffa038e05f>] ? init_dummy+0x1f/0x30 [dummy_module]
           [<ffffffff81076b90>] warn_slowpath_common+0xd0/0xd0
           [<ffffffffa038e040>] ? dummy_greetings+0x40/0x40 [dummy_module]
           [<ffffffff81076c8a>] warn_slowpath_null+0x1a/0x20
           [<ffffffffa038e05f>] init_dummy+0x1f/0x30 [dummy_module]
           [<ffffffff81002144>] do_one_initcall+0xd4/0x210
           [<ffffffff811b52c2>] ? __vunmap+0xc2/0x110
           [<ffffffff810f8889>] load_module+0x16a9/0x1b30
           [<ffffffff810f3d30>] ? store_uevent+0x70/0x70
           [<ffffffff810f49b9>] ? copy_module_from_fd.isra.44+0x129/0x180
           [<ffffffff810f8ec6>] SyS_finit_module+0xa6/0xd0
           [<ffffffff8166cf29>] system_call_fastpath+0x12/0x17
      
      Successfully tested by me.
      
      hpa said: There is another very valid use for this: many operators would
      rather a machine shuts down than being potentially compromised either
      functionally or security-wise.
      Signed-off-by: default avatarPrarit Bhargava <prarit@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Acked-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9e3961a0
  22. 14 Oct, 2014 1 commit
    • Oleg Nesterov's avatar
      coredump: add %i/%I in core_pattern to report the tid of the crashed thread · b03023ec
      Oleg Nesterov authored
      format_corename() can only pass the leader's pid to the core handler,
      but there is no simple way to figure out which thread originated the
      coredump.
      
      As Jan explains, this also means that there is no simple way to create
      the backtrace of the crashed process:
      
      As programs are mostly compiled with implicit gcc -fomit-frame-pointer
      one needs program's .eh_frame section (equivalently PT_GNU_EH_FRAME
      segment) or .debug_frame section.  .debug_frame usually is present only
      in separate debug info files usually not even installed on the system.
      While .eh_frame is a part of the executable/library (and it is even
      always mapped for C++ exceptions unwinding) it no longer has to be
      present anywhere on the disk as the program could be upgraded in the
      meantime and the running instance has its executable file already
      unlinked from disk.
      
      One possibility is to echo 0x3f >/proc/*/coredump_filter and dump all
      the file-backed memory including the executable's .eh_frame section.
      But that can create huge core files, for example even due to mmapped
      data files.
      
      Other possibility would be to read .eh_frame from /proc/PID/mem at the
      core_pattern handler time of the core dump.  For the backtrace one needs
      to read the register state first which can be done from core_pattern
      handler:
      
          ptrace(PTRACE_SEIZE, tid, 0, PTRACE_O_TRACEEXIT)
          close(0);    // close pipe fd to resume the sleeping dumper
          waitpid();   // should report EXIT
          PTRACE_GETREGS or other requests
      
      The remaining problem is how to get the 'tid' value of the crashed
      thread.  It could be read from the first NT_PRSTATUS note of the core
      file but that makes the core_pattern handler complicated.
      
      Unfortunately %t is already used so this patch uses %i/%I.
      
      Automatic Bug Reporting Tool (https://github.com/abrt/abrt/wiki/overview)
      is experimenting with this.  It is using the elfutils
      (https://fedorahosted.org/elfutils/) unwinder for generating the
      backtraces.  Apart from not needing matching executables as mentioned
      above, another advantage is that we can get the backtrace without saving
      the core (which might be quite large) to disk.
      
      [mmilata@redhat.com: final paragraph of changelog]
      Signed-off-by: default avatarJan Kratochvil <jan.kratochvil@redhat.com>
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Mark Wielaard <mjw@redhat.com>
      Cc: Martin Milata <mmilata@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b03023ec
  23. 08 Aug, 2014 1 commit
  24. 23 Jun, 2014 1 commit
    • Aaron Tomlin's avatar
      kernel/watchdog.c: print traces for all cpus on lockup detection · ed235875
      Aaron Tomlin authored
      A 'softlockup' is defined as a bug that causes the kernel to loop in
      kernel mode for more than a predefined period to time, without giving
      other tasks a chance to run.
      
      Currently, upon detection of this condition by the per-cpu watchdog
      task, debug information (including a stack trace) is sent to the system
      log.
      
      On some occasions, we have observed that the "victim" rather than the
      actual "culprit" (i.e.  the owner/holder of the contended resource) is
      reported to the user.  Often this information has proven to be
      insufficient to assist debugging efforts.
      
      To avoid loss of useful debug information, for architectures which
      support NMI, this patch makes it possible to improve soft lockup
      reporting.  This is accomplished by issuing an NMI to each cpu to obtain
      a stack trace.
      
      If NMI is not supported we just revert back to the old method.  A sysctl
      and boot-time parameter is available to toggle this feature.
      
      [dzickus@redhat.com: add CONFIG_SMP in certain areas]
      [akpm@linux-foundation.org: additional CONFIG_SMP=n optimisations]
      [mq@suse.cz: fix warning]
      Signed-off-by: default avatarAaron Tomlin <atomlin@redhat.com>
      Signed-off-by: default avatarDon Zickus <dzickus@redhat.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarJan Moskyto Matejka <mq@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ed235875
  25. 06 Jun, 2014 1 commit
    • Kees Cook's avatar
      sysctl: allow for strict write position handling · f4aacea2
      Kees Cook authored
      When writing to a sysctl string, each write, regardless of VFS position,
      begins writing the string from the start.  This means the contents of
      the last write to the sysctl controls the string contents instead of the
      first:
      
        open("/proc/sys/kernel/modprobe", O_WRONLY)   = 1
        write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
        write(1, "/bin/true", 9)                = 9
        close(1)                                = 0
      
        $ cat /proc/sys/kernel/modprobe
        /bin/true
      
      Expected behaviour would be to have the sysctl be "AAAA..." capped at
      maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
      contents of the second write.  Similarly, multiple short writes would
      not append to the sysctl.
      
      The old behavior is unlike regular POSIX files enough that doing audits
      of software that interact with sysctls can end up in unexpected or
      dangerous situations.  For example, "as long as the input starts with a
      trusted path" turns out to be an insufficient filter, as what must also
      happen is for the input to be entirely contained in a single write
      syscall -- not a common consideration, especially for high level tools.
      
      This provides kernel.sysctl_writes_strict as a way to make this behavior
      act in a less surprising manner for strings, and disallows non-zero file
      position when writing numeric sysctls (similar to what is already done
      when reading from non-zero file positions).  For now, the default (0) is
      to warn about non-zero file position use, but retain the legacy
      behavior.  Setting this to -1 disables the warning, and setting this to
      1 enables the file position respecting behavior.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: move misplaced hunk, per Randy]
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f4aacea2
  26. 07 Apr, 2014 1 commit
  27. 13 Mar, 2014 1 commit
    • Mathieu Desnoyers's avatar
      Fix: module signature vs tracepoints: add new TAINT_UNSIGNED_MODULE · 66cc69e3
      Mathieu Desnoyers authored
      Users have reported being unable to trace non-signed modules loaded
      within a kernel supporting module signature.
      
      This is caused by tracepoint.c:tracepoint_module_coming() refusing to
      take into account tracepoints sitting within force-loaded modules
      (TAINT_FORCED_MODULE). The reason for this check, in the first place, is
      that a force-loaded module may have a struct module incompatible with
      the layout expected by the kernel, and can thus cause a kernel crash
      upon forced load of that module on a kernel with CONFIG_TRACEPOINTS=y.
      
      Tracepoints, however, specifically accept TAINT_OOT_MODULE and
      TAINT_CRAP, since those modules do not lead to the "very likely system
      crash" issue cited above for force-loaded modules.
      
      With kernels having CONFIG_MODULE_SIG=y (signed modules), a non-signed
      module is tainted re-using the TAINT_FORCED_MODULE taint flag.
      Unfortunately, this means that Tracepoints treat that module as a
      force-loaded module, and thus silently refuse to consider any tracepoint
      within this module.
      
      Since an unsigned module does not fit within the "very likely system
      crash" category of tainting, add a new TAINT_UNSIGNED_MODULE taint flag
      to specifically address this taint behavior, and accept those modules
      within Tracepoints. We use the letter 'X' as a taint flag character for
      a module being loaded that doesn't know how to sign its name (proposed
      by Steven Rostedt).
      
      Also add the missing 'O' entry to trace event show_module_flags() list
      for the sake of completeness.
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      NAKed-by: default avatarIngo Molnar <mingo@redhat.com>
      CC: Thomas Gleixner <tglx@linutronix.de>
      CC: David Howells <dhowells@redhat.com>
      CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      66cc69e3
  28. 31 Jan, 2014 1 commit
  29. 28 Jan, 2014 1 commit
  30. 25 Jan, 2014 1 commit
  31. 24 Jan, 2014 1 commit
    • Kees Cook's avatar
      kexec: add sysctl to disable kexec_load · 7984754b
      Kees Cook authored
      For general-purpose (i.e.  distro) kernel builds it makes sense to build
      with CONFIG_KEXEC to allow end users to choose what kind of things they
      want to do with kexec.  However, in the face of trying to lock down a
      system with such a kernel, there needs to be a way to disable kexec_load
      (much like module loading can be disabled).  Without this, it is too easy
      for the root user to modify kernel memory even when CONFIG_STRICT_DEVMEM
      and modules_disabled are set.  With this change, it is still possible to
      load an image for use later, then disable kexec_load so the image (or lack
      of image) can't be altered.
      
      The intention is for using this in environments where "perfect"
      enforcement is hard.  Without a verified boot, along with verified
      modules, and along with verified kexec, this is trying to give a system a
      better chance to defend itself (or at least grow the window of
      discoverability) against attack in the face of a privilege escalation.
      
      In my mind, I consider several boot scenarios:
      
      1) Verified boot of read-only verified root fs loading fd-based
         verification of kexec images.
      2) Secure boot of writable root fs loading signed kexec images.
      3) Regular boot loading kexec (e.g. kcrash) image early and locking it.
      4) Regular boot with no control of kexec image at all.
      
      1 and 2 don't exist yet, but will soon once the verified kexec series has
      landed.  4 is the state of things now.  The gap between 2 and 4 is too
      large, so this change creates scenario 3, a middle-ground above 4 when 2
      and 1 are not possible for a system.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7984754b
  32. 17 Dec, 2013 1 commit
  33. 13 Nov, 2013 1 commit
    • Ryan Mallon's avatar
      vsprintf: check real user/group id for %pK · 312b4e22
      Ryan Mallon authored
      Some setuid binaries will allow reading of files which have read
      permission by the real user id.  This is problematic with files which
      use %pK because the file access permission is checked at open() time,
      but the kptr_restrict setting is checked at read() time.  If a setuid
      binary opens a %pK file as an unprivileged user, and then elevates
      permissions before reading the file, then kernel pointer values may be
      leaked.
      
      This happens for example with the setuid pppd application on Ubuntu 12.04:
      
        $ head -1 /proc/kallsyms
        00000000 T startup_32
      
        $ pppd file /proc/kallsyms
        pppd: In file /proc/kallsyms: unrecognized option 'c1000000'
      
      This will only leak the pointer value from the first line, but other
      setuid binaries may leak more information.
      
      Fix this by adding a check that in addition to the current process having
      CAP_SYSLOG, that effective user and group ids are equal to the real ids.
      If a setuid binary reads the contents of a file which uses %pK then the
      pointer values will be printed as NULL if the real user is unprivileged.
      
      Update the sysctl documentation to reflect the changes, and also correct
      the documentation to state the kptr_restrict=0 is the default.
      
      This is a only temporary solution to the issue.  The correct solution is
      to do the permission check at open() time on files, and to replace %pK
      with a function which checks the open() time permission.  %pK uses in
      printk should be removed since no sane permission check can be done, and
      instead protected by using dmesg_restrict.
      Signed-off-by: default avatarRyan Mallon <rmallon@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Joe Perches <joe@perches.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      312b4e22
  34. 09 Oct, 2013 5 commits
  35. 11 Sep, 2013 1 commit
  36. 23 Jun, 2013 1 commit
    • Dave Hansen's avatar
      perf: Drop sample rate when sampling is too slow · 14c63f17
      Dave Hansen authored
      This patch keeps track of how long perf's NMI handler is taking,
      and also calculates how many samples perf can take a second.  If
      the sample length times the expected max number of samples
      exceeds a configurable threshold, it drops the sample rate.
      
      This way, we don't have a runaway sampling process eating up the
      CPU.
      
      This patch can tend to drop the sample rate down to level where
      perf doesn't work very well.  *BUT* the alternative is that my
      system hangs because it spends all of its time handling NMIs.
      
      I'll take a busted performance tool over an entire system that's
      busted and undebuggable any day.
      
      BTW, my suspicion is that there's still an underlying bug here.
      Using the HPET instead of the TSC is definitely a contributing
      factor, but I suspect there are some other things going on.
      But, I can't go dig down on a bug like that with my machine
      hanging all the time.
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: paulus@samba.org
      Cc: acme@ghostprotocols.net
      Cc: Dave Hansen <dave@sr71.net>
      [ Prettified it a bit. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      14c63f17