1. 10 Apr, 2017 1 commit
  2. 24 Mar, 2017 1 commit
    • Jason A. Donenfeld's avatar
      padata: avoid race in reordering · de5540d0
      Jason A. Donenfeld authored
      Under extremely heavy uses of padata, crashes occur, and with list
      debugging turned on, this happens instead:
      [87487.298728] WARNING: CPU: 1 PID: 882 at lib/list_debug.c:33
      [87487.301868] list_add corruption. prev->next should be next
      (ffffb17abfc043d0), but was ffff8dba70872c80. (prev=ffff8dba70872b00).
      [87487.339011]  [<ffffffff9a53d075>] dump_stack+0x68/0xa3
      [87487.342198]  [<ffffffff99e119a1>] ? console_unlock+0x281/0x6d0
      [87487.345364]  [<ffffffff99d6b91f>] __warn+0xff/0x140
      [87487.348513]  [<ffffffff99d6b9aa>] warn_slowpath_fmt+0x4a/0x50
      [87487.351659]  [<ffffffff9a58b5de>] __list_add+0xae/0x130
      [87487.354772]  [<ffffffff9add5094>] ? _raw_spin_lock+0x64/0x70
      [87487.357915]  [<ffffffff99eefd66>] padata_reorder+0x1e6/0x420
      [87487.361084]  [<ffffffff99ef0055>] padata_do_serial+0xa5/0x120
      padata_reorder calls list_add_tail with the list to which its adding
      locked, which seems correct:
      list_add_tail(&padata->list, &squeue->serial.list);
      This therefore leaves only place where such inconsistency could occur:
      if padata->list is added at the same time on two different threads.
      This pdata pointer comes from the function call to
      padata_get_next(pd), which has in it the following block:
      next_queue = per_cpu_ptr(pd->pqueue, cpu);
      padata = NULL;
      reorder = &next_queue->reorder;
      if (!list_empty(&reorder->list)) {
             padata = list_entry(reorder->list.next,
                                 struct padata_priv, list);
             goto out;
      return padata;
      I strongly suspect that the problem here is that two threads can race
      on reorder list. Even though the deletion is locked, call to
      list_entry is not locked, which means it's feasible that two threads
      pick up the same padata object and subsequently call list_add_tail on
      them at the same time. The fix is thus be hoist that lock outside of
      that block.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Acked-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
  3. 16 Mar, 2017 12 commits
    • Heiko Carstens's avatar
      mm: add private lock to serialize memory hotplug operations · 55adc1d0
      Heiko Carstens authored
      Commit bfc8c901 ("mem-hotplug: implement get/put_online_mems")
      introduced new functions get/put_online_mems() and mem_hotplug_begin/end()
      in order to allow similar semantics for memory hotplug like for cpu
      The corresponding functions for cpu hotplug are get/put_online_cpus()
      and cpu_hotplug_begin/done() for cpu hotplug.
      The commit however missed to introduce functions that would serialize
      memory hotplug operations like they are done for cpu hotplug with
      This basically leaves mem_hotplug.active_writer unprotected and allows
      concurrent writers to modify it, which may lead to problems as outlined
      by commit f931ab47 ("mm: fix devm_memremap_pages crash, use
      mem_hotplug_{begin, done}").
      That commit was extended again with commit b5d24fda ("mm,
      devm_memremap_pages: hold device_hotplug lock over mem_hotplug_{begin,
      done}") which serializes memory hotplug operations for some call sites
      by using the device_hotplug lock.
      In addition with commit 3fc21924 ("mm: validate device_hotplug is held
      for memory hotplug") a sanity check was added to mem_hotplug_begin() to
      verify that the device_hotplug lock is held.
      This in turn triggers the following warning on s390:
      WARNING: CPU: 6 PID: 1 at drivers/base/core.c:643 assert_held_device_hotplug+0x4a/0x58
       Call Trace:
      One possible fix would be to add more lock_device_hotplug() and
      unlock_device_hotplug() calls around each call site of
      mem_hotplug_begin/end().  But that would give the device_hotplug lock
      additional semantics it better should not have (serialize memory hotplug
      Instead add a new memory_add_remove_lock which has the similar semantics
      like cpu_add_remove_lock for cpu hotplug.
      To keep things hopefully a bit easier the lock will be locked and unlocked
      within the mem_hotplug_begin/end() functions.
      Link: http://lkml.kernel.org/r/20170314125226.16779-2-heiko.carstens@de.ibm.comSigned-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Reported-by: default avatarSebastian Ott <sebott@linux.vnet.ibm.com>
      Acked-by: default avatarDan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Peter Zijlstra's avatar
      perf/core: Better explain the inherit magic · d8a8cfc7
      Peter Zijlstra authored
      While going through the event inheritance code Oleg got confused.
      Add some comments to better explain the silent dissapearance of
      orphaned events.
      So what happens is that at perf_event_release_kernel() time; when an
      event looses its connection to userspace (and ceases to exist from the
      user's perspective) we can still have an arbitrary amount of inherited
      copies of the event. We want to synchronously find and remove all
      these child events.
      Since that requires a bit of lock juggling, there is the possibility
      that concurrent clone()s will create new child events. Therefore we
      first mark the parent event as DEAD, which marks all the extant child
      events as orphaned.
      We then avoid copying orphaned events; in order to avoid getting more
      of them.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: fweisbec@gmail.com
      Link: http://lkml.kernel.org/r/20170316125823.289567442@infradead.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
    • Peter Zijlstra's avatar
      perf/core: Simplify perf_event_free_task() · 15121c78
      Peter Zijlstra authored
      We have ctx->event_list that contains all events; no need to
      repeatedly iterate the group lists to find them all.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: fweisbec@gmail.com
      Link: http://lkml.kernel.org/r/20170316125823.239678244@infradead.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
    • Peter Zijlstra's avatar
      perf/core: Fix event inheritance on fork() · e7cc4865
      Peter Zijlstra authored
      While hunting for clues to a use-after-free, Oleg spotted that
      perf_event_init_context() can loose an error value with the result
      that fork() can succeed even though we did not fully inherit the perf
      event context.
      Spotted-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: oleg@redhat.com
      Cc: stable@vger.kernel.org
      Fixes: 889ff015 ("perf/core: Split context's event group list into pinned and non-pinned lists")
      Link: http://lkml.kernel.org/r/20170316125823.190342547@infradead.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
    • Peter Zijlstra's avatar
      perf/core: Fix use-after-free in perf_release() · e552a838
      Peter Zijlstra authored
      Dmitry reported syzcaller tripped a use-after-free in perf_release().
      After much puzzlement Oleg spotted the below scenario:
        Task1                           Task2
          /* ... */
          goto bad_fork_$foo;
          /* ... */
                                          list_for_each_entry(child, ...) {
                                            /* child == B */
                                            ctx = B->ctx;
              /* ... */
            put_ctx() /* >0 */
                                            /* ... */
                                            put_ctx() /* 0 */
                                              ctx->task && !TOMBSTONE
                                                put_task_struct() /* UAF */
      This patch closes the hole by making perf_event_free_task() destroy the
      task <-> ctx relation such that perf_event_release_kernel() will no longer
      observe the now dead task.
      Spotted-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: fweisbec@gmail.com
      Cc: oleg@redhat.com
      Cc: stable@vger.kernel.org
      Fixes: c6e5b732 ("perf: Synchronously clean up child events")
      Link: http://lkml.kernel.org/r/20170314155949.GE32474@worktop
      Link: http://lkml.kernel.org/r/20170316125823.140295131@infradead.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
    • Steven Rostedt (VMware)'s avatar
      sched/deadline: Use deadline instead of period when calculating overflow · 2317d5f1
      Steven Rostedt (VMware) authored
      I was testing Daniel's changes with his test case, and tweaked it a
      little. Instead of having the runtime equal to the deadline, I
      increased the deadline ten fold.
      Daniel's test case had:
      	attr.sched_runtime  = 2 * 1000 * 1000;		/* 2 ms */
      	attr.sched_deadline = 2 * 1000 * 1000;		/* 2 ms */
      	attr.sched_period   = 2 * 1000 * 1000 * 1000;	/* 2 s */
      To make it more interesting, I changed it to:
      	attr.sched_runtime  =  2 * 1000 * 1000;		/* 2 ms */
      	attr.sched_deadline = 20 * 1000 * 1000;		/* 20 ms */
      	attr.sched_period   =  2 * 1000 * 1000 * 1000;	/* 2 s */
      The results were rather surprising. The behavior that Daniel's patch
      was fixing came back. The task started using much more than .1% of the
      CPU. More like 20%.
      Looking into this I found that it was due to the dl_entity_overflow()
      constantly returning true. That's because it uses the relative period
      against relative runtime vs the absolute deadline against absolute
        runtime / (deadline - t) > dl_runtime / dl_period
      There's even a comment mentioning this, and saying that when relative
      deadline equals relative period, that the equation is the same as using
      deadline instead of period. That comment is backwards! What we really
      want is:
        runtime / (deadline - t) > dl_runtime / dl_deadline
      We care about if the runtime can make its deadline, not its period. And
      then we can say "when the deadline equals the period, the equation is
      the same as using dl_period instead of dl_deadline".
      After correcting this, now when the task gets enqueued, it can throttle
      correctly, and Daniel's fix to the throttling of sleeping deadline
      tasks works even when the runtime and deadline are not the same.
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@santannapisa.it>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
      Link: http://lkml.kernel.org/r/02135a27f1ae3fe5fd032568a5a2f370e190e8d7.1488392936.git.bristot@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
    • Daniel Bristot de Oliveira's avatar
      sched/deadline: Throttle a constrained deadline task activated after the deadline · df8eac8c
      Daniel Bristot de Oliveira authored
      During the activation, CBS checks if it can reuse the current task's
      runtime and period. If the deadline of the task is in the past, CBS
      cannot use the runtime, and so it replenishes the task. This rule
      works fine for implicit deadline tasks (deadline == period), and the
      CBS was designed for implicit deadline tasks. However, a task with
      constrained deadline (deadine < period) might be awakened after the
      deadline, but before the next period. In this case, replenishing the
      task would allow it to run for runtime / deadline. As in this case
      deadline < period, CBS enables a task to run for more than the
      runtime / period. In a very loaded system, this can cause a domino
      effect, making other tasks miss their deadlines.
      To avoid this problem, in the activation of a constrained deadline
      task after the deadline but before the next period, throttle the
      task and set the replenishing timer to the begin of the next period,
      unless it is boosted.
       --------------- %< ---------------
        int main (int argc, char **argv)
      	int ret;
      	int flags = 0;
      	unsigned long l = 0;
      	struct timespec ts;
      	struct sched_attr attr;
      	memset(&attr, 0, sizeof(attr));
      	attr.size = sizeof(attr);
      	attr.sched_policy   = SCHED_DEADLINE;
      	attr.sched_runtime  = 2 * 1000 * 1000;		/* 2 ms */
      	attr.sched_deadline = 2 * 1000 * 1000;		/* 2 ms */
      	attr.sched_period   = 2 * 1000 * 1000 * 1000;	/* 2 s */
      	ts.tv_sec = 0;
      	ts.tv_nsec = 2000 * 1000;			/* 2 ms */
      	ret = sched_setattr(0, &attr, flags);
      	if (ret < 0) {
      	for(;;) {
      		/* XXX: you may need to adjust the loop */
      		for (l = 0; l < 150000; l++);
      		 * The ideia is to go to sleep right before the deadline
      		 * and then wake up before the next period to receive
      		 * a new replenishment.
      		nanosleep(&ts, NULL);
        --------------- >% ---------------
      On my box, this reproducer uses almost 50% of the CPU time, which is
      obviously wrong for a task with 2/2000 reservation.
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@santannapisa.it>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
      Link: http://lkml.kernel.org/r/edf58354e01db46bf42df8d2dd32418833f68c89.1488392936.git.bristot@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
    • Daniel Bristot de Oliveira's avatar
      sched/deadline: Make sure the replenishment timer fires in the next period · 5ac69d37
      Daniel Bristot de Oliveira authored
      Currently, the replenishment timer is set to fire at the deadline
      of a task. Although that works for implicit deadline tasks because the
      deadline is equals to the begin of the next period, that is not correct
      for constrained deadline tasks (deadline < period).
      For instance:
       --------------- %< ---------------
      int main (void)
       --------------- >% ---------------
        # gcc -o f f.c
        # trace-cmd record -e sched:sched_switch                              \
      				   -e syscalls:sys_exit_sched_setattr   \
         chrt -d --sched-runtime  490000000					\
                 --sched-deadline 500000000					\
      	   --sched-period  1000000000 0 ./f
        # trace-cmd report | grep "{pid of ./f}"
      After setting parameters, the task is replenished and continue running
      until being throttled:
               f-11295 [003] 13322.113776: sys_exit_sched_setattr: 0x0
      The task is throttled after running 492318 ms, as expected:
               f-11295 [003] 13322.606094: sched_switch:   f:11295 [-1] R ==> watchdog/3:32 [0]
      But then, the task is replenished 500719 ms after the first
          <idle>-0     [003] 13322.614495: sched_switch:   swapper/3:0 [120] R ==> f:11295 [-1]
      Running for 490277 ms:
               f-11295 [003] 13323.104772: sched_switch:   f:11295 [-1] R ==>  swapper/3:0 [120]
      Hence, in the first period, the task runs 2 * runtime, and that is a bug.
      During the first replenishment, the next deadline is set one period away.
      So the runtime / period starts to be respected. However, as the second
      replenishment took place in the wrong instant, the next replenishment
      will also be held in a wrong instant of time. Rather than occurring in
      the nth period away from the first activation, it is taking place
      in the (nth period - relative deadline).
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarLuca Abeni <luca.abeni@santannapisa.it>
      Reviewed-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Reviewed-by: default avatarJuri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
      Link: http://lkml.kernel.org/r/ac50d89887c25285b47465638354b63362f8adff.1488392936.git.bristot@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
    • Niklas Cassel's avatar
      locking/rwsem: Fix down_write_killable() for CONFIG_RWSEM_GENERIC_SPINLOCK=y · 17fcbd59
      Niklas Cassel authored
      We hang if SIGKILL has been sent, but the task is stuck in down_read()
      (after do_exit()), even though no task is doing down_write() on the
      rwsem in question:
        INFO: task libupnp:21868 blocked for more than 120 seconds.
        libupnp         D    0 21868      1 0x08100008
        Call Trace:
      This bug has already been fixed for CONFIG_RWSEM_XCHGADD_ALGORITHM=y in
      the following commit:
       04cafed7 ("locking/rwsem: Fix down_write_killable()")
      ... however, this bug also exists for CONFIG_RWSEM_GENERIC_SPINLOCK=y.
      Signed-off-by: default avatarNiklas Cassel <niklas.cassel@axis.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Niklas Cassel <niklass@axis.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: d4799608 ("locking/rwsem: Introduce basis for down_write_killable()")
      Link: http://lkml.kernel.org/r/1487981873-12649-1-git-send-email-niklass@axis.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
    • Matt Fleming's avatar
      sched/loadavg: Use {READ,WRITE}_ONCE() for sample window · caeb5882
      Matt Fleming authored
      'calc_load_update' is accessed without any kind of locking and there's
      a clear assumption in the code that only a single value is read or
      Make this explicit by using READ_ONCE() and WRITE_ONCE(), and avoid
      unintentionally seeing multiple values, or having the load/stores
      Technically the loads in calc_global_*() don't require this since
      those are the only functions that update 'calc_load_update', but I've
      added the READ_ONCE() for consistency.
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarMatt Fleming <matt@codeblueprint.co.uk>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: http://lkml.kernel.org/r/20170217120731.11868-3-matt@codeblueprint.co.ukSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
    • Matt Fleming's avatar
      sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting · 6e5f32f7
      Matt Fleming authored
      If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to
      the pending sample window time on exit, setting the next update not
      one window into the future, but two.
      This situation on exiting NO_HZ is described by:
        this_rq->calc_load_update < jiffies < calc_load_update
      In this scenario, what we should be doing is:
        this_rq->calc_load_update = calc_load_update		     [ next window ]
      But what we actually do is:
        this_rq->calc_load_update = calc_load_update + LOAD_FREQ   [ next+1 window ]
      This has the effect of delaying load average updates for potentially
      up to ~9seconds.
      This can result in huge spikes in the load average values due to
      per-cpu uninterruptible task counts being out of sync when accumulated
      across all CPUs.
      It's safe to update the per-cpu active count if we wake between sample
      windows because any load that we left in 'calc_load_idle' will have
      been zero'd when the idle load was folded in calc_global_load().
      This issue is easy to reproduce before,
        commit 9d89c257 ("sched/fair: Rewrite runnable load and utilization average tracking")
      just by forking short-lived process pipelines built from ps(1) and
      grep(1) in a loop. I'm unable to reproduce the spikes after that
      commit, but the bug still seems to be present from code review.
      Signed-off-by: default avatarMatt Fleming <matt@codeblueprint.co.uk>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Fixes: commit 5167e8d5 ("sched/nohz: Rewrite and fix load-avg computation -- again")
      Link: http://lkml.kernel.org/r/20170217120731.11868-2-matt@codeblueprint.co.ukSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
    • Wanpeng Li's avatar
      sched/deadline: Add missing update_rq_clock() in dl_task_timer() · dcc3b5ff
      Wanpeng Li authored
      The following warning can be triggered by hot-unplugging the CPU
      on which an active SCHED_DEADLINE task is running on:
       ------------[ cut here ]------------
       WARNING: CPU: 7 PID: 0 at kernel/sched/sched.h:833 replenish_dl_entity+0x71e/0xc40
       rq->clock_update_flags < RQCF_ACT_SKIP
       CPU: 7 PID: 0 Comm: swapper/7 Tainted: G    B           4.11.0-rc1+ #24
       Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016
       Call Trace:
        ? __warn+0x1b0/0x1b0
        ? debug_check_no_locks_freed+0x2c0/0x2c0
        ? cpudl_set+0x3d/0x2b0
        ? dl_task_timer+0x777/0x990
        ? __hrtimer_run_queues+0x270/0xa50
        ? enqueue_task_dl+0x12e0/0x12e0
        ? enqueue_task_dl+0x12e0/0x12e0
        ? hrtimer_cancel+0x20/0x20
        ? hrtimer_interrupt+0x119/0x600
        ? trace_hardirqs_off+0xd/0x10
      The DL task will be migrated to a suitable later deadline rq once the DL
      timer fires and currnet rq is offline. The rq clock of the new rq should
      be updated. This patch fixes it by updating the rq clock after holding
      the new rq's rq lock.
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarMatt Fleming <matt@codeblueprint.co.uk>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1488865888-15894-1-git-send-email-wanpeng.li@hotmail.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
  4. 14 Mar, 2017 3 commits
  5. 10 Mar, 2017 4 commits
    • Thomas Gleixner's avatar
      kexec, x86/purgatory: Unbreak it and clean it up · 40c50c1f
      Thomas Gleixner authored
      The purgatory code defines global variables which are referenced via a
      symbol lookup in the kexec code (core and arch).
      A recent commit addressing sparse warnings made these static and thereby
      broke kexec_file.
      Why did this happen? Simply because the whole machinery is undocumented and
      lacks any form of forward declarations. The variable names are unspecific
      and lack a prefix, so adding forward declarations creates shadow variables
      in the core code. Aside of that the code relies on magic constants and
      duplicate struct definitions with no way to ensure that these things stay
      in sync. The section placement of the purgatory variables happened by
      chance and not by design.
      Unbreak kexec and cleanup the mess:
       - Add proper forward declarations and document the usage
       - Use common struct definition
       - Use the proper common defines instead of magic constants
       - Add a purgatory_ prefix to have a proper name space
       - Use ARRAY_SIZE() instead of a homebrewn reimplementation
       - Add proper sections to the purgatory variables [ From Mike ]
      Fixes: 72042a8c ("x86/purgatory: Make functions and variables static")
      Reported-by: default avatarMike Galbraith <&lt;efault@gmx.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Nicholas Mc Guire <der.herr@hofr.at>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: "Tobin C. Harding" <me@tobin.cc>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1703101315140.3681@nanosSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
    • Andrea Arcangeli's avatar
      userfaultfd: non-cooperative: rollback userfaultfd_exit · dd0db88d
      Andrea Arcangeli authored
      Patch series "userfaultfd non-cooperative further update for 4.11 merge
      Unfortunately I noticed one relevant bug in userfaultfd_exit while doing
      more testing.  I've been doing testing before and this was also tested
      by kbuild bot and exercised by the selftest, but this bug never
      reproduced before.
      I dropped userfaultfd_exit as result.  I dropped it because of
      implementation difficulty in receiving signals in __mmput and because I
      think -ENOSPC as result from the background UFFDIO_COPY should be enough
      Before I decided to remove userfaultfd_exit, I noticed userfaultfd_exit
      wasn't exercised by the selftest and when I tried to exercise it, after
      moving it to a more correct place in __mmput where it would make more
      sense and where the vma list is stable, it resulted in the
      event_wait_completion in D state.  So then I added the second patch to
      be sure even if we call userfaultfd_event_wait_completion too late
      during task exit(), we won't risk to generate tasks in D state.  The
      same check exists in handle_userfault() for the same reason, except it
      makes a difference there, while here is just a robustness check and it's
      run under WARN_ON_ONCE.
      While looking at the userfaultfd_event_wait_completion() function I
      looked back at its callers too while at it and I think it's not ok to
      stop executing dup_fctx on the fcs list because we relay on
      userfaultfd_event_wait_completion to execute
      userfaultfd_ctx_put(fctx->orig) which is paired against
      userfaultfd_ctx_get(fctx->orig) in dup_userfault just before
      list_add(fcs).  This change only takes care of fctx->orig but this area
      also needs further review looking for similar problems in fctx->new.
      The only patch that is urgent is the first because it's an use after
      free during a SMP race condition that affects all processes if
      CONFIG_USERFAULTFD=y.  Very hard to reproduce though and probably
      impossible without SLUB poisoning enabled.
      This patch (of 3):
      I once reproduced this oops with the userfaultfd selftest, it's not
      easily reproducible and it requires SLUB poisoning to reproduce.
          general protection fault: 0000 [#1] SMP
          Modules linked in:
          CPU: 2 PID: 18421 Comm: userfaultfd Tainted: G               ------------ T 3.10.0+ #15
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
          task: ffff8801f83b9440 ti: ffff8801f833c000 task.ti: ffff8801f833c000
          RIP: 0010:[<ffffffff81451299>]  [<ffffffff81451299>] userfaultfd_exit+0x29/0xa0
          RSP: 0018:ffff8801f833fe80  EFLAGS: 00010202
          RAX: ffff8801f833ffd8 RBX: 6b6b6b6b6b6b6b6b RCX: ffff8801f83b9440
          RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800baf18600
          RBP: ffff8801f833fee8 R08: 0000000000000000 R09: 0000000000000001
          R10: 0000000000000000 R11: ffffffff8127ceb3 R12: 0000000000000000
          R13: ffff8800baf186b0 R14: ffff8801f83b99f8 R15: 00007faed746c700
          FS:  0000000000000000(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
          CR2: 00007faf0966f028 CR3: 0000000001bc6000 CR4: 00000000000006e0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
          Call Trace:
          Code: 00 00 66 66 66 66 90 55 48 89 e5 41 54 53 48 83 ec 58 48 8b 1f 48 85 db 75 11 eb 73 66 0f 1f 44 00 00 48 8b 5b 10 48 85 db 74 64 <4c> 8b a3 b8 00 00 00 4d 85 e4 74 eb 41 f6 84 24 2c 01 00 00 80
          RIP  [<ffffffff81451299>] userfaultfd_exit+0x29/0xa0
           RSP <ffff8801f833fe80>
          ---[ end trace 9fecd6dcb442846a ]---
      In the debugger I located the "mm" pointer in the stack and walking
      mm->mmap->vm_next through the end shows the vma->vm_next list is fully
      consistent and it is null terminated list as expected.  So this has to
      be an SMP race condition where userfaultfd_exit was running while the
      vma list was being modified by another CPU.
      When userfaultfd_exit() run one of the ->vm_next pointers pointed to
      SLAB_POISON (RBX is the vma pointer and is 0x6b6b..).
      The reason is that it's not running in __mmput but while there are still
      other threads running and it's not holding the mmap_sem (it can't as it
      has to wait the even to be received by the manager).  So this is an use
      after free that was happening for all processes.
      One more implementation problem aside from the race condition:
      userfaultfd_exit has really to check a flag in mm->flags before walking
      the vma or it's going to slowdown the exit() path for regular tasks.
      One more implementation problem: at that point signals can't be
      delivered so it would also create a task in D state if the manager
      doesn't read the event.
      The major design issue: it overall looks superfluous as the manager can
      check for -ENOSPC in the background transfer:
      	if (mmget_not_zero(ctx->mm)) {
      	} else {
      		return -ENOSPC;
      It's safer to roll it back and re-introduce it later if at all.
      [rppt@linux.vnet.ibm.com: documentation fixup after removal of UFFD_EVENT_EXIT]
        Link: http://lkml.kernel.org/r/1488345437-4364-1-git-send-email-rppt@linux.vnet.ibm.com
      Link: http://lkml.kernel.org/r/20170224181957.19736-2-aarcange@redhat.comSigned-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Masahiro Yamada's avatar
      scripts/spelling.txt: add "overide" pattern and fix typo instances · 505d3085
      Masahiro Yamada authored
      Fix typos and add the following to the scripts/spelling.txt:
      While we are here, fix the doubled "address" in the touched line
      Also, fix the comment block style in the touched hunks in
      Link: http://lkml.kernel.org/r/1481573103-11329-21-git-send-email-yamada.masahiro@socionext.comSigned-off-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Masahiro Yamada's avatar
      scripts/spelling.txt: add "disble(d)" pattern and fix typo instances · 8a1115ff
      Masahiro Yamada authored
      Fix typos and add the following to the scripts/spelling.txt:
      I kept the TSL2563_INT_DISBLED in /drivers/iio/light/tsl2563.c
      untouched.  The macro is not referenced at all, but this commit is
      touching only comment blocks just in case.
      Link: http://lkml.kernel.org/r/1481573103-11329-20-git-send-email-yamada.masahiro@socionext.comSigned-off-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  6. 09 Mar, 2017 2 commits
  7. 08 Mar, 2017 1 commit
    • Linus Torvalds's avatar
      sched/headers: fix up header file dependency on <linux/sched/signal.h> · bd0f9b35
      Linus Torvalds authored
      The scheduler header file split and cleanups ended up exposing a few
      nasty header file dependencies, and in particular it showed how we in
      <linux/wait.h> ended up depending on "signal_pending()", which now comes
      from <linux/sched/signal.h>.
      That's a very subtle and annoying dependency, which already caused a
      semantic merge conflict (see commit e58bc927 "Pull overlayfs updates
      from Miklos Szeredi", which added that fixup in the merge commit).
      It turns out that we can avoid this dependency _and_ improve code
      generation by moving the guts of the fairly nasty helper #define
      __wait_event_interruptible_locked() to out-of-line code.  The code that
      includes the signal_pending() check is all in the slow-path where we
      actually go to sleep waiting for the event anyway, so using a helper
      function is the right thing to do.
      Using a helper function is also what we already did for the non-locked
      versions, see the "__wait_event*()" macros and the "prepare_to_wait*()"
      set of helper functions.
      We might want to try to unify all these macro games, we have a _lot_ of
      subtly different wait-event loops.  But this is the minimal patch to fix
      the annoying header dependency.
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  8. 07 Mar, 2017 1 commit
  9. 06 Mar, 2017 5 commits
    • Eric W. Biederman's avatar
      ucount: Remove the atomicity from ucount->count · 040757f7
      Eric W. Biederman authored
      Always increment/decrement ucount->count under the ucounts_lock.  The
      increments are there already and moving the decrements there means the
      locking logic of the code is simpler.  This simplification in the
      locking logic fixes a race between put_ucounts and get_ucounts that
      could result in a use-after-free because the count could go zero then
      be found by get_ucounts and then be freed by put_ucounts.
      A bug presumably this one was found by a combination of syzkaller and
      KASAN.  JongWhan Kim reported the syzkaller failure and Dmitry Vyukov
      spotted the race in the code.
      Cc: stable@vger.kernel.org
      Fixes: f6b2db1a ("userns: Make the count of user namespaces per user")
      Reported-by: default avatarJongHwan Kim <zzoru007@gmail.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: default avatarAndrei Vagin <avagin@gmail.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
    • Tejun Heo's avatar
      workqueue: trigger WARN if queue_delayed_work() is called with NULL @wq · 637fdbae
      Tejun Heo authored
      If queue_delayed_work() gets called with NULL @wq, the kernel will
      oops asynchronuosly on timer expiration which isn't too helpful in
      tracking down the offender.  This actually happened with smc.
      __queue_delayed_work() already does several input sanity checks
      synchronously.  Add NULL @wq check.
      Reported-by: default avatarDave Jones <davej@codemonkey.org.uk>
      Link: http://lkml.kernel.org/r/20170227171439.jshx3qplflyrgcv7@codemonkey.org.ukSigned-off-by: default avatarTejun Heo <tj@kernel.org>
    • Kees Cook's avatar
      cgroups: censor kernel pointer in debug files · b6a6759d
      Kees Cook authored
      As found in grsecurity, this avoids exposing a kernel pointer through
      the cgroup debug entries.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    • Tejun Heo's avatar
      cgroup/pids: remove spurious suspicious RCU usage warning · 1d18c274
      Tejun Heo authored
      pids_can_fork() is special in that the css association is guaranteed
      to be stable throughout the function and thus doesn't need RCU
      protection around task_css access.  When determining the css to charge
      the pid, task_css_check() is used to override the RCU sanity check.
      While adding a warning message on fork rejection from pids limit,
      135b8b37 ("cgroup: Add pids controller event when fork fails
      because of pid limit") incorrectly added a task_css access which is
      neither RCU protected or explicitly annotated.  This triggers the
      following suspicious RCU usage warning when RCU debugging is enabled.
        cgroup: fork rejected by pids controller in
        [ ERR: suspicious RCU usage.  ]
        4.10.0-work+ #1 Not tainted
        ./include/linux/cgroup.h:435 suspicious rcu_dereference_check() usage!
        other info that might help us debug this:
        rcu_scheduler_active = 2, debug_locks = 0
        1 lock held by bash/1748:
         #0:  (&cgroup_threadgroup_rwsem){+++++.}, at: [<ffffffff81052c96>] _do_fork+0xe6/0x6e0
        stack backtrace:
        CPU: 3 PID: 1748 Comm: bash Not tainted 4.10.0-work+ #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
        Call Trace:
        RIP: 0033:0x7f7853fab93a
        RSP: 002b:00007ffc12d05c90 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
        RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7853fab93a
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
        RBP: 00007ffc12d05cc0 R08: 0000000000000000 R09: 00007f78548db700
        R10: 00007f78548db9d0 R11: 0000000000000246 R12: 00000000000006d4
        R13: 0000000000000001 R14: 0000000000000000 R15: 000055e3ebe2c04d
      There's no reason to dereference task_css again here when the
      associated css is already available.  Fix it by replacing the
      task_cgroup() call with css->cgroup.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarMike Galbraith <efault@gmx.de>
      Fixes: 135b8b37 ("cgroup: Add pids controller event when fork fails because of pid limit")
      Cc: Kenny Yu <kennyyu@fb.com>
      Cc: stable@vger.kernel.org # v4.8+
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    • Alexei Starovoitov's avatar
      bpf: add get_next_key callback to LPM map · f38837b0
      Alexei Starovoitov authored
      map_get_next_key callback is mandatory. Supply dummy handler.
      Fixes: b95a5c4d ("bpf: add a longest prefix match trie map implementation")
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  10. 05 Mar, 2017 2 commits
  11. 03 Mar, 2017 8 commits