1. 25 Feb, 2010 2 commits
    • Paul E. McKenney's avatar
      sched: Better name for for_each_domain_rd · 497f0ab3
      Paul E. McKenney authored
      As suggested by Peter Ziljstra, make better choice of name
      for for_each_domain_rd(), containing "rcu_dereference", given
      that it is but a wrapper for rcu_dereference_check().  The name
      rcu_dereference_check_sched_domain() does that and provides a
      separate per-subsystem name space.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josh@joshtriplett.org
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: Valdis.Kletnieks@vt.edu
      Cc: dhowells@redhat.com
      LKML-Reference: <1266887105-1528-7-git-send-email-paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      497f0ab3
    • Paul E. McKenney's avatar
      sched: Use lockdep-based checking on rcu_dereference() · d11c563d
      Paul E. McKenney authored
      Update the rcu_dereference() usages to take advantage of the new
      lockdep-based checking.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josh@joshtriplett.org
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: Valdis.Kletnieks@vt.edu
      Cc: dhowells@redhat.com
      LKML-Reference: <1266887105-1528-6-git-send-email-paulmck@linux.vnet.ibm.com>
      [ -v2: fix allmodconfig missing symbol export build failure on x86 ]
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      d11c563d
  2. 17 Feb, 2010 2 commits
  3. 16 Feb, 2010 2 commits
    • Peter Zijlstra's avatar
      sched: Fix race between ttwu() and task_rq_lock() · 0970d299
      Peter Zijlstra authored
      Thomas found that due to ttwu() changing a task's cpu without holding
      the rq->lock, task_rq_lock() might end up locking the wrong rq.
      
      Avoid this by serializing against TASK_WAKING.
      Reported-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1266241712.15770.420.camel@laptop>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      0970d299
    • Suresh Siddha's avatar
      sched: Fix SMT scheduler regression in find_busiest_queue() · 9000f05c
      Suresh Siddha authored
      Fix a SMT scheduler performance regression that is leading to a scenario
      where SMT threads in one core are completely idle while both the SMT threads
      in another core (on the same socket) are busy.
      
      This is caused by this commit (with the problematic code highlighted)
      
         commit bdb94aa5
         Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
         Date:   Tue Sep 1 10:34:38 2009 +0200
      
         sched: Try to deal with low capacity
      
         @@ -4203,15 +4223,18 @@ find_busiest_queue()
         ...
      	for_each_cpu(i, sched_group_cpus(group)) {
         +	unsigned long power = power_of(i);
      
         ...
      
         -	wl = weighted_cpuload(i);
         +	wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
         +	wl /= power;
      
         -	if (rq->nr_running == 1 && wl > imbalance)
         +	if (capacity && rq->nr_running == 1 && wl > imbalance)
      		continue;
      
      On a SMT system, power of the HT logical cpu will be 589 and
      the scheduler load imbalance (for scenarios like the one mentioned above)
      can be approximately 1024 (SCHED_LOAD_SCALE). The above change of scaling
      the weighted load with the power will result in "wl > imbalance" and
      ultimately resulting in find_busiest_queue() return NULL, causing
      load_balance() to think that the load is well balanced. But infact
      one of the tasks can be moved to the idle core for optimal performance.
      
      We don't need to use the weighted load (wl) scaled by the cpu power to
      compare with  imabalance. In that condition, we already know there is only a
      single task "rq->nr_running == 1" and the comparison between imbalance,
      wl is to make sure that we select the correct priority thread which matches
      imbalance. So we really need to compare the imabalnce with the original
      weighted load of the cpu and not the scaled load.
      
      But in other conditions where we want the most hammered(busiest) cpu, we can
      use scaled load to ensure that we consider the cpu power in addition to the
      actual load on that cpu, so that we can move the load away from the
      guy that is getting most hammered with respect to the actual capacity,
      as compared with the rest of the cpu's in that busiest group.
      
      Fix it.
      Reported-by: default avatarMa Ling <ling.ma@intel.com>
      Initial-Analysis-by: default avatarZhang, Yanmin <yanmin_zhang@linux.intel.com>
      Signed-off-by: default avatarSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1266023662.2808.118.camel@sbs-t61.sc.intel.com>
      Cc: stable@kernel.org [2.6.32.x]
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      9000f05c
  4. 08 Feb, 2010 2 commits
    • Anton Blanchard's avatar
      sched: cpuacct: Use bigger percpu counter batch values for stats counters · fa535a77
      Anton Blanchard authored
      When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are
      enabled we can call cpuacct_update_stats with values much larger
      than percpu_counter_batch.  This means the call to
      percpu_counter_add will always add to the global count which is
      protected by a spinlock and we end up with a global spinlock in
      the scheduler.
      
      Based on an idea by KOSAKI Motohiro, this patch scales the batch
      value by cputime_one_jiffy such that we have the same batch
      limit as we would if CONFIG_VIRT_CPU_ACCOUNTING was disabled.
      His patch did this once at boot but that initialisation happened
      too early on PowerPC (before time_init) and it was never updated
      at runtime as a result of a hotplug cpu add/remove.
      
      This patch instead scales percpu_counter_batch by
      cputime_one_jiffy at runtime, which keeps the batch correct even
      after cpu hotplug operations.  We cap it at INT_MAX in case of
      overflow.
      
      For architectures that do not support
      CONFIG_VIRT_CPU_ACCOUNTING, cputime_one_jiffy is the constant 1
      and gcc is smart enough to optimise min(s32
      percpu_counter_batch, INT_MAX) to just percpu_counter_batch at
      least on x86 and PowerPC.  So there is no need to add an #ifdef.
      
      On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and
      CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark
      is 234x faster and almost matches a CONFIG_CGROUP_CPUACCT
      disabled kernel:
      
       CONFIG_CGROUP_CPUACCT disabled:   16906698 ctx switches/sec
       CONFIG_CGROUP_CPUACCT enabled:       61720 ctx switches/sec
       CONFIG_CGROUP_CPUACCT + patch:	   16663217 ctx switches/sec
      
      Tested with:
      
       wget http://ozlabs.org/~anton/junkcode/context_switch.c
       make context_switch
       for i in `seq 0 63`; do taskset -c $i ./context_switch & done
       vmstat 1
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Tested-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      fa535a77
    • Andrew Morton's avatar
      kernel/sched.c: Suppress unused var warning · 50200df4
      Andrew Morton authored
      On UP:
      
       kernel/sched.c: In function 'wake_up_new_task':
       kernel/sched.c:2631: warning: unused variable 'cpu'
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      50200df4
  5. 04 Feb, 2010 1 commit
  6. 02 Feb, 2010 1 commit
  7. 22 Jan, 2010 2 commits
  8. 21 Jan, 2010 5 commits
  9. 13 Jan, 2010 1 commit
  10. 28 Dec, 2009 2 commits
  11. 23 Dec, 2009 1 commit
  12. 20 Dec, 2009 2 commits
  13. 17 Dec, 2009 3 commits
    • Peter Zijlstra's avatar
      sched: Fix broken assertion · 077614ee
      Peter Zijlstra authored
      There's a preemption race in the set_task_cpu() debug check in
      that when we get preempted after setting task->state we'd still
      be on the rq proper, but fail the test.
      
      Check for preempted tasks, since those are always on the RQ.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20091217121830.137155561@chello.nl>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      077614ee
    • Frederic Weisbecker's avatar
      sched: Teach might_sleep() about preemptible RCU · 234da7bc
      Frederic Weisbecker authored
      In practice, it is harmless to voluntarily sleep in a
      rcu_read_lock() section if we are running under preempt rcu, but
      it is illegal if we build a kernel running non-preemptable rcu.
      
      Currently, might_sleep() doesn't notice sleepable operations
      under rcu_read_lock() sections if we are running under
      preemptable rcu because preempt_count() is left untouched after
      rcu_read_lock() in this case. But we want developers who test
      their changes under such config to notice the "sleeping while
      atomic" issues.
      
      So we add rcu_read_lock_nesting to prempt_count() in
      might_sleep() checks.
      
      [ v2: Handle rcu-tiny ]
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Reviewed-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      LKML-Reference: <1260991265-8451-1-git-send-regression-fweisbec@gmail.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      234da7bc
    • Ingo Molnar's avatar
      sched: Make warning less noisy · 416eb395
      Ingo Molnar authored
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <20091216170517.807938893@chello.nl>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      416eb395
  14. 16 Dec, 2009 9 commits
    • Peter Zijlstra's avatar
      sched: Simplify set_task_cpu() · 738d2be4
      Peter Zijlstra authored
      Rearrange code a bit now that its a simpler function.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <20091216170518.269101883@chello.nl>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      738d2be4
    • Peter Zijlstra's avatar
      sched: Remove the cfs_rq dependency from set_task_cpu() · 88ec22d3
      Peter Zijlstra authored
      In order to remove the cfs_rq dependency from set_task_cpu() we
      need to ensure the task is cfs_rq invariant for all callsites.
      
      The simple approach is to substract cfs_rq->min_vruntime from
      se->vruntime on dequeue, and add cfs_rq->min_vruntime on
      enqueue.
      
      However, this has the downside of breaking FAIR_SLEEPERS since
      we loose the old vruntime as we only maintain the relative
      position.
      
      To solve this, we observe that we only migrate runnable tasks,
      we do this using deactivate_task(.sleep=0) and
      activate_task(.wakeup=0), therefore we can restrain the
      min_vruntime invariance to that state.
      
      The only other case is wakeup balancing, since we want to
      maintain the old vruntime we cannot make it relative on dequeue,
      but since we don't migrate inactive tasks, we can do so right
      before we activate it again.
      
      This is where we need the new pre-wakeup hook, we need to call
      this while still holding the old rq->lock. We could fold it into
      ->select_task_rq(), but since that has multiple callsites and
      would obfuscate the locking requirements, that seems like a
      fudge.
      
      This leaves the fork() case, simply make sure that ->task_fork()
      leaves the ->vruntime in a relative state.
      
      This covers all cases where set_task_cpu() gets called, and
      ensures it sees a relative vruntime.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <20091216170518.191697025@chello.nl>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      88ec22d3
    • Peter Zijlstra's avatar
      sched: Add pre and post wakeup hooks · efbbd05a
      Peter Zijlstra authored
      As will be apparent in the next patch, we need a pre wakeup hook
      for sched_fair task migration, hence rename the post wakeup hook
      and one pre wakeup.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <20091216170518.114746117@chello.nl>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      efbbd05a
    • Peter Zijlstra's avatar
      sched: Move kthread_bind() back to kthread.c · 881232b7
      Peter Zijlstra authored
      Since kthread_bind() lost its dependencies on sched.c, move it
      back where it came from.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <20091216170518.039524041@chello.nl>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      881232b7
    • Peter Zijlstra's avatar
      sched: Fix select_task_rq() vs hotplug issues · 5da9a0fb
      Peter Zijlstra authored
      Since select_task_rq() is now responsible for guaranteeing
      ->cpus_allowed and cpu_active_mask, we need to verify this.
      
      select_task_rq_rt() can blindly return
      smp_processor_id()/task_cpu() without checking the valid masks,
      select_task_rq_fair() can do the same in the rare case that all
      SD_flags are disabled.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <20091216170517.961475466@chello.nl>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      5da9a0fb
    • Peter Zijlstra's avatar
      sched: Fix sched_exec() balancing · 38022906
      Peter Zijlstra authored
      Since we access ->cpus_allowed without holding rq->lock we need
      a retry loop to validate the result, this comes for near free
      when we merge sched_migrate_task() into sched_exec() since that
      already does the needed check.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <20091216170517.884743662@chello.nl>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      38022906
    • Peter Zijlstra's avatar
      sched: Ensure set_task_cpu() is never called on blocked tasks · e2912009
      Peter Zijlstra authored
      In order to clean up the set_task_cpu() rq dependencies we need
      to ensure it is never called on blocked tasks because such usage
      does not pair with consistent rq->lock usage.
      
      This puts the migration burden on ttwu().
      
      Furthermore we need to close a race against changing
      ->cpus_allowed, since select_task_rq() runs with only preemption
      disabled.
      
      For sched_fork() this is safe because the child isn't in the
      tasklist yet, for wakeup we fix this by synchronizing
      set_cpus_allowed_ptr() against TASK_WAKING, which leaves
      sched_exec to be a problem
      
      This also closes a hole in (6ad4c188 sched: Fix balance vs
      hotplug race) where ->select_task_rq() doesn't validate the
      result against the sched_domain/root_domain.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <20091216170517.807938893@chello.nl>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      e2912009
    • Peter Zijlstra's avatar
      sched: Use TASK_WAKING for fork wakups · 06b83b5f
      Peter Zijlstra authored
      For later convenience use TASK_WAKING for fresh tasks.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <20091216170517.732561278@chello.nl>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      06b83b5f
    • Peter Zijlstra's avatar
      sched: Fix task_hot() test order · e6c8fba7
      Peter Zijlstra authored
      Make sure not to access sched_fair fields before verifying it is
      indeed a sched_fair task.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      CC: stable@kernel.org
      LKML-Reference: <20091216170517.577998058@chello.nl>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      e6c8fba7
  15. 14 Dec, 2009 5 commits