1. 03 Jul, 2006 3 commits
    • Ingo Molnar's avatar
      [PATCH] lockdep: core · fbb9ce95
      Ingo Molnar authored
      Do 'make oldconfig' and accept all the defaults for new config options -
      reboot into the kernel and if everything goes well it should boot up fine and
      you should have /proc/lockdep and /proc/lockdep_stats files.
      
      Typically if the lock validator finds some problem it will print out
      voluminous debug output that begins with "BUG: ..." and which syslog output
      can be used by kernel developers to figure out the precise locking scenario.
      
      What does the lock validator do?  It "observes" and maps all locking rules as
      they occur dynamically (as triggered by the kernel's natural use of spinlocks,
      rwlocks, mutexes and rwsems).  Whenever the lock validator subsystem detects a
      new locking scenario, it validates this new rule against the existing set of
      rules.  If this new rule is consistent with the existing set of rules then the
      new rule is added transparently and the kernel continues as normal.  If the
      new rule could create a deadlock scenario then this condition is printed out.
      
      When determining validity of locking, all possible "deadlock scenarios" are
      considered: assuming arbitrary number of CPUs, arbitrary irq context and task
      context constellations, running arbitrary combinations of all the existing
      locking scenarios.  In a typical system this means millions of separate
      scenarios.  This is why we call it a "locking correctness" validator - for all
      rules that are observed the lock validator proves it with mathematical
      certainty that a deadlock could not occur (assuming that the lock validator
      implementation itself is correct and its internal data structures are not
      corrupted by some other kernel subsystem).  [see more details and conditionals
      of this statement in include/linux/lockdep.h and
      Documentation/lockdep-design.txt]
      
      Furthermore, this "all possible scenarios" property of the validator also
      enables the finding of complex, highly unlikely multi-CPU multi-context races
      via single single-context rules, increasing the likelyhood of finding bugs
      drastically.  In practical terms: the lock validator already found a bug in
      the upstream kernel that could only occur on systems with 3 or more CPUs, and
      which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
      That bug was found and reported on a single-CPU system (!).  So in essence a
      race will be found "piecemail-wise", triggering all the necessary components
      for the race, without having to reproduce the race scenario itself!  In its
      short existence the lock validator found and reported many bugs before they
      actually caused a real deadlock.
      
      To further increase the efficiency of the validator, the mapping is not per
      "lock instance", but per "lock-class".  For example, all struct inode objects
      in the kernel have inode->inotify_mutex.  If there are 10,000 inodes cached,
      then there are 10,000 lock objects.  But ->inotify_mutex is a single "lock
      type", and all locking activities that occur against ->inotify_mutex are
      "unified" into this single lock-class.  The advantage of the lock-class
      approach is that all historical ->inotify_mutex uses are mapped into a single
      (and as narrow as possible) set of locking rules - regardless of how many
      different tasks or inode structures it took to build this set of rules.  The
      set of rules persist during the lifetime of the kernel.
      
      To see the rough magnitude of checking that the lock validator does, here's a
      portion of /proc/lockdep_stats, fresh after bootup:
      
       lock-classes:                            694 [max: 2048]
       direct dependencies:                  1598 [max: 8192]
       indirect dependencies:               17896
       all direct dependencies:             16206
       dependency chains:                    1910 [max: 8192]
       in-hardirq chains:                      17
       in-softirq chains:                     105
       in-process chains:                    1065
       stack-trace entries:                 38761 [max: 131072]
       combined max dependencies:         2033928
       hardirq-safe locks:                     24
       hardirq-unsafe locks:                  176
       softirq-safe locks:                     53
       softirq-unsafe locks:                  137
       irq-safe locks:                         59
       irq-unsafe locks:                      176
      
      The lock validator has observed 1598 actual single-thread locking patterns,
      and has validated all possible 2033928 distinct locking scenarios.
      
      More details about the design of the lock validator can be found in
      Documentation/lockdep-design.txt, which can also found at:
      
         http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt
      
      [bunk@stusta.de: cleanups]
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: default avatarAdrian Bunk <bunk@stusta.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      fbb9ce95
    • Ingo Molnar's avatar
      [PATCH] lockdep: irqtrace subsystem, core · de30a2b3
      Ingo Molnar authored
      Accurate hard-IRQ-flags and softirq-flags state tracing.
      
      This allows us to attach extra functionality to IRQ flags on/off
      events (such as trace-on/off).
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      de30a2b3
    • Ingo Molnar's avatar
      [PATCH] lockdep: better lock debugging · 9a11b49a
      Ingo Molnar authored
      Generic lock debugging:
      
       - generalized lock debugging framework. For example, a bug in one lock
         subsystem turns off debugging in all lock subsystems.
      
       - got rid of the caller address passing (__IP__/__IP_DECL__/etc.) from
         the mutex/rtmutex debugging code: it caused way too much prototype
         hackery, and lockdep will give the same information anyway.
      
       - ability to do silent tests
      
       - check lock freeing in vfree too.
      
       - more finegrained debugging options, to allow distributions to
         turn off more expensive debugging features.
      
      There's no separate 'held mutexes' list anymore - but there's a 'held locks'
      stack within lockdep, which unifies deadlock detection across all lock
      classes.  (this is independent of the lockdep validation stuff - lockdep first
      checks whether we are holding a lock already)
      
      Here are the current debugging options:
      
      CONFIG_DEBUG_MUTEXES=y
      CONFIG_DEBUG_LOCK_ALLOC=y
      
      which do:
      
       config DEBUG_MUTEXES
                bool "Mutex debugging, basic checks"
      
       config DEBUG_LOCK_ALLOC
               bool "Detect incorrect freeing of live mutexes"
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      9a11b49a
  2. 30 Jun, 2006 1 commit
  3. 28 Jun, 2006 2 commits
  4. 26 Jun, 2006 2 commits
  5. 25 Jun, 2006 1 commit
    • KaiGai Kohei's avatar
      [PATCH] pacct: add pacct_struct to fix some pacct bugs. · 0e464814
      KaiGai Kohei authored
      The pacct facility need an i/o operation when an accounting record is
      generated.  There is a possibility to wake OOM killer up.  If OOM killer is
      activated, it kills some processes to make them release process memory
      regions.
      
      But acct_process() is called in the killed processes context before calling
      exit_mm(), so those processes cannot release own memory.  In the results, any
      processes stop in this point and it finally cause a system stall.
      0e464814
  6. 23 Jun, 2006 2 commits
  7. 01 May, 2006 1 commit
  8. 20 Apr, 2006 1 commit
  9. 19 Apr, 2006 1 commit
  10. 15 Apr, 2006 1 commit
  11. 31 Mar, 2006 3 commits
    • Kirill Korotaev's avatar
      [PATCH] wrong error path in dup_fd() leading to oopses in RCU · 42862298
      Kirill Korotaev authored
      Wrong error path in dup_fd() - it should return NULL on error,
      not an address of already freed memory :/
      
      Triggered by OpenVZ stress test suite.
      
      What is interesting is that it was causing different oopses in RCU like
      below:
      Call Trace:
         [<c013492c>] rcu_do_batch+0x2c/0x80
         [<c0134bdd>] rcu_process_callbacks+0x3d/0x70
         [<c0126cf3>] tasklet_action+0x73/0xe0
         [<c01269aa>] __do_softirq+0x10a/0x130
         [<c01058ff>] do_softirq+0x4f/0x60
         =======================
         [<c0113817>] smp_apic_timer_interrupt+0x77/0x110
         [<c0103b54>] apic_timer_interrupt+0x1c/0x24
        Code:  Bad EIP value.
         <0>Kernel panic - not syncing: Fatal exception in interrupt
      Signed-Off-By: default avatarPavel Emelianov <xemul@sw.ru>
      Signed-Off-By: default avatarDmitry Mishin <dim@openvz.org>
      Signed-Off-By: default avatarKirill Korotaev <dev@openvz.org>
      Signed-Off-By: default avatarLinus Torvalds <torvalds@osdl.org>
      42862298
    • Eric W. Biederman's avatar
      [PATCH] pidhash: Refactor the pid hash table · 92476d7f
      Eric W. Biederman authored
      Simplifies the code, reduces the need for 4 pid hash tables, and makes the
      code more capable.
      
      In the discussions I had with Oleg it was felt that to a large extent the
      cleanup itself justified the work.  With struct pid being dynamically
      allocated meant we could create the hash table entry when the pid was
      allocated and free the hash table entry when the pid was freed.  Instead of
      playing with the hash lists when ever a process would attach or detach to a
      process.
      
      For myself the fact that it gave what my previous task_ref patch gave for free
      with simpler code was a big win.  The problem is that if you hold a reference
      to struct task_struct you lock in 10K of low memory.  If you do that in a user
      controllable way like /proc does, with an unprivileged but hostile user space
      application with typical resource limits of 1000 fds and 100 processes I can
      trigger the OOM killer by consuming all of low memory with task structs, on a
      machine wight 1GB of low memory.
      
      If I instead hold a reference to struct pid which holds a pointer to my
      task_struct, I don't suffer from that problem because struct pid is 2 orders
      of magnitude smaller.  In fact struct pid is small enough that most other
      kernel data structures dwarf it, so simply limiting the number of referring
      data structures is enough to prevent exhaustion of low memory.
      
      This splits the current struct pid into two structures, struct pid and struct
      pid_link, and reduces our number of hash tables from PIDTYPE_MAX to just one.
      struct pid_link is the per process linkage into the hash tables and lives in
      struct task_struct.  struct pid is given an indepedent lifetime, and holds
      pointers to each of the pid types.
      
      The independent life of struct pid simplifies attach_pid, and detach_pid,
      because we are always manipulating the list of pids and not the hash table.
      In addition in giving struct pid an indpendent life it makes the concept much
      more powerful.
      
      Kernel data structures can now embed a struct pid * instead of a pid_t and
      not suffer from pid wrap around problems or from keeping unnecessarily
      large amounts of memory allocated.
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      92476d7f
    • Andrew Morton's avatar
      [PATCH] resurrect __put_task_struct · 158d9ebd
      Andrew Morton authored
      This just got nuked in mainline.  Bring it back because Eric's patches use it.
      
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      158d9ebd
  12. 29 Mar, 2006 9 commits
  13. 27 Mar, 2006 1 commit
  14. 26 Mar, 2006 2 commits
  15. 24 Mar, 2006 1 commit
    • Paul Jackson's avatar
      [PATCH] cpuset memory spread slab cache optimizations · c61afb18
      Paul Jackson authored
      The hooks in the slab cache allocator code path for support of NUMA
      mempolicies and cpuset memory spreading are in an important code path.  Many
      systems will use neither feature.
      
      This patch optimizes those hooks down to a single check of some bits in the
      current tasks task_struct flags.  For non NUMA systems, this hook and related
      code is already ifdef'd out.
      
      The optimization is done by using another task flag, set if the task is using
      a non-default NUMA mempolicy.  Taking this flag bit along with the
      PF_SPREAD_PAGE and PF_SPREAD_SLAB flag bits added earlier in this 'cpuset
      memory spreading' patch set, one can check for the combination of any of these
      special case memory placement mechanisms with a single test of the current
      tasks task_struct flags.
      
      This patch also tightens up the code, to save a few bytes of kernel text
      space, and moves some of it out of line.  Due to the nested inlines called
      from multiple places, we were ending up with three copies of this code, which
      once we get off the main code path (for local node allocation) seems a bit
      wasteful of instruction memory.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      c61afb18
  16. 23 Mar, 2006 2 commits
    • Jens Axboe's avatar
    • Eric Dumazet's avatar
      [PATCH] Shrinks sizeof(files_struct) and better layout · 0c9e63fd
      Eric Dumazet authored
      1) Reduce the size of (struct fdtable) to exactly 64 bytes on 32bits
         platforms, lowering kmalloc() allocated space by 50%.
      
      2) Reduce the size of (files_struct), using a special 32 bits (or
         64bits) embedded_fd_set, instead of a 1024 bits fd_set for the
         close_on_exec_init and open_fds_init fields.  This save some ram (248
         bytes per task) as most tasks dont open more than 32 files.  D-Cache
         footprint for such tasks is also reduced to the minimum.
      
      3) Reduce size of allocated fdset.  Currently two full pages are
         allocated, that is 32768 bits on x86 for example, and way too much.  The
         minimum is now L1_CACHE_BYTES.
      
      UP and SMP should benefit from this patch, because most tasks will touch
      only one cache line when open()/close() stdin/stdout/stderr (0/1/2),
      (next_fd, close_on_exec_init, open_fds_init, fd_array[0 ..  2] being in the
      same cache line)
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      0c9e63fd
  17. 22 Mar, 2006 1 commit
  18. 18 Mar, 2006 1 commit
  19. 17 Mar, 2006 1 commit
  20. 14 Mar, 2006 1 commit
    • GOTO Masanori's avatar
      [PATCH] Fix sigaltstack corruption among cloned threads · f9a3879a
      GOTO Masanori authored
      This patch fixes alternate signal stack corruption among cloned threads
      with CLONE_SIGHAND (and CLONE_VM) for linux-2.6.16-rc6.
      
      The value of alternate signal stack is currently inherited after a call of
      clone(...  CLONE_SIGHAND | CLONE_VM).  But if sigaltstack is set by a
      parent thread, and then if multiple cloned child threads (+ parent threads)
      call signal handler at the same time, some threads may be conflicted -
      because they share to use the same alternative signal stack region.
      Finally they get sigsegv.  It's an undesirable race condition.  Note that
      child threads created from NPTL pthread_create() also hit this conflict
      when the parent thread uses sigaltstack, without my patch.
      
      To fix this problem, this patch clears the child threads' sigaltstack
      information like exec().  This behavior follows the SUSv3 specification.
      In SUSv3, pthread_create() says "The alternate stack shall not be inherited
      (when new threads are initialized)".  It means that sigaltstack should be
      cleared when sigaltstack memory space is shared by cloned threads with
      CLONE_SIGHAND.
      
      Note that I chose "if (clone_flags & CLONE_SIGHAND)" line because:
        - If clone_flags line is not existed, fork() does not inherit sigaltstack.
        - CLONE_VM is another choice, but vfork() does not inherit sigaltstack.
        - CLONE_SIGHAND implies CLONE_VM, and it looks suitable.
        - CLONE_THREAD is another candidate, and includes CLONE_SIGHAND + CLONE_VM,
          but this flag has a bit different semantics.
      I decided to use CLONE_SIGHAND.
      
      [ Changed to test for CLONE_VM && !CLONE_VFORK after discussion --Linus ]
      Signed-off-by: default avatarGOTO Masanori <gotom@sanori.org>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Acked-by: default avatarLinus Torvalds <torvalds@osdl.org>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Cc: Jakub Jelinek <jakub@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      f9a3879a
  21. 11 Mar, 2006 1 commit
  22. 15 Feb, 2006 2 commits
    • Oleg Nesterov's avatar
      [PATCH] fix kill_proc_info() vs fork() theoretical race · dadac81b
      Oleg Nesterov authored
      copy_process:
      
      	attach_pid(p, PIDTYPE_PID, p->pid);
      	attach_pid(p, PIDTYPE_TGID, p->tgid);
      
      What if kill_proc_info(p->pid) happens in between?
      
      copy_process() holds current->sighand.siglock, so we are safe
      in CLONE_THREAD case, because current->sighand == p->sighand.
      
      Otherwise, p->sighand is unlocked, the new process is already
      visible to the find_task_by_pid(), but have a copy of parent's
      'struct pid' in ->pids[PIDTYPE_TGID].
      
      This means that __group_complete_signal() may hang while doing
      
      	do ... while (next_thread() != p)
      
      We can solve this problem if we reverse these 2 attach_pid()s:
      
      	attach_pid() does wmb()
      
      	group_send_sig_info() calls spin_lock(), which
      	provides a read barrier. // Yes ?
      
      I don't think we can hit this race in practice, but still.
      Signed-off-by: default avatarOleg Nesterov <oleg@tv-sign.ru>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      dadac81b
    • Oleg Nesterov's avatar
      [PATCH] fix kill_proc_info() vs CLONE_THREAD race · 3f17da69
      Oleg Nesterov authored
      There is a window after copy_process() unlocks ->sighand.siglock
      and before it adds the new thread to the thread list.
      
      In that window __group_complete_signal(SIGKILL) will not see the
      new thread yet, so this thread will start running while the whole
      thread group was supposed to exit.
      
      I beleive we have another good reason to place attach_pid(PID/TGID)
      under ->sighand.siglock. We can do the same for
      
      	release_task()->__unhash_process()
      
      	de_thread()->switch_exec_pids()
      
      After that we don't need tasklist_lock to iterate over the thread
      list, and we can simplify things, see for example do_sigaction()
      or sys_times().
      Signed-off-by: default avatarOleg Nesterov <oleg@tv-sign.ru>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      3f17da69