1. 24 Mar, 2011 2 commits
  2. 21 Mar, 2011 1 commit
  3. 16 Mar, 2011 1 commit
    • Christoph Hellwig's avatar
      prune back iprune_sem · bab1d944
      Christoph Hellwig authored
      iprune_sem is continously giving us lockdep warnings because we do take it in
      read mode in the reclaim path, but we're also doing non-NOFS allocations under
      it taken in write mode.
      Taking a bit deeper look at it I think it's fixable quite trivially:
       - for invalidate_inodes we do not need iprune_sem at all.  We have an active
         reference on the superblock, so the filesystem is not going away until it
         has finished.
       - for evict_inodes we do need it, to make sure prune_icache has done it's
         work before we tear down the superblock.  But there is no reason to
         hold it over the actual reclaim operation - it's enough to cycle through
         it after the actual reclaim to make sure we wait for any pending
         prune_icache to complete.  We just have to remove the WARN_ON for
         otherwise busy inodes as they can actually happen now.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  4. 24 Feb, 2011 2 commits
    • NeilBrown's avatar
      Fix over-zealous flush_disk when changing device size. · 93b270f7
      NeilBrown authored
      There are two cases when we call flush_disk.
      In one, the device has disappeared (check_disk_change) so any
      data will hold becomes irrelevant.
      In the oter, the device has changed size (check_disk_size_change)
      so data we hold may be irrelevant.
      In both cases it makes sense to discard any 'clean' buffers,
      so they will be read back from the device if needed.
      In the former case it makes sense to discard 'dirty' buffers
      as there will never be anywhere safe to write the data.  In the
      second case it *does*not* make sense to discard dirty buffers
      as that will lead to file system corruption when you simply enlarge
      the containing devices.
      flush_disk calls __invalidate_devices.
      __invalidate_device calls both invalidate_inodes and invalidate_bdev.
      invalidate_inodes *does* discard I_DIRTY inodes and this does lead
      to fs corruption.
      invalidate_bev *does*not* discard dirty pages, but I don't really care
      about that at present.
      So this patch adds a flag to __invalidate_device (calling it
      __invalidate_device2) to indicate whether dirty buffers should be
      killed, and this is passed to invalidate_inodes which can choose to
      skip dirty inodes.
      flusk_disk then passes true from check_disk_change and false from
      dm avoids tripping over this problem by calling i_size_write directly
      rathher than using check_disk_size_change.
      md does use check_disk_size_change and so is affected.
      This regression was introduced by commit 608aeef1
       which causes
      check_disk_size_change to call flush_disk, so it is suitable for any
      kernel since 2.6.27.
      Cc: stable@kernel.org
      Acked-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Cc: Andrew Patterson <andrew.patterson@hp.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • Miklos Szeredi's avatar
      mm: prevent concurrent unmap_mapping_range() on the same inode · 2aa15890
      Miklos Szeredi authored
      Michael Leun reported that running parallel opens on a fuse filesystem
      can trigger a "kernel BUG at mm/truncate.c:475"
      Gurudas Pai reported the same bug on NFS.
      The reason is, unmap_mapping_range() is not prepared for more than
      one concurrent invocation per inode.  For example:
        thread1: going through a big range, stops in the middle of a vma and
           stores the restart address in vm_truncate_count.
        thread2: comes in with a small (e.g. single page) unmap request on
           the same vma, somewhere before restart_address, finds that the
           vma was already unmapped up to the restart address and happily
           returns without doing anything.
      Another scenario would be two big unmap requests, both having to
      restart the unmapping and each one setting vm_truncate_count to its
      own value.  This could go on forever without any of them being able to
      Truncate and hole punching already serialize with i_mutex.  Other
      callers of unmap_mapping_range() do not, and it's difficult to get
      i_mutex protection for all callers.  In particular ->d_revalidate(),
      which calls invalidate_inode_pages2_range() in fuse, may be called
      with or without i_mutex.
      This patch adds a new mutex to 'struct address_space' to prevent
      running multiple concurrent unmap_mapping_range() on the same mapping.
      [ We'll hopefully get rid of all this with the upcoming mm
        preemptibility series by Peter Zijlstra, the "mm: Remove i_mmap_mutex
        lockbreak" patch in particular.  But that is for 2.6.39 ]
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Reported-by: default avatarMichael Leun <lkml20101129@newton.leun.net>
      Reported-by: default avatarGurudas Pai <gurudas.pai@oracle.com>
      Tested-by: default avatarGurudas Pai <gurudas.pai@oracle.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  5. 07 Jan, 2011 4 commits
    • Nick Piggin's avatar
      fs: avoid inode RCU freeing for pseudo fs · ff0c7d15
      Nick Piggin authored
      Pseudo filesystems that don't put inode on RCU list or reachable by
      rcu-walk dentries do not need to RCU free their inodes.
      Signed-off-by: default avatarNick Piggin <npiggin@kernel.dk>
    • Nick Piggin's avatar
      fs: icache RCU free inodes · fa0d7e3d
      Nick Piggin authored
      RCU free the struct inode. This will allow:
      - Subsequent store-free path walking patch. The inode must be consulted for
        permissions when walking, so an RCU inode reference is a must.
      - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
        to take i_lock no longer need to take sb_inode_list_lock to walk the list in
        the first place. This will simplify and optimize locking.
      - Could remove some nested trylock loops in dcache code
      - Could potentially simplify things a bit in VM land. Do not need to take the
        page lock to follow page->mapping.
      The downsides of this is the performance cost of using RCU. In a simple
      creat/unlink microbenchmark, performance drops by about 10% due to inability to
      reuse cache-hot slab objects. As iterations increase and RCU freeing starts
      kicking over, this increases to about 20%.
      In cases where inode lifetimes are longer (ie. many inodes may be allocated
      during the average life span of a single inode), a lot of this cache reuse is
      not applicable, so the regression caused by this patch is smaller.
      The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
      however this adds some complexity to list walking and store-free path walking,
      so I prefer to implement this at a later date, if it is shown to be a win in
      real situations. I haven't found a regression in any non-micro benchmark so I
      doubt it will be a problem.
      Signed-off-by: default avatarNick Piggin <npiggin@kernel.dk>
    • Nick Piggin's avatar
      fs: use fast counters for vfs caches · 3e880fb5
      Nick Piggin authored
      percpu_counter library generates quite nasty code, so unless you need
      to dynamically allocate counters or take fast approximate value, a
      simple per cpu set of counters is much better.
      The percpu_counter can never be made to work as well, because it has an
      indirection from pointer to percpu memory, and it can't use direct
      this_cpu_inc interfaces because it doesn't use static PER_CPU data, so
      code will always be worse.
      In the fastpath, it is the difference between this:
              incl %gs:nr_dentry      # nr_dentry
      and this:
              movl    percpu_counter_batch(%rip), %edx        # percpu_counter_batch,
              movl    $1, %esi        #,
              movq    $nr_dentry, %rdi        #,
              call    __percpu_counter_add    # (plus I clobber registers)
              pushq   %rbp    #
              movq    %rsp, %rbp      #,
              subq    $32, %rsp       #,
              movq    %rbx, -24(%rbp) #,
              movq    %r12, -16(%rbp) #,
              movq    %r13, -8(%rbp)  #,
              movq    %rdi, %rbx      # fbc, fbc
      # 216 "/home/npiggin/usr/src/linux-2.6/arch/x86/include/asm/thread_info.h" 1
              movq %gs:kernel_stack,%rax      #, pfo_ret__
      # 0 "" 2
              incl    -8124(%rax)     # <variable>.preempt_count
              movq    32(%rdi), %r12  # <variable>.counters, tcp_ptr__
      # 78 "lib/percpu_counter.c" 1
              add %gs:this_cpu_off, %r12      # this_cpu_off, tcp_ptr__
      # 0 "" 2
              movslq  (%r12),%r13     #* tcp_ptr__, tmp73
              movslq  %edx,%rax       # batch, batch
              addq    %rsi, %r13      # amount, count
              cmpq    %rax, %r13      # batch, count
              jge     .L27    #,
              negl    %edx    # tmp76
              movslq  %edx,%rdx       # tmp76, tmp77
              cmpq    %rdx, %r13      # tmp77, count
              jg      .L28    #,
              movq    %rbx, %rdi      # fbc,
              call    _raw_spin_lock  #
              addq    %r13, 8(%rbx)   # count, <variable>.count
              movq    %rbx, %rdi      # fbc,
              movl    $0, (%r12)      #,* tcp_ptr__
              call    _raw_spin_unlock        #
      # 216 "/home/npiggin/usr/src/linux-2.6/arch/x86/include/asm/thread_info.h" 1
              movq %gs:kernel_stack,%rax      #, pfo_ret__
      # 0 "" 2
              decl    -8124(%rax)     # <variable>.preempt_count
              movq    -8136(%rax), %rax       #, D.14625
              testb   $8, %al #, D.14625
              jne     .L32    #,
              movq    -24(%rbp), %rbx #,
              movq    -16(%rbp), %r12 #,
              movq    -8(%rbp), %r13  #,
              .p2align 4,,10
              .p2align 3
              movl    %r13d, (%r12)   # count,*
              jmp     .L29    #
              call    preempt_schedule        #
              .p2align 4,,6
              jmp     .L31    #
              .size   __percpu_counter_add, .-__percpu_counter_add
              .p2align 4,,15
      Signed-off-by: default avatarNick Piggin <npiggin@kernel.dk>
    • Nick Piggin's avatar
      vfs: revert per-cpu nr_unused counters for dentry and inodes · 86c8749e
      Nick Piggin authored
      The nr_unused counters count the number of objects on an LRU, and as such they
      are synchronized with LRU object insertion and removal and scanning, and
      protected under the LRU lock.
      Making it per-cpu does not actually get any concurrency improvements because of
      this lock, and summing the counter is much slower, and
      incrementing/decrementing it costs more code size and is slower too.
      These counters should stay per-LRU, which currently means global.
      Signed-off-by: default avatarNick Piggin <npiggin@kernel.dk>
  6. 26 Oct, 2010 19 commits
  7. 09 Aug, 2010 11 commits