1. 11 Jun, 2009 7 commits
    • Eric Paris's avatar
      fsnotify: include pathnames with entries when possible · 62ffe5df
      Eric Paris authored
      
      
      When inotify wants to send events to a directory about a child it includes
      the name of the original file.  This patch collects that filename and makes
      it available for notification.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      62ffe5df
    • Eric Paris's avatar
      fsnotify: generic notification queue and waitq · a2d8bc6c
      Eric Paris authored
      
      
      inotify needs to do asyc notification in which event information is stored
      on a queue until the listener is ready to receive it.  This patch
      implements a generic notification queue for inotify (and later fanotify) to
      store events to be sent at a later time.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      a2d8bc6c
    • Eric Paris's avatar
      dnotify: reimplement dnotify using fsnotify · 3c5119c0
      Eric Paris authored
      
      
      Reimplement dnotify using fsnotify.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      3c5119c0
    • Eric Paris's avatar
      fsnotify: parent event notification · c28f7e56
      Eric Paris authored
      
      
      inotify and dnotify both use a similar parent notification mechanism.  We
      add a generic parent notification mechanism to fsnotify for both of these
      to use.  This new machanism also adds the dentry flag optimization which
      exists for inotify to dnotify.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      c28f7e56
    • Eric Paris's avatar
      fsnotify: add marks to inodes so groups can interpret how to handle those inodes · 3be25f49
      Eric Paris authored
      
      
      This patch creates a way for fsnotify groups to attach marks to inodes.
      These marks have little meaning to the generic fsnotify infrastructure
      and thus their meaning should be interpreted by the group that attached
      them to the inode's list.
      
      dnotify and inotify  will make use of these markings to indicate which
      inodes are of interest to their respective groups.  But this implementation
      has the useful property that in the future other listeners could actually
      use the marks for the exact opposite reason, aka to indicate which inodes
      it had NO interest in.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      3be25f49
    • Eric Paris's avatar
      fsnotify: unified filesystem notification backend · 90586523
      Eric Paris authored
      fsnotify is a backend for filesystem notification.  fsnotify does
      not provide any userspace interface but does provide the basis
      needed for other notification schemes such as dnotify.  fsnotify
      can be extended to be the backend for inotify or the upcoming
      fanotify.  fsnotify provides a mechanism for "groups" to register for
      some set of filesystem events and to then deliver those events to
      those groups for processing.
      
      fsnotify has a number of benefits, the first being actually shrinking the size
      of an inode.  Before fsnotify to support both dnotify and inotify an inode had
      
              unsigned long           i_dnotify_mask; /* Directory notify events */
              struct dnotify_struct   *i_dnotify; /* for directory notifications */
              struct list_head        inotify_watches; /* watches on this inode */
              struct mutex            inotify_mutex;  /* protects the watches list
      
      But with fsnotify this same functionallity (and more) is done with just
      
              __u32                   i_fsnotify_mask; /* all events for this inode */
              struct hlist_head       i_fsnotify_mark_entries; /* marks on this inode */
      
      That's right, inotify, dnotify, and fanotify all in 64 bits.  We used that
      much space just in inotify_watches alone, before this patch set.
      
      fsnotify object lifetime and locking is MUCH better than what we have today.
      inotify locking is incredibly complex.  See 8f7b0ba1 as an example of
      what's been busted since inception.  inotify needs to know internal semantics
      of superblock destruction and unmounting to function.  The inode pinning and
      vfs contortions are horrible.
      
      no fsnotify implementers do allocation under locks.  This means things like
      f04b30de
      
       which (due to an overabundance of caution) changes GFP_KERNEL to
      GFP_NOFS can be reverted.  There are no longer any allocation rules when using
      or implementing your own fsnotify listener.
      
      fsnotify paves the way for fanotify.  In brief fanotify is a notification
      mechanism that delivers the lisener both an 'event' and an open file descriptor
      to the object in question.  This means that fanotify is pathname agnostic.
      Some on lkml may not care for the original companies or users that pushed for
      TALPA, but fanotify was designed with flexibility and input for other users in
      mind.  The readahead group expressed interest in fanotify as it could be used
      to profile disk access on boot without breaking the audit system.  The desktop
      search groups have also expressed interest in fanotify as it solves a number
      of the race conditions and problems present with managing inotify when more
      than a limited number of specific files are of interest.  fanotify can provide
      for a userspace access control system which makes it a clean interface for AV
      vendors to hook without trying to do binary patching on the syscall table,
      LSM, and everywhere else they do their things today.  With this patch series
      fanotify can be implemented in less than 1200 lines of easy to review code.
      Almost all of which is the socket based user interface.
      
      This patch series builds fsnotify to the point that it can implement
      dnotify and inotify_user.  Patches exist and will be sent soon after
      acceptance to finish the in kernel inotify conversion (audit) and implement
      fanotify.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      90586523
    • Alan Cox's avatar
  2. 10 Jun, 2009 6 commits
  3. 09 Jun, 2009 4 commits
    • Jan Kara's avatar
      jbd: fix race in buffer processing in commit code · a61d90d7
      Jan Kara authored
      
      
      In commit code, we scan buffers attached to a transaction.  During this
      scan, we sometimes have to drop j_list_lock and then we recheck whether
      the journal buffer head didn't get freed by journal_try_to_free_buffers().
       But checking for buffer_jbd(bh) isn't enough because a new journal head
      could get attached to our buffer head.  So add a check whether the journal
      head remained the same and whether it's still at the same transaction and
      list.
      
      This is a nasty bug and can cause problems like memory corruption (use after
      free) or trigger various assertions in JBD code (observed).
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Cc: <stable@kernel.org>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a61d90d7
    • Ian Kent's avatar
      autofs4: remove hashed check in validate_wait() · 463aea1a
      Ian Kent authored
      
      
      The recent ->lookup() deadlock correction required the directory inode
      mutex to be dropped while waiting for expire completion.  We were
      concerned about side effects from this change and one has been identified.
      
      I saw several error messages.
      
      They cause autofs to become quite confused and don't really point to the
      actual problem.
      
      Things like:
      
      handle_packet_missing_direct:1376: can't find map entry for (43,1827932)
      
      which is usually totally fatal (although in this case it wouldn't be
      except that I treat is as such because it normally is).
      
      do_mount_direct: direct trigger not valid or already mounted
      /test/nested/g3c/s1/ss1
      
      which is recoverable, however if this problem is at play it can cause
      autofs to become quite confused as to the dependencies in the mount tree
      because mount triggers end up mounted multiple times.  It's hard to
      accurately check for this over mounting case and automount shouldn't need
      to if the kernel module is doing its job.
      
      There was one other message, similar in consequence of this last one but I
      can't locate a log example just now.
      
      When checking if a mount has already completed prior to adding a new mount
      request to the wait queue we check if the dentry is hashed and, if so, if
      it is a mount point.  But, if a mount successfully completed while we
      slept on the wait queue mutex the dentry must exist for the mount to have
      completed so the test is not really needed.
      
      Mounts can also be done on top of a global root dentry, so for the above
      case, where a mount request completes and the wait queue entry has already
      been removed, the hashed test returning false can cause an incorrect
      callback to the daemon.  Also, d_mountpoint() is not sufficient to check
      if a mount has completed for the multi-mount case when we don't have a
      real mount at the base of the tree.
      Signed-off-by: default avatarIan Kent <raven@themaw.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      463aea1a
    • Li Zefan's avatar
      tracing/events: convert block trace points to TRACE_EVENT() · 55782138
      Li Zefan authored
      
      
      TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
      these new capabilities to this tracepoint:
      
        - zero-copy and per-cpu splice() tracing
        - binary tracing without printf overhead
        - structured logging records exposed under /debug/tracing/events
        - trace events embedded in function tracer output and other plugins
        - user-defined, per tracepoint filter expressions
        ...
      
      Cons:
      
        - no dev_t info for the output of plug, unplug_timer and unplug_io events.
          no dev_t info for getrq and sleeprq events if bio == NULL.
          no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.
      
          This is mainly because we can't get the deivce from a request queue.
          But this may change in the future.
      
        - A packet command is converted to a string in TP_assign, not TP_print.
          While blktrace do the convertion just before output.
      
          Since pc requests should be rather rare, this is not a big issue.
      
        - In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
          has a unique format, which means we have some unused data in a trace entry.
      
          The overhead is minimized by using __dynamic_array() instead of __array().
      
      I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:
      
            dd                   dd + ioctl blktrace       dd + TRACE_EVENT (splice)
      1     7.36s, 42.7 MB/s     7.50s, 42.0 MB/s          7.41s, 42.5 MB/s
      2     7.43s, 42.3 MB/s     7.48s, 42.1 MB/s          7.43s, 42.4 MB/s
      3     7.38s, 42.6 MB/s     7.45s, 42.2 MB/s          7.41s, 42.5 MB/s
      
      So the overhead of tracing is very small, and no regression when using
      those trace events vs blktrace.
      
      And the binary output of TRACE_EVENT is much smaller than blktrace:
      
       # ls -l -h
       -rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
       -rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
       -rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out
      
      Following are some comparisons between TRACE_EVENT and blktrace:
      
      plug:
        kjournald-480   [000]   303.084981: block_plug: [kjournald]
        kjournald-480   [000]   303.084981:   8,0    P   N [kjournald]
      
      unplug_io:
        kblockd/0-118   [000]   300.052973: block_unplug_io: [kblockd/0] 1
        kblockd/0-118   [000]   300.052974:   8,0    U   N [kblockd/0] 1
      
      remap:
        kjournald-480   [000]   303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384
        kjournald-480   [000]   303.085043:   8,0    A   W 102736992 + 8 <- (8,8) 33384
      
      bio_backmerge:
        kjournald-480   [000]   303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald]
        kjournald-480   [000]   303.085086:   8,0    M   W 102737032 + 8 [kjournald]
      
      getrq:
        kjournald-480   [000]   303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald]
        kjournald-480   [000]   303.084975:   8,0    G   W 102736984 + 8 [kjournald]
      
        bash-2066  [001]  1072.953770:   8,0    G   N [bash]
        bash-2066  [001]  1072.953773: block_getrq: 0,0 N 0 + 0 [bash]
      
      rq_complete:
        konsole-2065  [001]   300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0]
        konsole-2065  [001]   300.053191:   8,0    C   W 103669040 + 16 [0]
      
        ksoftirqd/1-7   [001]  1072.953811:   8,0    C   N (5a 00 08 00 00 00 00 00 24 00) [0]
        ksoftirqd/1-7   [001]  1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0]
      
      rq_insert:
        kjournald-480   [000]   303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald]
        kjournald-480   [000]   303.084986:   8,0    I   W 102736984 + 8 [kjournald]
      
      Changelog from v2 -> v3:
      
      - use the newly introduced __dynamic_array().
      
      Changelog from v1 -> v2:
      
      - use __string() instead of __array() to minimize the memory required
        to store hex dump of rq->cmd().
      
      - support large pc requests.
      
      - add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.
      
      - some cleanups.
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      55782138
    • Theodore Ts'o's avatar
      ext4: Don't treat a truncation of a zero-length file as replace-via-truncate · 0eab9282
      Theodore Ts'o authored
      
      
      If a non-existent file is opened via O_WRONLY|O_CREAT|O_TRUNC, there's
      no need to treat this as a true file truncation, so we shouldn't
      activate the replace-via-truncate hueristic.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      0eab9282
  4. 08 Jun, 2009 1 commit
    • Toshiyuki Okajima's avatar
      ext4: fix dx_map_entry to support 256k directory blocks · 9aee2286
      Toshiyuki Okajima authored
      
      
      The dx_map_entry structure doesn't support over 64KB block size by
      current usage of its member("offs"). Because "offs" treats an offset
      of copies of the ext4_dir_entry_2 structure as is. This member size is
      16 bits. But real offset for over 64KB(256KB) block size needs 18
      bits. However, real offset keeps 4 byte boundary, so lower 2 bits is
      not used.
      
      Therefore, we do the following to fix this limitation:
      For "store": 
      	we divide the real offset by 4 and then store this result to "offs" 
      	member.
      For "use":
      	we multiply "offs" member by 4 and then use this result 
      	as real offset.
      Signed-off-by: default avatarToshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      9aee2286
  5. 06 Jun, 2009 5 commits
    • Hugh Dickins's avatar
      integrity: fix IMA inode leak · f07502da
      Hugh Dickins authored
      
      
      CONFIG_IMA=y inode activity leaks iint_cache and radix_tree_node objects
      until the system runs out of memory.  Nowhere is calling ima_inode_free()
      a.k.a. ima_iint_delete().  Fix that by calling it from destroy_inode().
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f07502da
    • Steve French's avatar
      [CIFS] Add mention of new mount parm (forceuid) to cifs readme · f0472d0e
      Steve French authored
      
      
      Also update fs/cifs/CHANGES
      Signed-off-by: default avatarSteve French <sfrench@us.ibm.com>
      f0472d0e
    • Jeff Layton's avatar
      cifs: make overriding of ownership conditional on new mount options · 4ae1507f
      Jeff Layton authored
      
      
      We have a bit of a problem with the uid= option. The basic issue is that
      it means too many things and has too many side-effects.
      
      It's possible to allow an unprivileged user to mount a filesystem if the
      user owns the mountpoint, /bin/mount is setuid root, and the mount is
      set up in /etc/fstab with the "user" option.
      
      When doing this though, /bin/mount automatically adds the "uid=" and
      "gid=" options to the share. This is fortunate since the correct uid=
      option is needed in order to tell the upcall what user's credcache to
      use when generating the SPNEGO blob.
      
      On a mount without unix extensions this is fine -- you generally will
      want the files to be owned by the "owner" of the mount. The problem
      comes in on a mount with unix extensions. With those enabled, the
      uid/gid options cause the ownership of files to be overriden even though
      the server is sending along the ownership info.
      
      This means that it's not possible to have a mount by an unprivileged
      user that shows the server's file ownership info. The result is also
      inode permissions that have no reflection at all on the server. You
      simply cannot separate ownership from the mode in this fashion.
      
      This behavior also makes MultiuserMount option less usable. Once you
      pass in the uid= option for a mount, then you can't use unix ownership
      info and allow someone to share the mount.
      
      While I'm not thrilled with it, the only solution I can see is to stop
      making uid=/gid= force the overriding of ownership on mounts, and to add
      new mount options that turn this behavior on.
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarSteve French <sfrench@us.ibm.com>
      4ae1507f
    • Al Viro's avatar
      ext3/4 with synchronous writes gets wedged by Postfix · 72a43d63
      Al Viro authored
      
      
      OK, that's probably the easiest way to do that, as much as I don't like it...
      Since iget() et.al. will not accept I_FREEING (will wait to go away
      and restart), and since we'd better have serialization between new/free
      on fs data structures anyway, we can afford simply skipping I_FREEING
      et.al. in insert_inode_locked().
      
      We do that from new_inode, so it won't race with free_inode in any interesting
      ways and it won't race with iget (of any origin; nfsd or in case of fs
      corruption a lookup) since both still will wait for I_LOCK.
      Reviewed-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Tested-by: default avatarDavid Watson <dbwatson@ukfsn.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      72a43d63
    • Theodore Ts'o's avatar
      Fix nobh_truncate_page() to not pass stack garbage to get_block() · 460bcf57
      Theodore Ts'o authored
      
      
      The nobh_truncate_page() function is used by ext2, exofs, and jfs.  Of
      these three, only ext2 and jfs's get_block() function pays attention
      to bh->b_size --- which is normally always the filesystem blocksize
      except when the get_block() function is called by either
      mpage_readpage(), mpage_readpages(), or the direct I/O routines in
      fs/direct_io.c.
      
      Unfortunately, nobh_truncate_page() does not initialize map_bh before
      calling the filesystem-supplied get_block() function.  So ext2 and jfs
      will try to calculate the number of blocks to map by taking stack
      garbage and shifting it left by inode->i_blkbits.  This should be
      *mostly* harmless (except the filesystem will do some unnneeded work)
      unless the stack garbage is less than filesystem's blocksize, in which
      case maxblocks will be zero, and the attempt to find out whether or
      not the filesystem has a hole at a given logical block will fail, and
      the page cache entry might not get zero'ed out.
      
      Also if the stack garbage in in map_bh->state happens to have the
      BH_Mapped bit set, there could be an attempt to call readpage() on a
      non-existent page, which could cause nobh_truncate_page() to return an
      error when it should not.
      
      Fix this by initializing map_bh->state and map_bh->size.
      
      Fortunately, it's probably fairly unlikely that ext2 and jfs users
      mount with nobh these days.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      460bcf57
  6. 05 Jun, 2009 3 commits
  7. 04 Jun, 2009 1 commit
  8. 09 Jun, 2009 1 commit
    • Jan Kara's avatar
      ext4: Get rid of EXTEND_DISKSIZE flag of ext4_get_blocks_handle() · 03f5d8bc
      Jan Kara authored
      
      
      Get rid of EXTEND_DISKSIZE flag of ext4_get_blocks_handle(). This
      seems to be a relict from some old days and setting disksize in this
      function does not make much sense.  Currently it was set only by
      ext4_getblk().  Since the parameter has some effect only if create ==
      1, it is easy to check by grepping through the sources that the three
      callers which end up calling ext4_getblk() with create == 1
      (ext4_append, ext4_quota_write, ext4_mkdir) do the right thing and set
      disksize themselves.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      03f5d8bc
  9. 04 Jun, 2009 3 commits
    • Jens Axboe's avatar
      Revert "block: implement blkdev_readpages" · 172124e2
      Jens Axboe authored
      This reverts commit db2dbb12
      
      .
      
      It apparently causes problems with partition table read-ahead
      on archs with large page sizes. Until that problem is diagnosed
      further, just drop the readpages support on block devices.
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      172124e2
    • Chris Mason's avatar
      Btrfs: Fix oops and use after free during space balancing · 44fb5511
      Chris Mason authored
      
      
      The btrfs allocator uses list_for_each to walk the available block
      groups when searching for free blocks.  It starts off with a hint
      to help find the best block group for a given allocation.
      
      The hint is resolved into a block group, but we don't properly check
      to make sure the block group we find isn't in the middle of being
      freed due to filesystem shrinking or balancing.  If it is being
      freed, the list pointers in it are bogus and can't be trusted.  But,
      the code happily goes along and uses them in the list_for_each loop,
      leading to all kinds of fun.
      
      The fix used here is to check to make sure the block group we find really
      is on the list before we use it.  list_del_init is used when removing
      it from the list, so we can do a proper check.
      
      The allocation clustering code has a similar bug where it will trust
      the block group in the current free space cluster.  If our allocation
      flags have changed (going from single spindle dup to raid1 for example)
      because the drives in the FS have changed, we're not allowed to use
      the old block group any more.
      
      The fix used here is to check the current cluster against the
      current allocation flags.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      44fb5511
    • Yan Zheng's avatar
      Btrfs: set device->total_disk_bytes when adding new device · 2cc3c559
      Yan Zheng authored
      
      
      It was not being properly initialized, and so the size saved to
      disk was not correct.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      2cc3c559
  10. 03 Jun, 2009 1 commit
  11. 09 Jun, 2009 1 commit
  12. 03 Jun, 2009 1 commit
  13. 02 Jun, 2009 5 commits
  14. 30 May, 2009 1 commit