1. 08 Jun, 2011 7 commits
    • Christoph Hellwig's avatar
      writeback: split inode_wb_list_lock into bdi_writeback.list_lock · f758eeab
      Christoph Hellwig authored
      Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
      as it's currently the most contended lock in the system for metadata
      heavy workloads.  It won't help for single-filesystem workloads for
      which we'll need the I/O-less balance_dirty_pages, but at least we
      can dedicate a cpu to spinning on each bdi now for larger systems.
      
      Based on earlier patches from Nick Piggin and Dave Chinner.
      
      It reduces lock contentions to 1/4 in this test case:
      10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram
      
      lock_stat version 0.3
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                    class name    con-bounces    contentions   waittime-min   waittime-max waittime-total    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      vanilla 2.6.39-rc3:
                            inode_wb_list_lock:         42590          44433           0.12         147.74      144127.35         252274         886792           0.08         121.34      917211.23
                            ------------------
                            inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                            inode_wb_list_lock             34          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                            inode_wb_list_lock          12893          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                            inode_wb_list_lock          10702          [<ffffffff8115afef>] writeback_single_inode+0x16d/0x20a
                            ------------------
                            inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                            inode_wb_list_lock             19          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                            inode_wb_list_lock           5550          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                            inode_wb_list_lock           8511          [<ffffffff8115b4ad>] writeback_sb_inodes+0x10f/0x157
      
      2.6.39-rc3 + patch:
                      &(&wb->list_lock)->rlock:         11383          11657           0.14         151.69       40429.51          90825         527918           0.11         145.90      556843.37
                      ------------------------
                      &(&wb->list_lock)->rlock             10          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                      &(&wb->list_lock)->rlock           1493          [<ffffffff8115b1ed>] writeback_inodes_wb+0x3d/0x150
                      &(&wb->list_lock)->rlock           3652          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
                      &(&wb->list_lock)->rlock           1412          [<ffffffff8115a38e>] writeback_single_inode+0x17f/0x223
                      ------------------------
                      &(&wb->list_lock)->rlock              3          [<ffffffff8110b5af>] bdi_lock_two+0x46/0x4b
                      &(&wb->list_lock)->rlock              6          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                      &(&wb->list_lock)->rlock           2061          [<ffffffff8115af97>] __mark_inode_dirty+0x173/0x1cf
                      &(&wb->list_lock)->rlock           2629          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
      
      hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
      akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      f758eeab
    • Wu Fengguang's avatar
      writeback: refill b_io iff empty · 424b351f
      Wu Fengguang authored
      There is no point to carry different refill policies between for_kupdate
      and other type of works. Use a consistent "refill b_io iff empty" policy
      which can guarantee fairness in an easy to understand way.
      
      A b_io refill will setup a _fixed_ work set with all currently eligible
      inodes and start a new round of walk through b_io. The "fixed" work set
      means no new inodes will be added to the work set during the walk.
      Only when a complete walk over b_io is done, new inodes that are
      eligible at the time will be enqueued and the walk be started over.
      
      This procedure provides fairness among the inodes because it guarantees
      each inode to be synced once and only once at each round. So all inodes
      will be free from starvations.
      
      This change relies on wb_writeback() to keep retrying as long as we made
      some progress on cleaning some pages and/or inodes. Without that ability,
      the old logic on background works relies on aggressively queuing all
      eligible inodes into b_io at every time. But that's not a guarantee.
      
      The below test script completes a slightly faster now:
      
                   2.6.39-rc3	  2.6.39-rc3-dyn-expire+
      ------------------------------------------------
      all elapsed     256.043      252.367
      stddev           24.381       12.530
      
      tar elapsed      30.097       28.808
      dd  elapsed      13.214       11.782
      
      	#!/bin/zsh
      
      	cp /c/linux-2.6.38.3.tar.bz2 /dev/shm/
      
      	umount /dev/sda7
      	mkfs.xfs -f /dev/sda7
      	mount /dev/sda7 /fs
      
      	echo 3 > /proc/sys/vm/drop_caches
      
      	tic=$(cat /proc/uptime|cut -d' ' -f2)
      
      	cd /fs
      	time tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 &
      	time dd if=/dev/zero of=/fs/zero bs=1M count=1000 &
      
      	wait
      	sync
      	tac=$(cat /proc/uptime|cut -d' ' -f2)
      	echo elapsed: $((tac - tic))
      
      It maintains roughly the same small vs. large file writeout shares, and
      offers large files better chances to be written in nice 4M chunks.
      
      Analyzes from Dave Chinner in great details:
      
      Let's say we have lots of inodes with 100 dirty pages being created,
      and one large writeback going on. We expire 8 new inodes for every
      1024 pages we write back.
      
      With the old code, we do:
      
      	b_more_io (large inode) -> b_io (1l)
      	8 newly expired inodes -> b_io (1l, 8s)
      
      	writeback  large inode 1024 pages -> b_more_io
      
      	b_more_io (large inode) -> b_io (8s, 1l)
      	8 newly expired inodes -> b_io (8s, 1l, 8s)
      
      	writeback  8 small inodes 800 pages
      		   1 large inode 224 pages -> b_more_io
      
      	b_more_io (large inode) -> b_io (8s, 1l)
      	8 newly expired inodes -> b_io (8s, 1l, 8s)
      	.....
      
      Your new code:
      
      	b_more_io (large inode) -> b_io (1l)
      	8 newly expired inodes -> b_io (1l, 8s)
      
      	writeback  large inode 1024 pages -> b_more_io
      	(b_io == 8s)
      	writeback  8 small inodes 800 pages
      
      	b_io empty: (1800 pages written)
      		b_more_io (large inode) -> b_io (1l)
      		14 newly expired inodes -> b_io (1l, 14s)
      
      	writeback  large inode 1024 pages -> b_more_io
      	(b_io == 14s)
      	writeback  10 small inodes 1000 pages
      		   1 small inode 24 pages -> b_more_io (1l, 1s(24))
      	writeback  5 small inodes 500 pages
      	b_io empty: (2548 pages written)
      		b_more_io (large inode) -> b_io (1l, 1s(24))
      		20 newly expired inodes -> b_io (1l, 1s(24), 20s)
      	......
      
      Rough progression of pages written at b_io refill:
      
      Old code:
      
      	total	large file	% of writeback
      	1024	224		21.9% (fixed)
      
      New code:
      	total	large file	% of writeback
      	1800	1024		~55%
      	2550	1024		~40%
      	3050	1024		~33%
      	3500	1024		~29%
      	3950	1024		~26%
      	4250	1024		~24%
      	4500	1024		~22.7%
      	4700	1024		~21.7%
      	4800	1024		~21.3%
      	4800	1024		~21.3%
      	(pretty much steady state from here)
      
      Ok, so the steady state is reached with a similar percentage of
      writeback to the large file as the existing code. Ok, that's good,
      but providing some evidence that is doesn't change the shared of
      writeback to the large should be in the commit message ;)
      
      The other advantage to this is that we always write 1024 page chunks
      to the large file, rather than smaller "whatever remains" chunks.
      
      CC: Jan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      424b351f
    • Wu Fengguang's avatar
      writeback: the kupdate expire timestamp should be a moving target · ba9aa839
      Wu Fengguang authored
      Dynamically compute the dirty expire timestamp at queue_io() time.
      
      writeback_control.older_than_this used to be determined at entrance to
      the kupdate writeback work. This _static_ timestamp may go stale if the
      kupdate work runs on and on. The flusher may then stuck with some old
      busy inodes, never considering newly expired inodes thereafter.
      
      This has two possible problems:
      
      - It is unfair for a large dirty inode to delay (for a long time) the
        writeback of small dirty inodes.
      
      - As time goes by, the large and busy dirty inode may contain only
        _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
        delaying the expired dirty pages to the end of LRU lists, triggering
        the evil pageout(). Nevertheless this patch merely addresses part
        of the problem.
      
      v2: keep policy changes inside wb_writeback() and keep the
      wbc.older_than_this visibility as suggested by Dave.
      
      CC: Dave Chinner <david@fromorbit.com>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarItaru Kitayama <kitayama@cl.bb4u.ne.jp>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      ba9aa839
    • Wu Fengguang's avatar
      writeback: try more writeback as long as something was written · e6fb6da2
      Wu Fengguang authored
      writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
      they only populate possibly a subset of eligible inodes into b_io at
      entrance time. When the queued set of inodes are all synced, they just
      return, possibly with all queued inode pages written but still
      wbc.nr_to_write > 0.
      
      For kupdate and background writeback, there may be more eligible inodes
      sitting in b_dirty when the current set of b_io inodes are completed. So
      it is necessary to try another round of writeback as long as we made some
      progress in this round. When there are no more eligible inodes, no more
      inodes will be enqueued in queue_io(), hence nothing could/will be
      synced and we may safely bail.
      
      For example, imagine 100 inodes
      
              i0, i1, i2, ..., i90, i91, i99
      
      At queue_io() time, i90-i99 happen to be expired and moved to s_io for
      IO. When finished successfully, if their total size is less than
      MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
      quit the background work (w/o this patch) while it's still over
      background threshold. This will be a fairly normal/frequent case I guess.
      
      Now that we do tagged sync and update inode->dirtied_when after the sync,
      this change won't livelock sync(1).  I actually tried to write 1 page
      per 1ms with this command
      
      	write-and-fsync -n10000 -S 1000 -c 4096 /fs/test
      
      and do sync(1) at the same time. The sync completes quickly on ext4,
      xfs, btrfs.
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      e6fb6da2
    • Wu Fengguang's avatar
      writeback: introduce writeback_control.inodes_written · cb9bd115
      Wu Fengguang authored
      The flusher works on dirty inodes in batches, and may quit prematurely
      if the batch of inodes happen to be metadata-only dirtied: in this case
      wbc->nr_to_write won't be decreased at all, which stands for "no pages
      written" but also mis-interpreted as "no progress".
      
      So introduce writeback_control.inodes_written to count the inodes get
      cleaned from VFS POV.  A non-zero value means there are some progress on
      writeback, in which case more writeback can be tried.
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      cb9bd115
    • Wu Fengguang's avatar
      writeback: update dirtied_when for synced inode to prevent livelock · 94c3dcbb
      Wu Fengguang authored
      Explicitly update .dirtied_when on synced inodes, so that they are no
      longer considered for writeback in the next round.
      
      It can prevent both of the following livelock schemes:
      
      - while true; do echo data >> f; done
      - while true; do touch f;        done (in theory)
      
      The exact livelock condition is, during sync(1):
      
      (1) no new inodes are dirtied
      (2) an inode being actively dirtied
      
      On (2), the inode will be tagged and synced with .nr_to_write=LONG_MAX.
      When finished, it will be redirty_tail()ed because it's still dirty
      and (.nr_to_write > 0). redirty_tail() won't update its ->dirtied_when
      on condition (1). The sync work will then revisit it on the next
      queue_io() and find it eligible again because its old ->dirtied_when
      predates the sync work start time.
      
      We'll do more aggressive "keep writeback as long as we wrote something"
      logic in wb_writeback(). The "use LONG_MAX .nr_to_write" trick in commit
      b9543dac ("writeback: avoid livelocking WB_SYNC_ALL writeback") will
      no longer be enough to stop sync livelock.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      94c3dcbb
    • Wu Fengguang's avatar
      writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage · 6e6938b6
      Wu Fengguang authored
      sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
      WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and
      do livelock prevention for it, too.
      
      Jan's commit f446daae ("mm: implement writeback livelock avoidance
      using page tagging") is a partial fix in that it only fixed the
      WB_SYNC_ALL phase livelock.
      
      Although ext4 is tested to no longer livelock with commit f446daae,
      it may due to some "redirty_tail() after pages_skipped" effect which
      is by no means a guarantee for _all_ the file systems.
      
      Note that writeback_inodes_sb() is called by not only sync(), they are
      treated the same because the other callers also need livelock prevention.
      
      Impact:  It changes the order in which pages/inodes are synced to disk.
      Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
      until finished with the current inode.
      Acked-by: default avatarJan Kara <jack@suse.cz>
      CC: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      6e6938b6
  2. 04 Jun, 2011 11 commits
  3. 03 Jun, 2011 6 commits
    • Ben Gardiner's avatar
      UBIFS: fix-up free space earlier · 09801194
      Ben Gardiner authored
      The free space fixup is currently initiated during mount after the call to
      ubifs_write_master() which results in a write to PEBs; this has been observed
      with the patch 'assert no fixup when writing a node' applied:
      
      Move the free space fixup on mount to before the calls to
      ubifs_recover_inl_heads() and ubifs_write_master(). This results in no
      assertions with the previously mentioned patch applied.
      
      Artem: tweaked the patch a bit
      Signed-off-by: default avatarBen Gardiner <bengardiner@nanometrics>
      Reviewed-by: default avatarMatthew L. Creech <mlcreech@gmail.com>
      Signed-off-by: default avatarArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      09801194
    • Ben Gardiner's avatar
      UBIFS: intialize LPT earlier · 781c5717
      Ben Gardiner authored
      The current 'mount_ubifs()' implementation does not initialize the LPT until the
      the master node is marked dirty. Move the LPT initialization to before marking
      the master node dirty. This is a preparation for the next patch which will move
      the free-space-fixup check to before marking the master node dirty, because we
      have to fix-up the free space before doing any writes.
      
      Artem: massaged the patch and commit message.
      Signed-off-by: default avatarBen Gardiner <bengardiner@nanometrics.ca>
      Reviewed-by: default avatarMatthew L. Creech <mlcreech@gmail.com>
      Signed-off-by: default avatarArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      781c5717
    • Ben Gardiner's avatar
      UBIFS: assert no fixup when writing a node · 4f1ab9b0
      Ben Gardiner authored
      The current free space fixup can result in some writing to the UBI volume
      when the space_fixup flag is set.
      
      To catch instances where UBIFS is writing to the NAND while the space_fixup
      flag is set, add an assert to ubifs_write_node().
      
      Artem: tweaked the patch, added similar assertion to the write buffer
             write path.
      Signed-off-by: default avatarBen Gardiner <bengardiner@nanometrics.ca>
      Reviewed-by: default avatarMatthew L. Creech <mlcreech@gmail.com>
      Signed-off-by: default avatarArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      4f1ab9b0
    • Artem Bityutskiy's avatar
      UBIFS: fix clean znode counter corruption in error cases · 83707237
      Artem Bityutskiy authored
      UBIFS maintains per-filesystem and global clean znode counters
      ('c->clean_zn_cnt' and 'ubifs_clean_zn_cnt'). It is important to maintain
      correct values there since the shrinker relies on 'ubifs_clean_zn_cnt'.
      
      However, in case of failures during commit the counters were corrupted. E.g.,
      if a failure happens in the middle of 'write_index()', then some nodes in the
      commit list ('c->cnext') are marked as clean, and some are marked as dirty. And
      the 'ubifs_destroy_tnc_subtree()' frees does not retrun correct count, and we
      end up with non-zero 'c->clean_zn_cnt' when unmounting. This means that if we
      have 2 file-sytem and one of them fails, and we unmount it,
      'ubifs_clean_zn_cnt' stays incorrect and confuses the shrinker.
      Signed-off-by: default avatarArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      83707237
    • Artem Bityutskiy's avatar
      UBIFS: fix memory leak on error path · 812eb258
      Artem Bityutskiy authored
      UBIFS leaks memory on error path in 'ubifs_jnl_update()' in case of write
      failure because it forgets to free the 'struct ubifs_dent_node *dent' object.
      Although the object is small, the alignment can make it large - e.g., 2KiB
      if the min. I/O unit is 2KiB.
      Signed-off-by: default avatarArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      Cc: stable@kernel.org
      812eb258
    • Artem Bityutskiy's avatar
      UBIFS: fix shrinker object count reports · cf610bf4
      Artem Bityutskiy authored
      Sometimes VM asks the shrinker to return amount of objects it can shrink,
      and we return the ubifs_clean_zn_cnt in that case. However, it is possible
      that this counter is negative for a short period of time, due to the way
      UBIFS TNC code updates it. And I can observe the following warnings sometimes:
      
      shrink_slab: ubifs_shrinker+0x0/0x2b7 [ubifs] negative objects to delete nr=-8541616642706119788
      
      This patch makes sure UBIFS never returns negative count of objects.
      Signed-off-by: default avatarArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      Cc: stable@kernel.org
      cf610bf4
  4. 01 Jun, 2011 5 commits
    • Artem Bityutskiy's avatar
      UBIFS: fix recovery broken by the previous recovery fix · da8b94ea
      Artem Bityutskiy authored
      Unfortunately, the recovery fix d1606a59b6be4ea392eabd40d1250aa1eeb19efb
      (UBIFS: fix extremely rare mount failure) broke recovery. This commit make
      UBIFS drop the last min. I/O unit in all journal heads, but this is needed only
      for the GC head. And this does not work for non-GC heads. For example, if
      suppose we have min. I/O units A and B, and A contains a valid node X, which
      was fsynced, and then a group of nodes Y which spans the rest of A and B. In
      this case we'll drop not only Y, but also X, which is obviously incorrect.
      
      This patch fixes the issue and additionally makes recovery to drop last min.
      I/O unit only for the GC head, and leave things as they have been for ages for
      the other heads - this is safer.
      Signed-off-by: default avatarArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      da8b94ea
    • Artem Bityutskiy's avatar
      UBIFS: amend ubifs_recover_leb interface · efcfde54
      Artem Bityutskiy authored
      Instead of passing "grouped" parameter to 'ubifs_recover_leb()' which tells
      whether the nodes are grouped in the LEB to recover, pass the journal head
      number and let 'ubifs_recover_leb()' look at the journal head's 'grouped' flag.
      
      This patch is a preparation to a further fix where we'll need to know the
      journal head number for other purposes.
      Signed-off-by: default avatarArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      efcfde54
    • Artem Bityutskiy's avatar
      UBIFS: introduce a "grouped" journal head flag · 1a0b0699
      Artem Bityutskiy authored
      Journal heads are different in a way how UBIFS writes nodes there. All normal
      journal heads receive grouped nodes, while the GC journal heads receives
      ungrouped nodes. This patch adds a 'grouped' flag to 'struct ubifs_jhead' which
      describes this property.
      
      This patch is a preparation to a further recovery fix.
      Signed-off-by: default avatarArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      1a0b0699
    • Artem Bityutskiy's avatar
      UBIFS: supress false error messages · ab75950b
      Artem Bityutskiy authored
      Commit ab51afe05273741f72383529ef488aa1ea598ec6 was a good clean-up, but
      it introduced a regression - now UBIFS prints scary error messages during
      recovery on all corrupted nodes, even though the corruptions are expected
      (due to a power cut). This patch fixes the issue.
      
      Additionally fix a typo in a commentary introduced by the same commit.
      Signed-off-by: default avatarArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      ab75950b
    • Tejun Heo's avatar
      block: blkdev_get() should access ->bd_disk only after success · 4c49ff3f
      Tejun Heo authored
      d4dc210f (block: don't block events on excl write for non-optical
      devices) added dereferencing of bdev->bd_disk to test
      GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE; however, bdev->bd_disk can be
      %NULL if open failed which can lead to an oops.
      
      Test the flag after testing open was successful, not before.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarDavid Miller <davem@davemloft.net>
      Tested-by: default avatarDavid Miller <davem@davemloft.net>
      Cc: stable@kernel.org
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
      4c49ff3f
  5. 30 May, 2011 3 commits
  6. 29 May, 2011 8 commits