1. 15 Sep, 2018 1 commit
  2. 02 Apr, 2018 1 commit
  3. 01 Feb, 2018 1 commit
    • shidao.ytt's avatar
      mm/fadvise: discard partial page if endbyte is also EOF · a7ab400d
      shidao.ytt authored
      During our recent testing with fadvise(FADV_DONTNEED), we find that if
      given offset/length is not page-aligned, the last page will not be
      discarded.  The tool we use is vmtouch (https://hoytech.com/vmtouch/),
      we map a 10KB-sized file into memory and then try to run this tool to
      evict the whole file mapping, but the last single page always remains
      staying in the memory:
      
      $./vmtouch -e test_10K
                 Files: 1
           Directories: 0
         Evicted Pages: 3 (12K)
               Elapsed: 2.1e-05 seconds
      
      $./vmtouch test_10K
                 Files: 1
           Directories: 0
        Resident Pages: 1/3  4K/12K  33.3%
               Elapsed: 5.5e-05 seconds
      
      However when we test with an older kernel, say 3.10, this problem is
      gone.  So we wonder if this is a regression:
      
      $./vmtouch -e test_10K
                 Files: 1
           Directories: 0
         Evicted Pages: 3 (12K)
               Elapsed: 8.2e-05 seconds
      
      $./vmtouch test_10K
                 Files: 1
           Directories: 0
        Resident Pages: 0/3  0/12K  0%  <-- partial page also discarded
               Elapsed: 5e-05 seconds
      
      After digging a little bit into this problem, we find it seems not a
      regression.  Not discarding partial page is likely to be on purpose
      according to commit 441c228f ("mm: fadvise: document the
      fadvise(FADV_DONTNEED) behaviour for partial pages") written by Mel
      Gorman.  He explained why partial pages should be preserved instead of
      being discarded when using fadvise(FADV_DONTNEED).
      
      However, the interesting part is that the actual code did NOT work as
      the same as it was described, the partial page was still discarded
      anyway, due to a calculation mistake of `end_index' passed to
      invalidate_mapping_pages().  This mistake has not been fixed until
      recently, that's why we fail to reproduce our problem in old kernels.
      The fix is done in commit 18aba41c ("mm/fadvise.c: do not discard
      partial pages with POSIX_FADV_DONTNEED") by Oleg Drokin.
      
      Back to the original testing, our problem becomes that there is a
      special case that, if the page-unaligned `endbyte' is also the end of
      file, it is not necessary at all to preserve the last partial page, as
      we all know no one else will use the rest of it.  It should be safe
      enough if we just discard the whole page.  So we add an EOF check in
      this patch.
      
      We also find a poosbile real world issue in mainline kernel.  Assume
      such scenario: A userspace backup application want to backup a huge
      amount of small files (<4k) at once, the developer might (I guess) want
      to use fadvise(FADV_DONTNEED) to save memory.  However, FADV_DONTNEED
      won't really happen since the only page mapped is a partial page, and
      kernel will preserve it.  Our patch also fixes this problem, since we
      know the endbyte is EOF, so we discard it.
      
      Here is a simple reproducer to reproduce and verify each scenario we
      described above:
      
        test_fadvise.c
        ==============================
        #include <sys/mman.h>
        #include <sys/stat.h>
        #include <fcntl.h>
        #include <stdlib.h>
        #include <string.h>
        #include <stdio.h>
        #include <unistd.h>
      
        int main(int argc, char **argv)
        {
        	int i, fd, ret, len;
        	struct stat buf;
        	void *addr;
        	unsigned char *vec;
        	char *strbuf;
        	ssize_t pagesize = getpagesize();
        	ssize_t filesize;
      
        	fd = open(argv[1], O_RDWR|O_CREAT, S_IRUSR|S_IWUSR);
        	if (fd < 0)
        		return -1;
        	filesize = strtoul(argv[2], NULL, 10);
      
        	strbuf = malloc(filesize);
        	memset(strbuf, 42, filesize);
        	write(fd, strbuf, filesize);
        	free(strbuf);
        	fsync(fd);
      
        	len = (filesize + pagesize - 1) / pagesize;
        	printf("length of pages: %d\n", len);
      
        	addr = mmap(NULL, filesize, PROT_READ, MAP_SHARED, fd, 0);
        	if (addr == MAP_FAILED)
        		return -1;
      
        	ret = posix_fadvise(fd, 0, filesize, POSIX_FADV_DONTNEED);
        	if (ret < 0)
        		return -1;
      
        	vec = malloc(len);
        	ret = mincore(addr, filesize, (void *)vec);
        	if (ret < 0)
        		return -1;
      
        	for (i = 0; i < len; i++)
        		printf("pages[%d]: %x\n", i, vec[i] & 0x1);
      
        	free(vec);
        	close(fd);
      
        	return 0;
        }
        ==============================
      
      Test 1: running on kernel with commit 18aba41c reverted:
      
        [root@caspar ~]# uname -r
        4.15.0-rc6.revert+
        [root@caspar ~]# ./test_fadvise file1 1024
        length of pages: 1
        pages[0]: 0    # <-- partial page discarded
        [root@caspar ~]# ./test_fadvise file2 8192
        length of pages: 2
        pages[0]: 0
        pages[1]: 0
        [root@caspar ~]# ./test_fadvise file3 10240
        length of pages: 3
        pages[0]: 0
        pages[1]: 0
        pages[2]: 0    # <-- partial page discarded
      
      Test 2: running on mainline kernel:
      
        [root@caspar ~]# uname -r
        4.15.0-rc6+
        [root@caspar ~]# ./test_fadvise test1 1024
        length of pages: 1
        pages[0]: 1    # <-- partial and the only page not discarded
        [root@caspar ~]# ./test_fadvise test2 8192
        length of pages: 2
        pages[0]: 0
        pages[1]: 0
        [root@caspar ~]# ./test_fadvise test3 10240
        length of pages: 3
        pages[0]: 0
        pages[1]: 0
        pages[2]: 1    # <-- partial page not discarded
      
      Test 3: running on kernel with this patch:
      
        [root@caspar ~]# uname -r
        4.15.0-rc6.patched+
        [root@caspar ~]# ./test_fadvise test1 1024
        length of pages: 1
        pages[0]: 0    # <-- partial page and EOF, discarded
        [root@caspar ~]# ./test_fadvise test2 8192
        length of pages: 2
        pages[0]: 0
        pages[1]: 0
        [root@caspar ~]# ./test_fadvise test3 10240
        length of pages: 3
        pages[0]: 0
        pages[1]: 0
        pages[2]: 0    # <-- partial page and EOF, discarded
      
      [akpm@linux-foundation.org: tweak code comment]
      Link: http://lkml.kernel.org/r/5222da9ee20e1695eaabb69f631f200d6e6b8876.1515132470.git.jinli.zjl@alibaba-inc.comSigned-off-by: default avatarshidao.ytt <shidao.ytt@alibaba-inc.com>
      Signed-off-by: default avatarCaspar Zhang <jinli.zjl@alibaba-inc.com>
      Reviewed-by: default avatarOliver Yang <zhiche.yy@alibaba-inc.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a7ab400d
  4. 02 Nov, 2017 1 commit
    • Greg Kroah-Hartman's avatar
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman authored
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: default avatarKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: default avatarPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  5. 09 Sep, 2017 1 commit
    • Shakeel Butt's avatar
      mm: fadvise: avoid fadvise for fs without backing device · 3a77d214
      Shakeel Butt authored
      The fadvise() manpage is silent on fadvise()'s effect on memory-based
      filesystems (shmem, hugetlbfs & ramfs) and pseudo file systems (procfs,
      sysfs, kernfs).  The current implementaion of fadvise is mostly a noop
      for such filesystems except for FADV_DONTNEED which will trigger
      expensive remote LRU cache draining.  This patch makes the noop of
      fadvise() on such file systems very explicit.
      
      However this change has two side effects for ramfs and one for tmpfs.
      First fadvise(FADV_DONTNEED) could remove the unmapped clean zero'ed
      pages of ramfs (allocated through read, readahead & read fault) and
      tmpfs (allocated through read fault).  Also fadvise(FADV_WILLNEED) could
      create such clean zero'ed pages for ramfs.  This change removes those
      possibilities.
      
      One of our generic libraries does fadvise(FADV_DONTNEED).  Recently we
      observed high latency in fadvise() and noticed that the users have
      started using tmpfs files and the latency was due to expensive remote
      LRU cache draining.  For normal tmpfs files (have data written on them),
      fadvise(FADV_DONTNEED) will always trigger the unneeded remote cache
      draining.
      
      Link: http://lkml.kernel.org/r/20170818011023.181465-1-shakeelb@google.comSigned-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3a77d214
  6. 20 Dec, 2016 1 commit
    • Johannes Weiner's avatar
      mm: fadvise: avoid expensive remote LRU cache draining after FADV_DONTNEED · 4dd72b4a
      Johannes Weiner authored
      When FADV_DONTNEED cannot drop all pages in the range, it observes that
      some pages might still be on per-cpu LRU caches after recent
      instantiation and so initiates remote calls to all CPUs to flush their
      local caches.  However, in most cases, the fadvise happens from the same
      context that instantiated the pages, and any pre-LRU pages in the
      specified range are most likely sitting on the local CPU's LRU cache,
      and so in many cases this results in unnecessary remote calls, which, in
      a loaded system, can hold up the fadvise() call significantly.
      
      [ I didn't record it in the extreme case we observed at Facebook,
        unfortunately. We had a slow-to-respond system and noticed it
        lru_add_drain_all() leading the profile during fadvise calls. This
        patch came out of thinking about the code and how we commonly call
        FADV_DONTNEED.
      
        FWIW, I wrote a silly directory tree walker/searcher that recurses
        through /usr to read and FADV_DONTNEED each file it finds. On a 2
        socket 40 ht machine, over 1% is spent in lru_add_drain_all(). With
        the patch, that cost is gone; the local drain cost shows at 0.09%. ]
      
      Try to avoid the remote call by flushing the local LRU cache before even
      attempting to invalidate anything.  It's a cheap operation, and the
      local LRU cache is the most likely to hold any pre-LRU pages in the
      specified fadvise range.
      
      Link: http://lkml.kernel.org/r/20161214210017.GA1465@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4dd72b4a
  7. 09 Jun, 2016 1 commit
  8. 04 Apr, 2016 1 commit
    • Kirill A. Shutemov's avatar
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov authored
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  9. 02 Jun, 2015 1 commit
    • Tejun Heo's avatar
      writeback: implement and use inode_congested() · 703c2708
      Tejun Heo authored
      In several places, bdi_congested() and its wrappers are used to
      determine whether more IOs should be issued.  With cgroup writeback
      support, this question can't be answered solely based on the bdi
      (backing_dev_info).  It's dependent on whether the filesystem and bdi
      support cgroup writeback and the blkcg the inode is associated with.
      
      This patch implements inode_congested() and its wrappers which take
      @inode and determines the congestion state considering cgroup
      writeback.  The new functions replace bdi_*congested() calls in places
      where the query is about specific inode and task.
      
      There are several filesystem users which also fit this criteria but
      they should be updated when each filesystem implements cgroup
      writeback support.
      
      v2: Now that a given inode is associated with only one wb, congestion
          state can be determined independent from the asking task.  Drop
          @task.  Spotted by Vivek.  Also, converted to take @inode instead
          of @mapping and renamed to inode_congested().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      703c2708
  10. 17 Feb, 2015 1 commit
    • Matthew Wilcox's avatar
      vfs: remove get_xip_mem · e748dcd0
      Matthew Wilcox authored
      All callers of get_xip_mem() are now gone.  Remove checks for it,
      initialisers of it, documentation of it and the only implementation of it.
       Also remove mm/filemap_xip.c as it is now empty.  Also remove
      documentation of the long-gone get_xip_page().
      Signed-off-by: default avatarMatthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Andreas Dilger <andreas.dilger@intel.com>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e748dcd0
  11. 20 Jan, 2015 1 commit
  12. 13 Dec, 2014 1 commit
  13. 04 Mar, 2013 1 commit
  14. 24 Feb, 2013 1 commit
    • Mel Gorman's avatar
      mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages · 67d46b29
      Mel Gorman authored
      Rob van der Heij reported the following (paraphrased) on private mail.
      
      	The scenario is that I want to avoid backups to fill up the page
      	cache and purge stuff that is more likely to be used again (this is
      	with s390x Linux on z/VM, so I don't give it as much memory that
      	we don't care anymore). So I have something with LD_PRELOAD that
      	intercepts the close() call (from tar, in this case) and issues
      	a posix_fadvise() just before closing the file.
      
      	This mostly works, except for small files (less than 14 pages)
      	that remains in page cache after the face.
      
      Unfortunately Rob has not had a chance to test this exact patch but the
      test program below should be reproducing the problem he described.
      
      The issue is the per-cpu pagevecs for LRU additions.  If the pages are
      added by one CPU but fadvise() is called on another then the pages
      remain resident as the invalidate_mapping_pages() only drains the local
      pagevecs via its call to pagevec_release().  The user-visible effect is
      that a program that uses fadvise() properly is not obeyed.
      
      A possible fix for this is to put the necessary smarts into
      invalidate_mapping_pages() to globally drain the LRU pagevecs if a
      pagevec page could not be discarded.  The downside with this is that an
      inode cache shrink would send a global IPI and memory pressure
      potentially causing global IPI storms is very undesirable.
      
      Instead, this patch adds a check during fadvise(POSIX_FADV_DONTNEED) to
      check if invalidate_mapping_pages() discarded all the requested pages.
      If a subset of pages are discarded it drains the LRU pagevecs and tries
      again.  If the second attempt fails, it assumes it is due to the pages
      being mapped, locked or dirty and does not care.  With this patch, an
      application using fadvise() correctly will be obeyed but there is a
      downside that a malicious application can force the kernel to send
      global IPIs and increase overhead.
      
      If accepted, I would like this to be considered as a -stable candidate.
      It's not an urgent issue but it's a system call that is not working as
      advertised which is weak.
      
      The following test program demonstrates the problem.  It should never
      report that pages are still resident but will without this patch.  It
      assumes that CPU 0 and 1 exist.
      
      int main() {
      	int fd;
      	int pagesize = getpagesize();
      	ssize_t written = 0, expected;
      	char *buf;
      	unsigned char *vec;
      	int resident, i;
      	cpu_set_t set;
      
      	/* Prepare a buffer for writing */
      	expected = FILESIZE_PAGES * pagesize;
      	buf = malloc(expected + 1);
      	if (buf == NULL) {
      		printf("ENOMEM\n");
      		exit(EXIT_FAILURE);
      	}
      	buf[expected] = 0;
      	memset(buf, 'a', expected);
      
      	/* Prepare the mincore vec */
      	vec = malloc(FILESIZE_PAGES);
      	if (vec == NULL) {
      		printf("ENOMEM\n");
      		exit(EXIT_FAILURE);
      	}
      
      	/* Bind ourselves to CPU 0 */
      	CPU_ZERO(&set);
      	CPU_SET(0, &set);
      	if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
      		perror("sched_setaffinity");
      		exit(EXIT_FAILURE);
      	}
      
      	/* open file, unlink and write buffer */
      	fd = open("fadvise-test-file", O_CREAT|O_EXCL|O_RDWR);
      	if (fd == -1) {
      		perror("open");
      		exit(EXIT_FAILURE);
      	}
      	unlink("fadvise-test-file");
      	while (written < expected) {
      		ssize_t this_write;
      		this_write = write(fd, buf + written, expected - written);
      
      		if (this_write == -1) {
      			perror("write");
      			exit(EXIT_FAILURE);
      		}
      
      		written += this_write;
      	}
      	free(buf);
      
      	/*
      	 * Force ourselves to another CPU. If fadvise only flushes the local
      	 * CPUs pagevecs then the fadvise will fail to discard all file pages
      	 */
      	CPU_ZERO(&set);
      	CPU_SET(1, &set);
      	if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
      		perror("sched_setaffinity");
      		exit(EXIT_FAILURE);
      	}
      
      	/* sync and fadvise to discard the page cache */
      	fsync(fd);
      	if (posix_fadvise(fd, 0, expected, POSIX_FADV_DONTNEED) == -1) {
      		perror("posix_fadvise");
      		exit(EXIT_FAILURE);
      	}
      
      	/* map the file and use mincore to see which parts of it are resident */
      	buf = mmap(NULL, expected, PROT_READ, MAP_SHARED, fd, 0);
      	if (buf == NULL) {
      		perror("mmap");
      		exit(EXIT_FAILURE);
      	}
      	if (mincore(buf, expected, vec) == -1) {
      		perror("mincore");
      		exit(EXIT_FAILURE);
      	}
      
      	/* Check residency */
      	for (i = 0, resident = 0; i < FILESIZE_PAGES; i++) {
      		if (vec[i])
      			resident++;
      	}
      	if (resident != 0) {
      		printf("Nr unexpected pages resident: %d\n", resident);
      		exit(EXIT_FAILURE);
      	}
      
      	munmap(buf, expected);
      	close(fd);
      	free(vec);
      	exit(EXIT_SUCCESS);
      }
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reported-by: default avatarRob van der Heij <rvdheij@gmail.com>
      Tested-by: default avatarRob van der Heij <rvdheij@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      67d46b29
  15. 23 Feb, 2013 1 commit
  16. 27 Sep, 2012 2 commits
  17. 01 Aug, 2012 1 commit
  18. 11 Jan, 2012 1 commit
  19. 06 Mar, 2010 1 commit
    • Wu Fengguang's avatar
      readahead: introduce FMODE_RANDOM for POSIX_FADV_RANDOM · 0141450f
      Wu Fengguang authored
      This fixes inefficient page-by-page reads on POSIX_FADV_RANDOM.
      
      POSIX_FADV_RANDOM used to set ra_pages=0, which leads to poor performance:
      a 16K read will be carried out in 4 _sync_ 1-page reads.
      
      In other places, ra_pages==0 means
      - it's ramfs/tmpfs/hugetlbfs/sysfs/configfs
      - some IO error happened
      where multi-page read IO won't help or should be avoided.
      
      POSIX_FADV_RANDOM actually want a different semantics: to disable the
      *heuristic* readahead algorithm, and to use a dumb one which faithfully
      submit read IO for whatever application requests.
      
      So introduce a flag FMODE_RANDOM for POSIX_FADV_RANDOM.
      
      Note that the random hint is not likely to help random reads performance
      noticeably.  And it may be too permissive on huge request size (its IO
      size is not limited by read_ahead_kb).
      
      In Quentin's report (http://lkml.org/lkml/2009/12/24/145), the overall
      (NFS read) performance of the application increased by 313%!
      Tested-by: default avatarQuentin Barnes <qbarnes+nfs@yahoo-inc.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: <stable@kernel.org>			[2.6.33.x]
      Cc: <qbarnes+nfs@yahoo-inc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0141450f
  20. 17 Jun, 2009 1 commit
  21. 14 Jan, 2009 1 commit
    • Heiko Carstens's avatar
      [CVE-2009-0029] System call wrapper special cases · 6673e0c3
      Heiko Carstens authored
      System calls with an unsigned long long argument can't be converted with
      the standard wrappers since that would include a cast to long, which in
      turn means that we would lose the upper 32 bit on 32 bit architectures.
      Also semctl can't use the standard wrapper since it has a 'union'
      parameter.
      
      So we handle them as special case and add some extra wrappers instead.
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      6673e0c3
  22. 16 Oct, 2008 1 commit
  23. 28 Apr, 2008 1 commit
  24. 05 Feb, 2008 1 commit
    • Masatake YAMATO's avatar
      check ADVICE of fadvise64_64 even if get_xip_page is given · b5beb1ca
      Masatake YAMATO authored
      I've written some test programs in ltp project.  During writing I met an
      problem which I cannot solve in user land.  So I wrote a patch for linux
      kernel.  Please, include this patch if acceptable.
      
      The test program tests the 4th parameter of fadvise64_64:
      
          long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice);
      
      My test case calls fadvise64_64 with invalid advice value and checks errno is
      set to EINVAL.  About the advice parameter man page says:
      
          ...
          Permissible values for advice include:
      
      	   POSIX_FADV_NORMAL
                        ...
      	   POSIX_FADV_SEQUENTIAL
                        ...
      	   POSIX_FADV_RANDOM
      		  ...
      	   POSIX_FADV_NOREUSE
                        ...
      	   POSIX_FADV_WILLNEED
                        ...
      	   POSIX_FADV_DONTNEED
      		  ...
          ERRORS
                 ...
      	   EINVAL An invalid value was specified for advice.
      
      However, I got a bug report that the system call invocations
      in my test case returned 0 unexpectedly.
      
      I've inspected the kernel code:
      
          asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
          {
      	    struct file *file = fget(fd);
      	    struct address_space *mapping;
      	    struct backing_dev_info *bdi;
      	    loff_t endbyte;			/* inclusive */
      	    pgoff_t start_index;
      	    pgoff_t end_index;
      	    unsigned long nrpages;
      	    int ret = 0;
      
      	    if (!file)
      		    return -EBADF;
      
      	    if (S_ISFIFO(file->f_path.dentry->d_inode->i_mode)) {
      		    ret = -ESPIPE;
      		    goto out;
      	    }
      
      	    mapping = file->f_mapping;
      	    if (!mapping || len < 0) {
      		    ret = -EINVAL;
      		    goto out;
      	    }
      
      	    if (mapping->a_ops->get_xip_page)
      		    /* no bad return value, but ignore advice */
      		    goto out;
          ...
          out:
      	    fput(file);
      	    return ret;
          }
      
      I found the advice parameter is just ignored in the case
      mapping->a_ops->get_xip_page is given. This behavior is different from
      what is written on the man page. Is this o.k.?
      
      get_xip_page is given if CONFIG_EXT2_FS_XIP is true.
      Anyway I cannot find the easy way to detect get_xip_page
      field is given or CONFIG_EXT2_FS_XIP is true from the
      user space.
      
      I propose the following patch which checks the advice parameter
      even if get_xip_page is given.
      Signed-off-by: default avatarMasatake YAMATO <yamato@redhat.com>
      Acked-by: default avatarCarsten Otte <cotte@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b5beb1ca
  25. 08 Dec, 2006 1 commit
  26. 06 Aug, 2006 1 commit
    • Andrew Morton's avatar
      [PATCH] fadvise() make POSIX_FADV_NOREUSE a no-op · 60c371bc
      Andrew Morton authored
      The POSIX_FADV_NOREUSE hint means "the application will use this range of the
      file a single time".  It seems to be intended that the implementation will use
      this hint to perform drop-behind of that part of the file when the application
      gets around to reading or writing it.
      
      However for reasons which aren't obvious (or sane?) I mapped
      POSIX_FADV_NOREUSE onto POSIX_FADV_WILLNEED.  ie: it does readahead.
      
      That's daft.  So for now, make POSIX_FADV_NOREUSE a no-op.
      
      This is a non-back-compatible change.  If someone was using POSIX_FADV_NOREUSE
      to perform readahead, they lose.  The likelihood is low.
      
      If/when we later implement POSIX_FADV_NOREUSE things will get interesting - to
      do it fully we'll need to maintain file offset/length ranges and peform all
      sorts of complex tricks, and managing the lifetime of those ranges' data
      structures will be interesting..
      
      A sensible implementation would probably ignore the file range and would
      simply mark the entire file as needing some form of drop-behind treatment.
      
      Cc: Michael Kerrisk <mtk-manpages@gmx.net>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      60c371bc
  27. 10 Jul, 2006 1 commit
  28. 31 Mar, 2006 1 commit
    • Andrew Morton's avatar
      [PATCH] sys_sync_file_range() · f79e2abb
      Andrew Morton authored
      Remove the recently-added LINUX_FADV_ASYNC_WRITE and LINUX_FADV_WRITE_WAIT
      fadvise() additions, do it in a new sys_sync_file_range() syscall instead.
      Reasons:
      
      - It's more flexible.  Things which would require two or three syscalls with
        fadvise() can be done in a single syscall.
      
      - Using fadvise() in this manner is something not covered by POSIX.
      
      The patch wires up the syscall for x86.
      
      The sycall is implemented in the new fs/sync.c.  The intention is that we can
      move sys_fsync(), sys_fdatasync() and perhaps sys_sync() into there later.
      
      Documentation for the syscall is in fs/sync.c.
      
      A test app (sync_file_range.c) is in
      http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz.
      
      The available-to-GPL-modules do_sync_file_range() is for knfsd: "A COMMIT can
      say NFS_DATA_SYNC or NFS_FILE_SYNC.  I can skip the ->fsync call for
      NFS_DATA_SYNC which is hopefully the more common."
      
      Note: the `async' writeout mode SYNC_FILE_RANGE_WRITE will turn synchronous if
      the queue is congested.  This is trivial to fix: add a new flag bit, set
      wbc->nonblocking.  But I'm not sure that we want to expose implementation
      details down to that level.
      
      Note: it's notable that we can sync an fd which wasn't opened for writing.
      Same with fsync() and fdatasync()).
      
      Note: the code takes some care to handle attempts to sync file contents
      outside the 16TB offset on 32-bit machines.  It makes such attempts appear to
      succeed, for best 32-bit/64-bit compatibility.  Perhaps it should make such
      requests fail...
      
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Michael Kerrisk <mtk-manpages@gmx.net>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Cc: Neil Brown <neilb@cse.unsw.edu.au>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      f79e2abb
  29. 24 Mar, 2006 1 commit
    • Andrew Morton's avatar
      [PATCH] fadvise(): write commands · ebcf28e1
      Andrew Morton authored
      Add two new linux-specific fadvise extensions():
      
      LINUX_FADV_ASYNC_WRITE: start async writeout of any dirty pages between file
      offsets `offset' and `offset+len'.  Any pages which are currently under
      writeout are skipped, whether or not they are dirty.
      
      LINUX_FADV_WRITE_WAIT: wait upon writeout of any dirty pages between file
      offsets `offset' and `offset+len'.
      
      By combining these two operations the application may do several things:
      
      LINUX_FADV_ASYNC_WRITE: push some or all of the dirty pages at the disk.
      
      LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE: push all of the currently dirty
      pages at the disk.
      
      LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE, LINUX_FADV_WRITE_WAIT: push all
      of the currently dirty pages at the disk, wait until they have been written.
      
      It should be noted that none of these operations write out the file's
      metadata.  So unless the application is strictly performing overwrites of
      already-instantiated disk blocks, there are no guarantees here that the data
      will be available after a crash.
      
      To complete this suite of operations I guess we should have a "sync file
      metadata only" operation.  This gives applications access to all the building
      blocks needed for all sorts of sync operations.  But sync-metadata doesn't fit
      well with the fadvise() interface.  Probably it should be a new syscall:
      sys_fmetadatasync().
      
      The patch also diddles with the meaning of `endbyte' in sys_fadvise64_64().
      It is made to represent that last affected byte in the file (ie: it is
      inclusive).  Generally, all these byterange and pagerange functions are
      inclusive so we can easily represent EOF with -1.
      
      As Ulrich notes, these two functions are somewhat abusive of the fadvise()
      concept, which appears to be "set the future policy for this fd".
      
      But these commands are a perfect fit with the fadvise() impementation, and
      several of the existing fadvise() commands are synchronous and don't affect
      future policy either.   I think we can live with the slight incongruity.
      
      Cc: Michael Kerrisk <mtk-manpages@gmx.net>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      ebcf28e1
  30. 09 Jan, 2006 1 commit
  31. 24 Jun, 2005 1 commit
  32. 16 Apr, 2005 1 commit
    • Linus Torvalds's avatar
      Linux-2.6.12-rc2 · 1da177e4
      Linus Torvalds authored
      Initial git repository build. I'm not bothering with the full history,
      even though we have it. We can create a separate "historical" git
      archive of that later if we want to, and in the meantime it's about
      3.2GB when imported into git - space that would just make the early
      git days unnecessarily complicated, when we don't have a lot of good
      infrastructure for it.
      
      Let it rip!
      1da177e4