Skip to content
  • Linus Torvalds's avatar
    Revert "Change mincore() to count "mapped" pages rather than "cached" pages" · 30bac164
    Linus Torvalds authored
    This reverts commit 574823bf
    
    .
    
    It turns out that my hope that we could just remove the code that
    exposes the cache residency status from mincore() was too optimistic.
    
    There are various random users that want it, and one example would be
    the Netflix database cluster maintenance. To quote Josh Snyder:
    
     "For Netflix, losing accurate information from the mincore syscall
      would lengthen database cluster maintenance operations from days to
      months. We rely on cross-process mincore to migrate the contents of a
      page cache from machine to machine, and across reboots.
    
      To do this, I wrote and maintain happycache [1], a page cache
      dumper/loader tool. It is quite similar in architecture to pgfincore,
      except that it is agnostic to workload. The gist of happycache's
      operation is "produce a dump of residence status for each page, do
      some operation, then reload exactly the same pages which were present
      before." happycache is entirely dependent on accurate reporting of the
      in-core status of file-backed pages, as accessed by another process.
    
      We primarily use happycache with Cassandra, which (like Postgres +
      pgfincore) relies heavily on OS page cache to reduce disk accesses.
      Because our workloads never experience a cold page cache, we are able
      to provision hardware for a peak utilization level that is far lower
      than the hypothetical "every query is a cache miss" peak.
    
      A database warmed by happycache can be ready for service in seconds
      (bounded only by the performance of the drives and the I/O subsystem),
      with no period of in-service degradation. By contrast, putting a
      database in service without a page cache entails a potentially
      unbounded period of degradation (at Netflix, the time to populate a
      single node's cache via natural cache misses varies by workload from
      hours to weeks). If a single node upgrade were to take weeks, then
      upgrading an entire cluster would take months. Since we want to apply
      security upgrades (and other things) on a somewhat tighter schedule,
      we would have to develop more complex solutions to provide the same
      functionality already provided by mincore.
    
      At the bottom line, happycache is designed to benignly exploit the
      same information leak documented in the paper [2]. I think it makes
      perfect sense to remove cross-process mincore functionality from
      unprivileged users, but not to remove it entirely"
    
    We do have an alternate approach that limits the cache residency
    reporting only to processes that have write permissions to the file, so
    we can fix the original information leak issue that way.  It involves
    _adding_ code rather than removing it, which is sad, but hey, at least
    we haven't found any users that would find the restrictions
    unacceptable.
    
    So revert the optimistic first approach to make room for that alternate
    fix instead.
    
    Reported-by: default avatarJosh Snyder <joshs@netflix.com>
    Cc: Jiri Kosina <jikos@kernel.org>
    Cc: Dominique Martinet <asmadeus@codewreck.org>
    Cc: Andy Lutomirski <luto@amacapital.net>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Kevin Easton <kevin@guarana.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Cyril Hrubis <chrubis@suse.cz>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Kirill A. Shutemov <kirill@shutemov.name>
    Cc: Daniel Gruss <daniel@gruss.cc>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    30bac164