1. 14 Apr, 2015 7 commits
  2. 06 Apr, 2015 1 commit
    • Al Viro's avatar
      fix mremap() vs. ioctx_kill() race · b2edffdd
      Al Viro authored
      teach ->mremap() method to return an error and have it fail for
      aio mappings in process of being killed
      Note that in case of ->mremap() failure we need to undo move_page_tables()
      we'd already done; we could call ->mremap() first, but then the failure of
      move_page_tables() would require undoing whatever _successful_ ->mremap()
      has done, which would be a lot more headache in general.
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
  3. 25 Mar, 2015 9 commits
    • Mel Gorman's avatar
      mm: numa: mark huge PTEs young when clearing NUMA hinting faults · b7b04004
      Mel Gorman authored
      Base PTEs are marked young when the NUMA hinting information is cleared
      but the same does not happen for huge pages which this patch addresses.
      Note that migrated pages are not marked young as the base page migration
      code does not assume that migrated pages have been referenced.  This
      could be addressed but beyond the scope of this series which is aimed at
      Dave Chinners shrink workload that is unlikely to be affected by this
      Signed-off-by: 's avatarMel Gorman <mgorman@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Mel Gorman's avatar
      mm: numa: slow PTE scan rate if migration failures occur · 074c2381
      Mel Gorman authored
      Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226
        Across the board the 4.0-rc1 numbers are much slower, and the degradation
        is far worse when using the large memory footprint configs. Perf points
        straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config:
         -   56.07%    56.07%  [kernel]            [k] default_send_IPI_mask_sequence_phys
            - default_send_IPI_mask_sequence_phys
               - 99.99% physflat_send_IPI_mask
                  - 99.37% native_send_call_func_ipi
                     - native_flush_tlb_others
                        - 99.85% flush_tlb_page
                           - handle_mm_fault
                              - 99.73% __do_page_fault
                                 + async_page_fault
                    0.63% native_send_call_func_single_ipi
      This is showing excessive migration activity even though excessive
      migrations are meant to get throttled.  Normally, the scan rate is tuned
      on a per-task basis depending on the locality of faults.  However, if
      migrations fail for any reason then the PTE scanner may scan faster if
      the faults continue to be remote.  This means there is higher system CPU
      overhead and fault trapping at exactly the time we know that migrations
      cannot happen.  This patch tracks when migration failures occur and
      slows the PTE scanner.
      Signed-off-by: 's avatarMel Gorman <mgorman@suse.de>
      Reported-by: 's avatarDave Chinner <david@fromorbit.com>
      Tested-by: 's avatarDave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Mel Gorman's avatar
      mm: numa: preserve PTE write permissions across a NUMA hinting fault · b191f9b1
      Mel Gorman authored
      Protecting a PTE to trap a NUMA hinting fault clears the writable bit
      and further faults are needed after trapping a NUMA hinting fault to set
      the writable bit again.  This patch preserves the writable bit when
      trapping NUMA hinting faults.  The impact is obvious from the number of
      minor faults trapped during the basis balancing benchmark and the system
      CPU usage;
                                                   4.0.0-rc4             4.0.0-rc4
                                                    baseline              preserve
        Time System-NUMA01                  107.13 (  0.00%)      103.13 (  3.73%)
        Time System-NUMA01_THEADLOCAL       131.87 (  0.00%)       83.30 ( 36.83%)
        Time System-NUMA02                    8.95 (  0.00%)       10.72 (-19.78%)
        Time System-NUMA02_SMT                4.57 (  0.00%)        3.99 ( 12.69%)
        Time Elapsed-NUMA01                 515.78 (  0.00%)      517.26 ( -0.29%)
        Time Elapsed-NUMA01_THEADLOCAL      384.10 (  0.00%)      384.31 ( -0.05%)
        Time Elapsed-NUMA02                  48.86 (  0.00%)       48.78 (  0.16%)
        Time Elapsed-NUMA02_SMT              47.98 (  0.00%)       48.12 ( -0.29%)
                     4.0.0-rc4   4.0.0-rc4
                      baseline    preserve
        User          44383.95    43971.89
        System          252.61      201.24
        Elapsed         998.68     1000.94
        Minor Faults   2597249     1981230
        Major Faults       365         364
      There is a similar drop in system CPU usage using Dave Chinner's xfsrepair
                                            4.0.0-rc4             4.0.0-rc4
                                             baseline              preserve
        Amean    real-xfsrepair      454.14 (  0.00%)      442.36 (  2.60%)
        Amean    syst-xfsrepair      277.20 (  0.00%)      204.68 ( 26.16%)
      The patch looks hacky but the alternatives looked worse.  The tidest was
      to rewalk the page tables after a hinting fault but it was more complex
      than this approach and the performance was worse.  It's not generally
      safe to just mark the page writable during the fault if it's a write
      fault as it may have been read-only for COW so that approach was
      Signed-off-by: 's avatarMel Gorman <mgorman@suse.de>
      Reported-by: 's avatarDave Chinner <david@fromorbit.com>
      Tested-by: 's avatarDave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Mel Gorman's avatar
      mm: numa: group related processes based on VMA flags instead of page table flags · bea66fbd
      Mel Gorman authored
      These are three follow-on patches based on the xfsrepair workload Dave
      Chinner reported was problematic in 4.0-rc1 due to changes in page table
      management -- https://lkml.org/lkml/2015/3/1/226.
      Much of the problem was reduced by commit 53da3bc2 ("mm: fix up numa
      read-only thread grouping logic") and commit ba68bc01 ("mm: thp:
      Return the correct value for change_huge_pmd").  It was known that the
      performance in 3.19 was still better even if is far less safe.  This
      series aims to restore the performance without compromising on safety.
      For the test of this mail, I'm comparing 3.19 against 4.0-rc4 and the
      three patches applied on top
                                                      3.19.0             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4
                                                     vanilla               vanilla          vmwrite-v5r8         preserve-v5r8         slowscan-v5r8
        Time System-NUMA01                  124.00 (  0.00%)      161.86 (-30.53%)      107.13 ( 13.60%)      103.13 ( 16.83%)      145.01 (-16.94%)
        Time System-NUMA01_THEADLOCAL       115.54 (  0.00%)      107.64 (  6.84%)      131.87 (-14.13%)       83.30 ( 27.90%)       92.35 ( 20.07%)
        Time System-NUMA02                    9.35 (  0.00%)       10.44 (-11.66%)        8.95 (  4.28%)       10.72 (-14.65%)        8.16 ( 12.73%)
        Time System-NUMA02_SMT                3.87 (  0.00%)        4.63 (-19.64%)        4.57 (-18.09%)        3.99 ( -3.10%)        3.36 ( 13.18%)
        Time Elapsed-NUMA01                 570.06 (  0.00%)      567.82 (  0.39%)      515.78 (  9.52%)      517.26 (  9.26%)      543.80 (  4.61%)
        Time Elapsed-NUMA01_THEADLOCAL      393.69 (  0.00%)      384.83 (  2.25%)      384.10 (  2.44%)      384.31 (  2.38%)      380.73 (  3.29%)
        Time Elapsed-NUMA02                  49.09 (  0.00%)       49.33 ( -0.49%)       48.86 (  0.47%)       48.78 (  0.63%)       50.94 ( -3.77%)
        Time Elapsed-NUMA02_SMT              47.51 (  0.00%)       47.15 (  0.76%)       47.98 ( -0.99%)       48.12 ( -1.28%)       49.56 ( -4.31%)
                      3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                     vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
        User        46334.60    46391.94    44383.95    43971.89    44372.12
        System        252.84      284.66      252.61      201.24      249.00
        Elapsed      1062.14     1050.96      998.68     1000.94     1026.78
      Overall the system CPU usage is comparable and the test is naturally a
      bit variable.  The slowing of the scanner hurts numa01 but on this
      machine it is an adverse workload and patches that dramatically help it
      often hurt absolutely everything else.
      Due to patch 2, the fault activity is interesting
                                        3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                                       vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
        Minor Faults                   2097811     2656646     2597249     1981230     1636841
        Major Faults                       362         450         365         364         365
      Note the impact preserving the write bit across protection updates and
      fault reduces faults.
        NUMA alloc hit                 1229008     1217015     1191660     1178322     1199681
        NUMA alloc miss                      0           0           0           0           0
        NUMA interleave hit                  0           0           0           0           0
        NUMA alloc local               1228514     1216317     1190871     1177448     1199021
        NUMA base PTE updates        245706197   240041607   238195516   244704842   115012800
        NUMA huge PMD updates           479530      468448      464868      477573      224487
        NUMA page range updates      491225557   479886983   476207932   489222218   229950144
        NUMA hint faults                659753      656503      641678      656926      294842
        NUMA hint local faults          381604      373963      360478      337585      186249
        NUMA hint local percent             57          56          56          51          63
        NUMA pages migrated            5412140     6374899     6266530     5277468     5755096
        AutoNUMA cost                    5121%       5083%       4994%       5097%       2388%
      Here the impact of slowing the PTE scanner on migratrion failures is
      obvious as "NUMA base PTE updates" and "NUMA huge PMD updates" are
      massively reduced even though the headline performance is very similar.
      As xfsrepair was the reported workload here is the impact of the series
      on it.
                                               3.19.0             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4
                                              vanilla               vanilla          vmwrite-v5r8         preserve-v5r8         slowscan-v5r8
        Min      real-fsmark        1183.29 (  0.00%)     1165.73 (  1.48%)     1152.78 (  2.58%)     1153.64 (  2.51%)     1177.62 (  0.48%)
        Min      syst-fsmark        4107.85 (  0.00%)     4027.75 (  1.95%)     3986.74 (  2.95%)     3979.16 (  3.13%)     4048.76 (  1.44%)
        Min      real-xfsrepair      441.51 (  0.00%)      463.96 ( -5.08%)      449.50 ( -1.81%)      440.08 (  0.32%)      439.87 (  0.37%)
        Min      syst-xfsrepair      195.76 (  0.00%)      278.47 (-42.25%)      262.34 (-34.01%)      203.70 ( -4.06%)      143.64 ( 26.62%)
        Amean    real-fsmark        1188.30 (  0.00%)     1177.34 (  0.92%)     1157.97 (  2.55%)     1158.21 (  2.53%)     1182.22 (  0.51%)
        Amean    syst-fsmark        4111.37 (  0.00%)     4055.70 (  1.35%)     3987.19 (  3.02%)     3998.72 (  2.74%)     4061.69 (  1.21%)
        Amean    real-xfsrepair      450.88 (  0.00%)      468.32 ( -3.87%)      454.14 ( -0.72%)      442.36 (  1.89%)      440.59 (  2.28%)
        Amean    syst-xfsrepair      199.66 (  0.00%)      290.60 (-45.55%)      277.20 (-38.84%)      204.68 ( -2.51%)      150.55 ( 24.60%)
        Stddev   real-fsmark           4.12 (  0.00%)       10.82 (-162.29%)       4.14 ( -0.28%)        5.98 (-45.05%)        4.60 (-11.53%)
        Stddev   syst-fsmark           2.63 (  0.00%)       20.32 (-671.82%)       0.37 ( 85.89%)       16.47 (-525.59%)      15.05 (-471.79%)
        Stddev   real-xfsrepair        6.87 (  0.00%)        4.55 ( 33.75%)        3.46 ( 49.58%)        1.78 ( 74.12%)        0.52 ( 92.50%)
        Stddev   syst-xfsrepair        3.02 (  0.00%)       10.30 (-241.37%)      13.17 (-336.37%)       0.71 ( 76.63%)        5.00 (-65.61%)
        CoeffVar real-fsmark           0.35 (  0.00%)        0.92 (-164.73%)       0.36 ( -2.91%)        0.52 (-48.82%)        0.39 (-12.10%)
        CoeffVar syst-fsmark           0.06 (  0.00%)        0.50 (-682.41%)       0.01 ( 85.45%)        0.41 (-543.22%)       0.37 (-478.78%)
        CoeffVar real-xfsrepair        1.52 (  0.00%)        0.97 ( 36.21%)        0.76 ( 49.94%)        0.40 ( 73.62%)        0.12 ( 92.33%)
        CoeffVar syst-xfsrepair        1.51 (  0.00%)        3.54 (-134.54%)       4.75 (-214.31%)       0.34 ( 77.20%)        3.32 (-119.63%)
        Max      real-fsmark        1193.39 (  0.00%)     1191.77 (  0.14%)     1162.90 (  2.55%)     1166.66 (  2.24%)     1188.50 (  0.41%)
        Max      syst-fsmark        4114.18 (  0.00%)     4075.45 (  0.94%)     3987.65 (  3.08%)     4019.45 (  2.30%)     4082.80 (  0.76%)
        Max      real-xfsrepair      457.80 (  0.00%)      474.60 ( -3.67%)      457.82 ( -0.00%)      444.42 (  2.92%)      441.03 (  3.66%)
        Max      syst-xfsrepair      203.11 (  0.00%)      303.65 (-49.50%)      294.35 (-44.92%)      205.33 ( -1.09%)      155.28 ( 23.55%)
      The really relevant lines as syst-xfsrepair which is the system CPU
      usage when running xfsrepair.  Note that on my machine the overhead was
      45% higher on 4.0-rc4 which may be part of what Dave is seeing.  Once we
      preserve the write bit across faults, it's only 2.51% higher on average.
      With the full series applied, system CPU usage is 24.6% lower on
      Again, the impact of preserving the write bit on minor faults is obvious
      and the impact of slowing scanning after migration failures is obvious
      on the PTE updates.  Note also that the number of pages migrated is much
      reduced even though the headline performance is comparable.
                                        3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                                       vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
        Minor Faults                 153466827   254507978   249163829   153501373   105737890
        Major Faults                       610         702         690         649         724
        NUMA base PTE updates        217735049   210756527   217729596   216937111   144344993
        NUMA huge PMD updates           129294       85044      106921      127246       79887
        NUMA pages migrated           21938995    29705270    28594162    22687324    16258075
                              3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                             vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
        Mean sdb-avgqusz       13.47        2.54        2.55        2.47        2.49
        Mean sdb-avgrqsz      202.32      140.22      139.50      139.02      138.12
        Mean sdb-await         25.92        5.09        5.33        5.02        5.22
        Mean sdb-r_await        4.71        0.19        0.83        0.51        0.11
        Mean sdb-w_await      104.13        5.21        5.38        5.05        5.32
        Mean sdb-svctm          0.59        0.13        0.14        0.13        0.14
        Mean sdb-rrqm           0.16        0.00        0.00        0.00        0.00
        Mean sdb-wrqm           3.59     1799.43     1826.84     1812.21     1785.67
        Max  sdb-avgqusz      111.06       12.13       14.05       11.66       15.60
        Max  sdb-avgrqsz      255.60      190.34      190.01      187.33      191.78
        Max  sdb-await        168.24       39.28       49.22       44.64       65.62
        Max  sdb-r_await      660.00       52.00      280.00       76.00       12.00
        Max  sdb-w_await     7804.00       39.28       49.22       44.64       65.62
        Max  sdb-svctm          4.00        2.82        2.86        1.98        2.84
        Max  sdb-rrqm           8.30        0.00        0.00        0.00        0.00
        Max  sdb-wrqm          34.20     5372.80     5278.60     5386.60     5546.15
      FWIW, I also checked SPECjbb in different configurations but it's
      similar observations -- minor faults lower, PTE update activity lower
      and performance is roughly comparable against 3.19.
      This patch (of 3):
      Threads that share writable data within pages are grouped together as
      related tasks.  This decision is based on whether the PTE is marked
      dirty which is subject to timing races between the PTE scanner update
      and when the application writes the page.  If the page is file-backed,
      then background flushes and sync also affect placement.  This is
      unpredictable behaviour which is impossible to reason about so this
      patch makes grouping decisions based on the VMA flags.
      Signed-off-by: 's avatarMel Gorman <mgorman@suse.de>
      Reported-by: 's avatarDave Chinner <david@fromorbit.com>
      Tested-by: 's avatarDave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Laura Abbott's avatar
      mm/page_alloc.c: call kernel_map_pages in unset_migrateype_isolate · cfa86943
      Laura Abbott authored
      Commit 3c605096 ("mm/page_alloc: restrict max order of merging on
      isolated pageblock") changed the logic of unset_migratetype_isolate to
      check the buddy allocator and explicitly call __free_pages to merge.
      The page that is being freed in this path never had prep_new_page called
      so set_page_refcounted is called explicitly but there is no call to
      kernel_map_pages.  With the default kernel_map_pages this is mostly
      harmless but if kernel_map_pages does any manipulation of the page
      tables (unmapping or setting pages to read only) this may trigger a
          alloc_contig_range test_pages_isolated(ceb00, ced00) failed
          Unable to handle kernel paging request at virtual address ffffffc0cec00000
          pgd = ffffffc045fc4000
          [ffffffc0cec00000] *pgd=0000000000000000
          Internal error: Oops: 9600004f [#1] PREEMPT SMP
          Modules linked in: exfatfs
          CPU: 1 PID: 23237 Comm: TimedEventQueue Not tainted 3.10.49-gc72ad36-dirty #1
          task: ffffffc03de52100 ti: ffffffc015388000 task.ti: ffffffc015388000
          PC is at memset+0xc8/0x1c0
          LR is at kernel_map_pages+0x1ec/0x244
      Fix this by calling kernel_map_pages to ensure the page is set in the
      page table properly
      Fixes: 3c605096 ("mm/page_alloc: restrict max order of merging on isolated pageblock")
      Signed-off-by: 's avatarLaura Abbott <lauraa@codeaurora.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: 's avatarRik van Riel <riel@redhat.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Acked-by: 's avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Gioh Kim <gioh.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Mark Rutland's avatar
      mm/slub: fix lockups on PREEMPT && !SMP kernels · 859b7a0e
      Mark Rutland authored
      Commit 9aabf810 ("mm/slub: optimize alloc/free fastpath by removing
      preemption on/off") introduced an occasional hang for kernels built with
      The problem is the following loop the patch introduced to
      slab_alloc_node and slab_free:
          do {
              tid = this_cpu_read(s->cpu_slab->tid);
              c = raw_cpu_ptr(s->cpu_slab);
          } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
      GCC 4.9 has been observed to hoist the load of c and c->tid above the
      loop for !SMP kernels (as in this case raw_cpu_ptr(x) is compile-time
      constant and does not force a reload).  On arm64 the generated assembly
      looks like:
               ldr     x4, [x0,#8]
               ldr     x1, [x0,#8]
               cmp     x1, x4
               b.ne    loop
      If the thread is preempted between the load of c->tid (into x1) and tid
      (into x4), and an allocation or free occurs in another thread (bumping
      the cpu_slab's tid), the thread will be stuck in the loop until
      s->cpu_slab->tid wraps, which may be forever in the absence of
      allocations/frees on the same CPU.
      This patch changes the loop condition to access c->tid with READ_ONCE.
      This ensures that the value is reloaded even when the compiler would
      otherwise assume it could cache the value, and also ensures that the
      load will not be torn.
      Signed-off-by: 's avatarMark Rutland <mark.rutland@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Acked-by: 's avatarChristoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Steve Capper <steve.capper@linaro.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Gu Zheng's avatar
      mm/memory hotplug: postpone the reset of obsolete pgdat · b0dc3a34
      Gu Zheng authored
      Qiu Xishi reported the following BUG when testing hot-add/hot-remove node under
      stress condition:
        BUG: unable to handle kernel paging request at 0000000000025f60
        IP: next_online_pgdat+0x1/0x50
        PGD 0
        Oops: 0000 [#1] SMP
        ACPI: Device does not support D3cold
        Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb dca i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core iTCO_vendor_support tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad rtc_cmos button ext3 jbd mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh ahci libahci libata scsi_mod [last unloaded: rasf]
        CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G           O 3.10.15-5885-euler0302 #1
        Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei N1, BIOS V100R001 03/02/2015
        Workqueue: events vmstat_update
        task: ffffa800d32c0000 ti: ffffa800d32ae000 task.ti: ffffa800d32ae000
        RIP: 0010: next_online_pgdat+0x1/0x50
        RSP: 0018:ffffa800d32afce8  EFLAGS: 00010286
        RAX: 0000000000001440 RBX: ffffffff81da53b8 RCX: 0000000000000082
        RDX: 0000000000000000 RSI: 0000000000000082 RDI: 0000000000000000
        RBP: ffffa800d32afd28 R08: ffffffff81c93bfc R09: ffffffff81cbdc96
        R10: 00000000000040ec R11: 00000000000000a0 R12: ffffa800fffb3440
        R13: ffffa800d32afd38 R14: 0000000000000017 R15: ffffa800e6616800
        FS:  0000000000000000(0000) GS:ffffa800e6600000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000025f60 CR3: 0000000001a0b000 CR4: 00000000001407e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
      The cause is the "memset(pgdat, 0, sizeof(*pgdat))" at the end of
      try_offline_node, which will reset all the content of pgdat to 0, as the
      pgdat is accessed lock-free, so that the users still using the pgdat
      will panic, such as the vmstat_update routine.
      process A:				offline node XX:
             find online node XX
      					offline cpu and memory, then try_offline_node()
      					node_set_offline(nid), and memset(pgdat, 0, sizeof(*pgdat))
             zone = next_zone(zone)
               pg_data_t *pgdat = zone->zone_pgdat;  // here pgdat is NULL now
                   next_online_node(pgdat->node_id);  // NULL pointer access
      So the solution here is postponing the reset of obsolete pgdat from
      try_offline_node() to hotadd_new_pgdat(), and just resetting
      pgdat->nr_zones and pgdat->classzone_idx to be 0 rather than the memset
      0 to avoid breaking pointer information in pgdat.
      Signed-off-by: 's avatarGu Zheng <guz.fnst@cn.fujitsu.com>
      Reported-by: 's avatarXishi Qiu <qiuxishi@huawei.com>
      Suggested-by: 's avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Xie XiuQi <xiexiuqi@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Naoya Horiguchi's avatar
      mm/pagewalk.c: prevent positive return value of walk_page_test() from being passed to callers · f6837395
      Naoya Horiguchi authored
      walk_page_test() is purely pagewalk's internal stuff, and its positive
      return values are not intended to be passed to the callers of pagewalk.
      However, in the current code if the last vma in the do-while loop in
      walk_page_range() happens to return a positive value, it leaks outside
      walk_page_range().  So the user visible effect is invalid/unexpected
      return value (according to the reporter, mbind() causes it.)
      This patch fixes it simply by reinitializing the return value after
      Another exposed interface, walk_page_vma(), already returns 0 for such
      cases so no problem.
      Fixes: fafaa426 ("pagewalk: improve vma handling")
      Signed-off-by: 's avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: 's avatarKazutomo Yoshii <kazutomo.yoshii@gmail.com>
      Reported-by: 's avatarKazutomo Yoshii <kazutomo.yoshii@gmail.com>
      Acked-by: 's avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Leon Yu's avatar
      mm: fix anon_vma->degree underflow in anon_vma endless growing prevention · 3fe89b3e
      Leon Yu authored
      I have constantly stumbled upon "kernel BUG at mm/rmap.c:399!" after
      upgrading to 3.19 and had no luck with 4.0-rc1 neither.
      So, after looking into new logic introduced by commit 7a3ef208 ("mm:
      prevent endless growth of anon_vma hierarchy"), I found chances are that
      unlink_anon_vmas() is called without incrementing dst->anon_vma->degree
      in anon_vma_clone() due to allocation failure.  If dst->anon_vma is not
      NULL in error path, its degree will be incorrectly decremented in
      unlink_anon_vmas() and eventually underflow when exiting as a result of
      another call to unlink_anon_vmas().  That's how "kernel BUG at
      mm/rmap.c:399!" is triggered for me.
      This patch fixes the underflow by dropping dst->anon_vma when allocation
      fails.  It's safe to do so regardless of original value of dst->anon_vma
      because dst->anon_vma doesn't have valid meaning if anon_vma_clone()
      fails.  Besides, callers don't care dst->anon_vma in such case neither.
      Also suggested by Michal Hocko, we can clean up vma_adjust() a bit as
      anon_vma_clone() now does the work.
      [akpm@linux-foundation.org: tweak comment]
      Fixes: 7a3ef208 ("mm: prevent endless growth of anon_vma hierarchy")
      Signed-off-by: 's avatarLeon Yu <chianglungyu@gmail.com>
      Signed-off-by: 's avatarKonstantin Khlebnikov <koct9i@gmail.com>
      Reviewed-by: 's avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: 's avatarRik van Riel <riel@redhat.com>
      Acked-by: 's avatarDavid Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
  4. 24 Mar, 2015 1 commit
  5. 23 Mar, 2015 1 commit
    • Tejun Heo's avatar
      writeback: fix possible underflow in write bandwidth calculation · c72efb65
      Tejun Heo authored
      From 1ebf33901ecc75d9496862dceb1ef0377980587c Mon Sep 17 00:00:00 2001
      From: Tejun Heo <tj@kernel.org>
      Date: Mon, 23 Mar 2015 00:08:19 -0400
      2f800fbd ("writeback: fix dirtied pages accounting on redirty")
      introduced account_page_redirty() which reverts stat updates for a
      redirtied page, making BDI_DIRTIED no longer monotonically increasing.
      bdi_update_write_bandwidth() uses the delta in BDI_DIRTIED as the
      basis for bandwidth calculation.  While unlikely, since the above
      patch, the newer value may be lower than the recorded past value and
      underflow the bandwidth calculation leading to a wild result.
      Fix it by subtracing min of the old and new values when calculating
      delta.  AFAIK, there hasn't been any report of it happening but the
      resulting erratic behavior would be non-critical and temporary, so
      it's possible that the issue is happening without being reported.  The
      risk of the fix is very low, so tagged for -stable.
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Fixes: 2f800fbd ("writeback: fix dirtied pages accounting on redirty")
      Cc: stable@vger.kernel.org
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
  6. 13 Mar, 2015 7 commits
  7. 12 Mar, 2015 2 commits
    • Mel Gorman's avatar
      mm: thp: Return the correct value for change_huge_pmd · ba68bc01
      Mel Gorman authored
      The wrong value is being returned by change_huge_pmd since commit
      10c1045f ("mm: numa: avoid unnecessary TLB flushes when setting
      NUMA hinting entries") which allows a fallthrough that tries to adjust
      non-existent PTEs. This patch corrects it.
      Signed-off-by: 's avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Linus Torvalds's avatar
      mm: fix up numa read-only thread grouping logic · 53da3bc2
      Linus Torvalds authored
      Dave Chinner reported that commit 4d942466 ("mm: convert
      p[te|md]_mknonnuma and remaining page table manipulations") slowed down
      his xfsrepair test enormously.  In particular, it was using more system
      time due to extra TLB flushing.
      The ultimate reason turns out to be how the change to use the regular
      page table accessor functions broke the NUMA grouping logic.  The old
      special mknuma/mknonnuma code accessed the page table present bit and
      the magic NUMA bit directly, while the new code just changes the page
      protections using PROT_NONE and the regular vma protections.
      That sounds equivalent, and from a fault standpoint it really is, but a
      subtle side effect is that the *other* protection bits of the page table
      entries also change.  And the code to decide how to group the NUMA
      entries together used the writable bit to decide whether a particular
      page was likely to be shared read-only or not.
      And with the change to make the NUMA handling use the regular permission
      setting functions, that writable bit was basically always cleared for
      private mappings due to COW.  So even if the page actually ends up being
      written to in the end, the NUMA balancing would act as if it was always
      shared RO.
      This code is a heuristic anyway, so the fix - at least for now - is to
      instead check whether the page is dirty rather than writable.  The bit
      doesn't change with protection changes.
      NOTE! This also adds a FIXME comment to revisit this issue,
      Not only should we probably re-visit the whole "is this a shared
      read-only page" heuristic (we might want to take the vma permissions
      into account and base this more on those than the per-page ones, and
      also look at whether the particular access that triggers it is a write
      or not), but the whole COW issue shows that we should think about the
      NUMA fault handling some more.
      For example, maybe we should do the early-COW thing that a regular fault
      does.  Or maybe we should accept that while using the same bits as
      PROTNONE was a good thing (and got rid of the specual NUMA bit), we
      might still want to just preseve the other protection bits across NUMA
      Those are bigger questions, left for later.  This just fixes up the
      heuristic so that it at least approximates working again.  More analysis
      and work needed.
      Reported-by: 's avatarDave Chinner <david@fromorbit.com>
      Tested-by: 's avatarMel Gorman <mgorman@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>,
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
  8. 04 Mar, 2015 1 commit
    • Tejun Heo's avatar
      writeback: add missing INITIAL_JIFFIES init in global_update_bandwidth() · 7d70e154
      Tejun Heo authored
      global_update_bandwidth() uses static variable update_time as the
      timestamp for the last update but forgets to initialize it to
      This means that global_dirty_limit will be 5 mins into the future on
      32bit and some large amount jiffies into the past on 64bit.  This
      isn't critical as the only effect is that global_dirty_limit won't be
      updated for the first 5 mins after booting on 32bit machines,
      especially given the auxiliary nature of global_dirty_limit's role -
      protecting against global dirty threshold's sudden dips; however, it
      does lead to unintended suboptimal behavior.  Fix it.
      Fixes: c42843f2 ("writeback: introduce smoothed global dirty limit")
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Acked-by: 's avatarJan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: stable@vger.kernel.org
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
  9. 28 Feb, 2015 4 commits
    • Johannes Weiner's avatar
      mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change · cc873177
      Johannes Weiner authored
      Historically, !__GFP_FS allocations were not allowed to invoke the OOM
      killer once reclaim had failed, but nevertheless kept looping in the
      Commit 9879de73 ("mm: page_alloc: embed OOM killing naturally into
      allocation slowpath"), which should have been a simple cleanup patch,
      accidentally changed the behavior to aborting the allocation at that
      point.  This creates problems with filesystem callers (?) that currently
      rely on the allocator waiting for other tasks to intervene.
      Revert the behavior as it shouldn't have been changed as part of a
      cleanup patch.
      Fixes: 9879de73 ("mm: page_alloc: embed OOM killing naturally into allocation slowpath")
      Signed-off-by: 's avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: 's avatarMichal Hocko <mhocko@suse.cz>
      Reported-by: 's avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Dave Chinner <david@fromorbit.com>
      Acked-by: 's avatarDavid Rientjes <rientjes@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: <stable@vger.kernel.org>	[3.19.x]
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Johannes Weiner's avatar
      mm: memcontrol: use "max" instead of "infinity" in control knobs · d2973697
      Johannes Weiner authored
      The memcg control knobs indicate the highest possible value using the
      symbolic name "infinity", which is long and awkward to type.
      Switch to the string "max", which is just as descriptive but shorter and
      This changes a user interface, so do it before the release and before
      the development flag is dropped from the default hierarchy.
      Signed-off-by: 's avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Michal Hocko's avatar
      memcg: fix low limit calculation · 4e54dede
      Michal Hocko authored
      A memcg is considered low limited even when the current usage is equal to
      the low limit.  This leads to interesting side effects e.g.
      groups/hierarchies with no memory accounted are considered protected and
      so the reclaim will emit MEMCG_LOW event when encountering them.
      Another and much bigger issue was reported by Joonsoo Kim.  He has hit a
      NULL ptr dereference with the legacy cgroup API which even doesn't have
      low limit exposed.  The limit is 0 by default but the initial check fails
      for memcg with 0 consumption and parent_mem_cgroup() would return NULL if
      use_hierarchy is 0 and so page_counter_read would try to dereference NULL.
      I suppose that the current implementation is just an overlook because the
      documentation in Documentation/cgroups/unified-hierarchy.txt says:
        "The memory.low boundary on the other hand is a top-down allocated
        reserve.  A cgroup enjoys reclaim protection when it and all its
        ancestors are below their low boundaries"
      Fix the usage and the low limit comparision in mem_cgroup_low accordingly.
      Fixes: 241994ed (mm: memcontrol: default hierarchy interface for memory)
      Reported-by: 's avatarJoonsoo Kim <js1304@gmail.com>
      Signed-off-by: 's avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: 's avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Joonsoo Kim's avatar
      mm/nommu: fix memory leak · da616534
      Joonsoo Kim authored
      Maxime reported the following memory leak regression due to commit
      dbc8358c ("mm/nommu: use alloc_pages_exact() rather than its own
      On v3.19, I am facing a memory leak.  Each time I run a command one page
      is lost.  Here an example with busybox's free command:
        / # free
                     total       used       free     shared    buffers     cached
        Mem:          7928       1972       5956          0          0        492
        -/+ buffers/cache:       1480       6448
        / # free
                     total       used       free     shared    buffers     cached
        Mem:          7928       1976       5952          0          0        492
        -/+ buffers/cache:       1484       6444
        / # free
                     total       used       free     shared    buffers     cached
        Mem:          7928       1980       5948          0          0        492
        -/+ buffers/cache:       1488       6440
        / # free
                     total       used       free     shared    buffers     cached
        Mem:          7928       1984       5944          0          0        492
        -/+ buffers/cache:       1492       6436
        / # free
                     total       used       free     shared    buffers     cached
        Mem:          7928       1988       5940          0          0        492
        -/+ buffers/cache:       1496       6432
      At some point, the system fails to sastisfy 256KB allocations:
        free: page allocation failure: order:6, mode:0xd0
        CPU: 0 PID: 67 Comm: free Not tainted 3.19.0-05389-gacf2cf1-dirty #64
        Hardware name: STM32 (Device Tree Support)
        Normal per-cpu:
        CPU    0: hi:    0, btch:   1 usd:   0
        active_anon:0 inactive_anon:0 isolated_anon:0
         active_file:0 inactive_file:0 isolated_file:0
         unevictable:123 dirty:0 writeback:0 unstable:0
         free:1515 slab_reclaimable:17 slab_unreclaimable:139
         mapped:0 shmem:0 pagetables:0 bounce:0
        Normal free:6060kB min:352kB low:440kB high:528kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:492kB isolated(anon):0ks
        lowmem_reserve[]: 0 0
        Normal: 23*4kB (U) 22*8kB (U) 24*16kB (U) 23*32kB (U) 23*64kB (U) 23*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6060kB
        123 total pagecache pages
        2048 pages of RAM
        1538 free pages
        66 reserved pages
        109 slab pages
        -46 pages shared
        0 pages swap cached
        nommu: Allocation of length 221184 from process 67 (free) failed
        Normal per-cpu:
        CPU    0: hi:    0, btch:   1 usd:   0
        active_anon:0 inactive_anon:0 isolated_anon:0
         active_file:0 inactive_file:0 isolated_file:0
         unevictable:123 dirty:0 writeback:0 unstable:0
         free:1515 slab_reclaimable:17 slab_unreclaimable:139
         mapped:0 shmem:0 pagetables:0 bounce:0
        Normal free:6060kB min:352kB low:440kB high:528kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:492kB isolated(anon):0ks
        lowmem_reserve[]: 0 0
        Normal: 23*4kB (U) 22*8kB (U) 24*16kB (U) 23*32kB (U) 23*64kB (U) 23*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6060kB
        123 total pagecache pages
        Unable to allocate RAM for process text/data, errno 12 SEGV
      This problem happens because we allocate ordered page through
      __get_free_pages() in do_mmap_private() in some cases and we try to free
      individual pages rather than ordered page in free_page_series().  In
      this case, freeing pages whose refcount is not 0 won't be freed to the
      page allocator so memory leak happens.
      To fix the problem, this patch changes __get_free_pages() to
      alloc_pages_exact() since alloc_pages_exact() returns
      physically-contiguous pages but each pages are refcounted.
      Fixes: dbc8358c ("mm/nommu: use alloc_pages_exact() rather than its own implementation").
      Reported-by: 's avatarMaxime Coquelin <mcoquelin.stm32@gmail.com>
      Tested-by: 's avatarMaxime Coquelin <mcoquelin.stm32@gmail.com>
      Signed-off-by: 's avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>	[3.19]
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
  10. 23 Feb, 2015 1 commit
    • Sasha Levin's avatar
      mm: shmem: check for mapping owner before dereferencing · f0774d88
      Sasha Levin authored
      mapping->host can be NULL and shouldn't be dereferenced before being checked.
      [ 1295.741844] GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: 0000 [#1] SMP KASAN
      [ 1295.746387] Dumping ftrace buffer:
      [ 1295.748217]    (ftrace buffer empty)
      [ 1295.749527] Modules linked in:
      [ 1295.750268] CPU: 62 PID: 23410 Comm: trinity-c70 Not tainted 3.19.0-next-20150219-sasha-00045-g9130270f #1939
      [ 1295.750268] task: ffff8803a49db000 ti: ffff8803a4dc8000 task.ti: ffff8803a4dc8000
      [ 1295.750268] RIP: shmem_mapping (mm/shmem.c:1458)
      [ 1295.750268] RSP: 0000:ffff8803a4dcfbf8  EFLAGS: 00010206
      [ 1295.750268] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 00000000000f2804
      [ 1295.750268] RDX: 0000000000000005 RSI: 0400000000000794 RDI: 0000000000000028
      [ 1295.750268] RBP: ffff8803a4dcfc08 R08: 0000000000000000 R09: 00000000031de000
      [ 1295.750268] R10: dffffc0000000000 R11: 00000000031c1000 R12: 0400000000000794
      [ 1295.750268] R13: 00000000031c2000 R14: 00000000031de000 R15: ffff880e3bdc1000
      [ 1295.750268] FS:  00007f8703c7e700(0000) GS:ffff881164800000(0000) knlGS:0000000000000000
      [ 1295.750268] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1295.750268] CR2: 0000000004e58000 CR3: 00000003a9f3c000 CR4: 00000000000007a0
      [ 1295.750268] DR0: ffffffff81000000 DR1: 0000009494949494 DR2: 0000000000000000
      [ 1295.750268] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000d0602
      [ 1295.750268] Stack:
      [ 1295.750268]  ffff8803a4dcfec8 ffffffffbb1dc770 ffff8803a4dcfc38 ffffffffad6f230b
      [ 1295.750268]  ffffffffad6f2b0d 0000014100000000 ffff88001e17c08b ffff880d9453fe08
      [ 1295.750268]  ffff8803a4dcfd18 ffffffffad6f2ce2 ffff8803a49dbcd8 ffff8803a49dbce0
      [ 1295.750268] Call Trace:
      [ 1295.750268] mincore_page (mm/mincore.c:61)
      [ 1295.750268] ? mincore_pte_range (include/linux/spinlock.h:312 mm/mincore.c:131)
      [ 1295.750268] mincore_pte_range (mm/mincore.c:151)
      [ 1295.750268] ? mincore_unmapped_range (mm/mincore.c:113)
      [ 1295.750268] __walk_page_range (mm/pagewalk.c:51 mm/pagewalk.c:90 mm/pagewalk.c:116 mm/pagewalk.c:204)
      [ 1295.750268] walk_page_range (mm/pagewalk.c:275)
      [ 1295.750268] SyS_mincore (mm/mincore.c:191 mm/mincore.c:253 mm/mincore.c:220)
      [ 1295.750268] ? mincore_pte_range (mm/mincore.c:220)
      [ 1295.750268] ? mincore_unmapped_range (mm/mincore.c:113)
      [ 1295.750268] ? __mincore_unmapped_range (mm/mincore.c:105)
      [ 1295.750268] ? ptlock_free (mm/mincore.c:24)
      [ 1295.750268] ? syscall_trace_enter (arch/x86/kernel/ptrace.c:1610)
      [ 1295.750268] ia32_do_call (arch/x86/ia32/ia32entry.S:446)
      [ 1295.750268] Code: e5 48 c1 ea 03 53 48 89 fb 48 83 ec 08 80 3c 02 00 75 4f 48 b8 00 00 00 00 00 fc ff df 48 8b 1b 48 8d 7b 28 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 3f 48 b8 00 00 00 00 00 fc ff df 48 8b 5b 28 48
      All code
         0:	e5 48                	in     $0x48,%eax
         2:	c1 ea 03             	shr    $0x3,%edx
         5:	53                   	push   %rbx
         6:	48 89 fb             	mov    %rdi,%rbx
         9:	48 83 ec 08          	sub    $0x8,%rsp
         d:	80 3c 02 00          	cmpb   $0x0,(%rdx,%rax,1)
        11:	75 4f                	jne    0x62
        13:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
        1a:	fc ff df
        1d:	48 8b 1b             	mov    (%rbx),%rbx
        20:	48 8d 7b 28          	lea    0x28(%rbx),%rdi
        24:	48 89 fa             	mov    %rdi,%rdx
        27:	48 c1 ea 03          	shr    $0x3,%rdx
        2b:*	80 3c 02 00          	cmpb   $0x0,(%rdx,%rax,1)		<-- trapping instruction
        2f:	75 3f                	jne    0x70
        31:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
        38:	fc ff df
        3b:	48 8b 5b 28          	mov    0x28(%rbx),%rbx
        3f:	48                   	rex.W
      Code starting with the faulting instruction
         0:	80 3c 02 00          	cmpb   $0x0,(%rdx,%rax,1)
         4:	75 3f                	jne    0x45
         6:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
         d:	fc ff df
        10:	48 8b 5b 28          	mov    0x28(%rbx),%rbx
        14:	48                   	rex.W
      [ 1295.750268] RIP shmem_mapping (mm/shmem.c:1458)
      [ 1295.750268]  RSP <ffff8803a4dcfbf8>
      Fixes: 97b713ba ("fs: kill BDI_CAP_SWAP_BACKED")
      Signed-off-by: 's avatarSasha Levin <sasha.levin@oracle.com>
      Reviewed-by: 's avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
  11. 22 Feb, 2015 1 commit
    • David Howells's avatar
      VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry) · e36cb0b8
      David Howells authored
      Convert the following where appropriate:
       (1) S_ISLNK(dentry->d_inode) to d_is_symlink(dentry).
       (2) S_ISREG(dentry->d_inode) to d_is_reg(dentry).
       (3) S_ISDIR(dentry->d_inode) to d_is_dir(dentry).  This is actually more
           complicated than it appears as some calls should be converted to
           d_can_lookup() instead.  The difference is whether the directory in
           question is a real dir with a ->lookup op or whether it's a fake dir with
           a ->d_automount op.
      In some circumstances, we can subsume checks for dentry->d_inode not being
      NULL into this, provided we the code isn't in a filesystem that expects
      d_inode to be NULL if the dirent really *is* negative (ie. if we're going to
      use d_inode() rather than d_backing_inode() to get the inode pointer).
      Note that the dentry type field may be set to something other than
      DCACHE_MISS_TYPE when d_inode is NULL in the case of unionmount, where the VFS
      manages the fall-through from a negative dentry to a lower layer.  In such a
      case, the dentry type of the negative union dentry is set to the same as the
      type of the lower dentry.
      However, if you know d_inode is not NULL at the call site, then you can use
      the d_is_xxx() functions even in a filesystem.
      There is one further complication: a 0,0 chardev dentry may be labelled
      DCACHE_WHITEOUT_TYPE rather than DCACHE_SPECIAL_TYPE.  Strictly, this was
      intended for special directory entry types that don't have attached inodes.
      The following perl+coccinelle script was used:
      use strict;
      my @callers;
      open($fd, 'git grep -l \'S_IS[A-Z].*->d_inode\' |') ||
          die "Can't grep for S_ISDIR and co. callers";
      @callers = <$fd>;
      unless (@callers) {
          print "No matches\n";
      my @cocci = (
          'expression E;',
          '- S_ISLNK(E->d_inode->i_mode)',
          '+ d_is_symlink(E)',
          'expression E;',
          '- S_ISDIR(E->d_inode->i_mode)',
          '+ d_is_dir(E)',
          'expression E;',
          '- S_ISREG(E->d_inode->i_mode)',
          '+ d_is_reg(E)' );
      my $coccifile = "tmp.sp.cocci";
      open($fd, ">$coccifile") || die $coccifile;
      print($fd "$_\n") || die $coccifile foreach (@cocci);
      foreach my $file (@callers) {
          chomp $file;
          print "Processing ", $file, "\n";
          system("spatch", "--sp-file", $coccifile, $file, "--in-place", "--no-show-diff") == 0 ||
      	die "spatch failed";
      [AV: overlayfs parts skipped]
      Signed-off-by: 's avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
  12. 18 Feb, 2015 2 commits
  13. 17 Feb, 2015 3 commits