1. 26 Oct, 2018 1 commit
    • Daniel Borkmann's avatar
      bpf: add bpf_jit_limit knob to restrict unpriv allocations · ede95a63
      Daniel Borkmann authored
      Rick reported that the BPF JIT could potentially fill the entire module
      space with BPF programs from unprivileged users which would prevent later
      attempts to load normal kernel modules or privileged BPF programs, for
      example. If JIT was enabled but unsuccessful to generate the image, then
      before commit 290af866 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
      we would always fall back to the BPF interpreter. Nowadays in the case
      where the CONFIG_BPF_JIT_ALWAYS_ON could be set, then the load will abort
      with a failure since the BPF interpreter was compiled out.
      Add a global limit and enforce it for unprivileged users such that in case
      of BPF interpreter compiled out we fail once the limit has been reached
      or we fall back to BPF interpreter earlier w/o using module mem if latter
      was compiled in. In a next step, fair share among unprivileged users can
      be resolved in particular for the case where we would fail hard once limit
      is reached.
      Fixes: 290af866 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
      Fixes: 0a14842f ("net: filter: Just In Time compiler for x86-64")
      Co-Developed-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: LKML <linux-kernel@vger.kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
  2. 09 Sep, 2018 1 commit
    • Henrik Austad's avatar
      Drop all 00-INDEX files from Documentation/ · a7ddcea5
      Henrik Austad authored
      This is a respin with a wider audience (all that get_maintainer returned)
      and I know this spams a *lot* of people. Not sure what would be the correct
      way, so my apologies for ruining your inbox.
      The 00-INDEX files are supposed to give a summary of all files present
      in a directory, but these files are horribly out of date and their
      usefulness is brought into question. Often a simple "ls" would reveal
      the same information as the filenames are generally quite descriptive as
      a short introduction to what the file covers (it should not surprise
      anyone what Documentation/sched/sched-design-CFS.txt covers)
      A few years back it was mentioned that these files were no longer really
      needed, and they have since then grown further out of date, so perhaps
      it is time to just throw them out.
      A short status yields the following _outdated_ 00-INDEX files, first
      counter is files listed in 00-INDEX but missing in the directory, last
      is files present but not listed in 00-INDEX.
      List of outdated 00-INDEX:
      Documentation: (4/10)
      Documentation/sysctl: (0/1)
      Documentation/timers: (1/0)
      Documentation/blockdev: (3/1)
      Documentation/w1/slaves: (0/1)
      Documentation/locking: (0/1)
      Documentation/devicetree: (0/5)
      Documentation/power: (1/1)
      Documentation/powerpc: (0/5)
      Documentation/arm: (1/0)
      Documentation/x86: (0/9)
      Documentation/x86/x86_64: (1/1)
      Documentation/scsi: (4/4)
      Documentation/filesystems: (2/9)
      Documentation/filesystems/nfs: (0/2)
      Documentation/cgroup-v1: (0/2)
      Documentation/kbuild: (0/4)
      Documentation/spi: (1/0)
      Documentation/virtual/kvm: (1/0)
      Documentation/scheduler: (0/2)
      Documentation/fb: (0/1)
      Documentation/block: (0/1)
      Documentation/networking: (6/37)
      Documentation/vm: (1/3)
      Then there are 364 subdirectories in Documentation/ with several files that
      are missing 00-INDEX alltogether (and another 120 with a single file and no
      I don't really have an opinion to whether or not we /should/ have 00-INDEX,
      but the above 00-INDEX should either be removed or be kept up to date. If
      we should keep the files, I can try to keep them updated, but I rather not
      if we just want to delete them anyway.
      As a starting point, remove all index-files and references to 00-INDEX and
      see where the discussion is going.
      Signed-off-by: default avatarHenrik Austad <henrik@austad.us>
      Acked-by: default avatar"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Just-do-it-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Reviewed-by: default avatarJens Axboe <axboe@kernel.dk>
      Acked-by: default avatarPaul Moore <paul@paul-moore.com>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Acked-by: default avatarMark Brown <broonie@kernel.org>
      Acked-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: [Almost everybody else]
      Signed-off-by: default avatarJonathan Corbet <corbet@lwn.net>
  3. 04 Sep, 2018 1 commit
  4. 24 Aug, 2018 1 commit
  5. 22 Aug, 2018 3 commits
  6. 26 Jul, 2018 1 commit
  7. 08 Jul, 2018 1 commit
  8. 26 Jun, 2018 1 commit
  9. 03 May, 2018 1 commit
    • Wang YanQing's avatar
      bpf, x86_32: add eBPF JIT compiler for ia32 · 03f5781b
      Wang YanQing authored
      The JIT compiler emits ia32 bit instructions. Currently, It supports eBPF
      only. Classic BPF is supported because of the conversion by BPF core.
      Almost all instructions from eBPF ISA supported except the following:
      BPF_ALU64 | BPF_DIV | BPF_K
      BPF_ALU64 | BPF_DIV | BPF_X
      BPF_ALU64 | BPF_MOD | BPF_K
      BPF_ALU64 | BPF_MOD | BPF_X
      It doesn't support BPF_JMP|BPF_CALL with BPF_PSEUDO_CALL at the moment.
      IA32 has few general purpose registers, EAX|EDX|ECX|EBX|ESI|EDI. I use
      EAX|EDX|ECX|EBX as temporary registers to simulate instructions in eBPF
      ISA, and allocate ESI|EDI to BPF_REG_AX for constant blinding, all others
      eBPF registers, R0-R10, are simulated through scratch space on stack.
      The reasons behind the hardware registers allocation policy are:
      1:MUL need EAX:EDX, shift operation need ECX, so they aren't fit
        for general eBPF 64bit register simulation.
      2:We need at least 4 registers to simulate most eBPF ISA operations
        on registers operands instead of on register&memory operands.
      3:We need to put BPF_REG_AX on hardware registers, or constant blinding
        will degrade jit performance heavily.
      Tested on PC (Intel(R) Core(TM) i5-5200U CPU).
      Testing results on i5-5200U:
      1) test_bpf: Summary: 349 PASSED, 0 FAILED, [319/341 JIT'ed]
      2) test_progs: Summary: 83 PASSED, 0 FAILED.
      3) test_lpm: OK
      4) test_lru_map: OK
      5) test_verifier: Summary: 828 PASSED, 0 FAILED.
      Above tests are all done in following two conditions separately:
      1:bpf_jit_enable=1 and bpf_jit_harden=0
      2:bpf_jit_enable=1 and bpf_jit_harden=2
      Below are some numbers for this jit implementation:
        I run test_progs in kselftest 100 times continuously for every condition,
        the numbers are in format: total/times=avg.
        The numbers that test_bpf reports show almost the same relation.
      a:jit_enable=0 and jit_harden=0            b:jit_enable=1 and jit_harden=0
        test_pkt_access:PASS:ipv4:15622/100=156    test_pkt_access:PASS:ipv4:10674/100=106
        test_pkt_access:PASS:ipv6:9130/100=91      test_pkt_access:PASS:ipv6:4855/100=48
        test_xdp:PASS:ipv4:240198/100=2401         test_xdp:PASS:ipv4:138912/100=1389
        test_xdp:PASS:ipv6:137326/100=1373         test_xdp:PASS:ipv6:68542/100=685
        test_l4lb:PASS:ipv4:61100/100=611          test_l4lb:PASS:ipv4:37302/100=373
        test_l4lb:PASS:ipv6:101000/100=1010        test_l4lb:PASS:ipv6:55030/100=550
      c:jit_enable=1 and jit_harden=2
      The numbers show we get 30%~50% improvement.
      See Documentation/networking/filter.txt for more information.
       Changes v5-v6:
       1:Add do {} while (0) to RETPOLINE_RAX_BPF_JIT for
         consistence reason.
       2:Clean up non-standard comments, reported by Daniel Borkmann.
       3:Fix a memory leak issue, repoted by Daniel Borkmann.
       Changes v4-v5:
       1:Delete is_on_stack, BPF_REG_AX is the only one
         on real hardware registers, so just check with
       2:Apply commit 1612a981 ("bpf, x64: fix JIT emission
         for dead code"), suggested by Daniel Borkmann.
       Changes v3-v4:
       1:Fix changelog in commit.
         I install llvm-6.0, then test_progs willn't report errors.
         I submit another patch:
         "bpf: fix misaligned access for BPF_PROG_TYPE_PERF_EVENT program type on x86_32 platform"
         to fix another problem, after that patch, test_verifier willn't report errors too.
       2:Fix clear r0[1] twice unnecessarily in *BPF_IND|BPF_ABS* simulation.
       Changes v2-v3:
       1:Move BPF_REG_AX to real hardware registers for performance reason.
       3:Using bpf_load_pointer instead of bpf_jit32.S, suggested by Daniel Borkmann.
       4:Delete partial codes in 1c2a088a, suggested by Daniel Borkmann.
       5:Some bug fixes and comments improvement.
       Changes v1-v2:
       1:Fix bug in emit_ia32_neg64.
       2:Fix bug in emit_ia32_arsh_r64.
       3:Delete filename in top level comment, suggested by Thomas Gleixner.
       4:Delete unnecessary boiler plate text, suggested by Thomas Gleixner.
       5:Rewrite some words in changelog.
       6:CodingSytle improvement and a little more comments.
      Signed-off-by: default avatarWang YanQing <udknight@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
  10. 27 Apr, 2018 1 commit
  11. 16 Apr, 2018 1 commit
  12. 11 Apr, 2018 3 commits
  13. 09 Mar, 2018 1 commit
    • Eric Dumazet's avatar
      net: do not create fallback tunnels for non-default namespaces · 79134e6c
      Eric Dumazet authored
      fallback tunnels (like tunl0, gre0, gretap0, erspan0, sit0,
      ip6tnl0, ip6gre0) are automatically created when the corresponding
      module is loaded.
      These tunnels are also automatically created when a new network
      namespace is created, at a great cost.
      In many cases, netns are used for isolation purposes, and these
      extra network devices are a waste of resources. We are using
      thousands of netns per host, and hit the netns creation/delete
      bottleneck a lot. (Many thanks to Kirill for recent work on this)
      Add a new sysctl so that we can opt-out from this automatic creation.
      Note that these tunnels are still created for the initial namespace,
      to be the least intrusive for typical setups.
      lpk43:~# cat add_del_unshare.sh
      for i in `seq 1 40`
       (for j in `seq 1 100` ; do  unshare -n /bin/true >/dev/null ; done) &
      lpk43:~# echo 0 >/proc/sys/net/core/fb_tunnels_only_for_init_net
      lpk43:~# time ./add_del_unshare.sh
      real	0m37.521s
      user	0m0.886s
      sys	7m7.084s
      lpk43:~# echo 1 >/proc/sys/net/core/fb_tunnels_only_for_init_net
      lpk43:~# time ./add_del_unshare.sh
      real	0m4.761s
      user	0m0.851s
      sys	1m8.343s
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  14. 07 Feb, 2018 1 commit
  15. 01 Feb, 2018 1 commit
    • Michal Hocko's avatar
      mm, hugetlb: remove hugepages_treat_as_movable sysctl · d6cb41cc
      Michal Hocko authored
      hugepages_treat_as_movable has been introduced by 396faf03 ("Allow
      huge page allocations to use GFP_HIGH_MOVABLE") to allow hugetlb
      allocations from ZONE_MOVABLE even when hugetlb pages were not
      migrateable.  The purpose of the movable zone was different at the time.
      It aimed at reducing memory fragmentation and hugetlb pages being long
      lived and large werre not contributing to the fragmentation so it was
      acceptable to use the zone back then.
      Things have changed though and the primary purpose of the zone became
      migratability guarantee.  If we allow non migrateable hugetlb pages to
      be in ZONE_MOVABLE memory hotplug might fail to offline the memory.
      Remove the knob and only rely on hugepage_migration_supported to allow
      movable zones.
      Mel said:
      : Primarily it was aimed at allowing the hugetlb pool to safely shrink with
      : the ability to grow it again.  The use case was for batched jobs, some of
      : which needed huge pages and others that did not but didn't want the memory
      : useless pinned in the huge pages pool.
      : I suspect that more users rely on THP than hugetlbfs for flexible use of
      : huge pages with fallback options so I think that removing the option
      : should be ok.
      Link: http://lkml.kernel.org/r/20171003072619.8654-1-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarAlexandru Moise <00moses.alexander00@gmail.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Alexandru Moise <00moses.alexander00@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  16. 21 Dec, 2017 1 commit
  17. 20 Dec, 2017 1 commit
  18. 11 Dec, 2017 1 commit
  19. 30 Nov, 2017 1 commit
    • Michal Hocko's avatar
      Revert "mm/page-writeback.c: print a warning if the vm dirtiness settings are illogical" · 90daf306
      Michal Hocko authored
      This reverts commit 0f6d24f8 ("mm/page-writeback.c: print a warning
      if the vm dirtiness settings are illogical") because it causes false
      positive warnings during OOM situations as noticed by Tetsuo Handa:
        Node 0 active_anon:3525940kB inactive_anon:8372kB active_file:216kB inactive_file:1872kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:2504kB dirty:52kB writeback:0kB shmem:8660kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 636928kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
        Node 0 DMA free:14848kB min:284kB low:352kB high:420kB active_anon:992kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 2687 3645 3645
        Node 0 DMA32 free:53004kB min:49608kB low:62008kB high:74408kB active_anon:2712648kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129216kB managed:2773132kB mlocked:0kB kernel_stack:96kB pagetables:5096kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 0 958 958
        Node 0 Normal free:17140kB min:17684kB low:22104kB high:26524kB active_anon:812300kB inactive_anon:8372kB active_file:1228kB inactive_file:1868kB unevictable:0kB writepending:52kB present:1048576kB managed:981224kB mlocked:0kB kernel_stack:3520kB pagetables:8552kB bounce:0kB free_pcp:120kB local_pcp:120kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        Out of memory: Kill process 8459 (a.out) score 999 or sacrifice child
        Killed process 8459 (a.out) total-vm:4180kB, anon-rss:88kB, file-rss:0kB, shmem-rss:0kB
        oom_reaper: reaped process 8459 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
        vm direct limit must be set greater than background limit.
      The problem is that both thresh and bg_thresh will be 0 if
      available_memory is less than 4 pages when evaluating
      While this might be worked around the whole point of the warning is
      dubious at best.  We do rely on admins to do sensible things when
      changing tunable knobs.  Dirty memory writeback knobs are not any
      special in that regards so revert the warning rather than adding more
      hacks to work this around.
      Debugged by Yafang Shao.
      Link: http://lkml.kernel.org/r/20171127091939.tahb77nznytcxw55@dhcp22.suse.cz
      Fixes: 0f6d24f8 ("mm/page-writeback.c: print a warning if the vm dirtiness settings are illogical")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  20. 18 Nov, 2017 1 commit
  21. 16 Nov, 2017 4 commits
  22. 03 Oct, 2017 1 commit
  23. 07 Sep, 2017 1 commit
    • Michal Hocko's avatar
      mm, page_alloc: rip out ZONELIST_ORDER_ZONE · c9bff3ee
      Michal Hocko authored
      Patch series "cleanup zonelists initialization", v1.
      This is aimed at cleaning up the zonelists initialization code we have
      but the primary motivation was bug report [2] which got resolved but the
      usage of stop_machine is just too ugly to live.  Most patches are
      straightforward but 3 of them need a special consideration.
      Patch 1 removes zone ordered zonelists completely.  I am CCing linux-api
      because this is a user visible change.  As I argue in the patch
      description I do not think we have a strong usecase for it these days.
      I have kept sysctl in place and warn into the log if somebody tries to
      configure zone lists ordering.  If somebody has a real usecase for it we
      can revert this patch but I do not expect anybody will actually notice
      runtime differences.  This patch is not strictly needed for the rest but
      it made patch 6 easier to implement.
      Patch 7 removes stop_machine from build_all_zonelists without adding any
      special synchronization between iterators and updater which I _believe_
      is acceptable as explained in the changelog.  I hope I am not missing
      Patch 8 then removes zonelists_mutex which is kind of ugly as well and
      not really needed AFAICS but a care should be taken when double checking
      my thinking.
      This patch (of 9):
      Supporting zone ordered zonelists costs us just a lot of code while the
      usefulness is arguable if existent at all.  Mel has already made node
      ordering default on 64b systems.  32b systems are still using
      ZONELIST_ORDER_ZONE because it is considered better to fallback to a
      different NUMA node rather than consume precious lowmem zones.
      This argument is, however, weaken by the fact that the memory reclaim
      has been reworked to be node rather than zone oriented.  This means that
      lowmem requests have to skip over all highmem pages on LRUs already and
      so zone ordering doesn't save the reclaim time much.  So the only
      advantage of the zone ordering is under a light memory pressure when
      highmem requests do not ever hit into lowmem zones and the lowmem
      pressure doesn't need to reclaim.
      Considering that 32b NUMA systems are rather suboptimal already and it
      is generally advisable to use 64b kernel on such a HW I believe we
      should rather care about the code maintainability and just get rid of
      ZONELIST_ORDER_ZONE altogether.  Keep systcl in place and warn if
      somebody tries to set zone ordering either from kernel command line or
      the sysctl.
      [mhocko@suse.com: reading vm.numa_zonelist_order will never terminate]
      Link: http://lkml.kernel.org/r/20170721143915.14161-2-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  24. 24 Aug, 2017 1 commit
  25. 22 Aug, 2017 1 commit
  26. 21 Aug, 2017 1 commit
  27. 18 Aug, 2017 1 commit
  28. 17 Aug, 2017 1 commit
  29. 14 Aug, 2017 1 commit
    • Tyler Hicks's avatar
      seccomp: Sysctl to display available actions · 8e5f1ad1
      Tyler Hicks authored
      This patch creates a read-only sysctl containing an ordered list of
      seccomp actions that the kernel supports. The ordering, from left to
      right, is the lowest action value (kill) to the highest action value
      (allow). Currently, a read of the sysctl file would return "kill trap
      errno trace allow". The contents of this sysctl file can be useful for
      userspace code as well as the system administrator.
      The path to the sysctl is:
      libseccomp and other userspace code can easily determine which actions
      the current kernel supports. The set of actions supported by the current
      kernel may be different than the set of action macros found in kernel
      headers that were installed where the userspace code was built.
      In addition, this sysctl will allow system administrators to know which
      actions are supported by the kernel and make it easier to configure
      exactly what seccomp logs through the audit subsystem. Support for this
      level of logging configuration will come in a future patch.
      Signed-off-by: default avatarTyler Hicks <tyhicks@canonical.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
  30. 10 Jul, 2017 1 commit
  31. 21 Apr, 2017 1 commit
  32. 03 Mar, 2017 1 commit
  33. 25 Feb, 2017 1 commit
    • David Rientjes's avatar
      mm, madvise: fail with ENOMEM when splitting vma will hit max_map_count · def5efe0
      David Rientjes authored
      If madvise(2) advice will result in the underlying vma being split and
      the number of areas mapped by the process will exceed
      /proc/sys/vm/max_map_count as a result, return ENOMEM instead of EAGAIN.
      EAGAIN is returned by madvise(2) when a kernel resource, such as slab,
      is temporarily unavailable.  It indicates that userspace should retry
      the advice in the near future.  This is important for advice such as
      MADV_DONTNEED which is often used by malloc implementations to free
      memory back to the system: we really do want to free memory back when
      madvise(2) returns EAGAIN because slab allocations (for vmas, anon_vmas,
      or mempolicies) cannot be allocated.
      Encountering /proc/sys/vm/max_map_count is not a temporary failure,
      however, so return ENOMEM to indicate this is a more serious issue.  A
      followup patch to the man page will specify this behavior.
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701241431120.42507@chino.kir.corp.google.comSigned-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>