Skip to content
Snippets Groups Projects
  1. May 13, 2015
    • Willem de Bruijn's avatar
      packet: rollover statistics · a9b63918
      Willem de Bruijn authored
      
      Rollover indicates exceptional conditions. Export a counter to inform
      socket owners of this state.
      
      If no socket with sufficient room is found, rollover fails. Also count
      these events.
      
      Finally, also count when flows are rolled over early thanks to huge
      flow detection, to validate its correctness.
      
      Tested:
        Read counters in bench_rollover on all other tests in the patchset
      
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a9b63918
    • Willem de Bruijn's avatar
      packet: rollover huge flows before small flows · 3b3a5b0a
      Willem de Bruijn authored
      
      Migrate flows from a socket to another socket in the fanout group not
      only when the socket is full. Start migrating huge flows early, to
      divert possible 4-tuple attacks without affecting normal traffic.
      
      Introduce fanout_flow_is_huge(). This detects huge flows, which are
      defined as taking up more than half the load. It does so cheaply, by
      storing the rxhashes of the N most recent packets. If over half of
      these are the same rxhash as the current packet, then drop it. This
      only protects against 4-tuple attacks. N is chosen to fit all data in
      a single cache line.
      
      Tested:
        Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input.
      
          lpbb5:/export/hda3/willemb# ./bench_rollover -l 1000 -r -s
          cpu         rx       rx.k     drop.k   rollover     r.huge   r.failed
            0         14         14          0          0          0          0
            1         20         20          0          0          0          0
            2         16         16          0          0          0          0
            3    6168824    6168824          0    4867721    4867721          0
            4    4867741    4867741          0          0          0          0
            5         12         12          0          0          0          0
            6         15         15          0          0          0          0
            7         17         17          0          0          0          0
      
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3b3a5b0a
    • Willem de Bruijn's avatar
      packet: rollover lock contention avoidance · 2ccdbaa6
      Willem de Bruijn authored
      
      Rollover has to call packet_rcv_has_room on sockets in the fanout
      group to find a socket to migrate to. This operation is expensive
      especially if the packet sockets use rings, when a lock has to be
      acquired.
      
      Avoid pounding on the lock by all sockets by temporarily marking a
      socket as "under memory pressure" when such pressure is detected.
      While set, only the socket owner may call packet_rcv_has_room on the
      socket. Once it detects normal conditions, it clears the flag. The
      socket is not used as a victim by any other socket in the meantime.
      
      Under reasonably balanced load, each socket writer frequently calls
      packet_rcv_has_room and clears its own pressure field. As a backup
      for when the socket is rarely written to, also clear the flag on
      reading (packet_recvmsg, packet_poll) if this can be done cheaply
      (i.e., without calling packet_rcv_has_room). This is only for
      edge cases.
      
      Tested:
        Ran bench_rollover: a process with 8 sockets in a single fanout
        group, each pinned to a single cpu that receives one nic recv
        interrupt. RPS and RFS are disabled. The benchmark uses packet
        rx_ring, which has to take a lock when determining whether a
        socket has room.
      
        Sent 3.5 Mpps of UDP traffic with sufficient entropy to spread
        uniformly across the packet sockets (and inserted an iptables
        rule to drop in PREROUTING to avoid protocol stack processing).
      
        Without this patch, all sockets try to migrate traffic to
        neighbors, causing lock contention when searching for a non-
        empty neighbor. The lock is the top 9 entries.
      
          perf record -a -g sleep 5
      
          -  17.82%   bench_rollover  [kernel.kallsyms]    [k] _raw_spin_lock
             - _raw_spin_lock
                - 99.00% spin_lock
          	 + 81.77% packet_rcv_has_room.isra.41
          	 + 18.23% tpacket_rcv
                + 0.84% packet_rcv_has_room.isra.41
          +   5.20%      ksoftirqd/6  [kernel.kallsyms]    [k] _raw_spin_lock
          +   5.15%      ksoftirqd/1  [kernel.kallsyms]    [k] _raw_spin_lock
          +   5.14%      ksoftirqd/2  [kernel.kallsyms]    [k] _raw_spin_lock
          +   5.12%      ksoftirqd/7  [kernel.kallsyms]    [k] _raw_spin_lock
          +   5.12%      ksoftirqd/5  [kernel.kallsyms]    [k] _raw_spin_lock
          +   5.10%      ksoftirqd/4  [kernel.kallsyms]    [k] _raw_spin_lock
          +   4.66%      ksoftirqd/0  [kernel.kallsyms]    [k] _raw_spin_lock
          +   4.45%      ksoftirqd/3  [kernel.kallsyms]    [k] _raw_spin_lock
          +   1.55%   bench_rollover  [kernel.kallsyms]    [k] packet_rcv_has_room.isra.41
      
        On net-next with this patch, this lock contention is no longer a
        top entry. Most time is spent in the actual read function. Next up
        are other locks:
      
          +  15.52%  bench_rollover  bench_rollover     [.] reader
          +   4.68%         swapper  [kernel.kallsyms]  [k] memcpy_erms
          +   2.77%         swapper  [kernel.kallsyms]  [k] packet_lookup_frame.isra.51
          +   2.56%     ksoftirqd/1  [kernel.kallsyms]  [k] memcpy_erms
          +   2.16%         swapper  [kernel.kallsyms]  [k] tpacket_rcv
          +   1.93%         swapper  [kernel.kallsyms]  [k] mlx4_en_process_rx_cq
      
        Looking closer at the remaining _raw_spin_lock, the cost of probing
        in rollover is now comparable to the cost of taking the lock later
        in tpacket_rcv.
      
          -   1.51%         swapper  [kernel.kallsyms]  [k] _raw_spin_lock
             - _raw_spin_lock
                + 33.41% packet_rcv_has_room
                + 28.15% tpacket_rcv
                + 19.54% enqueue_to_backlog
                + 6.45% __free_pages_ok
                + 2.78% packet_rcv_fanout
                + 2.13% fanout_demux_rollover
                + 2.01% netif_receive_skb_internal
      
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2ccdbaa6
    • Willem de Bruijn's avatar
      packet: rollover prepare: per-socket state · 0648ab70
      Willem de Bruijn authored
      
      Replace rollover state per fanout group with state per socket. Future
      patches will add fields to the new structure.
      
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0648ab70
  2. Mar 12, 2015
    • Eric W. Biederman's avatar
      net: Introduce possible_net_t · 0c5c9fb5
      Eric W. Biederman authored
      
      Having to say
      > #ifdef CONFIG_NET_NS
      > 	struct net *net;
      > #endif
      
      in structures is a little bit wordy and a little bit error prone.
      
      Instead it is possible to say:
      > typedef struct {
      > #ifdef CONFIG_NET_NS
      >       struct net *net;
      > #endif
      > } possible_net_t;
      
      And then in a header say:
      
      > 	possible_net_t net;
      
      Which is cleaner and easier to use and easier to test, as the
      possible_net_t is always there no matter what the compile options.
      
      Further this allows read_pnet and write_pnet to be functions in all
      cases which is better at catching typos.
      
      This change adds possible_net_t, updates the definitions of read_pnet
      and write_pnet, updates optional struct net * variables that
      write_pnet uses on to have the type possible_net_t, and finally fixes
      up the b0rked users of read_pnet and write_pnet.
      
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0c5c9fb5
  3. Aug 21, 2014
  4. Jan 17, 2014
    • Daniel Borkmann's avatar
      packet: use percpu mmap tx frame pending refcount · b0138408
      Daniel Borkmann authored
      In PF_PACKET's packet mmap(), we can avoid using one atomic_inc()
      and one atomic_dec() call in skb destructor and use a percpu
      reference count instead in order to determine if packets are
      still pending to be sent out. Micro-benchmark with [1] that has
      been slightly modified (that is, protcol = 0 in socket(2) and
      bind(2)), example on a rather crappy testing machine; I expect
      it to scale and have even better results on bigger machines:
      
      ./packet_mm_tx -s7000 -m7200 -z700000 em1, avg over 2500 runs:
      
      With patch:    4,022,015 cyc
      Without patch: 4,812,994 cyc
      
      time ./packet_mm_tx -s64 -c10000000 em1 > /dev/null, stable:
      
      With patch:
        real         1m32.241s
        user         0m0.287s
        sys          1m29.316s
      
      Without patch:
        real         1m38.386s
        user         0m0.265s
        sys          1m35.572s
      
      In function tpacket_snd(), it is okay to use packet_read_pending()
      since in fast-path we short-circuit the condition already with
      ph != NULL, since we have next frames to process. In case we have
      MSG_DONTWAIT, we also do not execute this path as need_wait is
      false here anyway, and in case of _no_ MSG_DONTWAIT flag, it is
      okay to call a packet_read_pending(), because when we ever reach
      that path, we're done processing outgoing frames anyway and only
      look if there are skbs still outstanding to be orphaned. We can
      stay lockless in this percpu counter since it's acceptable when we
      reach this path for the sum to be imprecise first, but we'll level
      out at 0 after all pending frames have reached the skb destructor
      eventually through tx reclaim. When people pin a tx process to
      particular CPUs, we expect overflows to happen in the reference
      counter as on one CPU we expect heavy increase; and distributed
      through ksoftirqd on all CPUs a decrease, for example. As
      David Laight points out, since the C language doesn't define the
      result of signed int overflow (i.e. rather than wrap, it is
      allowed to saturate as a possible outcome), we have to use
      unsigned int as reference count. The sum over all CPUs when tx
      is complete will result in 0 again.
      
      The BUG_ON() in tpacket_destruct_skb() we can remove as well. It
      can _only_ be set from inside tpacket_snd() path and we made sure
      to increase tx_ring.pending in any case before we called po->xmit(skb).
      So testing for tx_ring.pending == 0 is not too useful. Instead, it
      would rather have been useful to test if lower layers didn't orphan
      the skb so that we're missing ring slots being put back to
      TP_STATUS_AVAILABLE. But such a bug will be caught in user space
      already as we end up realizing that we do not have any
      TP_STATUS_AVAILABLE slots left anymore. Therefore, we're all set.
      
      Btw, in case of RX_RING path, we do not make use of the pending
      member, therefore we also don't need to use up any percpu memory
      here. Also note that __alloc_percpu() already returns a zero-filled
      percpu area, so initialization is done already.
      
        [1] http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap
      
      
      
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b0138408
  5. Dec 10, 2013
    • Daniel Borkmann's avatar
      packet: introduce PACKET_QDISC_BYPASS socket option · d346a3fa
      Daniel Borkmann authored
      
      This patch introduces a PACKET_QDISC_BYPASS socket option, that
      allows for using a similar xmit() function as in pktgen instead
      of taking the dev_queue_xmit() path. This can be very useful when
      PF_PACKET applications are required to be used in a similar
      scenario as pktgen, but with full, flexible packet payload that
      needs to be provided, for example.
      
      On default, nothing changes in behaviour for normal PF_PACKET
      TX users, so everything stays as is for applications. New users,
      however, can now set PACKET_QDISC_BYPASS if needed to prevent
      own packets from i) reentering packet_rcv() and ii) to directly
      push the frame to the driver.
      
      In doing so we can increase pps (here 64 byte packets) for
      PF_PACKET a bit:
      
        # CPUs -- QDISC_BYPASS   -- qdisc path -- qdisc path[**]
        1 CPU  ==  1,509,628 pps --  1,208,708 --  1,247,436
        2 CPUs ==  3,198,659 pps --  2,536,012 --  1,605,779
        3 CPUs ==  4,787,992 pps --  3,788,740 --  1,735,610
        4 CPUs ==  6,173,956 pps --  4,907,799 --  1,909,114
        5 CPUs ==  7,495,676 pps --  5,956,499 --  2,014,422
        6 CPUs ==  9,001,496 pps --  7,145,064 --  2,155,261
        7 CPUs == 10,229,776 pps --  8,190,596 --  2,220,619
        8 CPUs == 11,040,732 pps --  9,188,544 --  2,241,879
        9 CPUs == 12,009,076 pps -- 10,275,936 --  2,068,447
       10 CPUs == 11,380,052 pps -- 11,265,337 --  1,578,689
       11 CPUs == 11,672,676 pps -- 11,845,344 --  1,297,412
       [...]
       20 CPUs == 11,363,192 pps -- 11,014,933 --  1,245,081
      
       [**]: qdisc path with packet_rcv(), how probably most people
             seem to use it (hopefully not anymore if not needed)
      
      The test was done using a modified trafgen, sending a simple
      static 64 bytes packet, on all CPUs.  The trick in the fast
      "qdisc path" case, is to avoid reentering packet_rcv() by
      setting the RAW socket protocol to zero, like:
      socket(PF_PACKET, SOCK_RAW, 0);
      
      Tradeoffs are documented as well in this patch, clearly, if
      queues are busy, we will drop more packets, tc disciplines are
      ignored, and these packets are not visible to taps anymore. For
      a pktgen like scenario, we argue that this is acceptable.
      
      The pointer to the xmit function has been placed in packet
      socket structure hole between cached_dev and prot_hook that
      is hot anyway as we're working on cached_dev in each send path.
      
      Done in joint work together with Jesper Dangaard Brouer.
      
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d346a3fa
  6. Nov 21, 2013
    • Daniel Borkmann's avatar
      packet: fix use after free race in send path when dev is released · e40526cb
      Daniel Borkmann authored
      
      Salam reported a use after free bug in PF_PACKET that occurs when
      we're sending out frames on a socket bound device and suddenly the
      net device is being unregistered. It appears that commit 827d9780
      introduced a possible race condition between {t,}packet_snd() and
      packet_notifier(). In the case of a bound socket, packet_notifier()
      can drop the last reference to the net_device and {t,}packet_snd()
      might end up suddenly sending a packet over a freed net_device.
      
      To avoid reverting 827d9780 and thus introducing a performance
      regression compared to the current state of things, we decided to
      hold a cached RCU protected pointer to the net device and maintain
      it on write side via bind spin_lock protected register_prot_hook()
      and __unregister_prot_hook() calls.
      
      In {t,}packet_snd() path, we access this pointer under rcu_read_lock
      through packet_cached_dev_get() that holds reference to the device
      to prevent it from being freed through packet_notifier() while
      we're in send path. This is okay to do as dev_put()/dev_hold() are
      per-cpu counters, so this should not be a performance issue. Also,
      the code simplifies a bit as we don't need need_rls_dev anymore.
      
      Fixes: 827d9780 ("af-packet: Use existing netdev reference for bound sockets.")
      Reported-by: default avatarSalam Noureddine <noureddine@aristanetworks.com>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarSalam Noureddine <noureddine@aristanetworks.com>
      Cc: Ben Greear <greearb@candelatech.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e40526cb
  7. Apr 25, 2013
    • Daniel Borkmann's avatar
      packet: account statistics only in tpacket_stats_u · ee80fbf3
      Daniel Borkmann authored
      
      Currently, packet_sock has a struct tpacket_stats stats member for
      TPACKET_V1 and TPACKET_V2 statistic accounting, and with TPACKET_V3
      ``union tpacket_stats_u stats_u'' was introduced, where however only
      statistics for TPACKET_V3 are held, and when copied to user space,
      TPACKET_V3 does some hackery and access also tpacket_stats' stats,
      although everything could have been done within the union itself.
      
      Unify accounting within the tpacket_stats_u union so that we can
      remove 8 bytes from packet_sock that are there unnecessary. Note that
      even if we switch to TPACKET_V3 and would use non mmap(2)ed option,
      this still works due to the union with same types + offsets, that are
      exposed to the user space.
      
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ee80fbf3
    • Daniel Borkmann's avatar
      packet: reorder a member in packet_ring_buffer · 0578edc5
      Daniel Borkmann authored
      
      There's a 4 byte hole in packet_ring_buffer structure before
      prb_bdqc, that can be filled with 'pending' member, thus we can
      reduce the overall structure size from 224 bytes to 216 bytes.
      This also has the side-effect, that in struct packet_sock 2*4 byte
      holes after the embedded packet_ring_buffer members are removed,
      and overall, packet_sock can be reduced by 1 cacheline:
      
      Before: size: 1344, cachelines: 21, members: 24
      After:  size: 1280, cachelines: 20, members: 24
      
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0578edc5
  8. Mar 19, 2013
    • Willem de Bruijn's avatar
      packet: packet fanout rollover during socket overload · 77f65ebd
      Willem de Bruijn authored
      
      Changes:
        v3->v2: rebase (no other changes)
                passes selftest
        v2->v1: read f->num_members only once
                fix bug: test rollover mode + flag
      
      Minimize packet drop in a fanout group. If one socket is full,
      roll over packets to another from the group. Maintain flow
      affinity during normal load using an rxhash fanout policy, while
      dispersing unexpected traffic storms that hit a single cpu, such
      as spoofed-source DoS flows. Rollover breaks affinity for flows
      arriving at saturated sockets during those conditions.
      
      The patch adds a fanout policy ROLLOVER that rotates between sockets,
      filling each socket before moving to the next. It also adds a fanout
      flag ROLLOVER. If passed along with any other fanout policy, the
      primary policy is applied until the chosen socket is full. Then,
      rollover selects another socket, to delay packet drop until the
      entire system is saturated.
      
      Probing sockets is not free. Selecting the last used socket, as
      rollover does, is a greedy approach that maximizes chance of
      success, at the cost of extreme load imbalance. In practice, with
      sufficiently long queues to absorb bursts, sockets are drained in
      parallel and load balance looks uniform in `top`.
      
      To avoid contention, scales counters with number of sockets and
      accesses them lockfree. Values are bounds checked to ensure
      correctness.
      
      Tested using an application with 9 threads pinned to CPUs, one socket
      per thread and sufficient busywork per packet operation to limits each
      thread to handling 32 Kpps. When sent 500 Kpps single UDP stream
      packets, a FANOUT_CPU setup processes 32 Kpps in total without this
      patch, 270 Kpps with the patch. Tested with read() and with a packet
      ring (V1).
      
      Also, passes psock_fanout.c unit test added to selftests.
      
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      77f65ebd
  9. Nov 07, 2012
  10. Aug 20, 2012
  11. Aug 14, 2012
Loading