Skip to content
Snippets Groups Projects
  1. Aug 10, 2018
    • Toshiaki Makita's avatar
      veth: Add XDP TX and REDIRECT · d1396004
      Toshiaki Makita authored
      
      This allows further redirection of xdp_frames like
      
       NIC   -> veth--veth -> veth--veth
       (XDP)          (XDP)         (XDP)
      
      The intermediate XDP, redirecting packets from NIC to the other veth,
      reuses xdp_mem_info from NIC so that page recycling of the NIC works on
      the destination veth's XDP.
      In this way return_frame is not fully guarded by NAPI, since another
      NAPI handler on another cpu may use the same xdp_mem_info concurrently.
      Thus disable napi_direct by xdp_set_return_frame_no_direct() during the
      NAPI context.
      
      v8:
      - Don't use xdp_frame pointer address for data_hard_start of xdp_buff.
      
      v4:
      - Use xdp_[set|clear]_return_frame_no_direct() instead of a flag in
        xdp_mem_info.
      
      v3:
      - Fix double free when veth_xdp_tx() returns a positive value.
      - Convert xdp_xmit and xdp_redir variables into flags.
      
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      d1396004
    • Toshiaki Makita's avatar
      veth: Add ndo_xdp_xmit · af87a3aa
      Toshiaki Makita authored
      
      This allows NIC's XDP to redirect packets to veth. The destination veth
      device enqueues redirected packets to the napi ring of its peer, then
      they are processed by XDP on its peer veth device.
      This can be thought as calling another XDP program by XDP program using
      REDIRECT, when the peer enables driver XDP.
      
      Note that when the peer veth device does not set driver xdp, redirected
      packets will be dropped because the peer is not ready for NAPI.
      
      v4:
      - Don't use xdp_ok_fwd_dev() because checking IFF_UP is not necessary.
        Add comments about it and check only MTU.
      
      v2:
      - Drop the part converting xdp_frame into skb when XDP is not enabled.
      - Implement bulk interface of ndo_xdp_xmit.
      - Implement XDP_XMIT_FLUSH bit and drop ndo_xdp_flush.
      
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      af87a3aa
    • Toshiaki Makita's avatar
      veth: Handle xdp_frames in xdp napi ring · 9fc8d518
      Toshiaki Makita authored
      
      This is preparation for XDP TX and ndo_xdp_xmit.
      This allows napi handler to handle xdp_frames through xdp ring as well
      as sk_buff.
      
      v8:
      - Don't use xdp_frame pointer address to calculate skb->head and
        headroom.
      
      v7:
      - Use xdp_scrub_frame() instead of memset().
      
      v3:
      - Revert v2 change around rings and use a flag to differentiate skb and
        xdp_frame, since bulk skb xmit makes little performance difference
        for now.
      
      v2:
      - Use another ring instead of using flag to differentiate skb and
        xdp_frame. This approach makes bulk skb transmit possible in
        veth_xmit later.
      - Clear xdp_frame feilds in skb->head.
      - Implement adjust_tail.
      
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      9fc8d518
    • Toshiaki Makita's avatar
      veth: Avoid drops by oversized packets when XDP is enabled · dc224822
      Toshiaki Makita authored
      
      Oversized packets including GSO packets can be dropped if XDP is
      enabled on receiver side, so don't send such packets from peer.
      
      Drop TSO and SCTP fragmentation features so that veth devices themselves
      segment packets with XDP enabled. Also cap MTU accordingly.
      
      v4:
      - Don't auto-adjust MTU but cap max MTU.
      
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      dc224822
    • Toshiaki Makita's avatar
      veth: Add driver XDP · 948d4f21
      Toshiaki Makita authored
      
      This is the basic implementation of veth driver XDP.
      
      Incoming packets are sent from the peer veth device in the form of skb,
      so this is generally doing the same thing as generic XDP.
      
      This itself is not so useful, but a starting point to implement other
      useful veth XDP features like TX and REDIRECT.
      
      This introduces NAPI when XDP is enabled, because XDP is now heavily
      relies on NAPI context. Use ptr_ring to emulate NIC ring. Tx function
      enqueues packets to the ring and peer NAPI handler drains the ring.
      
      Currently only one ring is allocated for each veth device, so it does
      not scale on multiqueue env. This can be resolved by allocating rings
      on the per-queue basis later.
      
      Note that NAPI is not used but netif_rx is used when XDP is not loaded,
      so this does not change the default behaviour.
      
      v6:
      - Check skb->len only when allocation is needed.
      - Add __GFP_NOWARN to alloc_page() as it can be triggered by external
        events.
      
      v3:
      - Fix race on closing the device.
      - Add extack messages in ndo_bpf.
      
      v2:
      - Squashed with the patch adding NAPI.
      - Implement adjust_tail.
      - Don't acquire consumer lock because it is guarded by NAPI.
      - Make poll_controller noop since it is unnecessary.
      - Register rxq_info on enabling XDP rather than on opening the device.
      
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      948d4f21
  2. Dec 08, 2017
  3. Jun 27, 2017
  4. Jun 22, 2017
    • Serhey Popovych's avatar
      veth: Be more robust on network device creation when no attributes · 191cdb38
      Serhey Popovych authored
      
      There are number of problems with configuration peer
      network device in absence of IFLA_VETH_PEER attributes
      where attributes for main network device shared with
      peer.
      
      First it is not feasible to configure both network
      devices with same MAC address since this makes
      communication in such configuration problematic.
      
      This case can be reproduced with following sequence:
      
        # ip link add address 02:11:22:33:44:55 type veth
        # ip li sh
        ...
        26: veth0@veth1: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc \
        noop state DOWN mode DEFAULT qlen 1000
            link/ether 00:11:22:33:44:55 brd ff:ff:ff:ff:ff:ff
        27: veth1@veth0: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc \
        noop state DOWN mode DEFAULT qlen 1000
            link/ether 00:11:22:33:44:55 brd ff:ff:ff:ff:ff:ff
      
      Second it is not possible to register both main and
      peer network devices with same name, that happens
      when name for main interface is given with IFLA_IFNAME
      and same attribute reused for peer.
      
      This case can be reproduced with following sequence:
      
        # ip link add dev veth1a type veth
        RTNETLINK answers: File exists
      
      To fix both of the cases check if corresponding netlink
      attributes are taken from peer_tb when valid or
      name based on rtnl ops kind and random address is used.
      
      Signed-off-by: default avatarSerhey Popovych <serhe.popovych@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      191cdb38
  5. Jun 07, 2017
    • David S. Miller's avatar
      net: Fix inconsistent teardown and release of private netdev state. · cf124db5
      David S. Miller authored
      
      Network devices can allocate reasources and private memory using
      netdev_ops->ndo_init().  However, the release of these resources
      can occur in one of two different places.
      
      Either netdev_ops->ndo_uninit() or netdev->destructor().
      
      The decision of which operation frees the resources depends upon
      whether it is necessary for all netdev refs to be released before it
      is safe to perform the freeing.
      
      netdev_ops->ndo_uninit() presumably can occur right after the
      NETDEV_UNREGISTER notifier completes and the unicast and multicast
      address lists are flushed.
      
      netdev->destructor(), on the other hand, does not run until the
      netdev references all go away.
      
      Further complicating the situation is that netdev->destructor()
      almost universally does also a free_netdev().
      
      This creates a problem for the logic in register_netdevice().
      Because all callers of register_netdevice() manage the freeing
      of the netdev, and invoke free_netdev(dev) if register_netdevice()
      fails.
      
      If netdev_ops->ndo_init() succeeds, but something else fails inside
      of register_netdevice(), it does call ndo_ops->ndo_uninit().  But
      it is not able to invoke netdev->destructor().
      
      This is because netdev->destructor() will do a free_netdev() and
      then the caller of register_netdevice() will do the same.
      
      However, this means that the resources that would normally be released
      by netdev->destructor() will not be.
      
      Over the years drivers have added local hacks to deal with this, by
      invoking their destructor parts by hand when register_netdevice()
      fails.
      
      Many drivers do not try to deal with this, and instead we have leaks.
      
      Let's close this hole by formalizing the distinction between what
      private things need to be freed up by netdev->destructor() and whether
      the driver needs unregister_netdevice() to perform the free_netdev().
      
      netdev->priv_destructor() performs all actions to free up the private
      resources that used to be freed by netdev->destructor(), except for
      free_netdev().
      
      netdev->needs_free_netdev is a boolean that indicates whether
      free_netdev() should be done at the end of unregister_netdevice().
      
      Now, register_netdevice() can sanely release all resources after
      ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
      and netdev->priv_destructor().
      
      And at the end of unregister_netdevice(), we invoke
      netdev->priv_destructor() and optionally call free_netdev().
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cf124db5
  6. Apr 13, 2017
  7. Mar 29, 2017
  8. Jan 08, 2017
  9. Oct 20, 2016
    • Jarod Wilson's avatar
      net: use core MTU range checking in core net infra · 91572088
      Jarod Wilson authored
      
      geneve:
      - Merge __geneve_change_mtu back into geneve_change_mtu, set max_mtu
      - This one isn't quite as straight-forward as others, could use some
        closer inspection and testing
      
      macvlan:
      - set min/max_mtu
      
      tun:
      - set min/max_mtu, remove tun_net_change_mtu
      
      vxlan:
      - Merge __vxlan_change_mtu back into vxlan_change_mtu
      - Set max_mtu to IP_MAX_MTU and retain dynamic MTU range checks in
        change_mtu function
      - This one is also not as straight-forward and could use closer inspection
        and testing from vxlan folks
      
      bridge:
      - set max_mtu of IP_MAX_MTU and retain dynamic MTU range checks in
        change_mtu function
      
      openvswitch:
      - set min/max_mtu, remove internal_dev_change_mtu
      - note: max_mtu wasn't checked previously, it's been set to 65535, which
        is the largest possible size supported
      
      sch_teql:
      - set min/max_mtu (note: max_mtu previously unchecked, used max of 65535)
      
      macsec:
      - min_mtu = 0, max_mtu = 65535
      
      macvlan:
      - min_mtu = 0, max_mtu = 65535
      
      ntb_netdev:
      - min_mtu = 0, max_mtu = 65535
      
      veth:
      - min_mtu = 68, max_mtu = 65535
      
      8021q:
      - min_mtu = 0, max_mtu = 65535
      
      CC: netdev@vger.kernel.org
      CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      CC: Hannes Frederic Sowa <hannes@stressinduktion.org>
      CC: Tom Herbert <tom@herbertland.com>
      CC: Daniel Borkmann <daniel@iogearbox.net>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      CC: Paolo Abeni <pabeni@redhat.com>
      CC: Jiri Benc <jbenc@redhat.com>
      CC: WANG Cong <xiyou.wangcong@gmail.com>
      CC: Roopa Prabhu <roopa@cumulusnetworks.com>
      CC: Pravin B Shelar <pshelar@ovn.org>
      CC: Sabrina Dubroca <sd@queasysnail.net>
      CC: Patrick McHardy <kaber@trash.net>
      CC: Stephen Hemminger <stephen@networkplumber.org>
      CC: Pravin Shelar <pshelar@nicira.com>
      CC: Maxim Krasnyansky <maxk@qti.qualcomm.com>
      Signed-off-by: default avatarJarod Wilson <jarod@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      91572088
  10. Aug 31, 2016
  11. Aug 26, 2016
    • Xin Long's avatar
      veth: sctp: add NETIF_F_SCTP_CRC to device features · c80fafbb
      Xin Long authored
      
      Commit b17c7069 ("loopback: sctp: add NETIF_F_SCTP_CSUM to device
      features") added NETIF_F_SCTP_CRC to device features for lo device to
      improve the performance of sctp over lo.
      
      This patch is to add NETIF_F_SCTP_CRC to device features for veth to
      improve the performance of sctp over veth.
      
      Before this patch:
        ip netns exec cs_client netperf -H 10.167.12.2 -t SCTP_STREAM -- -m 10K
        Recv   Send    Send
        Socket Socket  Message  Elapsed
        Size   Size    Size     Time     Throughput
        bytes  bytes   bytes    secs.    10^6bits/sec
      
        212992 212992  10240    10.00    1117.16
      
      After this patch:
        ip netns exec cs_client netperf -H 10.167.12.2 -t SCTP_STREAM -- -m 10K
        Recv   Send    Send
        Socket Socket  Message  Elapsed
        Size   Size    Size     Time     Throughput
        bytes  bytes   bytes    secs.    10^6bits/sec
      
        212992 212992  10240    10.20    1415.22
      
      Tested-by: default avatarLi Shuang <tjlishuang@yeah.net>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c80fafbb
  12. Apr 21, 2016
  13. Mar 01, 2016
    • Paolo Abeni's avatar
      veth: implement ndo_set_rx_headroom · 163e5292
      Paolo Abeni authored
      
      The rx headroom for veth dev is the peer device needed_headroom.
      Avoid ping-pong updates setting the private flag IFF_PHONY_HEADROOM.
      
      This avoids skb head reallocation when forwarding from a veth dev
      towards a device adding some kind of encapsulation.
      
      When transmitting frames below the MTU size towards a vxlan device,
      this gives about 10% performance speed-up when OVS is used to connect
      the veth and the vxlan device and a little more when using a
      plain Linux bridge.
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      163e5292
  14. Dec 22, 2015
    • Vijay Pandurangan's avatar
      veth: don’t modify ip_summed; doing so treats packets with bad checksums as good. · ce8c839b
      Vijay Pandurangan authored
      
      Packets that arrive from real hardware devices have ip_summed ==
      CHECKSUM_UNNECESSARY if the hardware verified the checksums, or
      CHECKSUM_NONE if the packet is bad or it was unable to verify it. The
      current version of veth will replace CHECKSUM_NONE with
      CHECKSUM_UNNECESSARY, which causes corrupt packets routed from hardware to
      a veth device to be delivered to the application. This caused applications
      at Twitter to receive corrupt data when network hardware was corrupting
      packets.
      
      We believe this was added as an optimization to skip computing and
      verifying checksums for communication between containers. However, locally
      generated packets have ip_summed == CHECKSUM_PARTIAL, so the code as
      written does nothing for them. As far as we can tell, after removing this
      code, these packets are transmitted from one stack to another unmodified
      (tcpdump shows invalid checksums on both sides, as expected), and they are
      delivered correctly to applications. We didn’t test every possible network
      configuration, but we tried a few common ones such as bridging containers,
      using NAT between the host and a container, and routing from hardware
      devices to containers. We have effectively deployed this in production at
      Twitter (by disabling RX checksum offloading on veth devices).
      
      This code dates back to the first version of the driver, commit
      <e314dbdc> ("[NET]: Virtual ethernet device driver"), so I
      suspect this bug occurred mostly because the driver API has evolved
      significantly since then. Commit <0b796750> ("net/veth: Fix
      packet checksumming") (in December 2010) fixed this for packets that get
      created locally and sent to hardware devices, by not changing
      CHECKSUM_PARTIAL. However, the same issue still occurs for packets coming
      in from hardware devices.
      
      Co-authored-by: default avatarEvan Jones <ej@evanjones.ca>
      Signed-off-by: default avatarEvan Jones <ej@evanjones.ca>
      Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      Cc: Phil Sutter <phil@nwl.cc>
      Cc: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Cc: netdev@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarVijay Pandurangan <vijayp@vijayp.ca>
      Acked-by: default avatarCong Wang <cwang@twopensource.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ce8c839b
  15. Aug 18, 2015
  16. Aug 03, 2015
  17. Apr 02, 2015
  18. Jan 24, 2015
  19. Jul 15, 2014
  20. Jun 25, 2014
  21. Mar 28, 2014
  22. Mar 15, 2014
  23. Feb 20, 2014
  24. Feb 18, 2014
  25. Feb 14, 2014
  26. Nov 06, 2013
    • John Stultz's avatar
      net: Explicitly initialize u64_stats_sync structures for lockdep · 827da44c
      John Stultz authored
      
      In order to enable lockdep on seqcount/seqlock structures, we
      must explicitly initialize any locks.
      
      The u64_stats_sync structure, uses a seqcount, and thus we need
      to introduce a u64_stats_init() function and use it to initialize
      the structure.
      
      This unfortunately adds a lot of fairly trivial initialization code
      to a number of drivers. But the benefit of ensuring correctness makes
      this worth while.
      
      Because these changes are required for lockdep to be enabled, and the
      changes are quite trivial, I've not yet split this patch out into 30-some
      separate patches, as I figured it would be better to get the various
      maintainers thoughts on how to best merge this change along with
      the seqcount lockdep enablement.
      
      Feedback would be appreciated!
      
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Acked-by: default avatarJulian Anastasov <ja@ssi.bg>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jesse Gross <jesse@nicira.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Mirko Lindner <mlindner@marvell.com>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Roger Luethi <rl@hellgate.ch>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Simon Horman <horms@verge.net.au>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Cc: Wensong Zhang <wensong@linux-vs.org>
      Cc: netdev@vger.kernel.org
      Link: http://lkml.kernel.org/r/1381186321-4906-2-git-send-email-john.stultz@linaro.org
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      827da44c
  27. Oct 28, 2013
  28. Oct 10, 2013
  29. Oct 09, 2013
  30. Oct 08, 2013
  31. Jul 20, 2013
    • Flavio Leitner's avatar
      veth: add vlan features · b69bbddf
      Flavio Leitner authored
      
      The veth device doesn't provide the vlan features,
      so TSO for example is disabled and that causes
      performance issues when using tagged traffic.
      
      The test topology looks like this:
      
          br0                     br1
        /   \                  /     \
      vnet  veth0.10 ----- veth1.10   vnet
      VM                               VM
      
      The netperf results with current veth driver:
      MIGRATED TCP STREAM TEST from 192.168.1.1 ()
      port 0 AF_INET to 192.168.1.2 () port 0 AF_INET
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380  16384  16384    10.01    2210.22
      
      Now after applying the proposed patch:
      MIGRATED TCP STREAM TEST from 192.168.1.1 ()
      port 0 AF_INET to 192.168.1.2 () port 0 AF_INET
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380  16384  16384    10.00    13067.47
      
      Signed-off-by: default avatarFlavio Leitner <fbl@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b69bbddf
  32. Jun 12, 2013
  33. Apr 19, 2013
  34. Feb 11, 2013
Loading