Skip to content
Snippets Groups Projects
  1. Nov 27, 2020
  2. Nov 26, 2020
  3. Nov 25, 2020
  4. Nov 24, 2020
    • Alexander Duyck's avatar
      tcp: Set ECT0 bit in tos/tclass for synack when BPF needs ECN · 407c85c7
      Alexander Duyck authored
      
      When a BPF program is used to select between a type of TCP congestion
      control algorithm that uses either ECN or not there is a case where the
      synack for the frame was coming up without the ECT0 bit set. A bit of
      research found that this was due to the final socket being configured to
      dctcp while the listener socket was staying in cubic.
      
      To reproduce it all that is needed is to monitor TCP traffic while running
      the sample bpf program "samples/bpf/tcp_cong_kern.c". What is observed,
      assuming tcp_dctcp module is loaded or compiled in and the traffic matches
      the rules in the sample file, is that for all frames with the exception of
      the synack the ECT0 bit is set.
      
      To address that it is necessary to make one additional call to
      tcp_bpf_ca_needs_ecn using the request socket and then use the output of
      that to set the ECT0 bit for the tos/tclass of the packet.
      
      Fixes: 91b5b21c ("bpf: Add support for changing congestion control")
      Signed-off-by: default avatarAlexander Duyck <alexanderduyck@fb.com>
      Link: https://lore.kernel.org/r/160593039663.2604.1374502006916871573.stgit@localhost.localdomain
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      407c85c7
    • Moshe Shemesh's avatar
      devlink: Fix reload stats structure · 5204bb68
      Moshe Shemesh authored
      
      Fix reload stats structure exposed to the user. Change stats structure
      hierarchy to have the reload action as a parent of the stat entry and
      then stat entry includes value per limit. This will also help to avoid
      string concatenation on iproute2 output.
      
      Reload stats structure before this fix:
      "stats": {
          "reload": {
              "driver_reinit": 2,
              "fw_activate": 1,
              "fw_activate_no_reset": 0
           }
      }
      
      After this fix:
      "stats": {
          "reload": {
              "driver_reinit": {
                  "unspecified": 2
              },
              "fw_activate": {
                  "unspecified": 1,
                  "no_reset": 0
              }
      }
      
      Fixes: a254c264 ("devlink: Add reload stats")
      Signed-off-by: default avatarMoshe Shemesh <moshe@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Link: https://lore.kernel.org/r/1606109785-25197-1-git-send-email-moshe@mellanox.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5204bb68
    • Eyal Birger's avatar
      net/packet: fix packet receive on L3 devices without visible hard header · d5496990
      Eyal Birger authored
      
      In the patchset merged by commit b9fcf0a0
      ("Merge branch 'support-AF_PACKET-for-layer-3-devices'") L3 devices which
      did not have header_ops were given one for the purpose of protocol parsing
      on af_packet transmit path.
      
      That change made af_packet receive path regard these devices as having a
      visible L3 header and therefore aligned incoming skb->data to point to the
      skb's mac_header. Some devices, such as ipip, xfrmi, and others, do not
      reset their mac_header prior to ingress and therefore their incoming
      packets became malformed.
      
      Ideally these devices would reset their mac headers, or af_packet would be
      able to rely on dev->hard_header_len being 0 for such cases, but it seems
      this is not the case.
      
      Fix by changing af_packet RX ll visibility criteria to include the
      existence of a '.create()' header operation, which is used when creating
      a device hard header - via dev_hard_header() - by upper layers, and does
      not exist in these L3 devices.
      
      As this predicate may be useful in other situations, add it as a common
      dev_has_header() helper in netdevice.h.
      
      Fixes: b9fcf0a0 ("Merge branch 'support-AF_PACKET-for-layer-3-devices'")
      Signed-off-by: default avatarEyal Birger <eyal.birger@gmail.com>
      Acked-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Link: https://lore.kernel.org/r/20201121062817.3178900-1-eyal.birger@gmail.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d5496990
    • Stefano Garzarella's avatar
      vsock/virtio: discard packets only when socket is really closed · 3fe356d5
      Stefano Garzarella authored
      Starting from commit 8692cefc ("virtio_vsock: Fix race condition
      in virtio_transport_recv_pkt"), we discard packets in
      virtio_transport_recv_pkt() if the socket has been released.
      
      When the socket is connected, we schedule a delayed work to wait the
      RST packet from the other peer, also if SHUTDOWN_MASK is set in
      sk->sk_shutdown.
      This is done to complete the virtio-vsock shutdown algorithm, releasing
      the port assigned to the socket definitively only when the other peer
      has consumed all the packets.
      
      If we discard the RST packet received, the socket will be closed only
      when the VSOCK_CLOSE_TIMEOUT is reached.
      
      Sergio discovered the issue while running ab(1) HTTP benchmark using
      libkrun [1] and observing a latency increase with that commit.
      
      To avoid this issue, we discard packet only if the socket is really
      closed (SOCK_DONE flag is set).
      We also set SOCK_DONE in virtio_transport_release() when we don't need
      to wait any packets from the other peer (we didn't schedule the delayed
      work). In this case we remove the socket from the vsock lists, releasing
      the port assigned.
      
      [1] https://github.com/containers/libkrun
      
      
      
      Fixes: 8692cefc ("virtio_vsock: Fix race condition in virtio_transport_recv_pkt")
      Cc: justin.he@arm.com
      Reported-by: default avatarSergio Lopez <slp@redhat.com>
      Tested-by: default avatarSergio Lopez <slp@redhat.com>
      Signed-off-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      Acked-by: default avatarJia He <justin.he@arm.com>
      Link: https://lore.kernel.org/r/20201120104736.73749-1-sgarzare@redhat.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3fe356d5
    • Ricardo Dias's avatar
      tcp: fix race condition when creating child sockets from syncookies · 01770a16
      Ricardo Dias authored
      
      When the TCP stack is in SYN flood mode, the server child socket is
      created from the SYN cookie received in a TCP packet with the ACK flag
      set.
      
      The child socket is created when the server receives the first TCP
      packet with a valid SYN cookie from the client. Usually, this packet
      corresponds to the final step of the TCP 3-way handshake, the ACK
      packet. But is also possible to receive a valid SYN cookie from the
      first TCP data packet sent by the client, and thus create a child socket
      from that SYN cookie.
      
      Since a client socket is ready to send data as soon as it receives the
      SYN+ACK packet from the server, the client can send the ACK packet (sent
      by the TCP stack code), and the first data packet (sent by the userspace
      program) almost at the same time, and thus the server will equally
      receive the two TCP packets with valid SYN cookies almost at the same
      instant.
      
      When such event happens, the TCP stack code has a race condition that
      occurs between the momement a lookup is done to the established
      connections hashtable to check for the existence of a connection for the
      same client, and the moment that the child socket is added to the
      established connections hashtable. As a consequence, this race condition
      can lead to a situation where we add two child sockets to the
      established connections hashtable and deliver two sockets to the
      userspace program to the same client.
      
      This patch fixes the race condition by checking if an existing child
      socket exists for the same client when we are adding the second child
      socket to the established connections socket. If an existing child
      socket exists, we drop the packet and discard the second child socket
      to the same client.
      
      Signed-off-by: default avatarRicardo Dias <rdias@singlestore.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20201120111133.GA67501@rdias-suse-pc.lan
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      01770a16
  5. Nov 21, 2020
    • Julian Wiedmann's avatar
      net/af_iucv: set correct sk_protocol for child sockets · c5dab094
      Julian Wiedmann authored
      
      Child sockets erroneously inherit their parent's sk_type (ie. SOCK_*),
      instead of the PF_IUCV protocol that the parent was created with in
      iucv_sock_create().
      
      We're currently not using sk->sk_protocol ourselves, so this shouldn't
      have much impact (except eg. getting the output in skb_dump() right).
      
      Fixes: eac3731b ("[S390]: Add AF_IUCV socket support")
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Link: https://lore.kernel.org/r/20201120100657.34407-1-jwi@linux.ibm.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c5dab094
    • Alexander Duyck's avatar
      tcp: Set INET_ECN_xmit configuration in tcp_reinit_congestion_control · 55472017
      Alexander Duyck authored
      
      When setting congestion control via a BPF program it is seen that the
      SYN/ACK for packets within a given flow will not include the ECT0 flag. A
      bit of simple printk debugging shows that when this is configured without
      BPF we will see the value INET_ECN_xmit value initialized in
      tcp_assign_congestion_control however when we configure this via BPF the
      socket is in the closed state and as such it isn't configured, and I do not
      see it being initialized when we transition the socket into the listen
      state. The result of this is that the ECT0 bit is configured based on
      whatever the default state is for the socket.
      
      Any easy way to reproduce this is to monitor the following with tcpdump:
      tools/testing/selftests/bpf/test_progs -t bpf_tcp_ca
      
      Without this patch the SYN/ACK will follow whatever the default is. If dctcp
      all SYN/ACK packets will have the ECT0 bit set, and if it is not then ECT0
      will be cleared on all SYN/ACK packets. With this patch applied the SYN/ACK
      bit matches the value seen on the other packets in the given stream.
      
      Fixes: 91b5b21c ("bpf: Add support for changing congestion control")
      Signed-off-by: default avatarAlexander Duyck <alexanderduyck@fb.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      55472017
    • Alexander Duyck's avatar
      tcp: Allow full IP tos/IPv6 tclass to be reflected in L3 header · 861602b5
      Alexander Duyck authored
      
      An issue was recently found where DCTCP SYN/ACK packets did not have the
      ECT bit set in the L3 header. A bit of code review found that the recent
      change referenced below had gone though and added a mask that prevented the
      ECN bits from being populated in the L3 header.
      
      This patch addresses that by rolling back the mask so that it is only
      applied to the flags coming from the incoming TCP request instead of
      applying it to the socket tos/tclass field. Doing this the ECT bits were
      restored in the SYN/ACK packets in my testing.
      
      One thing that is not addressed by this patch set is the fact that
      tcp_reflect_tos appears to be incompatible with ECN based congestion
      avoidance algorithms. At a minimum the feature should likely be documented
      which it currently isn't.
      
      Fixes: ac8f1710 ("tcp: reflect tos value received in SYN to the socket")
      Signed-off-by: default avatarAlexander Duyck <alexanderduyck@fb.com>
      Acked-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      861602b5
  6. Nov 20, 2020
  7. Nov 19, 2020
  8. Nov 18, 2020
  9. Nov 17, 2020
  10. Nov 16, 2020
  11. Nov 15, 2020
Loading