1. 15 Dec, 2018 1 commit
  2. 09 Nov, 2018 1 commit
    • Stefano Brivio's avatar
      net: Convert protocol error handlers from void to int · 32bbd879
      Stefano Brivio authored
      We'll need this to handle ICMP errors for tunnels without a sending socket
      (i.e. FoU and GUE). There, we might have to look up different types of IP
      tunnels, registered as network protocols, before we get a match, so we
      want this for the error handlers of IPPROTO_IPIP and IPPROTO_IPV6 in both
      inet_protos and inet6_protos. These error codes will be used in the next
      For consistency, return sensible error codes in protocol error handlers
      whenever handlers can't handle errors because, even if valid, they don't
      match a protocol or any of its states.
      This has no effect on existing error handling paths.
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  3. 22 Jul, 2018 1 commit
    • David Ahern's avatar
      net/ipv6: Fix linklocal to global address with VRF · 24b711ed
      David Ahern authored
      Example setup:
          host: ip -6 addr add dev eth1 2001:db8:104::4
                 where eth1 is enslaved to a VRF
          switch: ip -6 ro add 2001:db8:104::4/128 dev br1
                  where br1 only has an LLA
                 ping6 2001:db8:104::4
                 ssh   2001:db8:104::4
      (NOTE: UDP works fine if the PKTINFO has the address set to the global
      address and ifindex is set to the index of eth1 with a destination an
      For ICMP, icmp6_iif needs to be updated to check if skb->dev is an
      L3 master. If it is then return the ifindex from rt6i_idev similar
      to what is done for loopback.
      For TCP, restore the original tcp_v6_iif definition which is needed in
      most places and add a new tcp_v6_iif_l3_slave that considers the
      l3_slave variability. This latter check is only needed for socket
      Fixes: 9ff74384 ("net: vrf: Handle ipv6 multicast and link-local addresses")
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  4. 15 Jun, 2018 1 commit
  5. 31 May, 2018 1 commit
  6. 16 May, 2018 2 commits
  7. 10 May, 2018 1 commit
    • Jon Maxwell's avatar
      tcp: Add mark for TIMEWAIT sockets · 00483690
      Jon Maxwell authored
      This version has some suggestions by Eric Dumazet:
      - Use a local variable for the mark in IPv6 instead of ctl_sk to avoid SMP
      - Use the more elegant "IP4_REPLY_MARK(net, skb->mark) ?: sk->sk_mark"
      - Factorize code as sk_fullsock() check is not necessary.
      Aidan McGurn from Openwave Mobility systems reported the following bug:
      "Marked routing is broken on customer deployment. Its effects are large
      increase in Uplink retransmissions caused by the client never receiving
      the final ACK to their FINACK - this ACK misses the mark and routes out
      of the incorrect route."
      Currently marks are added to sk_buffs for replies when the "fwmark_reflect"
      sysctl is enabled. But not for TW sockets that had sk->sk_mark set via
      Fix this in IPv4/v6 by adding tw->tw_mark for TIME_WAIT sockets. Copy the the
      original sk->sk_mark in __inet_twsk_hashdance() to the new tw->tw_mark location.
      Then progate this so that the skb gets sent with the correct mark. Do the same
      for resets. Give the "fwmark_reflect" sysctl precedence over sk->sk_mark so that
      netfilter rules are still honored.
      Signed-off-by: default avatarJon Maxwell <jmaxwell37@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  8. 31 Mar, 2018 1 commit
    • Andrey Ignatov's avatar
      bpf: Hooks for sys_connect · d74bad4e
      Andrey Ignatov authored
      == The problem ==
      See description of the problem in the initial patch of this patch set.
      == The solution ==
      The patch provides much more reliable in-kernel solution for the 2nd
      part of the problem: making outgoing connecttion from desired IP.
      It adds new attach types `BPF_CGROUP_INET4_CONNECT` and
      `BPF_CGROUP_INET6_CONNECT` for program type
      `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that can be used to override both
      source and destination of a connection at connect(2) time.
      Local end of connection can be bound to desired IP using newly
      introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though,
      and doesn't support binding to port, i.e. leverages
      `IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this:
      * looking for a free port is expensive and can affect performance
      * there is no use-case for port.
      As for remote end (`struct sockaddr *` passed by user), both parts of it
      can be overridden, remote IP and remote port. It's useful if an
      application inside cgroup wants to connect to another application inside
      same cgroup or to itself, but knows nothing about IP assigned to the
      Support is added for IPv4 and IPv6, for TCP and UDP.
      IPv4 and IPv6 have separate attach types for same reason as sys_bind
      hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields
      when user passes sockaddr_in since it'd be out-of-bound.
      == Implementation notes ==
      The patch introduces new field in `struct proto`: `pre_connect` that is
      a pointer to a function with same signature as `connect` but is called
      before it. The reason is in some cases BPF hooks should be called way
      before control is passed to `sk->sk_prot->connect`. Specifically
      `inet_dgram_connect` autobinds socket before calling
      `sk->sk_prot->connect` and there is no way to call `bpf_bind()` from
      hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since
      it'd cause double-bind. On the other hand `proto.pre_connect` provides a
      flexible way to add BPF hooks for connect only for necessary `proto` and
      call them at desired time before `connect`. Since `bpf_bind()` is
      allowed to bind only to IP and autobind in `inet_dgram_connect` binds
      only port there is no chance of double-bind.
      bpf_bind() sets `force_bind_address_no_port` to bind to only IP despite
      of value of `bind_address_no_port` socket field.
      bpf_bind() sets `with_lock` to `false` when calling to __inet_bind()
      and __inet6_bind() since all call-sites, where bpf_bind() is called,
      already hold socket lock.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
  9. 27 Mar, 2018 1 commit
  10. 19 Feb, 2018 1 commit
    • Kirill Tkhai's avatar
      net: Convert tcpv6_net_ops · fef65a2c
      Kirill Tkhai authored
      These pernet_operations create and destroy net::ipv6.tcp_sk
      socket, which is used in tcp_v6_send_response() only. It looks
      like foreign pernet_operations don't want to set ipv6 connection
      inside destroyed net, so this socket may be created in destroyed
      in parallel with anything else. inet_twsk_purge() is also safe
      for that, as described in patch for tcp_sk_ops. So, it's possible
      to mark them as async.
      Signed-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  11. 14 Feb, 2018 1 commit
    • Eric Dumazet's avatar
      tcp: try to keep packet if SYN_RCV race is lost · e0f9759f
      Eric Dumazet authored
      배석진 reported that in some situations, packets for a given 5-tuple
      end up being processed by different CPUS.
      This involves RPS, and fragmentation.
      배석진 is seeing packet drops when a SYN_RECV request socket is
      moved into ESTABLISH state. Other states are protected by socket lock.
      This is caused by a CPU losing the race, and simply not caring enough.
      Since this seems to occur frequently, we can do better and perform
      a second lookup.
      Note that all needed memory barriers are already in the existing code,
      thanks to the spin_lock()/spin_unlock() pair in inet_ehash_insert()
      and reqsk_put(). The second lookup must find the new socket,
      unless it has already been accepted and closed by another cpu.
      Note that the fragmentation could be avoided in the first place by
      use of a correct TCP MSS option in the SYN{ACK} packet, but this
      does not mean we can not be more robust.
      Many thanks to 배석진 for a very detailed analysis.
      Reported-by: default avatar배석진 <soukjin.bae@samsung.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  12. 08 Feb, 2018 1 commit
  13. 16 Jan, 2018 1 commit
    • Alexey Dobriyan's avatar
      net: delete /proc THIS_MODULE references · 96890d62
      Alexey Dobriyan authored
      /proc has been ignoring struct file_operations::owner field for 10 years.
      Specifically, it started with commit 786d7e16
      ("Fix rmmod/read/write races in /proc entries"). Notice the chunk where
      inode->i_fop is initialized with proxy struct file_operations for
      regular files:
      	-               if (de->proc_fops)
      	-                       inode->i_fop = de->proc_fops;
      	+               if (de->proc_fops) {
      	+                       if (S_ISREG(inode->i_mode))
      	+                               inode->i_fop = &proc_reg_file_ops;
      	+                       else
      	+                               inode->i_fop = de->proc_fops;
      	+               }
      VFS stopped pinning module at this point.
      Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  14. 08 Jan, 2018 1 commit
    • David Ahern's avatar
      net: ipv6: Allow connect to linklocal address from socket bound to vrf · 54dc3e33
      David Ahern authored
      Allow a process bound to a VRF to connect to a linklocal address.
      Currently, this fails because of a mismatch between the scope of the
      linklocal address and the sk_bound_dev_if inherited by the VRF binding:
          $ ssh -6 fe80::70b8:cff:fedd:ead8%eth1
          ssh: connect to host fe80::70b8:cff:fedd:ead8%eth1 port 22: Invalid argument
      Relax the scope check to allow the socket to be bound to the same L3
      device as the scope id.
      This makes ipv6 linklocal consistent with other relaxed checks enabled
      by commits 1ff23bee ("net: l3mdev: Allow send on enslaved interface")
      and 7bb387c5 ("net: Allow IP_MULTICAST_IF to set index to L3 slave").
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  15. 20 Dec, 2017 1 commit
  16. 12 Dec, 2017 1 commit
    • Christoph Paasch's avatar
      tcp md5sig: Use skb's saddr when replying to an incoming segment · 30791ac4
      Christoph Paasch authored
      The MD5-key that belongs to a connection is identified by the peer's
      IP-address. When we are in tcp_v4(6)_reqsk_send_ack(), we are replying
      to an incoming segment from tcp_check_req() that failed the seq-number
      Thus, to find the correct key, we need to use the skb's saddr and not
      the daddr.
      This bug seems to have been there since quite a while, but probably got
      unnoticed because the consequences are not catastrophic. We will call
      tcp_v4_reqsk_send_ack only to send a challenge-ACK back to the peer,
      thus the connection doesn't really fail.
      Fixes: 9501f972 ("tcp md5sig: Let the caller pass appropriate key for tcp_v{4,6}_do_calc_md5_hash().")
      Signed-off-by: default avatarChristoph Paasch <cpaasch@apple.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  17. 03 Dec, 2017 1 commit
    • Eric Dumazet's avatar
      tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb() · eeea10b8
      Eric Dumazet authored
      James Morris reported kernel stack corruption bug [1] while
      running the SELinux testsuite, and bisected to a recent
      commit bffa72cf ("net: sk_buff rbnode reorg")
      We believe this commit is fine, but exposes an older bug.
      SELinux code runs from tcp_filter() and might send an ICMP,
      expecting IP options to be found in skb->cb[] using regular IPCB placement.
      We need to defer TCP mangling of skb->cb[] after tcp_filter() calls.
      This patch adds tcp_v4_fill_cb()/tcp_v4_restore_cb() in a very
      similar way we added them for IPv6.
      [  339.806024] SELinux: failure in selinux_parse_skb(), unable to parse packet
      [  339.822505] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffff81745af5
      [  339.822505]
      [  339.852250] CPU: 4 PID: 3642 Comm: client Not tainted 4.15.0-rc1-test #15
      [  339.868498] Hardware name: LENOVO 10FGS0VA1L/30BC, BIOS FWKT68A   01/19/2017
      [  339.885060] Call Trace:
      [  339.896875]  <IRQ>
      [  339.908103]  dump_stack+0x63/0x87
      [  339.920645]  panic+0xe8/0x248
      [  339.932668]  ? ip_push_pending_frames+0x33/0x40
      [  339.946328]  ? icmp_send+0x525/0x530
      [  339.958861]  ? kfree_skbmem+0x60/0x70
      [  339.971431]  __stack_chk_fail+0x1b/0x20
      [  339.984049]  icmp_send+0x525/0x530
      [  339.996205]  ? netlbl_skbuff_err+0x36/0x40
      [  340.008997]  ? selinux_netlbl_err+0x11/0x20
      [  340.021816]  ? selinux_socket_sock_rcv_skb+0x211/0x230
      [  340.035529]  ? security_sock_rcv_skb+0x3b/0x50
      [  340.048471]  ? sk_filter_trim_cap+0x44/0x1c0
      [  340.061246]  ? tcp_v4_inbound_md5_hash+0x69/0x1b0
      [  340.074562]  ? tcp_filter+0x2c/0x40
      [  340.086400]  ? tcp_v4_rcv+0x820/0xa20
      [  340.098329]  ? ip_local_deliver_finish+0x71/0x1a0
      [  340.111279]  ? ip_local_deliver+0x6f/0xe0
      [  340.123535]  ? ip_rcv_finish+0x3a0/0x3a0
      [  340.135523]  ? ip_rcv_finish+0xdb/0x3a0
      [  340.147442]  ? ip_rcv+0x27c/0x3c0
      [  340.158668]  ? inet_del_offload+0x40/0x40
      [  340.170580]  ? __netif_receive_skb_core+0x4ac/0x900
      [  340.183285]  ? rcu_accelerate_cbs+0x5b/0x80
      [  340.195282]  ? __netif_receive_skb+0x18/0x60
      [  340.207288]  ? process_backlog+0x95/0x140
      [  340.218948]  ? net_rx_action+0x26c/0x3b0
      [  340.230416]  ? __do_softirq+0xc9/0x26a
      [  340.241625]  ? do_softirq_own_stack+0x2a/0x40
      [  340.253368]  </IRQ>
      [  340.262673]  ? do_softirq+0x50/0x60
      [  340.273450]  ? __local_bh_enable_ip+0x57/0x60
      [  340.285045]  ? ip_finish_output2+0x175/0x350
      [  340.296403]  ? ip_finish_output+0x127/0x1d0
      [  340.307665]  ? nf_hook_slow+0x3c/0xb0
      [  340.318230]  ? ip_output+0x72/0xe0
      [  340.328524]  ? ip_fragment.constprop.54+0x80/0x80
      [  340.340070]  ? ip_local_out+0x35/0x40
      [  340.350497]  ? ip_queue_xmit+0x15c/0x3f0
      [  340.361060]  ? __kmalloc_reserve.isra.40+0x31/0x90
      [  340.372484]  ? __skb_clone+0x2e/0x130
      [  340.382633]  ? tcp_transmit_skb+0x558/0xa10
      [  340.393262]  ? tcp_connect+0x938/0xad0
      [  340.403370]  ? ktime_get_with_offset+0x4c/0xb0
      [  340.414206]  ? tcp_v4_connect+0x457/0x4e0
      [  340.424471]  ? __inet_stream_connect+0xb3/0x300
      [  340.435195]  ? inet_stream_connect+0x3b/0x60
      [  340.445607]  ? SYSC_connect+0xd9/0x110
      [  340.455455]  ? __audit_syscall_entry+0xaf/0x100
      [  340.466112]  ? syscall_trace_enter+0x1d0/0x2b0
      [  340.476636]  ? __audit_syscall_exit+0x209/0x290
      [  340.487151]  ? SyS_connect+0xe/0x10
      [  340.496453]  ? do_syscall_64+0x67/0x1b0
      [  340.506078]  ? entry_SYSCALL64_slow_path+0x25/0x25
      Fixes: 971f10ec ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarJames Morris <james.l.morris@oracle.com>
      Tested-by: default avatarJames Morris <james.l.morris@oracle.com>
      Tested-by: default avatarCasey Schaufler <casey@schaufler-ca.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  18. 30 Nov, 2017 1 commit
  19. 10 Nov, 2017 1 commit
  20. 24 Oct, 2017 1 commit
  21. 18 Oct, 2017 1 commit
  22. 08 Sep, 2017 1 commit
  23. 28 Aug, 2017 1 commit
  24. 24 Aug, 2017 1 commit
    • Mike Maloney's avatar
      tcp: Extend SOF_TIMESTAMPING_RX_SOFTWARE to TCP recvmsg · 98aaa913
      Mike Maloney authored
      When SOF_TIMESTAMPING_RX_SOFTWARE is enabled for tcp sockets, return the
      timestamp corresponding to the highest sequence number data returned.
      Previously the skb->tstamp is overwritten when a TCP packet is placed
      in the out of order queue.  While the packet is in the ooo queue, save the
      timestamp in the TCB_SKB_CB.  This space is shared with the gso_*
      options which are only used on the tx path, and a previously unused 4
      byte hole.
      When skbs are coalesced either in the sk_receive_queue or the
      out_of_order_queue always choose the timestamp of the appended skb to
      maintain the invariant of returning the timestamp of the last byte in
      the recvmsg buffer.
      Signed-off-by: default avatarMike Maloney <maloney@google.com>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  25. 15 Aug, 2017 1 commit
  26. 07 Aug, 2017 1 commit
  27. 01 Aug, 2017 1 commit
  28. 31 Jul, 2017 1 commit
    • Florian Westphal's avatar
      tcp: remove prequeue support · e7942d06
      Florian Westphal authored
      prequeue is a tcp receive optimization that moves part of rx processing
      from bh to process context.
      This only works if the socket being processed belongs to a process that
      is blocked in recv on that socket.
      In practice, this doesn't happen anymore that often because nowadays
      servers tend to use an event driven (epoll) model.
      Even normal client applications (web browsers) commonly use many tcp
      connections in parallel.
      This has measureable impact only in netperf (which uses plain recv and
      thus allows prequeue use) from host to locally running vm (~4%), however,
      there were no changes when using netperf between two physical hosts with
      ixgbe interfaces.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  29. 29 Jul, 2017 1 commit
  30. 25 Jul, 2017 1 commit
  31. 01 Jul, 2017 2 commits
  32. 22 Jun, 2017 1 commit
  33. 19 Jun, 2017 2 commits
  34. 16 Jun, 2017 1 commit
    • Johannes Berg's avatar
      networking: make skb_push & __skb_push return void pointers · d58ff351
      Johannes Berg authored
      It seems like a historic accident that these return unsigned char *,
      and in many places that means casts are required, more often than not.
      Make these functions return void * and remove all the casts across
      the tree, adding a (u8 *) cast only where the unsigned char pointer
      was used directly, all done with the following spatch:
          expression SKB, LEN;
          typedef u8;
          identifier fn = { skb_push, __skb_push, skb_push_rcsum };
          - *(fn(SKB, LEN))
          + *(u8 *)fn(SKB, LEN)
          expression E, SKB, LEN;
          identifier fn = { skb_push, __skb_push, skb_push_rcsum };
          type T;
          - E = ((T *)(fn(SKB, LEN)))
          + E = fn(SKB, LEN)
          expression SKB, LEN;
          identifier fn = { skb_push, __skb_push, skb_push_rcsum };
          - fn(SKB, LEN)[0]
          + *(u8 *)fn(SKB, LEN)
      Note that the last part there converts from push(...)[0] to the
      more idiomatic *(u8 *)push(...).
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  35. 10 Jun, 2017 1 commit
  36. 08 Jun, 2017 2 commits
    • Eric Dumazet's avatar
      tcp: add TCPMemoryPressuresChrono counter · 06044751
      Eric Dumazet authored
      DRAM supply shortage and poor memory pressure tracking in TCP
      stack makes any change in SO_SNDBUF/SO_RCVBUF (or equivalent autotuning
      limits) and tcp_mem[] quite hazardous.
      TCPMemoryPressures SNMP counter is an indication of tcp_mem sysctl
      limits being hit, but only tracking number of transitions.
      If TCP stack behavior under stress was perfect :
      1) It would maintain memory usage close to the limit.
      2) Memory pressure state would be entered for short times.
      We certainly prefer 100 events lasting 10ms compared to one event
      lasting 200 seconds.
      This patch adds a new SNMP counter tracking cumulative duration of
      memory pressure events, given in ms units.
      $ cat /proc/sys/net/ipv4/tcp_mem
      3088    4117    6176
      $ grep TCP /proc/net/sockstat
      TCP: inuse 180 orphan 0 tw 2 alloc 234 mem 4140
      $ nstat -n ; sleep 10 ; nstat |grep Pressure
      TcpExtTCPMemoryPressures        1700
      TcpExtTCPMemoryPressuresChrono  5209
      v2: Used EXPORT_SYMBOL_GPL() instead of EXPORT_SYMBOL() as David
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Eric Dumazet's avatar