1. 30 Nov, 2018 1 commit
  2. 28 Nov, 2018 1 commit
  3. 25 Nov, 2018 1 commit
    • Eric Dumazet's avatar
      tcp: address problems caused by EDT misshaps · 9efdda4e
      Eric Dumazet authored
      When a qdisc setup including pacing FQ is dismantled and recreated,
      some TCP packets are sent earlier than instructed by TCP stack.
      
      TCP can be fooled when ACK comes back, because the following
      operation can return a negative value.
      
          tcp_time_stamp(tp) - tp->rx_opt.rcv_tsecr;
      
      Some paths in TCP stack were not dealing properly with this,
      this patch addresses four of them.
      
      Fixes: ab408b6d ("tcp: switch tcp and sch_fq to new earliest departure time model")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9efdda4e
  4. 21 Nov, 2018 1 commit
    • Eric Dumazet's avatar
      tcp: defer SACK compression after DupThresh · 86de5921
      Eric Dumazet authored
      Jean-Louis reported a TCP regression and bisected to recent SACK
      compression.
      
      After a loss episode (receiver not able to keep up and dropping
      packets because its backlog is full), linux TCP stack is sending
      a single SACK (DUPACK).
      
      Sender waits a full RTO timer before recovering losses.
      
      While RFC 6675 says in section 5, "Algorithm Details",
      
         (2) If DupAcks < DupThresh but IsLost (HighACK + 1) returns true --
             indicating at least three segments have arrived above the current
             cumulative acknowledgment point, which is taken to indicate loss
             -- go to step (4).
      ...
         (4) Invoke fast retransmit and enter loss recovery as follows:
      
      there are old TCP stacks not implementing this strategy, and
      still counting the dupacks before starting fast retransmit.
      
      While these stacks probably perform poorly when receivers implement
      LRO/GRO, we should be a little more gentle to them.
      
      This patch makes sure we do not enable SACK compression unless
      3 dupacks have been sent since last rcv_nxt update.
      
      Ideally we should even rearm the timer to send one or two
      more DUPACK if no more packets are coming, but that will
      be work aiming for linux-4.21.
      
      Many thanks to Jean-Louis for bisecting the issue, providing
      packet captures and testing this patch.
      
      Fixes: 5d9f4262 ("tcp: add SACK compression")
      Reported-by: default avatarJean-Louis Dupond <jean-louis@dupond.be>
      Tested-by: default avatarJean-Louis Dupond <jean-louis@dupond.be>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      86de5921
  5. 20 Nov, 2018 1 commit
  6. 11 Nov, 2018 1 commit
  7. 24 Oct, 2018 1 commit
    • Eric Dumazet's avatar
      tcp: add tcp_reset_xmit_timer() helper · 3f80e08f
      Eric Dumazet authored
      With EDT model, SRTT no longer is inflated by pacing delays.
      
      This means that RTO and some other xmit timers might be setup
      incorrectly. This is particularly visible with either :
      
      - Very small enforced pacing rates (SO_MAX_PACING_RATE)
      - Reduced rto (from the default 200 ms)
      
      This can lead to TCP flows aborts in the worst case,
      or spurious retransmits in other cases.
      
      For example, this session gets far more throughput
      than the requested 80kbit :
      
      $ netperf -H 127.0.0.2 -l 100 -- -q 10000
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 127.0.0.2 () port 0 AF_INET
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
      540000 262144 262144    104.00      2.66
      
      With the fix :
      
      $ netperf -H 127.0.0.2 -l 100 -- -q 10000
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 127.0.0.2 () port 0 AF_INET
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
      540000 262144 262144    104.00      0.12
      
      EDT allows for better control of rtx timers, since TCP has
      a better idea of the earliest departure time of each skb
      in the rtx queue. We only have to eventually add to the
      timer the difference of the EDT time with current time.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3f80e08f
  8. 01 Oct, 2018 2 commits
  9. 29 Sep, 2018 1 commit
    • Yuchung Cheng's avatar
      tcp: up initial rmem to 128KB and SYN rwin to around 64KB · a337531b
      Yuchung Cheng authored
      Previously TCP initial receive buffer is ~87KB by default and
      the initial receive window is ~29KB (20 MSS). This patch changes
      the two numbers to 128KB and ~64KB (rounding down to the multiples
      of MSS) respectively. The patch also simplifies the calculations s.t.
      the two numbers are directly controlled by sysctl tcp_rmem[1]:
      
        1) Initial receiver buffer budget (sk_rcvbuf): while this should
           be configured via sysctl tcp_rmem[1], previously tcp_fixup_rcvbuf()
           always override and set a larger size when a new connection
           establishes.
      
        2) Initial receive window in SYN: previously it is set to 20
           packets if MSS <= 1460. The number 20 was based on the initial
           congestion window of 10: the receiver needs twice amount to
           avoid being limited by the receive window upon out-of-order
           delivery in the first window burst. But since this only
           applies if the receiving MSS <= 1460, connection using large MTU
           (e.g. to utilize receiver zero-copy) may be limited by the
           receive window.
      
      With this patch TCP memory configuration is more straight-forward and
      more properly sized to modern high-speed networks by default. Several
      popular stacks have been announcing 64KB rwin in SYNs as well.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a337531b
  10. 22 Sep, 2018 1 commit
  11. 12 Sep, 2018 1 commit
  12. 01 Sep, 2018 1 commit
    • Yuchung Cheng's avatar
      tcp: change IPv6 flow-label upon receiving spurious retransmission · 7788174e
      Yuchung Cheng authored
      Currently a Linux IPv6 TCP sender will change the flow label upon
      timeouts to potentially steer away from a data path that has gone
      bad. However this does not help if the problem is on the ACK path
      and the data path is healthy. In this case the receiver is likely
      to receive repeated spurious retransmission because the sender
      couldn't get the ACKs in time and has recurring timeouts.
      
      This patch adds another feature to mitigate this problem. It
      leverages the DSACK states in the receiver to change the flow
      label of the ACKs to speculatively re-route the ACK packets.
      In order to allow triggering on the second consecutive spurious
      RTO, the receiver changes the flow label upon sending a second
      consecutive DSACK for a sequence number below RCV.NXT.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7788174e
  13. 11 Aug, 2018 3 commits
    • Yuchung Cheng's avatar
      tcp: avoid resetting ACK timer upon receiving packet with ECN CWR flag · fd2123a3
      Yuchung Cheng authored
      Previously commit 9aee4000 ("tcp: ack immediately when a cwr
      packet arrives") calls tcp_enter_quickack_mode to force sending
      two immediate ACKs upon receiving a packet w/ CWR flag. The side
      effect is it'll also reset the delayed ACK timer and interactive
      session tracking. This patch removes that side effect by using the
      new ACK_NOW flag to force an immmediate ACK.
      
      Packetdrill to demonstrate:
      
          0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
         +0 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
         +0 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
        +.1 < [ect0] . 1:1(0) ack 1 win 257
         +0 accept(3, ..., ...) = 4
      
         +0 < [ect0] . 1:1001(1000) ack 1 win 257
         +0 > [ect01] . 1:1(0) ack 1001
      
         +0 write(4, ..., 1) = 1
         +0 > [ect01] P. 1:2(1) ack 1001
      
         +0 < [ect0] . 1001:2001(1000) ack 2 win 257
         +0 write(4, ..., 1) = 1
         +0 > [ect01] P. 2:3(1) ack 2001
      
         +0 < [ect0] . 2001:3001(1000) ack 3 win 257
         +0 < [ect0] . 3001:4001(1000) ack 3 win 257
         // Ack delayed ...
      
         +.01 < [ce] P. 4001:4501(500) ack 3 win 257
         +0 > [ect01] . 3:3(0) ack 4001
         +0 > [ect01] E. 3:3(0) ack 4501
      
      +.001 read(4, ..., 4500) = 4500
         +0 write(4, ..., 1) = 1
         +0 > [ect01] PE. 3:4(1) ack 4501 win 100
      
       +.01 < [ect0] W. 4501:5501(1000) ack 4 win 257
         // No delayed ACK on CWR flag
         +0 > [ect01] . 4:4(0) ack 5501
      
       +.31 < [ect0] . 5501:6501(1000) ack 4 win 257
         +0 > [ect01] . 4:4(0) ack 6501
      
      Fixes: 9aee4000 ("tcp: ack immediately when a cwr packet arrives")
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fd2123a3
    • Yuchung Cheng's avatar
      tcp: always ACK immediately on hole repairs · 15bdd568
      Yuchung Cheng authored
      RFC 5681 sec 4.2:
        To provide feedback to senders recovering from losses, the receiver
        SHOULD send an immediate ACK when it receives a data segment that
        fills in all or part of a gap in the sequence space.
      
      When a gap is partially filled, __tcp_ack_snd_check already checks
      the out-of-order queue and correctly send an immediate ACK. However
      when a gap is fully filled, the previous implementation only resets
      pingpong mode which does not guarantee an immediate ACK because the
      quick ACK counter may be zero. This patch addresses this issue by
      marking the one-time immediate ACK flag instead.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      15bdd568
    • Yuchung Cheng's avatar
      tcp: mandate a one-time immediate ACK · 466466dc
      Yuchung Cheng authored
      Add a new flag to indicate a one-time immediate ACK. This flag is
      occasionaly set under specific TCP protocol states in addition to
      the more common quickack mechanism for interactive application.
      
      In several cases in the TCP code we want to force an immediate ACK
      but do not want to call tcp_enter_quickack_mode() because we do
      not want to forget the icsk_ack.pingpong or icsk_ack.ato state.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      466466dc
  14. 01 Aug, 2018 2 commits
  15. 25 Jul, 2018 1 commit
    • Lawrence Brakmo's avatar
      tcp: ack immediately when a cwr packet arrives · 9aee4000
      Lawrence Brakmo authored
      We observed high 99 and 99.9% latencies when doing RPCs with DCTCP. The
      problem is triggered when the last packet of a request arrives CE
      marked. The reply will carry the ECE mark causing TCP to shrink its cwnd
      to 1 (because there are no packets in flight). When the 1st packet of
      the next request arrives, the ACK was sometimes delayed even though it
      is CWR marked, adding up to 40ms to the RPC latency.
      
      This patch insures that CWR marked data packets arriving will be acked
      immediately.
      
      Packetdrill script to reproduce the problem:
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
      0.000 bind(3, ..., ...) = 0
      0.000 listen(3, 1) = 0
      
      0.100 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
      0.100 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
      0.110 < [ect0] . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      
      0.200 < [ect0] . 1:1001(1000) ack 1 win 257
      0.200 > [ect01] . 1:1(0) ack 1001
      
      0.200 write(4, ..., 1) = 1
      0.200 > [ect01] P. 1:2(1) ack 1001
      
      0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
      0.200 write(4, ..., 1) = 1
      0.200 > [ect01] P. 2:3(1) ack 2001
      
      0.200 < [ect0] . 2001:3001(1000) ack 3 win 257
      0.200 < [ect0] . 3001:4001(1000) ack 3 win 257
      0.200 > [ect01] . 3:3(0) ack 4001
      
      0.210 < [ce] P. 4001:4501(500) ack 3 win 257
      
      +0.001 read(4, ..., 4500) = 4500
      +0 write(4, ..., 1) = 1
      +0 > [ect01] PE. 3:4(1) ack 4501
      
      +0.010 < [ect0] W. 4501:5501(1000) ack 4 win 257
      // Previously the ACK sequence below would be 4501, causing a long RTO
      +0.040~+0.045 > [ect01] . 4:4(0) ack 5501   // delayed ack
      
      +0.311 < [ect0] . 5501:6501(1000) ack 4 win 257  // More data
      +0 > [ect01] . 4:4(0) ack 6501     // now acks everything
      
      +0.500 < F. 9501:9501(0) ack 4 win 257
      
      Modified based on comments by Neal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9aee4000
  16. 23 Jul, 2018 5 commits
  17. 20 Jul, 2018 1 commit
    • Yuchung Cheng's avatar
      tcp: do not delay ACK in DCTCP upon CE status change · a0496ef2
      Yuchung Cheng authored
      Per DCTCP RFC8257 (Section 3.2) the ACK reflecting the CE status change
      has to be sent immediately so the sender can respond quickly:
      
      """ When receiving packets, the CE codepoint MUST be processed as follows:
      
         1.  If the CE codepoint is set and DCTCP.CE is false, set DCTCP.CE to
             true and send an immediate ACK.
      
         2.  If the CE codepoint is not set and DCTCP.CE is true, set DCTCP.CE
             to false and send an immediate ACK.
      """
      
      Previously DCTCP implementation may continue to delay the ACK. This
      patch fixes that to implement the RFC by forcing an immediate ACK.
      
      Tested with this packetdrill script provided by Larry Brakmo
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
      0.000 bind(3, ..., ...) = 0
      0.000 listen(3, 1) = 0
      
      0.100 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
      0.100 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
      0.110 < [ect0] . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
         +0 setsockopt(4, SOL_SOCKET, SO_DEBUG, [1], 4) = 0
      
      0.200 < [ect0] . 1:1001(1000) ack 1 win 257
      0.200 > [ect01] . 1:1(0) ack 1001
      
      0.200 write(4, ..., 1) = 1
      0.200 > [ect01] P. 1:2(1) ack 1001
      
      0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
      +0.005 < [ce] . 2001:3001(1000) ack 2 win 257
      
      +0.000 > [ect01] . 2:2(0) ack 2001
      // Previously the ACK below would be delayed by 40ms
      +0.000 > [ect01] E. 2:2(0) ack 3001
      
      +0.500 < F. 9501:9501(0) ack 4 win 257
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a0496ef2
  18. 16 Jul, 2018 1 commit
  19. 14 Jul, 2018 1 commit
  20. 12 Jul, 2018 1 commit
    • Arnd Bergmann's avatar
      tcp: use monotonic timestamps for PAWS · cca9bab1
      Arnd Bergmann authored
      Using get_seconds() for timestamps is deprecated since it can lead
      to overflows on 32-bit systems. While the interface generally doesn't
      overflow until year 2106, the specific implementation of the TCP PAWS
      algorithm breaks in 2038 when the intermediate signed 32-bit timestamps
      overflow.
      
      A related problem is that the local timestamps in CLOCK_REALTIME form
      lead to unexpected behavior when settimeofday is called to set the system
      clock backwards or forwards by more than 24 days.
      
      While the first problem could be solved by using an overflow-safe method
      of comparing the timestamps, a nicer solution is to use a monotonic
      clocksource with ktime_get_seconds() that simply doesn't overflow (at
      least not until 136 years after boot) and that doesn't change during
      settimeofday().
      
      To make 32-bit and 64-bit architectures behave the same way here, and
      also save a few bytes in the tcp_options_received structure, I'm changing
      the type to a 32-bit integer, which is now safe on all architectures.
      
      Finally, the ts_recent_stamp field also (confusingly) gets used to store
      a jiffies value in tcp_synq_overflow()/tcp_synq_no_recent_overflow().
      This is currently safe, but changing the type to 32-bit requires
      some small changes there to keep it working.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cca9bab1
  21. 02 Jul, 2018 1 commit
  22. 01 Jul, 2018 1 commit
    • Ilpo Järvinen's avatar
      tcp: prevent bogus FRTO undos with non-SACK flows · 1236f22f
      Ilpo Järvinen authored
      If SACK is not enabled and the first cumulative ACK after the RTO
      retransmission covers more than the retransmitted skb, a spurious
      FRTO undo will trigger (assuming FRTO is enabled for that RTO).
      The reason is that any non-retransmitted segment acknowledged will
      set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
      no indication that it would have been delivered for real (the
      scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
      case so the check for that bit won't help like it does with SACK).
      Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo
      in tcp_process_loss.
      
      We need to use more strict condition for non-SACK case and check
      that none of the cumulatively ACKed segments were retransmitted
      to prove that progress is due to original transmissions. Only then
      keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in
      non-SACK case.
      
      (FLAG_ORIG_SACK_ACKED is planned to be renamed to FLAG_ORIG_PROGRESS
      to better indicate its purpose but to keep this change minimal, it
      will be done in another patch).
      
      Besides burstiness and congestion control violations, this problem
      can result in RTO loop: When the loss recovery is prematurely
      undoed, only new data will be transmitted (if available) and
      the next retransmission can occur only after a new RTO which in case
      of multiple losses (that are not for consecutive packets) requires
      one RTO per loss to recover.
      Signed-off-by: default avatarIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Tested-by: default avatarNeal Cardwell <ncardwell@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1236f22f
  23. 30 Jun, 2018 1 commit
  24. 28 Jun, 2018 1 commit
  25. 26 Jun, 2018 1 commit
  26. 22 Jun, 2018 1 commit
  27. 05 Jun, 2018 1 commit
  28. 31 May, 2018 1 commit
  29. 22 May, 2018 2 commits
  30. 18 May, 2018 2 commits