1. 24 Jul, 2018 2 commits
    • Ka-Cheong Poon's avatar
      rds: Enable RDS IPv6 support · 1e2b44e7
      Ka-Cheong Poon authored
      This patch enables RDS to use IPv6 addresses. For RDS/TCP, the
      listener is now an IPv6 endpoint which accepts both IPv4 and IPv6
      connection requests.  RDS/RDMA/IB uses a private data (struct
      rds_ib_connect_private) exchange between endpoints at RDS connection
      establishment time to support RDMA. This private data exchange uses a
      32 bit integer to represent an IP address. This needs to be changed in
      order to support IPv6. A new private data struct
      rds6_ib_connect_private is introduced to handle this. To ensure
      backward compatibility, an IPv6 capable RDS stack uses another RDMA
      listener port (RDS_CM_PORT) to accept IPv6 connection. And it
      continues to use the original RDS_PORT for IPv4 RDS connections. When
      it needs to communicate with an IPv6 peer, it uses the RDS_CM_PORT to
      send the connection set up request.
      
      v5: Fixed syntax problem (David Miller).
      
      v4: Changed port history comments in rds.h (Sowmini Varadhan).
      
      v3: Added support to set up IPv4 connection using mapped address
          (David Miller).
          Added support to set up connection between link local and non-link
          addresses.
          Various review comments from Santosh Shilimkar and Sowmini Varadhan.
      
      v2: Fixed bound and peer address scope mismatched issue.
          Added back rds_connect() IPv6 changes.
      Signed-off-by: default avatarKa-Cheong Poon <ka-cheong.poon@oracle.com>
      Acked-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1e2b44e7
    • Ka-Cheong Poon's avatar
      rds: Changing IP address internal representation to struct in6_addr · eee2fa6a
      Ka-Cheong Poon authored
      This patch changes the internal representation of an IP address to use
      struct in6_addr.  IPv4 address is stored as an IPv4 mapped address.
      All the functions which take an IP address as argument are also
      changed to use struct in6_addr.  But RDS socket layer is not modified
      such that it still does not accept IPv6 address from an application.
      And RDS layer does not accept nor initiate IPv6 connections.
      
      v2: Fixed sparse warnings.
      Signed-off-by: default avatarKa-Cheong Poon <ka-cheong.poon@oracle.com>
      Acked-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eee2fa6a
  2. 08 Feb, 2018 1 commit
    • Sowmini Varadhan's avatar
      rds: tcp: use rds_destroy_pending() to synchronize netns/module teardown and... · ebeeb1ad
      Sowmini Varadhan authored
      rds: tcp: use rds_destroy_pending() to synchronize netns/module teardown and rds connection/workq management
      
      An rds_connection can get added during netns deletion between lines 528
      and 529 of
      
        506 static void rds_tcp_kill_sock(struct net *net)
        :
        /* code to pull out all the rds_connections that should be destroyed */
        :
        528         spin_unlock_irq(&rds_tcp_conn_lock);
        529         list_for_each_entry_safe(tc, _tc, &tmp_list, t_tcp_node)
        530                 rds_conn_destroy(tc->t_cpath->cp_conn);
      
      Such an rds_connection would miss out the rds_conn_destroy()
      loop (that cancels all pending work) and (if it was scheduled
      after netns deletion) could trigger the use-after-free.
      
      A similar race-window exists for the module unload path
      in rds_tcp_exit -> rds_tcp_destroy_conns
      
      Concurrency with netns deletion (rds_tcp_kill_sock()) must be handled
      by checking check_net() before enqueuing new work or adding new
      connections.
      
      Concurrency with module-unload is handled by maintaining a module
      specific flag that is set at the start of the module exit function,
      and must be checked before enqueuing new work or adding new connections.
      
      This commit refactors existing RDS_DESTROY_PENDING checks added by
      commit 3db6e0d1 ("rds: use RCU to synchronize work-enqueue with
      connection teardown") and consolidates all the concurrency checks
      listed above into the function rds_destroy_pending().
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ebeeb1ad
  3. 05 Jan, 2018 1 commit
  4. 17 Jul, 2017 1 commit
    • Sowmini Varadhan's avatar
      rds: cancel send/recv work before queuing connection shutdown · aed20a53
      Sowmini Varadhan authored
      We could end up executing rds_conn_shutdown before the rds_recv_worker
      thread, then rds_conn_shutdown -> rds_tcp_conn_shutdown can do a
      sock_release and set sock->sk to null, which may interleave in bad
      ways with rds_recv_worker, e.g., it could result in:
      
      "BUG: unable to handle kernel NULL pointer dereference at 0000000000000078"
          [ffff881769f6fd70] release_sock at ffffffff815f337b
          [ffff881769f6fd90] rds_tcp_recv at ffffffffa043c888 [rds_tcp]
          [ffff881769f6fdb0] rds_recv_worker at ffffffffa04a4810 [rds]
          [ffff881769f6fde0] process_one_work at ffffffff810a14c1
          [ffff881769f6fe40] worker_thread at ffffffff810a1940
          [ffff881769f6fec0] kthread at ffffffff810a6b1e
      
      Also, do not enqueue any new shutdown workq items when the connection is
      shutting down (this may happen for rds-tcp in softirq mode, if a FIN
      or CLOSE is received while the modules is in the middle of an unload)
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aed20a53
  5. 22 Jun, 2017 1 commit
  6. 16 Jun, 2017 2 commits
  7. 17 Nov, 2016 1 commit
    • Sowmini Varadhan's avatar
      RDS: TCP: Force every connection to be initiated by numerically smaller IP address · 1a0e100f
      Sowmini Varadhan authored
      When 2 RDS peers initiate an RDS-TCP connection simultaneously,
      there is a potential for "duelling syns" on either/both sides.
      See commit 241b2719 ("RDS-TCP: Reset tcp callbacks if re-using an
      outgoing socket in rds_tcp_accept_one()") for a description of this
      condition, and the arbitration logic which ensures that the
      numerically large IP address in the TCP connection is bound to the
      RDS_TCP_PORT ("canonical ordering").
      
      The rds_connection should not be marked as RDS_CONN_UP until the
      arbitration logic has converged for the following reason. The sender
      may start transmitting RDS datagrams as soon as RDS_CONN_UP is set,
      and since the sender removes all datagrams from the rds_connection's
      cp_retrans queue based on TCP acks. If the TCP ack was sent from
      a tcp socket that got reset as part of duel aribitration (but
      before data was delivered to the receivers RDS socket layer),
      the sender may end up prematurely freeing the datagram, and
      the datagram is no longer reliably deliverable.
      
      This patch remedies that condition by making sure that, upon
      receipt of 3WH completion state change notification of TCP_ESTABLISHED
      in rds_tcp_state_change, we mark the rds_connection as RDS_CONN_UP
      if, and only if, the IP addresses and ports for the connection are
      canonically ordered. In all other cases, rds_tcp_state_change will
      force an rds_conn_path_drop(), and rds_queue_reconnect() on
      both peers will restart the connection to ensure canonical ordering.
      
      A side-effect of enforcing this condition in rds_tcp_state_change()
      is that rds_tcp_accept_one_path() can now be refactored for simplicity.
      It is also no longer possible to encounter an RDS_CONN_UP connection in
      the arbitration logic in rds_tcp_accept_one().
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1a0e100f
  8. 15 Jul, 2016 1 commit
  9. 01 Jul, 2016 4 commits
  10. 19 Jun, 2016 1 commit
  11. 15 Jun, 2016 1 commit
  12. 07 Jun, 2016 1 commit
    • Sowmini Varadhan's avatar
      RDS: TCP: fix race windows in send-path quiescence by rds_tcp_accept_one() · 9c79440e
      Sowmini Varadhan authored
      The send path needs to be quiesced before resetting callbacks from
      rds_tcp_accept_one(), and commit eb192840 ("RDS:TCP: Synchronize
      rds_tcp_accept_one with rds_send_xmit when resetting t_sock") achieves
      this using the c_state and RDS_IN_XMIT bit following the pattern
      used by rds_conn_shutdown(). However this leaves the possibility
      of a race window as shown in the sequence below
          take t_conn_lock in rds_tcp_conn_connect
          send outgoing syn to peer
          drop t_conn_lock in rds_tcp_conn_connect
          incoming from peer triggers rds_tcp_accept_one, conn is
      	marked CONNECTING
          wait for RDS_IN_XMIT to quiesce any rds_send_xmit threads
          call rds_tcp_reset_callbacks
          [.. race-window where incoming syn-ack can cause the conn
      	to be marked UP from rds_tcp_state_change ..]
          lock_sock called from rds_tcp_reset_callbacks, and we set
      	t_sock to null
      As soon as the conn is marked UP in the race-window above, rds_send_xmit()
      threads will proceed to rds_tcp_xmit and may encounter a null-pointer
      deref on the t_sock.
      
      Given that rds_tcp_state_change() is invoked in softirq context, whereas
      rds_tcp_reset_callbacks() is in workq context, and testing for RDS_IN_XMIT
      after lock_sock could result in a deadlock with tcp_sendmsg, this
      commit fixes the race by using a new c_state, RDS_TCP_RESETTING, which
      will prevent a transition to RDS_CONN_UP from rds_tcp_state_change().
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9c79440e
  13. 19 May, 2016 1 commit
  14. 03 May, 2016 1 commit
    • Sowmini Varadhan's avatar
      RDS: TCP: Synchronize accept() and connect() paths on t_conn_lock. · bd7c5f98
      Sowmini Varadhan authored
      An arbitration scheme for duelling SYNs is implemented as part of
      commit 241b2719 ("RDS-TCP: Reset tcp callbacks if re-using an
      outgoing socket in rds_tcp_accept_one()") which ensures that both nodes
      involved will arrive at the same arbitration decision. However, this
      needs to be synchronized with an outgoing SYN to be generated by
      rds_tcp_conn_connect(). This commit achieves the synchronization
      through the t_conn_lock mutex in struct rds_tcp_connection.
      
      The rds_conn_state is checked in rds_tcp_conn_connect() after acquiring
      the t_conn_lock mutex.  A SYN is sent out only if the RDS connection is
      not already UP (an UP would indicate that rds_tcp_accept_one() has
      completed 3WH, so no SYN needs to be generated).
      
      Similarly, the rds_conn_state is checked in rds_tcp_accept_one() after
      acquiring the t_conn_lock mutex. The only acceptable states (to
      allow continuation of the arbitration logic) are UP (i.e., outgoing SYN
      was SYN-ACKed by peer after it sent us the SYN) or CONNECTING (we sent
      outgoing SYN before we saw incoming SYN).
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bd7c5f98
  15. 07 Aug, 2015 2 commits
  16. 09 May, 2015 1 commit
    • Sowmini Varadhan's avatar
      net/rds: RDS-TCP: Always create a new rds_sock for an incoming connection. · f711a6ae
      Sowmini Varadhan authored
      When running RDS over TCP, the active (client) side connects to the
      listening ("passive") side at the RDS_TCP_PORT.  After the connection
      is established, if the client side reboots (potentially without even
      sending a FIN) the server still has a TCP socket in the esablished
      state.  If the server-side now gets a new SYN comes from the client
      with a different client port, TCP will create a new socket-pair, but
      the RDS layer will incorrectly pull up the old rds_connection (which
      is still associated with the stale t_sock and RDS socket state).
      
      This patch corrects this behavior by having rds_tcp_accept_one()
      always create a new connection for an incoming TCP SYN.
      The rds and tcp state associated with the old socket-pair is cleaned
      up via the rds_tcp_state_change() callback which would typically be
      invoked in most cases when the client-TCP sends a FIN on TCP restart,
      triggering a transition to CLOSE_WAIT state. In the rarer event of client
      death without a FIN, TCP_KEEPALIVE probes on the socket will detect
      the stale socket, and the TCP transition to CLOSE state will trigger
      the RDS state cleanup.
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f711a6ae
  17. 03 Oct, 2014 1 commit
    • Herton R. Krzesinski's avatar
      net/rds: do proper house keeping if connection fails in rds_tcp_conn_connect · eb74cc97
      Herton R. Krzesinski authored
      I see two problems if we consider the sock->ops->connect attempt to fail in
      rds_tcp_conn_connect. The first issue is that for example we don't remove the
      previously added rds_tcp_connection item to rds_tcp_tc_list at
      rds_tcp_set_callbacks, which means that on a next reconnect attempt for the
      same rds_connection, when rds_tcp_conn_connect is called we can again call
      rds_tcp_set_callbacks, resulting in duplicated items on rds_tcp_tc_list,
      leading to list corruption: to avoid this just make sure we call
      properly rds_tcp_restore_callbacks before we exit. The second issue
      is that we should also release the sock properly, by setting sock = NULL
      only if we are returning without error.
      Signed-off-by: default avatarHerton R. Krzesinski <herton@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eb74cc97
  18. 23 Aug, 2012 1 commit
  19. 25 Sep, 2010 1 commit
    • Eric Dumazet's avatar
      net: fix a lockdep splat · f064af1e
      Eric Dumazet authored
      We have for each socket :
      
      One spinlock (sk_slock.slock)
      One rwlock (sk_callback_lock)
      
      Possible scenarios are :
      
      (A) (this is used in net/sunrpc/xprtsock.c)
      read_lock(&sk->sk_callback_lock) (without blocking BH)
      <BH>
      spin_lock(&sk->sk_slock.slock);
      ...
      read_lock(&sk->sk_callback_lock);
      ...
      
      (B)
      write_lock_bh(&sk->sk_callback_lock)
      stuff
      write_unlock_bh(&sk->sk_callback_lock)
      
      (C)
      spin_lock_bh(&sk->sk_slock)
      ...
      write_lock_bh(&sk->sk_callback_lock)
      stuff
      write_unlock_bh(&sk->sk_callback_lock)
      spin_unlock_bh(&sk->sk_slock)
      
      This (C) case conflicts with (A) :
      
      CPU1 [A]                         CPU2 [C]
      read_lock(callback_lock)
      <BH>                             spin_lock_bh(slock)
      <wait to spin_lock(slock)>
                                       <wait to write_lock_bh(callback_lock)>
      
      We have one problematic (C) use case in inet_csk_listen_stop() :
      
      local_bh_disable();
      bh_lock_sock(child); // spin_lock_bh(&sk->sk_slock)
      WARN_ON(sock_owned_by_user(child));
      ...
      sock_orphan(child); // write_lock_bh(&sk->sk_callback_lock)
      
      lockdep is not happy with this, as reported by Tetsuo Handa
      
      It seems only way to deal with this is to use read_lock_bh(callbacklock)
      everywhere.
      
      Thanks to Jarek for pointing a bug in my first attempt and suggesting
      this solution.
      Reported-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Tested-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      CC: Jarek Poplawski <jarkao2@gmail.com>
      Tested-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f064af1e
  20. 09 Sep, 2010 1 commit
  21. 18 May, 2010 1 commit
  22. 04 Feb, 2010 1 commit
  23. 24 Aug, 2009 1 commit