Skip to content
  • Christian Brauner's avatar
    netns: restrict uevents · a3498436
    Christian Brauner authored
    commit 07e98962 ("kobject: Send hotplug events in all network namespaces")
    
    enabled sending hotplug events into all network namespaces back in 2010.
    Over time the set of uevents that get sent into all network namespaces has
    shrunk. We have now reached the point where hotplug events for all devices
    that carry a namespace tag are filtered according to that namespace.
    Specifically, they are filtered whenever the namespace tag of the kobject
    does not match the namespace tag of the netlink socket.
    Currently, only network devices carry namespace tags (i.e. network
    namespace tags). Hence, uevents for network devices only show up in the
    network namespace such devices are created in or moved to.
    
    However, any uevent for a kobject that does not have a namespace tag
    associated with it will not be filtered and we will broadcast it into all
    network namespaces. This behavior stopped making sense when user namespaces
    were introduced.
    
    This patch simplifies and fixes couple of things:
    - Split codepath for sending uevents by kobject namespace tags:
      1. Untagged kobjects - uevent_net_broadcast_untagged():
         Untagged kobjects will be broadcast into all uevent sockets recorded
         in uevent_sock_list, i.e. into all network namespacs owned by the
         intial user namespace.
      2. Tagged kobjects - uevent_net_broadcast_tagged():
         Tagged kobjects will only be broadcast into the network namespace they
         were tagged with.
      Handling of tagged kobjects in 2. does not cause any semantic changes.
      This is just splitting out the filtering logic that was handled by
      kobj_bcast_filter() before.
      Handling of untagged kobjects in 1. will cause a semantic change. The
      reasons why this is needed and ok have been discussed in [1]. Here is a
      short summary:
      - Userspace ignores uevents from network namespaces that are not owned by
        the intial user namespace:
        Uevents are filtered by userspace in a user namespace because the
        received uid != 0. Instead the uid associated with the event will be
        65534 == "nobody" because the global root uid is not mapped.
        This means we can safely and without introducing regressions modify the
        kernel to not send uevents into all network namespaces whose owning
        user namespace is not the initial user namespace because we know that
        userspace will ignore the message because of the uid anyway.
        I have a) verified that is is true for every udev implementation out
        there b) that this behavior has been present in all udev
        implementations from the very beginning.
      - Thundering herd:
        Broadcasting uevents into all network namespaces introduces significant
        overhead.
        All processes that listen to uevents running in non-initial user
        namespaces will end up responding to uevents that will be meaningless
        to them. Mainly, because non-initial user namespaces cannot easily
        manage devices unless they have a privileged host-process helping them
        out. This means that there will be a thundering herd of activity when
        there shouldn't be any.
      - Removing needless overhead/Increasing performance:
        Currently, the uevent socket for each network namespace is added to the
        global variable uevent_sock_list. The list itself needs to be protected
        by a mutex. So everytime a uevent is generated the mutex is taken on
        the list. The mutex is held *from the creation of the uevent (memory
        allocation, string creation etc. until all uevent sockets have been
        handled*. This is aggravated by the fact that for each uevent socket
        that has listeners the mc_list must be walked as well which means we're
        talking O(n^2) here. Given that a standard Linux workload usually has
        quite a lot of network namespaces and - in the face of containers - a
        lot of user namespaces this quickly becomes a performance problem (see
        "Thundering herd" above). By just recording uevent sockets of network
        namespaces that are owned by the initial user namespace we
        significantly increase performance in this codepath.
      - Injecting uevents:
        There's a valid argument that containers might be interested in
        receiving device events especially if they are delegated to them by a
        privileged userspace process. One prime example are SR-IOV enabled
        devices that are explicitly designed to be handed of to other users
        such as VMs or containers.
        This use-case can now be correctly handled since
        commit 692ec06d ("netns: send uevent messages"). This commit
        introduced the ability to send uevents from userspace. As such we can
        let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
        namespace of the network namespace of the netlink socket) userspace
        process make a decision what uevents should be sent. This removes the
        need to blindly broadcast uevents into all user namespaces and provides
        a performant and safe solution to this problem.
      - Filtering logic:
        This patch filters by *owning user namespace of the network namespace a
        given task resides in* and not by user namespace of the task per se.
        This means if the user namespace of a given task is unshared but the
        network namespace is kept and is owned by the initial user namespace a
        listener that is opening the uevent socket in that network namespace
        can still listen to uevents.
    - Fix permission for tagged kobjects:
      Network devices that are created or moved into a network namespace that
      is owned by a non-initial user namespace currently are send with
      INVALID_{G,U}ID in their credentials. This means that all current udev
      implementations in userspace will ignore the uevent they receive for
      them. This has lead to weird bugs whereby new devices showing up in such
      network namespaces were not recognized and did not get IPs assigned etc.
      This patch adjusts the permission to the appropriate {g,u}id in the
      respective user namespace. This way udevd is able to correctly handle
      such devices.
    - Simplify filtering logic:
      do_one_broadcast() already ensures that only listeners in mc_list receive
      uevents that have the same network namespace as the uevent socket itself.
      So the filtering logic in kobj_bcast_filter is not needed (see [3]). This
      patch therefore removes kobj_bcast_filter() and replaces
      netlink_broadcast_filtered() with the simpler netlink_broadcast()
      everywhere.
    
    [1]: https://lkml.org/lkml/2018/4/4/739
    [2]: https://lkml.org/lkml/2018/4/26/767
    [3]: https://lkml.org/lkml/2018/4/26/738
    
    
    Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    a3498436