Skip to content
  • Christian Brauner's avatar
    signal: add pidfd_send_signal() syscall · 3eb39f47
    Christian Brauner authored
    The kill() syscall operates on process identifiers (pid). After a process
    has exited its pid can be reused by another process. If a caller sends a
    signal to a reused pid it will end up signaling the wrong process. This
    issue has often surfaced and there has been a push to address this problem [1].
    
    This patch uses file descriptors (fd) from proc/<pid> as stable handles on
    struct pid. Even if a pid is recycled the handle will not change. The fd
    can be used to send signals to the process it refers to.
    Thus, the new syscall pidfd_send_signal() is introduced to solve this
    problem. Instead of pids it operates on process fds (pidfd).
    
    /* prototype and argument /*
    long pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags);
    
    /* syscall number 424 */
    The syscall number was chosen to be 424 to align with Arnd's rework in his
    y2038 to minimize merge conflicts (cf. [25]).
    
    In addition to the pidfd and signal argument it takes an additional
    siginfo_t and flags argument. If the siginfo_t argument is NULL then
    pidfd_send_signal() is equivalent to kill(<positive-pid>, <signal>). If it
    is not NULL pidfd_send_signal() is equivalent to rt_sigqueueinfo().
    The flags argument is added to allow for future extensions of this syscall.
    It currently needs to be passed as 0. Failing to do so will cause EINVAL.
    
    /* pidfd_send_signal() replaces multiple pid-based syscalls */
    The pidfd_send_signal() syscall currently takes on the job of
    rt_sigqueueinfo(2) and parts of the functionality of kill(2), Namely, when a
    positive pid is passed to kill(2). It will however be possible to also
    replace tgkill(2) and rt_tgsigqueueinfo(2) if this syscall is extended.
    
    /* sending signals to threads (tid) and process groups (pgid) */
    Specifically, the pidfd_send_signal() syscall does currently not operate on
    process groups or threads. This is left for future extensions.
    In order to extend the syscall to allow sending signal to threads and
    process groups appropriately named flags (e.g. PIDFD_TYPE_PGID, and
    PIDFD_TYPE_TID) should be added. This implies that the flags argument will
    determine what is signaled and not the file descriptor itself. Put in other
    words, grouping in this api is a property of the flags argument not a
    property of the file descriptor (cf. [13]). Clarification for this has been
    requested by Eric (cf. [19]).
    When appropriate extensions through the flags argument are added then
    pidfd_send_signal() can additionally replace the part of kill(2) which
    operates on process groups as well as the tgkill(2) and
    rt_tgsigqueueinfo(2) syscalls.
    How such an extension could be implemented has been very roughly sketched
    in [14], [15], and [16]. However, this should not be taken as a commitment
    to a particular implementation. There might be better ways to do it.
    Right now this is intentionally left out to keep this patchset as simple as
    possible (cf. [4]).
    
    /* naming */
    The syscall had various names throughout iterations of this patchset:
    - procfd_signal()
    - procfd_send_signal()
    - taskfd_send_signal()
    In the last round of reviews it was pointed out that given that if the
    flags argument decides the scope of the signal instead of different types
    of fds it might make sense to either settle for "procfd_" or "pidfd_" as
    prefix. The community was willing to accept either (cf. [17] and [18]).
    Given that one developer expressed strong preference for the "pidfd_"
    prefix (cf. [13]) and with other developers less opinionated about the name
    we should settle for "pidfd_" to avoid further bikeshedding.
    
    The  "_send_signal" suffix was chosen to reflect the fact that the syscall
    takes on the job of multiple syscalls. It is therefore intentional that the
    name is not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the
    fomer because it might imply that pidfd_send_signal() is a replacement for
    kill(2), and not the latter because it is a hassle to remember the correct
    spelling - especially for non-native speakers - and because it is not
    descriptive enough of what the syscall actually does. The name
    "pidfd_send_signal" makes it very clear that its job is to send signals.
    
    /* zombies */
    Zombies can be signaled just as any other process. No special error will be
    reported since a zombie state is an unreliable state (cf. [3]). However,
    this can be added as an extension through the @flags argument if the need
    ever arises.
    
    /* cross-namespace signals */
    The patch currently enforces that the signaler and signalee either are in
    the same pid namespace or that the signaler's pid namespace is an ancestor
    of the signalee's pid namespace. This is done for the sake of simplicity
    and because it is unclear to what values certain members of struct
    siginfo_t would need to be set to (cf. [5], [6]).
    
    /* compat syscalls */
    It became clear that we would like to avoid adding compat syscalls
    (cf. [7]).  The compat syscall handling is now done in kernel/signal.c
    itself by adding __copy_siginfo_from_user_generic() which lets us avoid
    compat syscalls (cf. [8]). It should be noted that the addition of
    __copy_siginfo_from_user_any() is caused by a bug in the original
    implementation of rt_sigqueueinfo(2) (cf. 12).
    With upcoming rework for syscall handling things might improve
    significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain
    any additional callers.
    
    /* testing */
    This patch was tested on x64 and x86.
    
    /* userspace usage */
    An asciinema recording for the basic functionality can be found under [9].
    With this patch a process can be killed via:
    
     #define _GNU_SOURCE
     #include <errno.h>
     #include <fcntl.h>
     #include <signal.h>
     #include <stdio.h>
     #include <stdlib.h>
     #include <string.h>
     #include <sys/stat.h>
     #include <sys/syscall.h>
     #include <sys/types.h>
     #include <unistd.h>
    
     static inline int do_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
                                             unsigned int flags)
     {
     #ifdef __NR_pidfd_send_signal
             return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
     #else
             return -ENOSYS;
     #endif
     }
    
     int main(int argc, char *argv[])
     {
             int fd, ret, saved_errno, sig;
    
             if (argc < 3)
                     exit(EXIT_FAILURE);
    
             fd = open(argv[1], O_DIRECTORY | O_CLOEXEC);
             if (fd < 0) {
                     printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]);
                     exit(EXIT_FAILURE);
             }
    
             sig = atoi(argv[2]);
    
             printf("Sending signal %d to process %s\n", sig, argv[1]);
             ret = do_pidfd_send_signal(fd, sig, NULL, 0);
    
             saved_errno = errno;
             close(fd);
             errno = saved_errno;
    
             if (ret < 0) {
                     printf("%s - Failed to send signal %d to process %s\n",
                            strerror(errno), sig, argv[1]);
                     exit(EXIT_FAILURE);
             }
    
             exit(EXIT_SUCCESS);
     }
    
    /* Q&A
     * Given that it seems the same questions get asked again by people who are
     * late to the party it makes sense to add a Q&A section to the commit
     * message so it's hopefully easier to avoid duplicate threads.
     *
     * For the sake of progress please consider these arguments settled unless
     * there is a new point that desperately needs to be addressed. Please make
     * sure to check the links to the threads in this commit message whether
     * this has not already been covered.
     */
    Q-01: (Florian Weimer [20], Andrew Morton [21])
          What happens when the target process has exited?
    A-01: Sending the signal will fail with ESRCH (cf. [22]).
    
    Q-02:  (Andrew Morton [21])
           Is the task_struct pinned by the fd?
    A-02:  No. A reference to struct pid is kept. struct pid - as far as I
           understand - was created exactly for the reason to not require to
           pin struct task_struct (cf. [22]).
    
    Q-03: (Andrew Morton [21])
          Does the entire procfs directory remain visible? Just one entry
          within it?
    A-03: The same thing that happens right now when you hold a file descriptor
          to /proc/<pid> open (cf. [22]).
    
    Q-04: (Andrew Morton [21])
          Does the pid remain reserved?
    A-04: No. This patchset guarantees a stable handle not that pids are not
          recycled (cf. [22]).
    
    Q-05: (Andrew Morton [21])
          Do attempts to signal that fd return errors?
    A-05: See {Q,A}-01.
    
    Q-06: (Andrew Morton [22])
          Is there a cleaner way of obtaining the fd? Another syscall perhaps.
    A-06: Userspace can already trivially retrieve file descriptors from procfs
          so this is something that we will need to support anyway. Hence,
          there's no immediate need to add another syscalls just to make
          pidfd_send_signal() not dependent on the presence of procfs. However,
          adding a syscalls to get such file descriptors is planned for a
          future patchset (cf. [22]).
    
    Q-07: (Andrew Morton [21] and others)
          This fd-for-a-process sounds like a handy thing and people may well
          think up other uses for it in the future, probably unrelated to
          signals. Are the code and the interface designed to permit such
          future applications?
    A-07: Yes (cf. [22]).
    
    Q-08: (Andrew Morton [21] and others)
          Now I think about it, why a new syscall? This thing is looking
          rather like an ioctl?
    A-08: This has been extensively discussed. It was agreed that a syscall is
          preferred for a variety or reasons. Here are just a few taken from
          prior threads. Syscalls are safer than ioctl()s especially when
          signaling to fds. Processes are a core kernel concept so a syscall
          seems more appropriate. The layout of the syscall with its four
          arguments would require the addition of a custom struct for the
          ioctl() thereby causing at least the same amount or even more
          complexity for userspace than a simple syscall. The new syscall will
          replace multiple other pid-based syscalls (see description above).
          The file-descriptors-for-processes concept introduced with this
          syscall will be extended with other syscalls in the future. See also
          [22], [23] and various other threads already linked in here.
    
    Q-09: (Florian Weimer [24])
          What happens if you use the new interface with an O_PATH descriptor?
    A-09:
          pidfds opened as O_PATH fds cannot be used to send signals to a
          process (cf. [2]). Signaling processes through pidfds is the
          equivalent of writing to a file. Thus, this is not an operation that
          operates "purely at the file descriptor level" as required by the
          open(2) manpage. See also [4].
    
    /* References */
    [1]:  https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/
    [2]:  https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/
    [3]:  https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/
    [4]:  https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/
    [5]:  https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/
    [6]:  https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/
    [7]:  https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/
    [8]:  https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/
    [9]:  https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy
    [11]: https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/
    [12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/
    [13]: https://lore.kernel.org/lkml/871s6u9z6u.fsf@xmission.com/
    [14]: https://lore.kernel.org/lkml/20181206231742.xxi4ghn24z4h2qki@brauner.io/
    [15]: https://lore.kernel.org/lkml/20181207003124.GA11160@mail.hallyn.com/
    [16]: https://lore.kernel.org/lkml/20181207015423.4miorx43l3qhppfz@brauner.io/
    [17]: https://lore.kernel.org/lkml/CAGXu5jL8PciZAXvOvCeCU3wKUEB_dU-O3q0tDw4uB_ojMvDEew@mail.gmail.com/
    [18]: https://lore.kernel.org/lkml/20181206222746.GB9224@mail.hallyn.com/
    [19]: https://lore.kernel.org/lkml/20181208054059.19813-1-christian@brauner.io/
    [20]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
    [21]: https://lore.kernel.org/lkml/20181228152012.dbf0508c2508138efc5f2bbe@linux-foundation.org/
    [22]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/
    [23]: https://lwn.net/Articles/773459/
    [24]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
    [25]: https://lore.kernel.org/lkml/CAK8P3a0ej9NcJM8wXNPbcGUyOUZYX+VLoDFdbenW3s3114oQZw@mail.gmail.com/
    
    
    
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Andy Lutomirsky <luto@kernel.org>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Florian Weimer <fweimer@redhat.com>
    Signed-off-by: default avatarChristian Brauner <christian@brauner.io>
    Reviewed-by: default avatarTycho Andersen <tycho@tycho.ws>
    Reviewed-by: default avatarKees Cook <keescook@chromium.org>
    Reviewed-by: default avatarDavid Howells <dhowells@redhat.com>
    Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
    Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
    Acked-by: default avatarSerge Hallyn <serge@hallyn.com>
    Acked-by: default avatarAleksa Sarai <cyphar@cyphar.com>
    3eb39f47