Commit cb60e3e6 authored by Linus Torvalds's avatar Linus Torvalds

Merge branch 'next' of git://

Pull security subsystem updates from James Morris:
 "New notable features:
   - The seccomp work from Will Drewry
   - PR_{GET,SET}_NO_NEW_PRIVS from Andy Lutomirski
   - Longer security labels for Smack from Casey Schaufler
   - Additional ptrace restriction modes for Yama by Kees Cook"

Fix up trivial context conflicts in arch/x86/Kconfig and include/linux/filter.h

* 'next' of git:// (65 commits)
  apparmor: fix long path failure due to disconnected path
  apparmor: fix profile lookup for unconfined
  ima: fix filename hint to reflect script interpreter name
  KEYS: Don't check for NULL key pointer in key_validate()
  Smack: allow for significantly longer Smack labels v4
  gfp flags for security_inode_alloc()?
  Smack: recursive tramsmute
  Yama: replace capable() with ns_capable()
  TOMOYO: Accept manager programs which do not start with / .
  KEYS: Add invalidation support
  KEYS: Do LRU discard in full keyrings
  KEYS: Permit in-place link replacement in keyring list
  KEYS: Perform RCU synchronisation on keys prior to key destruction
  KEYS: Announce key type (un)registration
  KEYS: Reorganise keys Makefile
  KEYS: Move the key config into security/keys/Kconfig
  KEYS: Use the compat keyctl() syscall wrapper on Sparc64 for Sparc32 compat
  Yama: remove an unused variable
  samples/seccomp: fix dependencies on arch macros
  Yama: add additional ptrace scopes
parents 99262a3d ff2bb047
SECure COMPuting with filters
A large number of system calls are exposed to every userland process
with many of them going unused for the entire lifetime of the process.
As system calls change and mature, bugs are found and eradicated. A
certain subset of userland applications benefit by having a reduced set
of available system calls. The resulting set reduces the total kernel
surface exposed to the application. System call filtering is meant for
use with those applications.
Seccomp filtering provides a means for a process to specify a filter for
incoming system calls. The filter is expressed as a Berkeley Packet
Filter (BPF) program, as with socket filters, except that the data
operated on is related to the system call being made: system call
number and the system call arguments. This allows for expressive
filtering of system calls using a filter program language with a long
history of being exposed to userland and a straightforward data set.
Additionally, BPF makes it impossible for users of seccomp to fall prey
to time-of-check-time-of-use (TOCTOU) attacks that are common in system
call interposition frameworks. BPF programs may not dereference
pointers which constrains all filters to solely evaluating the system
call arguments directly.
What it isn't
System call filtering isn't a sandbox. It provides a clearly defined
mechanism for minimizing the exposed kernel surface. It is meant to be
a tool for sandbox developers to use. Beyond that, policy for logical
behavior and information flow should be managed with a combination of
other system hardening techniques and, potentially, an LSM of your
choosing. Expressive, dynamic filters provide further options down this
path (avoiding pathological sizes or selecting which of the multiplexed
system calls in socketcall() is allowed, for instance) which could be
construed, incorrectly, as a more complete sandboxing solution.
An additional seccomp mode is added and is enabled using the same
prctl(2) call as the strict seccomp. If the architecture has
CONFIG_HAVE_ARCH_SECCOMP_FILTER, then filters may be added as below:
Now takes an additional argument which specifies a new filter
using a BPF program.
The BPF program will be executed over struct seccomp_data
reflecting the system call number, arguments, and other
metadata. The BPF program must then return one of the
acceptable values to inform the kernel which action should be
The 'prog' argument is a pointer to a struct sock_fprog which
will contain the filter program. If the program is invalid, the
call will return -1 and set errno to EINVAL.
If fork/clone and execve are allowed by @prog, any child
processes will be constrained to the same filters and system
call ABI as the parent.
Prior to use, the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) or
run with CAP_SYS_ADMIN privileges in its namespace. If these are not
true, -EACCES will be returned. This requirement ensures that filter
programs cannot be applied to child processes with greater privileges
than the task that installed them.
Additionally, if prctl(2) is allowed by the attached filter,
additional filters may be layered on which will increase evaluation
time, but allow for further decreasing the attack surface during
execution of a process.
The above call returns 0 on success and non-zero on error.
Return values
A seccomp filter may return any of the following values. If multiple
filters exist, the return value for the evaluation of a given system
call will always use the highest precedent value. (For example,
SECCOMP_RET_KILL will always take precedence.)
In precedence order, they are:
Results in the task exiting immediately without executing the
system call. The exit status of the task (status & 0x7f) will
Results in the kernel sending a SIGSYS signal to the triggering
task without executing the system call. The kernel will
rollback the register state to just before the system call
entry such that a signal handler in the task will be able to
inspect the ucontext_t->uc_mcontext registers and emulate
system call success or failure upon return from the signal
The SECCOMP_RET_DATA portion of the return value will be passed
as si_errno.
SIGSYS triggered by seccomp will have a si_code of SYS_SECCOMP.
Results in the lower 16-bits of the return value being passed
to userland as the errno without executing the system call.
When returned, this value will cause the kernel to attempt to
notify a ptrace()-based tracer prior to executing the system
call. If there is no tracer present, -ENOSYS is returned to
userland and the system call is not executed.
A tracer will be notified if it requests PTRACE_O_TRACESECCOMP
using ptrace(PTRACE_SETOPTIONS). The tracer will be notified
the BPF program return value will be available to the tracer
Results in the system call being executed.
If multiple filters exist, the return value for the evaluation of a
given system call will always use the highest precedent value.
Precedence is only determined using the SECCOMP_RET_ACTION mask. When
multiple filters return values of the same precedence, only the
SECCOMP_RET_DATA from the most recently installed filter will be
The biggest pitfall to avoid during use is filtering on system call
number without checking the architecture value. Why? On any
architecture that supports multiple system call invocation conventions,
the system call numbers may vary based on the specific invocation. If
the numbers in the different calling conventions overlap, then checks in
the filters may be abused. Always check the arch value!
The samples/seccomp/ directory contains both an x86-specific example
and a more generic example of a higher level macro interface for BPF
program generation.
Adding architecture support
See arch/Kconfig for the authoritative requirements. In general, if an
architecture supports both ptrace_event and seccomp, it will be able to
support seccomp filter with minor fixup: SIGSYS support and seccomp return
value checking. Then it must just add CONFIG_HAVE_ARCH_SECCOMP_FILTER
to its arch-specific Kconfig.
This diff is collapsed.
......@@ -34,7 +34,7 @@ parent to a child process (i.e. direct "gdb EXE" and "strace EXE" still
work), or with CAP_SYS_PTRACE (i.e. "gdb --pid=PID", and "strace -p PID"
still work as root).
For software that has defined application-specific relationships
In mode 1, software that has defined application-specific relationships
between a debugging process and its inferior (crash handlers, etc),
prctl(PR_SET_PTRACER, pid, ...) can be used. An inferior can declare which
other process (and its descendents) are allowed to call PTRACE_ATTACH
......@@ -46,6 +46,8 @@ restrictions, it can call prctl(PR_SET_PTRACER, PR_SET_PTRACER_ANY, ...)
so that any otherwise allowed process (even those in external pid namespaces)
may attach.
These restrictions do not change how ptrace via PTRACE_TRACEME operates.
The sysctl settings are:
0 - classic ptrace permissions: a process can PTRACE_ATTACH to any other
......@@ -60,6 +62,12 @@ The sysctl settings are:
inferior can call prctl(PR_SET_PTRACER, debugger, ...) to declare
an allowed debugger PID to call PTRACE_ATTACH on the inferior.
2 - admin-only attach: only processes with CAP_SYS_PTRACE may use ptrace
3 - no attach: no processes may use ptrace with PTRACE_ATTACH. Once set,
this sysctl cannot be changed to a lower value.
The original children-only logic was based on the restrictions in grsecurity.
......@@ -805,6 +805,23 @@ The keyctl syscall functions are:
kernel and resumes executing userspace.
(*) Invalidate a key.
long keyctl(KEYCTL_INVALIDATE, key_serial_t key);
This function marks a key as being invalidated and then wakes up the
garbage collector. The garbage collector immediately removes invalidated
keys from all keyrings and deletes the key when its reference count
reaches zero.
Keys that are marked invalidated become invisible to normal key operations
immediately, though they are still visible in /proc/keys until deleted
(they're marked with an 'i' flag).
A process must have search permission on the key for this function to be
......@@ -1733,6 +1733,7 @@ S: Supported
F: include/linux/capability.h
F: security/capability.c
F: security/commoncap.c
F: kernel/capability.c
M: Arnd Bergmann <>
......@@ -5950,7 +5951,7 @@ SECURITY SUBSYSTEM
M: James Morris <>
L: (suggested Cc:)
T: git git://
S: Supported
F: security/
......@@ -231,4 +231,27 @@ config HAVE_CMPXCHG_DOUBLE
An arch should select this symbol if it provides all of these things:
- syscall_get_arch()
- syscall_get_arguments()
- syscall_rollback()
- syscall_set_return_value()
- SIGSYS siginfo_t support
- secure_computing is called from a ptrace_event()-safe context
- secure_computing return value is checked and a return value of -1
results in the system call being skipped immediately.
def_bool y
Enable tasks to build secure computing environments defined
in terms of Berkeley Packet Filter programs which implement
task-defined system call filtering polices.
See Documentation/prctl/seccomp_filter.txt for details.
source "kernel/gcov/Kconfig"
......@@ -136,7 +136,7 @@ asmlinkage long do_syscall_trace_enter(struct pt_regs *regs)
long ret = 0;
if (test_thread_flag(TIF_SYSCALL_TRACE) &&
......@@ -535,7 +535,7 @@ static inline int audit_arch(void)
asmlinkage void syscall_trace_enter(struct pt_regs *regs)
/* do the secure computing check first */
if (!(current->ptrace & PT_PTRACED))
goto out;
......@@ -1710,7 +1710,7 @@ long do_syscall_trace_enter(struct pt_regs *regs)
long ret = 0;
if (test_thread_flag(TIF_SYSCALL_TRACE) &&
......@@ -719,7 +719,7 @@ asmlinkage long do_syscall_trace_enter(struct pt_regs *regs)
long ret = 0;
/* Do the secure computing check first. */
* The sysc_tracesys code in entry.S stored the system
......@@ -503,7 +503,7 @@ asmlinkage long do_syscall_trace_enter(struct pt_regs *regs)
long ret = 0;
if (test_thread_flag(TIF_SYSCALL_TRACE) &&
......@@ -522,7 +522,7 @@ asmlinkage long long do_syscall_trace_enter(struct pt_regs *regs)
long long ret = 0;
if (test_thread_flag(TIF_SYSCALL_TRACE) &&
......@@ -583,6 +583,9 @@ config SYSVIPC_COMPAT
depends on COMPAT && SYSVIPC
default y
def_bool y if COMPAT && KEYS
source "net/Kconfig"
......@@ -1062,7 +1062,7 @@ asmlinkage int syscall_trace_enter(struct pt_regs *regs)
int ret = 0;
/* do the secure computing check first */
if (test_thread_flag(TIF_SYSCALL_TRACE))
ret = tracehook_report_syscall_entry(regs);
......@@ -74,7 +74,7 @@ sys_call_table32:
.word sys_timer_delete, compat_sys_timer_create, sys_ni_syscall, compat_sys_io_setup, sys_io_destroy
/*270*/ .word sys32_io_submit, sys_io_cancel, compat_sys_io_getevents, sys32_mq_open, sys_mq_unlink
.word compat_sys_mq_timedsend, compat_sys_mq_timedreceive, compat_sys_mq_notify, compat_sys_mq_getsetattr, compat_sys_waitid
/*280*/ .word sys32_tee, sys_add_key, sys_request_key, sys_keyctl, compat_sys_openat
/*280*/ .word sys32_tee, sys_add_key, sys_request_key, compat_sys_keyctl, compat_sys_openat
.word sys_mkdirat, sys_mknodat, sys_fchownat, compat_sys_futimesat, compat_sys_fstatat64
/*290*/ .word sys_unlinkat, sys_renameat, sys_linkat, sys_symlinkat, sys_readlinkat
.word sys_fchmodat, sys_faccessat, compat_sys_pselect6, compat_sys_ppoll, sys_unshare
......@@ -83,6 +83,7 @@ config X86
......@@ -67,6 +67,10 @@ int copy_siginfo_to_user32(compat_siginfo_t __user *to, siginfo_t *from)
switch (from->si_code >> 16) {
case __SI_FAULT >> 16:
case __SI_SYS >> 16:
put_user_ex(from->si_syscall, &to->si_syscall);
put_user_ex(from->si_arch, &to->si_arch);
case __SI_CHLD >> 16:
if (ia32) {
put_user_ex(from->si_utime, &to->si_utime);
......@@ -144,6 +144,12 @@ typedef struct compat_siginfo {
int _band; /* POLL_IN, POLL_OUT, POLL_MSG */
int _fd;
} _sigpoll;
struct {
unsigned int _call_addr; /* calling insn */
int _syscall; /* triggering system call number */
unsigned int _arch; /* AUDIT_ARCH_* of syscall */
} _sigsys;
} _sifields;
} compat_siginfo_t;
......@@ -13,9 +13,11 @@
#ifndef _ASM_X86_SYSCALL_H
#define _ASM_X86_SYSCALL_H
#include <linux/audit.h>
#include <linux/sched.h>
#include <linux/err.h>
#include <asm/asm-offsets.h> /* For NR_syscalls */
#include <asm/thread_info.h> /* for TS_COMPAT */
#include <asm/unistd.h>
extern const unsigned long sys_call_table[];
......@@ -88,6 +90,12 @@ static inline void syscall_set_arguments(struct task_struct *task,
memcpy(&regs->bx + i, args, n * sizeof(args[0]));
static inline int syscall_get_arch(struct task_struct *task,
struct pt_regs *regs)
return AUDIT_ARCH_I386;
#else /* CONFIG_X86_64 */
static inline void syscall_get_arguments(struct task_struct *task,
......@@ -212,6 +220,25 @@ static inline void syscall_set_arguments(struct task_struct *task,
static inline int syscall_get_arch(struct task_struct *task,
struct pt_regs *regs)
* TS_COMPAT is set for 32-bit syscall entry and then
* remains set until we return to user mode.
* TIF_IA32 tasks should always have TS_COMPAT set at
* system call time.
* x32 tasks should be considered AUDIT_ARCH_X86_64.
if (task_thread_info(task)->status & TS_COMPAT)
return AUDIT_ARCH_I386;
/* Both x32 and x86_64 are considered "64-bit". */
return AUDIT_ARCH_X86_64;
#endif /* CONFIG_X86_32 */
#endif /* _ASM_X86_SYSCALL_H */
......@@ -1480,7 +1480,11 @@ long syscall_trace_enter(struct pt_regs *regs)
regs->flags |= X86_EFLAGS_TF;
/* do the secure computing check first */
if (secure_computing(regs->orig_ax)) {
/* seccomp failures shouldn't expose any additional code. */
ret = -1L;
goto out;
if (unlikely(test_thread_flag(TIF_SYSCALL_EMU)))
ret = -1L;
......@@ -1505,6 +1509,7 @@ long syscall_trace_enter(struct pt_regs *regs)
regs->dx, regs->r10);
return ret ?: regs->orig_ax;
......@@ -1245,6 +1245,13 @@ static int check_unsafe_exec(struct linux_binprm *bprm)
bprm->unsafe |= LSM_UNSAFE_PTRACE;
* This isn't strictly necessary, but it makes it harder for LSMs to
* mess up.
if (current->no_new_privs)
bprm->unsafe |= LSM_UNSAFE_NO_NEW_PRIVS;
n_fs = 1;
......@@ -1288,7 +1295,8 @@ int prepare_binprm(struct linux_binprm *bprm)
bprm->cred->euid = current_euid();
bprm->cred->egid = current_egid();
if (!(bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)) {
if (!(bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID) &&
!current->no_new_privs) {
/* Set-uid? */
if (mode & S_ISUID) {
bprm->per_clear |= PER_CLEAR_ON_SETID;
......@@ -681,7 +681,7 @@ static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt,
f->f_op = fops_get(inode->i_fop);
error = security_dentry_open(f, cred);
error = security_file_open(f, cred);
if (error)
goto cleanup_all;
......@@ -98,9 +98,18 @@ typedef struct siginfo {
int _fd;
} _sigpoll;
/* SIGSYS */
struct {
void __user *_call_addr; /* calling user insn */
int _syscall; /* triggering system call number */
unsigned int _arch; /* AUDIT_ARCH_* of syscall */
} _sigsys;
} _sifields;
} __ARCH_SI_ATTRIBUTES siginfo_t;
/* If the arch shares siginfo, then it has SIGSYS. */
#define __ARCH_SIGSYS
......@@ -124,6 +133,11 @@ typedef struct siginfo {
#define si_addr_lsb _sifields._sigfault._addr_lsb
#define si_band _sifields._sigpoll._band
#define si_fd _sifields._sigpoll._fd
#ifdef __ARCH_SIGSYS
#define si_call_addr _sifields._sigsys._call_addr
#define si_syscall _sifields._sigsys._syscall
#define si_arch _sifields._sigsys._arch
#ifdef __KERNEL__
#define __SI_MASK 0xffff0000u
......@@ -134,6 +148,7 @@ typedef struct siginfo {
#define __SI_CHLD (4 << 16)
#define __SI_RT (5 << 16)
#define __SI_MESGQ (6 << 16)
#define __SI_SYS (7 << 16)
#define __SI_CODE(T,N) ((T) | ((N) & 0xffff))
#define __SI_KILL 0
......@@ -143,6 +158,7 @@ typedef struct siginfo {
#define __SI_CHLD 0
#define __SI_RT 0
#define __SI_MESGQ 0
#define __SI_SYS 0
#define __SI_CODE(T,N) (N)
......@@ -239,6 +255,12 @@ typedef struct siginfo {
#define POLL_HUP (__SI_POLL|6) /* device disconnected */
#define NSIGPOLL 6
* SIGSYS si_codes
#define SYS_SECCOMP (__SI_SYS|1) /* seccomp triggered */
#define NSIGSYS 1
* sigevent definitions
......@@ -142,4 +142,18 @@ void syscall_set_arguments(struct task_struct *task, struct pt_regs *regs,
unsigned int i, unsigned int n,
const unsigned long *args);
* syscall_get_arch - return the AUDIT_ARCH for the current system call
* @task: task of interest, must be in system call entry tracing
* @regs: task_pt_regs() of @task
* Returns the AUDIT_ARCH_* based on the system call convention in use.
* It's only valid to call this when @task is stopped on entry to a system
* Architectures which permit CONFIG_HAVE_ARCH_SECCOMP_FILTER must
* provide an implementation of this.
int syscall_get_arch(struct task_struct *task, struct pt_regs *regs);
#endif /* _ASM_SYSCALL_H */
......@@ -24,7 +24,7 @@ struct keyring_list {
unsigned short maxkeys; /* max keys this list can hold */
unsigned short nkeys; /* number of keys currently held */
unsigned short delkey; /* key to be unlinked by RCU */
struct key *keys[0];
struct key __rcu *keys[0];
......@@ -330,6 +330,7 @@ header-y += scc.h
header-y += sched.h
header-y += screen_info.h
header-y += sdla.h
header-y += seccomp.h
header-y += securebits.h
header-y += selinux_netlink.h
header-y += sem.h
......@@ -463,7 +463,7 @@ extern void audit_putname(const char *name);
extern void __audit_inode(const char *name, const struct dentry *dentry);
extern void __audit_inode_child(const struct dentry *dentry,
const struct inode *parent);
extern void __audit_seccomp(unsigned long syscall);
extern void __audit_seccomp(unsigned long syscall, long signr, int code);
extern void __audit_ptrace(struct task_struct *t);
static inline int audit_dummy_context(void)
......@@ -508,10 +508,10 @@ static inline void audit_inode_child(const struct dentry *dentry,
void audit_core_dumps(long signr);
static inline void audit_seccomp(unsigned long syscall)
static inline void audit_seccomp(unsigned long syscall, long signr, int code)
if (unlikely(!audit_dummy_context()))
__audit_seccomp(syscall, signr, code);
static inline void audit_ptrace(struct task_struct *t)
......@@ -634,7 +634,7 @@ extern int audit_signals;
#define audit_inode(n,d) do { (void)(d); } while (0)
#define audit_inode_child(i,p) do { ; } while (0)
#define audit_core_dumps(i) do { ; } while (0)
#define audit_seccomp(i) do { ; } while (0)
#define audit_seccomp(i,s,c) do { ; } while (0)
#define auditsc_get_stamp(c,t,s) (0)
#define audit_get_loginuid(t) (-1)
#define audit_get_sessionid(t) (-1)
......@@ -10,6 +10,7 @@
#ifdef __KERNEL__
#include <linux/atomic.h>
#include <linux/compat.h>
......@@ -133,6 +134,16 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
#ifdef __KERNEL__
* A struct sock_filter is architecture independent.
struct compat_sock_fprog {
u16 len;
compat_uptr_t filter; /* struct sock_filter * */
struct sk_buff;
struct sock;
......@@ -233,6 +244,7 @@ enum {
#endif /* __KERNEL__ */
......@@ -124,7 +124,10 @@ static inline unsigned long is_key_possessed(const key_ref_t key_ref)
struct key {
atomic_t usage; /* number of references */
key_serial_t serial; /* key serial number */
struct rb_node serial_node;
union {
struct list_head graveyard_link;
struct rb_node serial_node;
struct key_type *type; /* type of key */
struct rw_semaphore sem; /* change vs change sem */
struct key_user *user; /* owner of this key */
......@@ -133,6 +136,7 @@ struct key {
time_t expiry; /* time at which key expires (or 0) */
time_t revoked_at; /* time at which key was revoked */
time_t last_used_at; /* last time used for LRU keyring discard */
uid_t uid;
gid_t gid;
key_perm_t perm; /* access permissions */
......@@ -156,6 +160,7 @@ struct key {
#define KEY_FLAG_USER_CONSTRUCT 4 /* set if key is being constructed in userspace */
#define KEY_FLAG_NEGATIVE 5 /* set if key is negative */
#define KEY_FLAG_ROOT_CAN_CLEAR 6 /* set if key can be cleared by root without permission */
#define KEY_FLAG_INVALIDATED 7 /* set if key has been invalidated */
/* the description string
* - this is used to match a key against search criteria
......@@ -199,6 +204,7 @@ extern struct key *key_alloc(struct key_type *type,
#define KEY_ALLOC_NOT_IN_QUOTA 0x0002 /* not in quota */
extern void key_revoke(struct key *key);
extern void key_invalidate(struct key *key);
extern void key_put(struct key *key);
static inline struct key *key_get(struct key *key)
......@@ -236,7 +242,7 @@ extern struct key *request_key_async_with_auxdata(struct key_type *type,
extern int wait_for_key_construction(struct key *key, bool intr);
extern int key_validate(struct key *key);
extern int key_validate(const struct key *key);
extern key_ref_t key_create_or_update(key_ref_t keyring,
const char *type,
......@@ -319,6 +325,7 @@ extern void key_init(void);
#define key_serial(k) 0
#define key_get(k) ({ NULL; })
#define key_revoke(k) do { } while(0)