Skip to content
  • Daniel Borkmann's avatar
    bpf: prevent out of bounds speculation on pointer arithmetic · 979d63d5
    Daniel Borkmann authored
    Jann reported that the original commit back in b2157399
    ("bpf: prevent out-of-bounds speculation") was not sufficient
    to stop CPU from speculating out of bounds memory access:
    While b2157399 only focussed on masking array map access
    for unprivileged users for tail calls and data access such
    that the user provided index gets sanitized from BPF program
    and syscall side, there is still a more generic form affected
    from BPF programs that applies to most maps that hold user
    data in relation to dynamic map access when dealing with
    unknown scalars or "slow" known scalars as access offset, for
    example:
    
      - Load a map value pointer into R6
      - Load an index into R7
      - Do a slow computation (e.g. with a memory dependency) that
        loads a limit into R8 (e.g. load the limit from a map for
        high latency, then mask it to make the verifier happy)
      - Exit if R7 >= R8 (mispredicted branch)
      - Load R0 = R6[R7]
      - Load R0 = R6[R0]
    
    For unknown scalars there are two options in the BPF verifier
    where we could derive knowledge from in order to guarantee
    safe access to the memory: i) While </>/<=/>= variants won't
    allow to derive any lower or upper bounds from the unknown
    scalar where it would be safe to add it to the map value
    pointer, it is possible through ==/!= test however. ii) another
    option is to transform the unknown scalar into a known scalar,
    for example, through ALU ops combination such as R &= <imm>
    followed by R |= <imm> or any similar combination where the
    original information from the unknown scalar would be destroyed
    entirely leaving R with a constant. The initial slow load still
    precedes the latter ALU ops on that register, so the CPU
    executes speculatively from that point. Once we have the known
    scalar, any compare operation would work then. A third option
    only involving registers with known scalars could be crafted
    as described in [0] where a CPU port (e.g. Slow Int unit)
    would be filled with many dependent computations such that
    the subsequent condition depending on its outcome has to wait
    for evaluation on its execution port and thereby executing
    speculatively if the speculated code can be scheduled on a
    different execution port, or any other form of mistraining
    as described in [1], for example. Given this is not limited
    to only unknown scalars, not only map but also stack access
    is affected since both is accessible for unprivileged users
    and could potentially be used for out of bounds access under
    speculation.
    
    In order to prevent any of these cases, the verifier is now
    sanitizing pointer arithmetic on the offset such that any
    out of bounds speculation would be masked in a way where the
    pointer arithmetic result in the destination register will
    stay unchanged, meaning offset masked into zero similar as
    in array_index_nospec() case. With regards to implementation,
    there are three options that were considered: i) new insn
    for sanitation, ii) push/pop insn and sanitation as inlined
    BPF, iii) reuse of ax register and sanitation as inlined BPF.
    
    Option i) has the downside that we end up using from reserved
    bits in the opcode space, but also that we would require
    each JIT to emit masking as native arch opcodes meaning
    mitigation would have slow adoption till everyone implements
    it eventually which is counter-productive. Option ii) and iii)
    have both in common that a temporary register is needed in
    order to implement the sanitation as inlined BPF since we
    are not allowed to modify the source register. While a push /
    pop insn in ii) would be useful to have in any case, it
    requires once again that every JIT needs to implement it
    first. While possible, amount of changes needed would also
    be unsuitable for a -stable patch. Therefore, the path which
    has fewer changes, less BPF instructions for the mitigation
    and does not require anything to be changed in the JITs is
    option iii) which this work is pursuing. The ax register is
    already mapped to a register in all JITs (modulo arm32 where
    it's mapped to stack as various other BPF registers there)
    and used in constant blinding for JITs-only so far. It can
    be reused for verifier rewrites under certain constraints.
    The interpreter's tmp "register" has therefore been remapped
    into extending the register set with hidden ax register and
    reusing that for a number of instructions that needed the
    prior temporary variable internally (e.g. div, mod). This
    allows for zero increase in stack space usage in the interpreter,
    and enables (restricted) generic use in rewrites otherwise as
    long as such a patchlet does not make use of these instructions.
    The sanitation mask is dynamic and relative to the offset the
    map value or stack pointer currently holds.
    
    There are various cases that need to be taken under consideration
    for the masking, e.g. such operation could look as follows:
    ptr += val or val += ptr or ptr -= val. Thus, the value to be
    sanitized could reside either in source or in destination
    register, and the limit is different depending on whether
    the ALU op is addition or subtraction and depending on the
    current known and bounded offset. The limit is derived as
    follows: limit := max_value_size - (smin_value + off). For
    subtraction: limit := umax_value + off. This holds because
    we do not allow any pointer arithmetic that would
    temporarily go out of bounds or would have an unknown
    value with mixed signed bounds where it is unclear at
    verification time whether the actual runtime value would
    be either negative or positive. For example, we have a
    derived map pointer value with constant offset and bounded
    one, so limit based on smin_value works because the verifier
    requires that statically analyzed arithmetic on the pointer
    must be in bounds, and thus it checks if resulting
    smin_value + off and umax_value + off is still within map
    value bounds at time of arithmetic in addition to time of
    access. Similarly, for the case of stack access we derive
    the limit as follows: MAX_BPF_STACK + off for subtraction
    and -off for the case of addition where off := ptr_reg->off +
    ptr_reg->var_off.value. Subtraction is a special case for
    the masking which can be in form of ptr += -val, ptr -= -val,
    or ptr -= val. In the first two cases where we know that
    the value is negative, we need to temporarily negate the
    value in order to do the sanitation on a positive value
    where we later swap the ALU op, and restore original source
    register if the value was in source.
    
    The sanitation of pointer arithmetic alone is still not fully
    sufficient as is, since a scenario like the following could
    happen ...
    
      PTR += 0x1000 (e.g. K-based imm)
      PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
      PTR += 0x1000
      PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
      [...]
    
    ... which under speculation could end up as ...
    
      PTR += 0x1000
      PTR -= 0 [ truncated by mitigation ]
      PTR += 0x1000
      PTR -= 0 [ truncated by mitigation ]
      [...]
    
    ... and therefore still access out of bounds. To prevent such
    case, the verifier is also analyzing safety for potential out
    of bounds access under speculative execution. Meaning, it is
    also simulating pointer access under truncation. We therefore
    "branch off" and push the current verification state after the
    ALU operation with known 0 to the verification stack for later
    analysis. Given the current path analysis succeeded it is
    likely that the one under speculation can be pruned. In any
    case, it is also subject to existing complexity limits and
    therefore anything beyond this point will be rejected. In
    terms of pruning, it needs to be ensured that the verification
    state from speculative execution simulation must never prune
    a non-speculative execution path, therefore, we mark verifier
    state accordingly at the time of push_stack(). If verifier
    detects out of bounds access under speculative execution from
    one of the possible paths that includes a truncation, it will
    reject such program.
    
    Given we mask every reg-based pointer arithmetic for
    unprivileged programs, we've been looking into how it could
    affect real-world programs in terms of size increase. As the
    majority of programs are targeted for privileged-only use
    case, we've unconditionally enabled masking (with its alu
    restrictions on top of it) for privileged programs for the
    sake of testing in order to check i) whether they get rejected
    in its current form, and ii) by how much the number of
    instructions and size will increase. We've tested this by
    using Katran, Cilium and test_l4lb from the kernel selftests.
    For Katran we've evaluated balancer_kern.o, Cilium bpf_lxc.o
    and an older test object bpf_lxc_opt_-DUNKNOWN.o and l4lb
    we've used test_l4lb.o as well as test_l4lb_noinline.o. We
    found that none of the programs got rejected by the verifier
    with this change, and that impact is rather minimal to none.
    balancer_kern.o had 13,904 bytes (1,738 insns) xlated and
    7,797 bytes JITed before and after the change. Most complex
    program in bpf_lxc.o had 30,544 bytes (3,817 insns) xlated
    and 18,538 bytes JITed before and after and none of the other
    tail call programs in bpf_lxc.o had any changes either. For
    the older bpf_lxc_opt_-DUNKNOWN.o object we found a small
    increase from 20,616 bytes (2,576 insns) and 12,536 bytes JITed
    before to 20,664 bytes (2,582 insns) and 12,558 bytes JITed
    after the change. Other programs from that object file had
    similar small increase. Both test_l4lb.o had no change and
    remained at 6,544 bytes (817 insns) xlated and 3,401 bytes
    JITed and for test_l4lb_noinline.o constant at 5,080 bytes
    (634 insns) xlated and 3,313 bytes JITed. This can be explained
    in that LLVM typically optimizes stack based pointer arithmetic
    by using K-based operations and that use of dynamic map access
    is not overly frequent. However, in future we may decide to
    optimize the algorithm further under known guarantees from
    branch and value speculation. Latter seems also unclear in
    terms of prediction heuristics that today's CPUs apply as well
    as whether there could be collisions in e.g. the predictor's
    Value History/Pattern Table for triggering out of bounds access,
    thus masking is performed unconditionally at this point but could
    be subject to relaxation later on. We were generally also
    brainstorming various other approaches for mitigation, but the
    blocker was always lack of available registers at runtime and/or
    overhead for runtime tracking of limits belonging to a specific
    pointer. Thus, we found this to be minimally intrusive under
    given constraints.
    
    With that in place, a simple example with sanitized access on
    unprivileged load at post-verification time looks as follows:
    
      # bpftool prog dump xlated id 282
      [...]
      28: (79) r1 = *(u64 *)(r7 +0)
      29: (79) r2 = *(u64 *)(r7 +8)
      30: (57) r1 &= 15
      31: (79) r3 = *(u64 *)(r0 +4608)
      32: (57) r3 &= 1
      33: (47) r3 |= 1
      34: (2d) if r2 > r3 goto pc+19
      35: (b4) (u32) r11 = (u32) 20479  |
      36: (1f) r11 -= r2                | Dynamic sanitation for pointer
      37: (4f) r11 |= r2                | arithmetic with registers
      38: (87) r11 = -r11               | containing bounded or known
      39: (c7) r11 s>>= 63              | scalars in order to prevent
      40: (5f) r11 &= r2                | out of bounds speculation.
      41: (0f) r4 += r11                |
      42: (71) r4 = *(u8 *)(r4 +0)
      43: (6f) r4 <<= r1
      [...]
    
    For the case where the scalar sits in the destination register
    as opposed to the source register, the following code is emitted
    for the above example:
    
      [...]
      16: (b4) (u32) r11 = (u32) 20479
      17: (1f) r11 -= r2
      18: (4f) r11 |= r2
      19: (87) r11 = -r11
      20: (c7) r11 s>>= 63
      21: (5f) r2 &= r11
      22: (0f) r2 += r0
      23: (61) r0 = *(u32 *)(r2 +0)
      [...]
    
    JIT blinding example with non-conflicting use of r10:
    
      [...]
       d5:	je     0x0000000000000106    _
       d7:	mov    0x0(%rax),%edi       |
       da:	mov    $0xf153246,%r10d     | Index load from map value and
       e0:	xor    $0xf153259,%r10      | (const blinded) mask with 0x1f.
       e7:	and    %r10,%rdi            |_
       ea:	mov    $0x2f,%r10d          |
       f0:	sub    %rdi,%r10            | Sanitized addition. Both use r10
       f3:	or     %rdi,%r10            | but do not interfere with each
       f6:	neg    %r10                 | other. (Neither do these instructions
       f9:	sar    $0x3f,%r10           | interfere with the use of ax as temp
       fd:	and    %r10,%rdi            | in interpreter.)
      100:	add    %rax,%rdi            |_
      103:	mov    0x0(%rdi),%eax
     [...]
    
    Tested that it fixes Jann's reproducer, and also checked that test_verifier
    and test_progs suite with interpreter, JIT and JIT with hardening enabled
    on x86-64 and arm64 runs successfully.
    
      [0] Speculose: Analyzing the Security Implications of Speculative
          Execution in CPUs, Giorgi Maisuradze and Christian Rossow,
          https://arxiv.org/pdf/1801.04084.pdf
    
      [1] A Systematic Evaluation of Transient Execution Attacks and
          Defenses, Claudio Canella, Jo Van Bulck, Michael Schwarz,
          Moritz Lipp, Benjamin von Berg, Philipp Ortner, Frank Piessens,
          Dmitry Evtyushkin, Daniel Gruss,
          https://arxiv.org/pdf/1811.05441.pdf
    
    Fixes: b2157399
    
     ("bpf: prevent out-of-bounds speculation")
    Reported-by: default avatarJann Horn <jannh@google.com>
    Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
    Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
    Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
    979d63d5