Commit ee9f8fce authored by Josh Poimboeuf's avatar Josh Poimboeuf Committed by Ingo Molnar

x86/unwind: Add the ORC unwinder

Add the new ORC unwinder which is enabled by CONFIG_ORC_UNWINDER=y.
It plugs into the existing x86 unwinder framework.

It relies on objtool to generate the needed .orc_unwind and
.orc_unwind_ip sections.

For more details on why ORC is used instead of DWARF, see
Documentation/x86/orc-unwinder.txt - but the short version is
that it's a simplified, fundamentally more robust debugninfo
data structure, which also allows up to two orders of magnitude
faster lookups than the DWARF unwinder - which matters to
profiling workloads like perf.

Thanks to Andy Lutomirski for the performance improvement ideas:
splitting the ORC unwind table into two parallel arrays and creating a
fast lookup table to search a subset of the unwind table.
Signed-off-by: default avatarJosh Poimboeuf <>
Cc: Andy Lutomirski <>
Cc: Borislav Petkov <>
Cc: Brian Gerst <>
Cc: Denys Vlasenko <>
Cc: H. Peter Anvin <>
Cc: Jiri Slaby <>
Cc: Linus Torvalds <>
Cc: Mike Galbraith <>
Cc: Peter Zijlstra <>
Cc: Thomas Gleixner <>
[ Extended the changelog. ]
Signed-off-by: default avatarIngo Molnar <>
parent 1ee6f00d
ORC unwinder
The kernel CONFIG_ORC_UNWINDER option enables the ORC unwinder, which is
similar in concept to a DWARF unwinder. The difference is that the
format of the ORC data is much simpler than DWARF, which in turn allows
the ORC unwinder to be much simpler and faster.
The ORC data consists of unwind tables which are generated by objtool.
They contain out-of-band data which is used by the in-kernel ORC
unwinder. Objtool generates the ORC data by first doing compile-time
stack metadata validation (CONFIG_STACK_VALIDATION). After analyzing
all the code paths of a .o file, it determines information about the
stack state at each instruction address in the file and outputs that
information to the .orc_unwind and .orc_unwind_ip sections.
The per-object ORC sections are combined at link time and are sorted and
post-processed at boot time. The unwinder uses the resulting data to
correlate instruction addresses with their stack states at run time.
ORC vs frame pointers
With frame pointers enabled, GCC adds instrumentation code to every
function in the kernel. The kernel's .text size increases by about
3.2%, resulting in a broad kernel-wide slowdown. Measurements by Mel
Gorman [1] have shown a slowdown of 5-10% for some workloads.
In contrast, the ORC unwinder has no effect on text size or runtime
performance, because the debuginfo is out of band. So if you disable
frame pointers and enable the ORC unwinder, you get a nice performance
improvement across the board, and still have reliable stack traces.
Ingo Molnar says:
"Note that it's not just a performance improvement, but also an
instruction cache locality improvement: 3.2% .text savings almost
directly transform into a similarly sized reduction in cache
footprint. That can transform to even higher speedups for workloads
whose cache locality is borderline."
Another benefit of ORC compared to frame pointers is that it can
reliably unwind across interrupts and exceptions. Frame pointer based
unwinds can sometimes skip the caller of the interrupted function, if it
was a leaf function or if the interrupt hit before the frame pointer was
The main disadvantage of the ORC unwinder compared to frame pointers is
that it needs more memory to store the ORC unwind tables: roughly 2-4MB
depending on the kernel config.
ORC debuginfo's advantage over DWARF itself is that it's much simpler.
It gets rid of the complex DWARF CFI state machine and also gets rid of
the tracking of unnecessary registers. This allows the unwinder to be
much simpler, meaning fewer bugs, which is especially important for
mission critical oops code.
The simpler debuginfo format also enables the unwinder to be much faster
than DWARF, which is important for perf and lockdep. In a basic
performance test by Jiri Slaby [2], the ORC unwinder was about 20x
faster than an out-of-tree DWARF unwinder. (Note: That measurement was
taken before some performance tweaks were added, which doubled
performance, so the speedup over DWARF may be closer to 40x.)
The ORC data format does have a few downsides compared to DWARF. ORC
unwind tables take up ~50% more RAM (+1.3MB on an x86 defconfig kernel)
than DWARF-based eh_frame tables.
Another potential downside is that, as GCC evolves, it's conceivable
that the ORC data may end up being *too* simple to describe the state of
the stack for certain optimizations. But IMO this is unlikely because
GCC saves the frame pointer for any unusual stack adjustments it does,
so I suspect we'll really only ever need to keep track of the stack
pointer and the frame pointer between call frames. But even if we do
end up having to track all the registers DWARF tracks, at least we will
still be able to control the format, e.g. no complex state machines.
ORC unwind table generation
The ORC data is generated by objtool. With the existing compile-time
stack metadata validation feature, objtool already follows all code
paths, and so it already has all the information it needs to be able to
generate ORC data from scratch. So it's an easy step to go from stack
validation to ORC data generation.
It should be possible to instead generate the ORC data with a simple
tool which converts DWARF to ORC data. However, such a solution would
be incomplete due to the kernel's extensive use of asm, inline asm, and
special sections like exception tables.
That could be rectified by manually annotating those special code paths
using GNU assembler .cfi annotations in .S files, and homegrown
annotations for inline asm in .c files. But asm annotations were tried
in the past and were found to be unmaintainable. They were often
incorrect/incomplete and made the code harder to read and keep updated.
And based on looking at glibc code, annotating inline asm in .c files
might be even worse.
Objtool still needs a few annotations, but only in code which does
unusual things to the stack like entry code. And even then, far fewer
annotations are needed than what DWARF would need, so they're much more
maintainable than DWARF CFI annotations.
So the advantages of using objtool to generate ORC data are that it
gives more accurate debuginfo, with very few annotations. It also
insulates the kernel from toolchain bugs which can be very painful to
deal with in the kernel since we often have to workaround issues in
older versions of the toolchain for years.
The downside is that the unwinder now becomes dependent on objtool's
ability to reverse engineer GCC code flow. If GCC optimizations become
too complicated for objtool to follow, the ORC data generation might
stop working or become incomplete. (It's worth noting that livepatch
already has such a dependency on objtool's ability to follow GCC code
If newer versions of GCC come up with some optimizations which break
objtool, we may need to revisit the current implementation. Some
possible solutions would be asking GCC to make the optimizations more
palatable, or having objtool use DWARF as an additional input, or
creating a GCC plugin to assist objtool with its analysis. But for now,
objtool follows GCC code quite well.
Unwinder implementation details
Objtool generates the ORC data by integrating with the compile-time
stack metadata validation feature, which is described in detail in
tools/objtool/Documentation/stack-validation.txt. After analyzing all
the code paths of a .o file, it creates an array of orc_entry structs,
and a parallel array of instruction addresses associated with those
structs, and writes them to the .orc_unwind and .orc_unwind_ip sections
The ORC data is split into the two arrays for performance reasons, to
make the searchable part of the data (.orc_unwind_ip) more compact. The
arrays are sorted in parallel at boot time.
Performance is further improved by the use of a fast lookup table which
is created at runtime. The fast lookup table associates a given address
with a range of indices for the .orc_unwind table, so that only a small
subset of the table needs to be searched.
Orcs, fearsome creatures of medieval folklore, are the Dwarves' natural
enemies. Similarly, the ORC unwinder was created in opposition to the
complexity and slowness of DWARF.
"Although Orcs rarely consider multiple solutions to a problem, they do
excel at getting things done because they are creatures of action, not
thought." [3] Similarly, unlike the esoteric DWARF unwinder, the
veracious ORC unwinder wastes no time or siloconic effort decoding
variable-length zero-extended unsigned-integer byte-coded
state-machine-based debug information entries.
Similar to how Orcs frequently unravel the well-intentioned plans of
their adversaries, the ORC unwinder frequently unravels stacks with
brutal, unyielding efficiency.
ORC stands for Oops Rewind Capability.
static inline void
unwind_module_init(struct module *mod, void *orc_ip, size_t orc_ip_size,
void *orc, size_t orc_size) {}
#endif /* _ASM_UML_UNWIND_H */
......@@ -157,6 +157,7 @@ config X86
select HAVE_NMI
......@@ -355,4 +355,29 @@ config PUNIT_ATOM_DEBUG
The current power state can be read from
bool "ORC unwinder"
depends on X86_64
This option enables the ORC (Oops Rewind Capability) unwinder for
unwinding kernel stack traces. It uses a custom data format which is
a simplified version of the DWARF Call Frame Information standard.
This unwinder is more accurate across interrupt entry frames than the
frame pointer unwinder. It can also enable a 5-10% performance
improvement across the entire kernel if CONFIG_FRAME_POINTER is
Enabling this option will increase the kernel's runtime memory usage
by roughly 2-4MB, depending on your kernel config.
def_bool y
def_bool y
......@@ -2,6 +2,15 @@
#define _ASM_X86_MODULE_H
#include <asm-generic/module.h>
#include <asm/orc_types.h>
struct mod_arch_specific {
unsigned int num_orcs;
int *orc_unwind_ip;
struct orc_entry *orc_unwind;
#ifdef CONFIG_X86_64
/* X86_64 does not define MODULE_PROC_FAMILY */
* Copyright (C) 2017 Josh Poimboeuf <>
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version 2
* of the License, or (at your option) any later version.
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* GNU General Public License for more details.
* You should have received a copy of the GNU General Public License
* along with this program; if not, see <>.
#ifndef _ORC_LOOKUP_H
#define _ORC_LOOKUP_H
* This is a lookup table for speeding up access to the .orc_unwind table.
* Given an input address offset, the corresponding lookup table entry
* specifies a subset of the .orc_unwind table to search.
* Each block represents the end of the previous range and the start of the
* next range. An extra block is added to give the last range an end.
* The block size should be a power of 2 to avoid a costly 'div' instruction.
* A block size of 256 was chosen because it roughly doubles unwinder
* performance while only adding ~5% to the ORC data footprint.
extern unsigned int orc_lookup[];
extern unsigned int orc_lookup_end[];
#define LOOKUP_START_IP (unsigned long)_stext
#define LOOKUP_STOP_IP (unsigned long)_etext
#endif /* LINKER_SCRIPT */
#endif /* _ORC_LOOKUP_H */
......@@ -88,7 +88,7 @@ struct orc_entry {
unsigned sp_reg:4;
unsigned bp_reg:4;
unsigned type:2;
} __packed;
* This struct is used by asm and inline asm code to manually annotate the
......@@ -12,11 +12,14 @@ struct unwind_state {
struct task_struct *task;
int graph_idx;
bool error;
bool signal, full_regs;
unsigned long sp, bp, ip;
struct pt_regs *regs;
bool got_irq;
unsigned long *bp, *orig_sp;
unsigned long *bp, *orig_sp, ip;
struct pt_regs *regs;
unsigned long ip;
unsigned long *sp;
......@@ -24,41 +27,30 @@ struct unwind_state {
void __unwind_start(struct unwind_state *state, struct task_struct *task,
struct pt_regs *regs, unsigned long *first_frame);
bool unwind_next_frame(struct unwind_state *state);
unsigned long unwind_get_return_address(struct unwind_state *state);
unsigned long *unwind_get_return_address_ptr(struct unwind_state *state);
static inline bool unwind_done(struct unwind_state *state)
return state->stack_info.type == STACK_TYPE_UNKNOWN;
static inline
void unwind_start(struct unwind_state *state, struct task_struct *task,
struct pt_regs *regs, unsigned long *first_frame)
first_frame = first_frame ? : get_stack_pointer(task, regs);
__unwind_start(state, task, regs, first_frame);
static inline bool unwind_error(struct unwind_state *state)
return state->error;
static inline
unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
void unwind_start(struct unwind_state *state, struct task_struct *task,
struct pt_regs *regs, unsigned long *first_frame)
if (unwind_done(state))
return NULL;
first_frame = first_frame ? : get_stack_pointer(task, regs);
return state->regs ? &state->regs->ip : state->bp + 1;
__unwind_start(state, task, regs, first_frame);
static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
if (unwind_done(state))
......@@ -66,20 +58,46 @@ static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
return state->regs;
static inline
unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
return NULL;
static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
void unwind_init(void);
void unwind_module_init(struct module *mod, void *orc_ip, size_t orc_ip_size,
void *orc, size_t orc_size);
static inline void unwind_init(void) {}
static inline
void unwind_module_init(struct module *mod, void *orc_ip, size_t orc_ip_size,
void *orc, size_t orc_size) {}
* This disables KASAN checking when reading a value from another task's stack,
* since the other task could be running on another CPU and could have poisoned
* the stack in the meantime.
#define READ_ONCE_TASK_STACK(task, x) \
({ \
unsigned long val; \
if (task == current) \
val = READ_ONCE(x); \
else \
val; \
static inline bool task_on_another_cpu(struct task_struct *task)
return NULL;
return task != current && task->on_cpu;
return false;
#endif /* _ASM_X86_UNWIND_H */
......@@ -126,11 +126,9 @@ obj-$(CONFIG_PERF_EVENTS) += perf_regs.o
obj-$(CONFIG_TRACING) += tracepoint.o
obj-$(CONFIG_SCHED_MC_PRIO) += itmt.o
obj-y += unwind_frame.o
obj-y += unwind_guess.o
obj-$(CONFIG_ORC_UNWINDER) += unwind_orc.o
obj-$(CONFIG_FRAME_POINTER_UNWINDER) += unwind_frame.o
obj-$(CONFIG_GUESS_UNWINDER) += unwind_guess.o
# 64 bit specific files
......@@ -35,6 +35,7 @@
#include <asm/page.h>
#include <asm/pgtable.h>
#include <asm/setup.h>
#include <asm/unwind.h>
#if 0
#define DEBUGP(fmt, ...) \
......@@ -213,7 +214,7 @@ int module_finalize(const Elf_Ehdr *hdr,
struct module *me)
const Elf_Shdr *s, *text = NULL, *alt = NULL, *locks = NULL,
*para = NULL;
*para = NULL, *orc = NULL, *orc_ip = NULL;
char *secstrings = (void *)hdr + sechdrs[hdr->e_shstrndx].sh_offset;
for (s = sechdrs; s < sechdrs + hdr->e_shnum; s++) {
......@@ -225,6 +226,10 @@ int module_finalize(const Elf_Ehdr *hdr,
locks = s;
if (!strcmp(".parainstructions", secstrings + s->sh_name))
para = s;
if (!strcmp(".orc_unwind", secstrings + s->sh_name))
orc = s;
if (!strcmp(".orc_unwind_ip", secstrings + s->sh_name))
orc_ip = s;
if (alt) {
......@@ -248,6 +253,10 @@ int module_finalize(const Elf_Ehdr *hdr,
/* make jump label nops */
if (orc && orc_ip)
unwind_module_init(me, (void *)orc_ip->sh_addr, orc_ip->sh_size,
(void *)orc->sh_addr, orc->sh_size);
return 0;
......@@ -115,6 +115,7 @@
#include <asm/microcode.h>
#include <asm/mmu_context.h>
#include <asm/kaslr.h>
#include <asm/unwind.h>
* max_low_pfn_mapped: highest direct mapped pfn under 4GB
......@@ -1310,6 +1311,8 @@ void __init setup_arch(char **cmdline_p)
if (efi_enabled(EFI_BOOT))
#ifdef CONFIG_X86_32
......@@ -10,20 +10,22 @@
#define FRAME_HEADER_SIZE (sizeof(long) * 2)
* This disables KASAN checking when reading a value from another task's stack,
* since the other task could be running on another CPU and could have poisoned
* the stack in the meantime.
#define READ_ONCE_TASK_STACK(task, x) \
({ \
unsigned long val; \
if (task == current) \
val = READ_ONCE(x); \
else \
val; \
unsigned long unwind_get_return_address(struct unwind_state *state)
if (unwind_done(state))
return 0;
return __kernel_text_address(state->ip) ? state->ip : 0;
unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
if (unwind_done(state))
return NULL;
return state->regs ? &state->regs->ip : state->bp + 1;
static void unwind_dump(struct unwind_state *state)
......@@ -66,15 +68,6 @@ static void unwind_dump(struct unwind_state *state)
unsigned long unwind_get_return_address(struct unwind_state *state)
if (unwind_done(state))
return 0;
return __kernel_text_address(state->ip) ? state->ip : 0;
static size_t regs_size(struct pt_regs *regs)
/* x86_32 regs from kernel mode are two words shorter: */
......@@ -19,6 +19,11 @@ unsigned long unwind_get_return_address(struct unwind_state *state)
unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
return NULL;
bool unwind_next_frame(struct unwind_state *state)
struct stack_info *info = &state->stack_info;
This diff is collapsed.
......@@ -24,6 +24,7 @@
#include <asm/asm-offsets.h>
#include <asm/thread_info.h>
#include <asm/page_types.h>
#include <asm/orc_lookup.h>
#include <asm/cache.h>
#include <asm/boot.h>
......@@ -148,6 +149,8 @@ SECTIONS
__vvar_page = .;
......@@ -680,6 +680,31 @@
#define BUG_TABLE
. = ALIGN(4); \
.orc_unwind_ip : AT(ADDR(.orc_unwind_ip) - LOAD_OFFSET) { \
VMLINUX_SYMBOL(__start_orc_unwind_ip) = .; \
KEEP(*(.orc_unwind_ip)) \
VMLINUX_SYMBOL(__stop_orc_unwind_ip) = .; \
} \
. = ALIGN(6); \
.orc_unwind : AT(ADDR(.orc_unwind) - LOAD_OFFSET) { \
VMLINUX_SYMBOL(__start_orc_unwind) = .; \
KEEP(*(.orc_unwind)) \
VMLINUX_SYMBOL(__stop_orc_unwind) = .; \
} \
. = ALIGN(4); \
.orc_lookup : AT(ADDR(.orc_lookup) - LOAD_OFFSET) { \
VMLINUX_SYMBOL(orc_lookup) = .; \
. += (((SIZEOF(.text) + LOOKUP_BLOCK_SIZE - 1) / \
LOOKUP_BLOCK_SIZE) + 1) * 4; \
VMLINUX_SYMBOL(orc_lookup_end) = .; \
#define TRACEDATA \
. = ALIGN(4); \
......@@ -866,7 +891,7 @@
} \
#define INIT_TEXT_SECTION(inittext_align) \
. = ALIGN(inittext_align); \
......@@ -374,6 +374,9 @@ config STACK_VALIDATION
pointers (if CONFIG_FRAME_POINTER is enabled). This helps ensure
that runtime stack traces are more reliable.
This is also a prerequisite for generation of ORC unwind data, which
is needed for CONFIG_ORC_UNWINDER.
For more information, see
......@@ -258,7 +258,8 @@ ifneq ($(SKIP_STACK_VALIDATION),1)
__objtool_obj := $(objtree)/tools/objtool/objtool
objtool_args = check
objtool_args = $(if $(CONFIG_ORC_UNWINDER),orc generate,check)
objtool_args += --no-fp
......@@ -279,6 +280,11 @@ objtool_obj = $(if $(patsubst y%,, \
# Rebuild all objects when objtool changes, or is enabled/disabled.
objtool_dep = $(objtool_obj) \
$(wildcard include/config/orc/unwinder.h \
define rule_cc_o_c
$(call echo-cmd,checksrc) $(cmd_checksrc) \
$(call cmd_and_fixdep,cc_o_c) \
......@@ -301,13 +307,13 @@ cmd_undef_syms = echo
# Built-in and composite module parts
$(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_obj) FORCE
$(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_dep) FORCE
$(call cmd,force_checksrc)
$(call if_changed_rule,cc_o_c)
# Single-part modules are special since we need to mark them in $(MODVERDIR)
$(single-used-m): $(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_obj) FORCE
$(single-used-m): $(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_dep) FORCE
$(call cmd,force_checksrc)
$(call if_changed_rule,cc_o_c)
@{ echo $(@:.o=.ko); echo $@; \
......@@ -402,7 +408,7 @@ cmd_modversions_S = \
$(obj)/%.o: $(src)/%.S $(objtool_obj) FORCE
$(obj)/%.o: $(src)/%.S $(objtool_dep) FORCE
$(call if_changed_rule,as_o_S)
targets += $(real-objs-y) $(real-objs-m) $(lib-y)
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment