2. Speaker info
Edgar Barbosa
Security researcher
Currently employed at COSEINC
Experience with reverse engineering of Windows kernel
and x86/x64 cpu architecture
Published some articles at rootkit.com
Participated in the creation of BluePill, a virtualization
hardware based rootkit
5. Hardware virtualization
rootkits
Intel and AMD developed virtualization extensions to the
x86 architecture - VT-x and SVM.
There are 2 famous hardware virtualization based rootkits:
Vitriol, created by Dino Dai Zovi – uses Intel VT-x
Bluepill, designed by Joanna Rutkowska – uses AMD SVM
Source code not public
We will focus the Bluepill rootkit in this presentation, but
the concepts and methods are very similar to the Intel
plataform.
6. Bluepill
Designed by Joanna Rutkowska
Intellectual property of COSEINC
Uses AMD Secure Virtual Machine (SVM) extensions
Runs in 64-bit mode
Supports multicore systems
7. AMD SVM
SVM stands for “Secure Virtual Machine”
It’s a CPU extension to support Virtual Machine Monitors
(VMM), a.k.a. hypervisor.
8 new instructions:
VMRUN
VMSAVE
VMLOAD
VMMCALL
CLGI
STGI
SKINIT
INVLPGA
8. Initialization of a SVM rootkit
Before any SVM instruction can be used, the EFER.SVME
must be set to 1.
Trying to execute a SVM instruction with SVME equal 0
results in #UD (Invalid opcode) exception.
Allocates and initialize the VMCB structure.
VMCB (Virtual Machine Control Block) address must be 4KB-
aligned
VMCB describes a virtual machine to be executed.
It contains:
Instruction or events in the guest to be intercepted
Control bits
Guest processor state( General registers, RIP, CR registers, … )
9. Initialization of a SVM rootkit
After VMCB initialization, set the VM_HSAVE_PA MSR.
This is the physical address where the VMRUN instruction
saves host processor state information.
Then execute the VMRUN instruction with RAX register value
equal the physical address of the VMCB
11. VMRUN instruction
Available only at CPL-0
CPU enters in a new processor mode: Guest Mode
In guest mode the behavior of some instructions changes
to facilitate virtualization
Consistency checks on the host and guest state
Saves the host processor state
Load the guest process state configured in the VMCB
CPU now runs in guest mode until an intercept occurs
12. #VMEXIT
When a intercept triggers, the processor performs a #VMEXIT
On #VMEXIT the processor:
Disable interrupts
Clear all intercepts
Sets the host CPL to 0
Disable all breakpoints
Checks the reload host state for consistency
The reason of the #VMEXIT is saved in the EXITINFO field
of the VMCB structure
Execute the Bluepill interception handler routine
15. “Undetectable” rootkits
Popek and Goldberg VMM properties:
Efficiency
Resource control
Equivalence
Equivalence “implies that any program executing on a virtual machine must
behave in a manner identical to the way it would have behaved when
running directly on the native hardware” [1]
SVM/VT-x rootkits are only theoreticaly ‘undetectable’
However, the equivalence principle is not fully respected in the hardware
virtualization extensions
There are computer resources that hypervisor has not full control:
TLB (partially)
Branch prediction
SMP processing
16. Timing attacks
The most obvious attack against hardware virtualization
rootkits is timing attack.
We measure the time of execution of some probably
intercepted instruction and compare the value against some
trusted baseline.
But AMD and Intel hardware virtualization extensions has
support to intercept any internal source of timing:
RDTSC
RDMSR
I/O ports
Hardware virtualization even supports a TSC offset value to be
subtracted from every TSC access attempt.
This is the reason that local timing attacks fails
18. TLB
A Translation Lookaside Buffer (TLB) is a CPU cache that is
used to improve the speed of virtual address translation.
Detailed TLB information can be obtained by CPUID
instruction. Returns information like the number of entries of
each TLB, the type and the associativity of the cache.
For each line in the TLB is stored information like:
Tag, used to compare with the virtual address
Physical address, the result of the VA translation
Page attributes
If the translation is not store in the cache (cache miss), the
system must execute the ‘table-walk’ procedure. This is a
expensive clock-cycle operation.
19. TLB
The TLB has a limited number of entries.
The contents of each line is not accessible by software
However we can fill the TLB by accessing several pages.
The idea is to fill all the TLB entries and measure the time
to access these cached pages. Now we execute a
privileged instruction that must be intercepted by a
hypervisor. If there is a hypervisor running on the system,
it will evict some TLB entries. After executing the
privileged instruction we measure the time to execute the
previous cached pages. If it takes more time to be
accessed, there is a hypervisor running.
20. TLB
The idea of using TLB to detect hypervisor was first published
by Peter Ferrie [2]. However, in the second version of his paper
[3], Ferrie states that the TLB method does not work on AMD-
based hypervisors because they can direct the hardware to not
flush the TLB when a hypervisor event occurs.
Ferrie suggests the CPUID instruction to be used in the TLB
method. But Bluepill doesn’t need to intercept cpuid
instruction. Another instruction could be used instead, the
rdmsr EFER, which bluepill must intercept.
It is still possible to use the TLB method to detect bluepill even
if the hypervisor controls TLB flush! How?
21. TLB
TLB entries are tagged with ASID (Address Space Identifier) bits to
distinguish different host and/or guest space address.
ASID #00 assigned to VMM and #1..#63 to guests.
TLB_CONTROL field:
The VMM can control the TLB flush operations by setting the
TLB_CONTROL field on the VMCB. If set to 1, the VMRUN
instruction will flush the entire TLB (all ASID’s).
Even with tagged ASID TLB, we can evict all lines in the TLB. The
number of TLB entries are limited, so it will evict lines if necessary.
Opteron primary TLB has only 40 entries [4].
AMD optimization manual suggests to avoid using the
TLB_CONTROL = 1 to flush the guest TLB. Instead, it is best to
assign a new ASID to the guest!
22. Branch prediction
Studies have shown that the behavior of branch instruction is
highly predictable [5]
Execution trace history of branch instructions can be used to
predict its future behavior.
If a branch is predicted to be taken and this prediction turns out
to be incorrect, there is a huge performance penalty because all
the pipeline must be flushed.
There are a lot of branch prediction schemes. Explaining these
schemes are out of the scope of this presentation.
There are some very good references about this subject[5]
Branch prediction unit uses a small cache to store the history of
the branch instruction execution.
23. Branch prediction
There is another buffer to store the target address of the branch,
the BTB (Branch Target Buffer )
How to use the branch prediction unit (BPU) to detect
hypervisor code?
Using the prediction rules of static and dynamic predictors, we
can fill the entries of the branch history tables and measure the
time to execute our code. Now the detector executes a privileged
instruction that will be intercept if there is a hypervisor running.
The hypervisor code will affect the branch history tables. We
execute now the ‘branch test code’ again without the privileged
instruction and measure the time. If the execution of the
privileged instruction was intercepted, the measured times will be
different.
24. Branch prediction
The Branch Prediction Unit was successfully used to obtain a
512-bit encryption key by using a Branch Prediction Analysis
(BPA) attack[6]. This attack is based in some interesting
features of BPU:
The execution history cache is accessed using just a few low-
order bits from the branch instruction address. Two different
address can use the same history. This is called Branch Aliasing
or Branch Interference.
The cache is shared between all threads.
The spy thread was running simultaneously with the decryption
thread. Since the two threads was using the same branch
prediction cache (branch aliasing), the spy thread can
determine which branches the decryption thread has taken.
26. Branch prediction
It is not possible to use Branch Aliasing effect to detect
virtualization rootkits due to the fact that we do not know the
virtual address where the rootkit code is being executed.
To use our detection method we must know internal details of
the implementation of the branch prediction unit. However, this
information is not easy to find in the cpu manuals.
We can use some very clever benchmarks developed by Milena
Milenkovic, Aleksandar Milenkovic and Jeffrey Kulick [7].
These benchmarks give us very specific details of the BPU,
like the number of entries and associativiy of BTB, the bits
used from the branch address that are used as the set index and
the local branch history length.
27. Counter based detection
Bluepill can intercept attempts to read the TSC MSR or any
other clock.
However, we can create our own clock by using a simple
counter algorithm.
Using a multicore system, we can create two threads running
concurrently.
Thread #2 can measure how long Thread #1 takes to execute
some routine or instruction.
It is necessary a synchronization protocol between the threads
to signal when to start and when to finish the counter increment
operation.
SMP uses shared memory, which can be used for the
synchronization protocol.
28. Counter based detection
How this detection method really works?
Basic idea:
Thread #1 will set the ‘start counter’ synchronization signal
and wait the reply of thread #2.
Thread #1 will execute a instruction that bluepill must
intercept ( rdmsr EFER ), while thread #2 starts to increment
the counter at the same time.
After the intercepted instruction, thread #1 set the ‘stop
counter’ synchronization signal and thread #2 stops the
counting process.
Now we check the value of the counter
30. Counter based detection
Using a simple counter code like:
while( stopCounter == 0 )
counter++;
The results of a ‘rdmsr EFER’ test are:
Counter = 016h for a clean system.
Counter = 02000h for a system running a HBR
The counter code can be optimized using the AMD
optimization manual rules.
31. Counter based detection
To avoid detection, bluepill must stop thread #2 counter as
soon it intercepts any event.
However just the #VMEXIT control transferring process takes
around 1000h clock cycles!
The CPU#2 bluepill hypervisor is in ‘sleep mode’ while the
counter runs and even if the CPU#1 sends a IPI (Inter
processor interrupt ) to CPU#2, it will take even more time.
The CPU#1 hypervisor doesn’t have access to the CPU#2
registers context.
It is too late to change any thread schedule quantum value
32. Counter based detection
What if our counter code is interrupted by some external
interrupt, like the clock, at the start of the counter process?
It is good to avoid interrupts in our counter code, but not really
necessary
We can’t guarantee that the counter code will not be
interrupted
Clear interrupt methods are interceptable by the rootkit:
Temporarily disable the APIC (interceptable)
CLI instruction (interceptable)
PUSHF and POPF instructions (interceptable)
Solution:
We can run the detection code several times. All we need is a
weird counter value.
33. Counter based detection
There is another way for the rootkit to detect this detection
method?
Very difficult. We can implement several different
synchronization routines and algorithms to make sure that
the threads are running concurrently.
There is no time for the rootkit to unload itself to avoid
detection after the intercept.
34. BP in hibernation-mode
One interesting idea discussed is the possibility of bluepill
being able to unload itself while some attack is being executed
and reload itself after the finish of the attack.[8]
That’s a weird idea because if we know that the rootkit is
unloaded, we can load our own detector hypervisor and waits
for any code trying to get access to SVM resources! Remember
bluepill is predicted to be undetectable even if the source is
published.
However, the unload idea can be cleverly used against the next
detection idea. It is interesting to present this attack to know
how virtualization rootkits can use this ‘unload’ trick.
35. #GP detection
EFER (Extended Feature Enable Register ) is a model
specific register (MSR)
Can be accessed by RDMSR and WRMSR instructions.
MSR EFER index is 0xC0000080
Before using the AMD SVM extensions, it is necessary to
set the EFER.SVME bit to 1.
Bluepill intercepts all attempt of read or write in the
EFER.
There is a way to know the value of SVME bit without
being intercepted?
36. VMSAVE instruction
The VMSAVE instruction stores a subset of the processor state into
the VMCB specified by the physical address in the RAX register.
This is a Secure Virtual Machine Instruction.
This instruction generates a #UD exception if SVM is not enabled.
Pseudo code:
37. VMSAVE and EFER
What happens if we execute VMSAVE instruction with RAX
containing a invalid physical address and SVM?
If the EFER.SVME = 0 the system generates a #UD
exception!
If the EFER.SVME = 1 the system generates a #GP
exception!
The VMSAVE instruction microcode is able to read the real
value of the EFER.SVME register without being intercepted!
We can use the VMSAVE instruction to detect HVBR.
The VMSAVE in not the only one SVM instruction that can be
used for detection. Take a look at AMD manuals.
38. Counter-attack - I
If the rootkit set the VMCB to intercept the VMSAVE
instruction, it will not detect the attack because the system
will generate a exception before executing VMSAVE.
The rootkit can set the VMCB to intercept #GP
exceptions!
After a #GP exception intercept, the rootkit must verify it
the guest RIP is pointing to a VMSAVE instruction!
If the VMSAVE instruction is being executed and the guest
EFER.SVME = 0, it can now inject a #UD exception in the
guest.
39. #GP attack - II
The detection code must not allow the rootkit to detect the
VMSAVE instruction at the guest RIP.
We can use the Translation Lookaside Buffer to hide our
detection code.
Exactly before executing the VMSAVE instruction, the
detector will change the PTE of the detection code to make it
point to a fake page which will have another code that
generates a #GP exception.
When the rootkit intercepts the #GP, it will not be able to look
at the real code of RIP because the PTE is not showing the real
detection code page. The real address of the detection code is
inside the I-TLB and there is no way to access the contents of
the TLB cache.
40. Counter-attack II
The rootkit is not able to find the real detector page because it
is cached at ITLB. But it can detect if the PTE address of the
RIP is fake.
How?
The rootkit will write a 0xCC (int 0x3) opcode at the RIP address
and restart guest execution at the same RIP.
If the system generates a #BP exception, the page is not fake.
If the system generates a #GP again, the page is fake.
If the rootkit detects such attack, it can’t know what is the
correct exception that must be inject in the guest because the
hidden code can be any instruction able to generate a #GP
exception. If it injects a #UD exception it will be easily
detected!
41. Counter-attack II
What the rootkit can do now?
It knows that a exception must be generated.
It hooks the guest exception handlers.
Next, it unload the hypervisor and now it calls the intercepted
instruction again.
In this case, the instruction will generate the correct exception
that will be detected by the hooked exception handlers.
Now, the exception handler just needs to load the hypervisor
again!
Due to the #GP attack, every virtualization rootkit must
implement configure the VMCB to intercept #GP exceptions.
42. CPU bugs
It is possible to use CPU bugs to detect HVBR?
Yes, but it is not a reliable way to detect rootkits.
I found that the execution of the Address-Size Prefix (0x67)
opcode together with the VMSAVE instruction is aparently
able to freeze systems running hypervisors !
A detector which freezes the system is not very useful
outside of lab environments.
43. Credits
All the cool crypto research papers using cpu
microarchitecture based attacks.
Alexander Tereshkin, for the creation of the counter-
attacks against the #GP exception method to detect
Bluepill.
44. References
[1] J. Smith and R. Nair. Virtual Machines. Versatile platforms for systems and processes. Morgan Kaufmann, 2005.
[2]http://pferrie.tripod.com/papers/attacks.pdf
[3]http://pferrie.tripod.com/papers/attacks2.pdf
[4]http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html
[5]J. Shen and M. Lipasti. Modern Processor Design. Fundamentals of Superscalar processors. McGraw-Hill , 2005.
[6]O. Acuçmez, Ç. Koç and J. Seifert. On the power of simple branch prediction analysis. http://eprint.iacr.org/2006/351.pdf
[7] M. Milenkovic, A. Milenkovic and J. Kulick. Demystifying Intel Branch Predictors.
http://www.ece.wisc.edu/~wddd/2002/final/milenkovic.pdf
[8]http://blogs.zdnet.com/Ou/?p=297