A New Tracer for Reverse Engineering
Niizh (Section 1b) : Background and Implementation (Work in Progress)
Tsukasa Ooi (@a...
I...
• will introduce the way to make reverse engineering
more efficient ...possibly.
• Possibly ?
– (Nov 2010) Generic OS...
Related Topics
• Reverse Engineering
– especially dynamic analysis, debuggers and tracers.
• Intel x86 (32-bit) architectu...
Agenda
• Drawbacks of instruction tracers
• New tracing method based on Record and Replay
• Tracing-VMM implementation on ...
Target Platform
• Intel x86 (16/32-bit) architecture
• PC/AT
• General purpose OSes (Windows, Linux etc...)
Background
Background
Dynamic Analysis
• Analyze running programs
– e.g. By intercepting operations of the program
• Various tools
– Debuggers
•...
Tracers (1)
• Capture and save the information
associated with specific event.
– Various granularity
• instruction, basic ...
Tracers (2)
• But, in early research, I found most of these
instruction tracers have some drawbacks:
– Extremely Slow
• Th...
Tracers (3)
• Can we solve these issues?
• Major Requirements:
– Overhead : <100%
– Size of trace : <5MB/s
• Theme is:
How...
Theory ‒ Record and Replay
Theory ― Record and Replay
Record and Replay (0)
• I was going to have independently discovered but:
– I didn t find any documents related before.
• ...
Record and Replay (1)
• The method have some variety of names:
– VMware calls this Record and Replay
– Logging and Replay ...
• Many architectures can be represented as this model:
– Input (can be null)
– Calculation / Process (+Internal Context)
–...
• Saving all information is equivalent to
saving all of internal context (zn).
– The output is not required because we ass...
• Focusing on dependency
– Input : there are no dependency.
– Calculation / Process (+Context) : depend on input
• Now you...
• Pass 1 : Record
– Capture and save initial context
– Run the virtual machine
• Accepts input from external hardware.
– C...
• Pass 2 : Replay
– Recover initial context from trace log.
– Run the virtual machine.
• But read trace log to supply inpu...
Cons. (1)
• It seems to be just running twice but:
– You have saved trace log so you can run
Replay pass anytime, anywhere...
Cons. (2)
• (Cont.)
– Two passes are independent.
• Even if you run slow analysis, the Record pass
remains running as befo...
Real World Example (1)
• VMware Workstation (6 or later)
– Record/Replay feature
• Record execution and you can replay jus...
Real World Example (2)
• VMware Workstation (6 or later)
– But...
• It s still a VMware .
• There is no enough debug inter...
• All deterministic elements can be considered
one type of input but not inefficient.
– Do you want to record many element...
• Nondeterministic Inputs
– The timing which internal context can be
undetermined can be determined uniquely
(like in inst...
• Interrupts
– The timing is not uniquely predictable.
– And actual content can be nondeterministic.
– In this case, trace...
• Modeling ― VM-Internal Disk
– Assume the VM-internal disk is reliable and
record initial disk image.
– Almost all elemen...
• Modeling ― Mouse, Keyboard, Network
– They are unpredictable/external input.
– The input from the device uses both of
x8...
• Modeling ― Time Stamp Counter (CPU)
– The clock count since computer reset
that can be read the value with RDTSC instruc...
• Modeling ― CPU exception
– Almost all exceptions are deterministic
including their timing.
• Page Fault occurs because t...
• Modeling ― Inexact Arithmetic Operation
– Transcendental instruction such as FSINCOS, FATAN
does not define the actual v...
Applying to x86 (4)
• Considering X nondeterministic?
– Increase number of hooks.
– Trace log get bigger, execution get sl...
How do you think?
• Is this instruction deterministic?
XOR edx, edx
– As you know, this instruction just
clears edx regist...
The curst of EFLAGS? (1)
• Let s look inside.
– edx IS zero. On the other hand,
EFLAGS.AF is updated to ? .
– Intel s manu...
The curst of EFLAGS? (2)
• This is not the end!
– These frequently used instructions as well.
– According to the profiling...
The curst of EFLAGS?(3)
• Not much, much fewer at all!
– Even 10% of instructions, the overhead of hooking
cannot be ignor...
The implementation problem (1)
• Public Record and Replay implementation
does not care about this condition!
– They just l...
The implementation problem (2)
• What is RIGHT?
– We cannot exactly know which CPU model is right.
– I want to integrate i...
EFLAGS : Lazy Evaluation (1)
• EFLAGS and programs have these characteristics:
– Over 80% of updated flags are just discar...
EFLAGS : Lazy Evaluation (2)
• Current Implementation:
– JIT compiling with static evaluation
(to make programs run faster...
EFLAGS : Lazy Evaluation (3)
• (cont.)
– If the instruction in the block depends on these
flags and virtual flags satisfy ...
Record and Replay : Conclusion
• Using Record and Replay , we can decrease
the amount of trace log and trace overhead.
• U...
Implementation
Implementation
Implementation
• I implement VMM-based tracer.
– To run general purpose OSes.
• But it was not a good idea. Because of its...
x86 on x64 (1)
• x64 is a 64-bit extension to x86 architecture.
– AMD, Intel and VIA have x64 extension.
– Very similar in...
x86 on x64 (2.1)
• Benefit : 32-bit registers and clamp
– General purpose register format is based on
its original (that s...
x86 on x64 (2.1)
• Benefit : 32-bit registers and clamp
– General purpose register format is based on
its original (that s...
x86 on x64 (2.2)
• Benefit : Increased Registers (GPR/XMM)
– 8→16 (16 additional register including XMM regs.)
– Save emul...
x86 on x64 (2.2)
• Benefit : Increased Registers (GPR/XMM)
– 8→16 (16 additional register including XMM regs.)
– Save emul...
x86 on x64 (2.2)
• Benefit : Increased Registers (GPR/XMM)
– 8→16 (16 additional register including XMM regs.)
– Save emul...
x86 on x64 (2.3.1)
• Benefit : Remained Addressing Format
– Some addressing modes are added but
still x86-based addressing...
x86 on x64 (2.3.2)
• Benefit : Remained Addressing Format
– (e.g. 1) : inc [ds:ecx] → inc [rbx+rcx]
• rbx : Base address o...
x86 on x64 (2.3.3)
• Benefit : Remained Addressing Format
– (e.g. 2 [wrong]) : inc [ds:ecx+edx] → inc [rbx+rcx+rdx]
• Stor...
x86 on x64 (2.4.1)
• Benefit : Huge Memory Range
– 64-bit address width
• Valid 48-bit (sign extended) logical address.
• ...
x86 on x64 (2.4.2)
• Benefit : Huge Memory Range
– But allocating just 4GB is not enough.
The result of address calculatio...
x86 on x64 (2.4.3)
• Benefit : Huge Memory Range
– Allocate virtual memory region.
– Considering address overflow, we allo...
x86 on x64 (2.4.4)
• Benefit : Huge Memory Range
– Allocate virtual memory region each
segment and/or segment access contr...
x86 on x64 (2.4.4)
• Benefit : Huge Memory Range
– Allocate virtual memory region each
segment and/or segment access contr...
x86 on x64 (2.5.1)
• Benefit : Simplified Architecture
– Architecture of x64 is relatively simplified
which makes implemen...
x86 on x64 (2.5.2)
• Benefit : Simplified Architecture
– Pass-through the interrupts
• We can do it safely with IDT switch...
x86 on x64 (3)
• Using these techniques, implement
binary translation.
– But currently, it is still incomplete.
• To trace...
Everything into the Ring-0
• Is privilege isolation required?
– Dynamic code is generated safely and
well isolated; enabli...
Tests
Verification
Trace size test (1.1)
• Trace log size required
– DLX Linux bundled Bochs 2.45
• From computer reset until login screen.
•...
Trace size test (1.2)
• Trace log size required
– Size of initial context is not included.
– Modeled devices in Bochs emul...
Trace size test (1.3)
• Methods (comparison included)
– Raw
Text-format instruction/memory trace generated by Bochs.
– Ver...
Trace size test (2.1)
Method Size (bytes)
Raw 7,178,948,236 6.68GB
Verbose X > 419,430,400 400MB
Dumb 60,713,538 57.90MB
R...
Trace size test (2.2)
10,000
100,000
1,000,000
10,000,000
100,000,000
1,000,000,000
10,000,000,000
Size (bytes)
Trace size test (2.3)
• Conclusion
– This result didn t come from actual implementation
so there is some suspicious points...
Overhead tests
0
10
20
30
40
min max
without Tracer
with Tracer
Possible Practical Uses
Application
Possible Practical Uses (1)
• Reverse Engineering (non-Malware)
– Everything *worked* is everything *recorded*
• All your ...
Possible Practical Uses (2)
• Avoiding Anti-debugging/Anti-VM
– No well-known backdoor.
– But binary translation based VM ...
Possible Practical Uses (3)
• Reverse Engineering (Malware)
– It is DANGEROUS to run malware directly!
– However, if you c...
Possible Practical Uses (4)
• Fuzzing / Exploit analysis / Bug discovery
– Imagine that Valgrind is applied to all program...
Possible Practical Uses (5)
• Analysis Support
– Export for other well-known tools.
• e.g. Wireshark
– In this case, you h...
Possible Practical Uses (6)
• <<Place Entry Here>>
– I guess you can use for other purposes.
– I hope that many people wor...
Future Challenges / Conclusion
Future Challenges / Summary
Challenge : Multicore (1)
• Original Record and Replay is not for
multi-processing environment.
– Many of communications m...
Challenge : Multicore (2)
• Time-sharing
– Only one CPU running simultaneously.
– Switch the CPU execution with timer to
s...
Challenge : Multicore (3)
• Software Implementation of MESI protocol
– Memory coherency algorithm
– CPU uses this protocol...
Challenge : Multicore (4)
• Trace Memory contents
– Also trace memory contents read for shared pages.
• Pros.
– Can achiev...
Challenge : 64-bit / Others
• x64 on x64 is very difficult.
– There are some ways but not so efficient.
• SSE2 / Reciproca...
CAUTION : PATENTS
• Some of these techniques are patented!
– Record and Replay
– Optimization for Binary Translation based...
Conclusion
• I described how to build tracing-VMM for
x86 on x64.
• Using proposal method, trace log gets smaller
and over...
contact me at : li at livegrid dot org
Open Source Project : Niizh
will be available at http://niizh.org/
Thank you!
Any q...
Upcoming SlideShare
Loading in …5
×

A New Tracer for Reverse Engineering - PacSec 2010

893 views

Published on

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
893
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
7
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

A New Tracer for Reverse Engineering - PacSec 2010

  1. 1. A New Tracer for Reverse Engineering Niizh (Section 1b) : Background and Implementation (Work in Progress) Tsukasa Ooi (@a4lg)
  2. 2. I... • will introduce the way to make reverse engineering more efficient ...possibly. • Possibly ? – (Nov 2010) Generic OSes don t work currently. • Sorry for no live demo! – Some predictions are included.
  3. 3. Related Topics • Reverse Engineering – especially dynamic analysis, debuggers and tracers. • Intel x86 (32-bit) architecture • Virtualization / Virtual Machine Monitor (VMM) – Record and Replay • Intrusion detection and analysis (e.g. honeypots) • Bug detection (e.g. fuzzing)
  4. 4. Agenda • Drawbacks of instruction tracers • New tracing method based on Record and Replay • Tracing-VMM implementation on x64 • Partial Tests • (Possible) Practical use of this Tracer • Challenges
  5. 5. Target Platform • Intel x86 (16/32-bit) architecture • PC/AT • General purpose OSes (Windows, Linux etc...)
  6. 6. Background Background
  7. 7. Dynamic Analysis • Analyze running programs – e.g. By intercepting operations of the program • Various tools – Debuggers • e.g. OllyDbg, IDA Pro... – Monitors • Process Monitor, Wireshark... – Tracers • API Monitor, OllyDbg, Process Stalker... • Today, I will talk about so called tracers.
  8. 8. Tracers (1) • Capture and save the information associated with specific event. – Various granularity • instruction, basic block, function, system call... • Instruction tracing – REALLY easy to apply automatic analysis (like automated-unpacking.) – If you can trace every internal context each instruction, it means you can acquire any information you would like.
  9. 9. Tracers (2) • But, in early research, I found most of these instruction tracers have some drawbacks: – Extremely Slow • They hook every instruction execution that makes tracers really slow. • x10-x1000 – Generate huge amount of data • Several gigabytes per real-second. (real-second : 1sec with no-emulation) • Save many information each instruction. • Saving information can be also bottleneck.
  10. 10. Tracers (3) • Can we solve these issues? • Major Requirements: – Overhead : <100% – Size of trace : <5MB/s • Theme is: How (did I implement ¦ to make) VMM-based tracer satisfying these requirements?
  11. 11. Theory ‒ Record and Replay Theory ― Record and Replay
  12. 12. Record and Replay (0) • I was going to have independently discovered but: – I didn t find any documents related before. • ReVirt : Enabling Intrusion Analysis through Virtual- Machine Logging and Replay – I found the new method is a variety of Record and Replay . – It is very related and difficult to separate. • So I m going to describe Record and Replay with my method.
  13. 13. Record and Replay (1) • The method have some variety of names: – VMware calls this Record and Replay – Logging and Replay Lockstep • Execution with 2-passes (Record/Replay) • By focusing on common characteristics of many machine architectures, it makes trace output phenomenally small. – Normally, the input from external hardware is not so frequent.
  14. 14. • Many architectures can be represented as this model: – Input (can be null) – Calculation / Process (+Internal Context) – Output (can be null) • Assuming the output is uniquely determined by internal context (by function g below.) • zn+1 = f(zn, in) on+1 = g(zn+1) Record and Replay (2) Input Output Calc/Proc +Context
  15. 15. • Saving all information is equivalent to saving all of internal context (zn). – The output is not required because we assume it is uniquely determined by internal context. • Also save z0 (initial internal context.) • Function f (equivalent to calculation/process) must be a mathematic function. – Same input, same output. – Not ambiguous. Record and Replay (3) Input Output Calc/Proc +Context
  16. 16. • Focusing on dependency – Input : there are no dependency. – Calculation / Process (+Context) : depend on input • Now you can find... – Internal context only depends on internal state and the input array. You can recover all of these from that information. Record and Replay (4) Input Calc/Proc +Context
  17. 17. • Pass 1 : Record – Capture and save initial context – Run the virtual machine • Accepts input from external hardware. – Capture and save all inputs • This does not generate the dump of internal context but you can recover it from this small amount of data. Record and Replay (5) Input Trace log Calc/Proc +Context InitState
  18. 18. • Pass 2 : Replay – Recover initial context from trace log. – Run the virtual machine. • But read trace log to supply input data. • So it does not accept new hardware inputs. – Read internal context from running virtual-machine. • It is very similar to Record pass! Record and Replay (6) Input Trace log Calc/Proc +Context InitState
  19. 19. Cons. (1) • It seems to be just running twice but: – You have saved trace log so you can run Replay pass anytime, anywhere, as you want. • You will extract a part of information from Replay pass. • If you need more information, you just need to run Replay pass with different configuration. – If you need to, you can run Replay pass in parallel. • You can shorten the automated-analysis. (Actually, you may encounter the dependency issues.)
  20. 20. Cons. (2) • (Cont.) – Two passes are independent. • Even if you run slow analysis, the Record pass remains running as before. • You may use Replay pass to do slow and verbose analysis which is difficult to apply directly (such buffer-overflow detection.) • This method has an affinity for reverse engineering. – Trace log contains nearly *everything* happening in the virtual machine!
  21. 21. Real World Example (1) • VMware Workstation (6 or later) – Record/Replay feature • Record execution and you can replay just like videos and/or you can use it to debug. – It proprietary and no enough robustness but this is actually the example implemented Record and Replay method. – Trace log : normally 1-10MB/s
  22. 22. Real World Example (2) • VMware Workstation (6 or later) – But... • It s still a VMware . • There is no enough debug interface. – If debug interface is well equipped, you could use it for reverse engineering. • Other examples: – ReplayDIRECTOR (Java debugging tool) – Jockey (http://home.gna.org/jockey/) • User-mode Recording / Debugging library for Linux
  23. 23. • All deterministic elements can be considered one type of input but not inefficient. – Do you want to record many element of null?! • Classify the type of so called inputs. – Nondeterministic Input(s) – Interrupt(s) • Just a name; they don t represent its name literally. Applying to x86 (1) 入力 トレース 計算/処理 +内部状態 初期状態
  24. 24. • Nondeterministic Inputs – The timing which internal context can be undetermined can be determined uniquely (like in instruction in x86.) – But you cannot determine the actual value or contents without running it. – Save actual value or contents. But don t save its timing. • We can determine the timing from recent internal context and interrupts. Applying to x86 (2.1)
  25. 25. • Interrupts – The timing is not uniquely predictable. – And actual content can be nondeterministic. – In this case, trace the timing. Additionally, if actual content of interrupt is nondeterministic, trace it too. • e.g. Interrupt vector number (hardware interrupt) • The most important thing is: – Based on these classification, we have to classify all elements in the virtual machine. Applying to x86 (2.2)
  26. 26. • Modeling ― VM-Internal Disk – Assume the VM-internal disk is reliable and record initial disk image. – Almost all elements are deterministic except interrupts that disk generates. • The content read is equivalent to the content last written. • But timing of ATA interrupt cannot be predicted strictly so we can consider this interrupt. Applying to x86 (3.1)
  27. 27. • Modeling ― Mouse, Keyboard, Network – They are unpredictable/external input. – The input from the device uses both of x86 interrupt and I/O port operation. – Both. – Network packet you sent are recovered from the internal context. Applying to x86 (3.2)
  28. 28. • Modeling ― Time Stamp Counter (CPU) – The clock count since computer reset that can be read the value with RDTSC instruction. – Consider Nondeterministic Input. – Even if the physical location of the value is inside the CPU, you should consider these value when they produce unpredictable results. • If you could model and consider this deterministic, the implementation can be inefficient. • NOT considering this deterministic improves VM emulation efficiency. Applying to x86 (3.3)
  29. 29. • Modeling ― CPU exception – Almost all exceptions are deterministic including their timing. • Page Fault occurs because the CPU has accessed the invalid memory address. – So this is not even the input. • Modeling ― Not determinable behavior of CPU – After some CPU operation, the part of internal context can be nondeterministic. (Value/behavior is undefined by the architecture.) – Consider this Nondeterministic Inputs. Applying to x86 (3.4)
  30. 30. • Modeling ― Inexact Arithmetic Operation – Transcendental instruction such as FSINCOS, FATAN does not define the actual value because specifying the actual value is very difficult. – The minimum information that can be used to recover the original value is considered Nondeterministic Input. • Likewise, we have to model *everything* – Implementation is relatively difficult. Applying to x86 (3.5)
  31. 31. Applying to x86 (4) • Considering X nondeterministic? – Increase number of hooks. – Trace log get bigger, execution get slower. – Fewer is great. • I thought these nondeterministic events are much, much fewer than normal instructions so there s no problem. – But it was wrong.
  32. 32. How do you think? • Is this instruction deterministic? XOR edx, edx – As you know, this instruction just clears edx register. – But answer is No. • Many of normal operations make some part of internal context nondeterministic. – IT IS EFLAGS.
  33. 33. The curst of EFLAGS? (1) • Let s look inside. – edx IS zero. On the other hand, EFLAGS.AF is updated to ? . – Intel s manual says this value is undefined (can vary.) xxx......xxx 000......000 x x x x x x 0 0 1 ? 1 0 XOR edx, edx (next instruction) OFedx SF ZF AF PF CF EFLAGS
  34. 34. The curst of EFLAGS? (2) • This is not the end! – These frequently used instructions as well. – According to the profiling, 10-15% of instruction makes a part of EFLAGS undefined! 0 M M ? M 0 AND, OR, XOR, TEST (Logical Arithmetic) OF SF ZF PF CFAF M ? ? ? ? M MUL, IMUL (Multiplication) ? ? ? ? ? ? DIV, IDIV (Division) ? M M ? M ? SHL, SHR, SAL, SAR count (Shift)
  35. 35. The curst of EFLAGS?(3) • Not much, much fewer at all! – Even 10% of instructions, the overhead of hooking cannot be ignored. – We can choose EFLAGS not to trace . For instance we can update EFLAGS register to deterministic value. But... • Updating flags (POPF) is extremely slow! • 24-25 clocks in Intel Nehalem MA (Core i7) – To avoid this problem, we need to avoid these values to be affected.
  36. 36. The implementation problem (1) • Public Record and Replay implementation does not care about this condition! – They just limit processor model. If we record the program in processor model A, we need to replay with the exactly same model. – Prevents distributed analysis. – Normally, programs don t depend on these undefined (nondeterministic) values. • But technically, 1-bit of nondeterministic value can cause chaos.
  37. 37. The implementation problem (2) • What is RIGHT? – We cannot exactly know which CPU model is right. – I want to integrate information in one. No more compatibility/portability problems. • This is no good for reverse engineering. – I want robustness!
  38. 38. EFLAGS : Lazy Evaluation (1) • EFLAGS and programs have these characteristics: – Over 80% of updated flags are just discarded. • We want to trace *everything*. but it is worthless to trace the value that is not used at all. – Updating/Evaluating flags are adjacent in most cases. • e.g. Compare → Jump Conditionally • Intel do this optimization! (Macro-Fusion) – How about lazy evaluation? • Trace nondeterministic EFLAGS value when it is used.
  39. 39. EFLAGS : Lazy Evaluation (2) • Current Implementation: – JIT compiling with static evaluation (to make programs run faster.) – Evaluate each instruction block • From the instruction after some jump operation to the unconditional jump (instruction/exception). • Scan each block forward. – Evaluate propagation of virtual EFLAGS. • Deterministic or not (Initial Value : No) • Last instruction that updated flag value. • We use heuristics.
  40. 40. EFLAGS : Lazy Evaluation (3) • (cont.) – If the instruction in the block depends on these flags and virtual flags satisfy the condition below, we just consider this value nondeterministic. • The value of virtual flag is nondeterministic. • The value is deterministic but updated instruction is too old (32-bytes / 8-instruction or more older.) • Currently, this is very effective. – I found almost of all flags are traced during interrupt handling / context switch.
  41. 41. Record and Replay : Conclusion • Using Record and Replay , we can decrease the amount of trace log and trace overhead. • Using (my) improved method, we can acquire robust trace log in x86 platform.
  42. 42. Implementation Implementation
  43. 43. Implementation • I implement VMM-based tracer. – To run general purpose OSes. • But it was not a good idea. Because of its complexity, I couldn t finalize the VMM (Nov 2010.) – Using binary translation • Read guest instruction and transform it to run on host platform. – I chose to use x64 platform to implement VMM. • There s some reason that x64 is good for binary translation-based x86 emulation.
  44. 44. x86 on x64 (1) • x64 is a 64-bit extension to x86 architecture. – AMD, Intel and VIA have x64 extension. – Very similar instruction format. – Some extensions: • Increased general purpose and XMM registers (8→16) • New addressing modes (64-bit, RIP [program counter] relative) • There are many elements that make implementing binary translation-based VMM.
  45. 45. x86 on x64 (2.1) • Benefit : 32-bit registers and clamp – General purpose register format is based on its original (that shares lower bits.) • 例 : ax (16-bit), eax (32-bit), rax (64-bit) – If you run the instruction which destination is 32-bit register, upper 32-bit of corresponding register is cleared! 0123 0123 4567 1234 MOV eax, 0x01234567 MOV ax, 0x1234 eax ax
  46. 46. x86 on x64 (2.1) • Benefit : 32-bit registers and clamp – General purpose register format is based on its original (that shares lower bits.) • 例 : ax (16-bit), eax (32-bit), rax (64-bit) – If you run the instruction which destination is 32-bit register, upper 32-bit of corresponding register is cleared! 01234567 00000000 89abcdef 12345678 MOV rax, 0x0123456789abcdef MOV eax, 0x12345678 rax eax
  47. 47. x86 on x64 (2.2) • Benefit : Increased Registers (GPR/XMM) – 8→16 (16 additional register including XMM regs.) – Save emulator s context without destroying the existing registers. rax r8 rcx r9 rdx r10 rbx r11 rsp r12 rbp r13 rsi r14 rdi r15 xmm0 xmm8 xmm1 xmm9 xmm2 xmm10 xmm3 xmm11 xmm4 xmm12 xmm5 xmm13 xmm6 xmm14 xmm7 xmm15
  48. 48. x86 on x64 (2.2) • Benefit : Increased Registers (GPR/XMM) – 8→16 (16 additional register including XMM regs.) – Save emulator s context without destroying the existing registers. eax cs.base ecx es.base edx emuinfo ds.base ebx stack esp ebp tmp2 esi ss.base tmp1 edi xmm0 fs.base xmm1 gs.base xmm2 tmp3 xmm3 tmp4 xmm4 notused xmm5 notused xmm6 notused xmm7 notused Actual register mapping table. For memory/cache optimization, some registers are relocated.
  49. 49. x86 on x64 (2.2) • Benefit : Increased Registers (GPR/XMM) – 8→16 (16 additional register including XMM regs.) – Save emulator s context without destroying the existing registers. – XMM registers are difficult to use sometime but we can transfer to GPR using movq instruction.
  50. 50. x86 on x64 (2.3.1) • Benefit : Remained Addressing Format – Some addressing modes are added but still x86-based addressing format. – x86 have complex addressing mode: • Like 2-add, 1-shift : [esi+edx*4+123] • We can use it to separate memory access! – Address Translation : [segbase+offset] • All memory access if segbase-relative. (segbase contains 64-bit address of segment base.) – Achieving Memory Isolation • Like Google Native Client for x64
  51. 51. x86 on x64 (2.3.2) • Benefit : Remained Addressing Format – (e.g. 1) : inc [ds:ecx] → inc [rbx+rcx] • rbx : Base address of DS segment. • rcx : Guest ECX register. – Wait a minute, ecx register is 32-bit but using rcx register that is 64-bit register! (You sure that way?) • No problem. As I described before, result of 32-bit operations are also clamped. • We can guarantee that the value of rcx is in the 32-bit range (0x0000_0000-0xffff_ffff.)
  52. 52. x86 on x64 (2.3.3) • Benefit : Remained Addressing Format – (e.g. 2 [wrong]) : inc [ds:ecx+edx] → inc [rbx+rcx+rdx] • Store intermediate result to temporary register. – (e.g. 2 [correct]) : inc [ds:ecx+edx] → lea edi, [rcx+rdx] ; inc [rbx+rdi] • edi/rdi : Temporary register • Almost same as first example. – I ll take the best encoding x64 have. • Store 64-bit address to 32-bit register! • This is also a valid encoding. Address is automatically clamped and instruction is shortened.
  53. 53. x86 on x64 (2.4.1) • Benefit : Huge Memory Range – 64-bit address width • Valid 48-bit (sign extended) logical address. • 0x0000_1234_5678 → 0x0000_0000_1234_5678 • 0x8000_1234_5678 → 0xffff_8000_1234_5678 – We can place the data/code that VMM uses outside the guest accessible region. • Looking x86 on x86, it needed address compression to store host/guest data in same address space. • Increases VMM speed.
  54. 54. x86 on x64 (2.4.2) • Benefit : Huge Memory Range – But allocating just 4GB is not enough. The result of address calculation can over/underflow. – On 32-bit mode on x86, address calculation is done by 32-bit precision and overflow/underflow is ignored. It means lower 32-bits is equivalent to actual accessed memory address. – So, we modify the page table to satisfy: lower 32-bits are equivalent == same physical address
  55. 55. x86 on x64 (2.4.3) • Benefit : Huge Memory Range – Allocate virtual memory region. – Considering address overflow, we allocate up to 44.5GB range of virtual memory. • Red and Blue areas point exactly same physical region. • We use page table to achieve. 44.5GB 42.25GB 2.25GB
  56. 56. x86 on x64 (2.4.4) • Benefit : Huge Memory Range – Allocate virtual memory region each segment and/or segment access control. • On segment switch, just change base address. cs.base ds.base es.base ss.base data3 code0 data3 code3
  57. 57. x86 on x64 (2.4.4) • Benefit : Huge Memory Range – Allocate virtual memory region each segment and/or segment access control. • On segment switch, just change base address. cs.base ds.base es.base ss.base data3 code0 data3 code3
  58. 58. x86 on x64 (2.5.1) • Benefit : Simplified Architecture – Architecture of x64 is relatively simplified which makes implementing Type-2 VMM easier. • Only two interrupt handler types: – Interrupt Gate and Trap Gate • Now segment is a mere façade. – Flat memory model for CS, DS, ES and SS. – Replacing IDT (interrupt vector) to allocate VM-specific context. • PatchGuard compatible! • Nearly stealth but cannot hook system calls.
  59. 59. x86 on x64 (2.5.2) • Benefit : Simplified Architecture – Pass-through the interrupts • We can do it safely with IDT switching. • There s some overhead. VM OS Actually implementation is a bit more complicated but I show the summary. IDT switch IDT switch OS Kernel VM Trampoline OS IntHandler VM Entry VM IntHandler VM Kernel
  60. 60. x86 on x64 (3) • Using these techniques, implement binary translation. – But currently, it is still incomplete. • To trace the timing, the following information is required. – Value of branch counter (software implementation is possible.) – Current program counter (IP, EIP) – Repeat count (CX, ECX) • only when rep instruction was executing.
  61. 61. Everything into the Ring-0 • Is privilege isolation required? – Dynamic code is generated safely and well isolated; enabling run everything in the kernel-mode (Ring-0.) • Low-overhead implementation. • Current implementation do it. – If this is dangerous behavior, you can also run the code on user-mode (Ring-3.)
  62. 62. Tests Verification
  63. 63. Trace size test (1.1) • Trace log size required – DLX Linux bundled Bochs 2.45 • From computer reset until login screen. • 52,217,403 instructions (no-emulation : 53 sec) – Specs • 1 MIPS (1,000,000 instructions/sec) • 32MB MEM, 10MB HDD – Use Bochs to generate instruction/memory trace and convert using specific methods.
  64. 64. Trace size test (1.2) • Trace log size required – Size of initial context is not included. – Modeled devices in Bochs emulator and estimated the size of trace log required. – Due to simplified model, the size is only estimated (not exact value.)
  65. 65. Trace size test (1.3) • Methods (comparison included) – Raw Text-format instruction/memory trace generated by Bochs. – Verbose Normal tracer (like OllyDbg does) – Dumb Record and Replay plus memory monitoring. – RnR (1) Record and Replay (tracing EFLAGS) – PROPOSAL Improved Record and Replay method – RnR (2) Record and Replay (IGNORING EFLAGS)
  66. 66. Trace size test (2.1) Method Size (bytes) Raw 7,178,948,236 6.68GB Verbose X > 419,430,400 400MB Dumb 60,713,538 57.90MB RnR (1) 6,932,542 6.61MB PROPOSAL 389,013 380KB RnR (2) 31,788 31KB This table shows PROPOSAL generates only 1/1,000 of trace log than Verbose tracer. Record and Replay method (ignoring EFLAGS) is smaller than PROPSAL but it has low portability.
  67. 67. Trace size test (2.2) 10,000 100,000 1,000,000 10,000,000 100,000,000 1,000,000,000 10,000,000,000 Size (bytes)
  68. 68. Trace size test (2.3) • Conclusion – This result didn t come from actual implementation so there is some suspicious points. – Despite of this, the proposal method generates really small trace log compared to old methods.
  69. 69. Overhead tests 0 10 20 30 40 min max without Tracer with Tracer
  70. 70. Possible Practical Uses Application
  71. 71. Possible Practical Uses (1) • Reverse Engineering (non-Malware) – Everything *worked* is everything *recorded* • All your program are belong to us! • Programs behavior is recorded, including VM detection and/or anti-debugging. – Of course program is unpacked/decrypted. • You can integrate multiple analysis.
  72. 72. Possible Practical Uses (2) • Avoiding Anti-debugging/Anti-VM – No well-known backdoor. – But binary translation based VM can be detected by running specific code. • e.g. Self-modifying code is (extremely) slow. – You can find how VM is detected. At least, you can extract useful information to avoid VM detection. • Protection of normal program is not so strong.
  73. 73. Possible Practical Uses (3) • Reverse Engineering (Malware) – It is DANGEROUS to run malware directly! – However, if you can take care of these problems, this tracer can be useful. – Honeypots?
  74. 74. Possible Practical Uses (4) • Fuzzing / Exploit analysis / Bug discovery – Imagine that Valgrind is applied to all programs and you can use the guest program interactively. – By offline-analysis, you can find and track memory corruption. – If you can reproduce the issue, you can extract useful information. – However, it can be very implementation-dependent for fuzzing. (efficient or not.)
  75. 75. Possible Practical Uses (5) • Analysis Support – Export for other well-known tools. • e.g. Wireshark – In this case, you have program s behavior so you can add metadata and/or supplemental info. • e.g. SSL/TLS auto decryption • You cannot steal a key from packet dump but remember, you can run the program which uses private (common/shared) key!
  76. 76. Possible Practical Uses (6) • <<Place Entry Here>> – I guess you can use for other purposes. – I hope that many people work best around these type of tracer.
  77. 77. Future Challenges / Conclusion Future Challenges / Summary
  78. 78. Challenge : Multicore (1) • Original Record and Replay is not for multi-processing environment. – Many of communications make tracer slow. – Almost all implementations restricts 1 CPU/thread. (mine, too ) • But, it doesn t mean this is impossible. – Time-sharing – Software emulation of MESI protocol – Trace memory contents
  79. 79. Challenge : Multicore (2) • Time-sharing – Only one CPU running simultaneously. – Switch the CPU execution with timer to simulate running multiple CPUs. • Pros. – Almost no synchronization required. • Cons. – More CPUs, less efficiency. – Difficult to reproduce multi-threading problems because this is not true multi-procesing.
  80. 80. Challenge : Multicore (3) • Software Implementation of MESI protocol – Memory coherency algorithm – CPU uses this protocol (or its varieties) to make memory/cache coherent. – We can implement this using page-level protection. – Lock the page to write them. • Pros. – High efficiency on few shared pages. • Cons. – Software implementation is quite slow.
  81. 81. Challenge : Multicore (4) • Trace Memory contents – Also trace memory contents read for shared pages. • Pros. – Can achieve high efficiency... maybe. • Cons. – It is not a perfect-information tracer. (Which CPU has written this value?!) – Memory trace is slow. • Bandwidth monster may be required.
  82. 82. Challenge : 64-bit / Others • x64 on x64 is very difficult. – There are some ways but not so efficient. • SSE2 / Reciprocal, Square root instructions – Not exact value is required for these instructions and fast to run it (this is a problem.) • Hypervisor again? – Trace without portability and convert it to portable one (using same processor model.) – This is not perfect, but possible choice.
  83. 83. CAUTION : PATENTS • Some of these techniques are patented! – Record and Replay – Optimization for Binary Translation based VMM. – Difficult/Impossible to avoid these patents. • However, all patents I have founds are only United States patent and I guess using this tracer outside US is no problem. – Be careful.
  84. 84. Conclusion • I described how to build tracing-VMM for x86 on x64. • Using proposal method, trace log gets smaller and overhead gets lower too. – However, proper tests (validations) are required to check whether this is useful for reverse engineering. • Many of practical uses! – Some other?
  85. 85. contact me at : li at livegrid dot org Open Source Project : Niizh will be available at http://niizh.org/ Thank you! Any questions?

×