A New Tracer for Reverse Engineering - PacSec 2010

A New Tracer for Reverse Engineering
Niizh (Section 1b) : Background and Implementation (Work in Progress)
Tsukasa Ooi (@a4lg)

I...
• will introduce the way to make reverse engineering
more efficient ...possibly.
• Possibly ?
– (Nov 2010) Generic OSes don t work currently.
• Sorry for no live demo!
– Some predictions are included.

Related Topics
• Reverse Engineering
– especially dynamic analysis, debuggers and tracers.
• Intel x86 (32-bit) architecture
• Virtualization / Virtual Machine Monitor (VMM)
– Record and Replay
• Intrusion detection and analysis (e.g. honeypots)
• Bug detection (e.g. fuzzing)

Agenda
• Drawbacks of instruction tracers
• New tracing method based on Record and Replay
• Tracing-VMM implementation on x64
• Partial Tests
• (Possible) Practical use of this Tracer
• Challenges

Target Platform
• Intel x86 (16/32-bit) architecture
• PC/AT
• General purpose OSes (Windows, Linux etc...)

Dynamic Analysis
• Analyze running programs
– e.g. By intercepting operations of the program
• Various tools
– Debuggers
• e.g. OllyDbg, IDA Pro...
– Monitors
• Process Monitor, Wireshark...
– Tracers
• API Monitor, OllyDbg, Process Stalker...
• Today, I will talk about so called tracers.

Tracers (1)
• Capture and save the information
associated with specific event.
– Various granularity
• instruction, basic block, function, system call...
• Instruction tracing
– REALLY easy to apply automatic analysis
(like automated-unpacking.)
– If you can trace every internal context
each instruction, it means you can acquire any
information you would like.

Tracers (2)
• But, in early research, I found most of these
instruction tracers have some drawbacks:
– Extremely Slow
• They hook every instruction execution that makes
tracers really slow.
• x10-x1000
– Generate huge amount of data
• Several gigabytes per real-second.
(real-second : 1sec with no-emulation)
• Save many information each instruction.
• Saving information can be also bottleneck.

Tracers (3)
• Can we solve these issues?
• Major Requirements:
– Overhead : <100%
– Size of trace : <5MB/s
• Theme is:
How (did I implement ¦ to make)
VMM-based tracer satisfying these requirements?

Theory ‒ Record and Replay
Theory ― Record and Replay

Record and Replay (0)
• I was going to have independently discovered but:
– I didn t find any documents related before.
• ReVirt : Enabling Intrusion Analysis through Virtual-
Machine Logging and Replay
– I found the new method is a variety of
Record and Replay .
– It is very related and difficult to separate.
• So I m going to describe Record and Replay
with my method.

• The method have some variety of names:
– VMware calls this Record and Replay
– Logging and Replay Lockstep
• Execution with 2-passes (Record/Replay)
• By focusing on common characteristics of
many machine architectures, it makes
trace output phenomenally small.
– Normally, the input from external hardware
is not so frequent.

• Many architectures can be represented as this model:
– Input (can be null)
– Calculation / Process (+Internal Context)
– Output (can be null)
• Assuming the output is uniquely determined
by internal context (by function g below.)
• zn+1 = f(zn, in)
on+1 = g(zn+1)
Input
Output
Calc/Proc
+Context

• Saving all information is equivalent to
saving all of internal context (zn).
– The output is not required because we assume
it is uniquely determined by internal context.
• Also save z0 (initial internal context.)
• Function f (equivalent to calculation/process)
must be a mathematic function.
– Same input, same output.
– Not ambiguous.
Input
Output
Calc/Proc
+Context

• Focusing on dependency
– Input : there are no dependency.
– Calculation / Process (+Context) : depend on input
• Now you can find...
– Internal context only depends on internal state and
the input array. You can recover all of these from that
information.
Input
Calc/Proc
+Context

• Pass 1 : Record
– Capture and save initial context
– Run the virtual machine
• Accepts input from external hardware.
– Capture and save all inputs
• This does not generate the dump of
internal context but you can recover
it from this small amount
of data.
Input
Trace log
Calc/Proc
+Context
InitState

• Pass 2 : Replay
– Recover initial context from trace log.
– Run the virtual machine.
• But read trace log to supply input data.
• So it does not accept new hardware inputs.
– Read internal context from
running virtual-machine.
• It is very similar to
Record pass!
Input
Trace log
Calc/Proc
+Context
InitState

Cons. (1)
• It seems to be just running twice but:
– You have saved trace log so you can run
Replay pass anytime, anywhere, as you want.
• You will extract a part of information from Replay pass.
• If you need more information, you just need to
run Replay pass with different configuration.
– If you need to, you can run Replay pass in parallel.
• You can shorten the automated-analysis.
(Actually, you may encounter the dependency issues.)

Cons. (2)
• (Cont.)
– Two passes are independent.
• Even if you run slow analysis, the Record pass
remains running as before.
• You may use Replay pass to do slow and
verbose analysis which is difficult to apply directly
(such buffer-overflow detection.)
• This method has an affinity for reverse engineering.
– Trace log contains nearly *everything*
happening in the virtual machine!

Real World Example (1)
• VMware Workstation (6 or later)
– Record/Replay feature
• Record execution and you can replay just like
videos and/or you can use it to debug.
– It proprietary and no enough robustness
but this is actually the example implemented
Record and Replay method.
– Trace log : normally 1-10MB/s

Real World Example (2)
• VMware Workstation (6 or later)
– But...
• It s still a VMware .
• There is no enough debug interface.
– If debug interface is well equipped,
you could use it for reverse engineering.
• Other examples:
– ReplayDIRECTOR (Java debugging tool)
– Jockey (http://home.gna.org/jockey/)
• User-mode Recording / Debugging library for Linux

• All deterministic elements can be considered
one type of input but not inefficient.
– Do you want to record many element of null?!
• Classify the type of so called inputs.
– Nondeterministic Input(s)
– Interrupt(s)
• Just a name; they don t represent
its name literally.
Applying to x86 (1)
入力
トレース
計算/処理
+内部状態
初期状態

• Nondeterministic Inputs
– The timing which internal context can be
undetermined can be determined uniquely
(like in instruction in x86.)
– But you cannot determine the actual value
or contents without running it.
– Save actual value or contents.
But don t save its timing.
• We can determine the timing from
recent internal context and interrupts.
Applying to x86 (2.1)

• Interrupts
– The timing is not uniquely predictable.
– And actual content can be nondeterministic.
– In this case, trace the timing. Additionally,
if actual content of interrupt is nondeterministic,
trace it too.
• e.g. Interrupt vector number (hardware interrupt)
• The most important thing is:
– Based on these classification, we have to
classify all elements in the virtual machine.

• Modeling ― VM-Internal Disk
– Assume the VM-internal disk is reliable and
record initial disk image.
– Almost all elements are deterministic
except interrupts that disk generates.
• The content read is equivalent to
the content last written.
• But timing of ATA interrupt cannot be
predicted strictly so we can consider this interrupt.

• Modeling ― Mouse, Keyboard, Network
– They are unpredictable/external input.
– The input from the device uses both of
x86 interrupt and I/O port operation.
– Both.
– Network packet you sent are recovered from
the internal context.

• Modeling ― Time Stamp Counter (CPU)
– The clock count since computer reset
that can be read the value with RDTSC instruction.
– Consider Nondeterministic Input.
– Even if the physical location of the value is inside
the CPU, you should consider these value when
they produce unpredictable results.
• If you could model and consider this deterministic,
the implementation can be inefficient.
• NOT considering this deterministic improves
VM emulation efficiency.

• Modeling ― CPU exception
– Almost all exceptions are deterministic
including their timing.
• Page Fault occurs because the CPU has
accessed the invalid memory address.
– So this is not even the input.
• Modeling ― Not determinable behavior of CPU
– After some CPU operation, the part of internal context
can be nondeterministic. (Value/behavior is undefined
by the architecture.)
– Consider this Nondeterministic Inputs.

• Modeling ― Inexact Arithmetic Operation
– Transcendental instruction such as FSINCOS, FATAN
does not define the actual value because
specifying the actual value is very difficult.
– The minimum information that can be used to
recover the original value is considered
Nondeterministic Input.
• Likewise, we have to model *everything*
– Implementation is relatively difficult.

Applying to x86 (4)
• Considering X nondeterministic?
– Increase number of hooks.
– Trace log get bigger, execution get slower.
– Fewer is great.
• I thought these nondeterministic events are
much, much fewer than normal instructions so
there s no problem.
– But it was wrong.

How do you think?
• Is this instruction deterministic?
XOR edx, edx
– As you know, this instruction just
clears edx register.
– But answer is No.
• Many of normal operations make some part of
internal context nondeterministic.
– IT IS EFLAGS.

The curst of EFLAGS? (1)
• Let s look inside.
– edx IS zero. On the other hand,
EFLAGS.AF is updated to ? .
– Intel s manual says this value is undefined
(can vary.)
xxx......xxx
000......000
x x x x x x
0 0 1 ? 1 0
XOR edx, edx
(next instruction)
OFedx SF ZF AF PF CF
EFLAGS

The curst of EFLAGS? (2)
• This is not the end!
– These frequently used instructions as well.
– According to the profiling, 10-15% of instruction
makes a part of EFLAGS undefined!
0 M M ? M 0 AND, OR, XOR, TEST (Logical Arithmetic)
OF SF ZF PF CFAF
M ? ? ? ? M MUL, IMUL (Multiplication)
? ? ? ? ? ? DIV, IDIV (Division)
? M M ? M ? SHL, SHR, SAL, SAR count (Shift)

The curst of EFLAGS?(3)
• Not much, much fewer at all!
– Even 10% of instructions, the overhead of hooking
cannot be ignored.
– We can choose EFLAGS not to trace .
For instance we can update EFLAGS register to
deterministic value. But...
• Updating flags (POPF) is extremely slow!
• 24-25 clocks in Intel Nehalem MA (Core i7)
– To avoid this problem, we need to
avoid these values to be affected.

The implementation problem (1)
• Public Record and Replay implementation
does not care about this condition!
– They just limit processor model.
If we record the program in processor model A,
we need to replay with the exactly same model.
– Prevents distributed analysis.
– Normally, programs don t depend on these
undefined (nondeterministic) values.
• But technically, 1-bit of nondeterministic value
can cause chaos.

The implementation problem (2)
• What is RIGHT?
– We cannot exactly know which CPU model is right.
– I want to integrate information in one.
No more compatibility/portability problems.
• This is no good for reverse engineering.
– I want robustness!

EFLAGS : Lazy Evaluation (1)
• EFLAGS and programs have these characteristics:
– Over 80% of updated flags are just discarded.
• We want to trace *everything*. but it is
worthless to trace the value that is not used at all.
– Updating/Evaluating flags are
adjacent in most cases.
• e.g. Compare → Jump Conditionally
• Intel do this optimization! (Macro-Fusion)
– How about lazy evaluation?
• Trace nondeterministic EFLAGS value
when it is used.

• Current Implementation:
– JIT compiling with static evaluation
(to make programs run faster.)
– Evaluate each instruction block
• From the instruction after some jump operation
to the unconditional jump (instruction/exception).
• Scan each block forward.
– Evaluate propagation of virtual EFLAGS.
• Deterministic or not (Initial Value : No)
• Last instruction that updated flag value.
• We use heuristics.

• (cont.)
– If the instruction in the block depends on these
flags and virtual flags satisfy the condition below,
we just consider this value nondeterministic.
• The value of virtual flag is nondeterministic.
• The value is deterministic but updated instruction
is too old (32-bytes / 8-instruction or more older.)
• Currently, this is very effective.
– I found almost of all flags are traced
during interrupt handling / context switch.

Record and Replay : Conclusion
• Using Record and Replay , we can decrease
the amount of trace log and trace overhead.
• Using (my) improved method,
we can acquire robust trace log in x86 platform.

Implementation
• I implement VMM-based tracer.
– To run general purpose OSes.
• But it was not a good idea. Because of its
complexity, I couldn t finalize the VMM (Nov 2010.)
– Using binary translation
• Read guest instruction and transform it
to run on host platform.
– I chose to use x64 platform to implement VMM.
• There s some reason that x64 is good for
binary translation-based x86 emulation.

x86 on x64 (1)
• x64 is a 64-bit extension to x86 architecture.
– AMD, Intel and VIA have x64 extension.
– Very similar instruction format.
– Some extensions:
• Increased general purpose and XMM registers (8→16)
• New addressing modes
(64-bit, RIP [program counter] relative)
• There are many elements that make implementing
binary translation-based VMM.

x86 on x64 (2.1)
• Benefit : 32-bit registers and clamp
– General purpose register format is based on
its original (that shares lower bits.)
• 例 : ax (16-bit), eax (32-bit), rax (64-bit)
– If you run the instruction which destination is
32-bit register, upper 32-bit of corresponding register
is cleared!
0123
0123
4567
1234
MOV eax, 0x01234567
MOV ax, 0x1234
eax
ax

x86 on x64 (2.1)
• Benefit : 32-bit registers and clamp
– General purpose register format is based on
its original (that shares lower bits.)
• 例 : ax (16-bit), eax (32-bit), rax (64-bit)
– If you run the instruction which destination is
32-bit register, upper 32-bit of
corresponding register is cleared!
01234567
00000000
89abcdef
12345678
MOV rax, 0x0123456789abcdef
MOV eax, 0x12345678
rax
eax

x86 on x64 (2.2)
• Benefit : Increased Registers (GPR/XMM)
– 8→16 (16 additional register including XMM regs.)
– Save emulator s context without
destroying the existing registers.
rax r8
rcx r9
rdx r10
rbx r11
rsp r12
rbp r13
rsi r14
rdi r15
xmm0 xmm8
xmm1 xmm9
xmm2 xmm10
xmm3 xmm11
xmm4 xmm12
xmm5 xmm13
xmm6 xmm14
xmm7 xmm15

x86 on x64 (2.2)
eax cs.base
ecx es.base
edx emuinfo
ds.base ebx
stack esp
ebp tmp2
esi ss.base
tmp1 edi
xmm0 fs.base
xmm1 gs.base
xmm2 tmp3
xmm3 tmp4
xmm4 notused
xmm5 notused
xmm6 notused
xmm7 notused
Actual register mapping table.
For memory/cache optimization,
some registers are relocated.

x86 on x64 (2.2)
– XMM registers are difficult to use sometime
but we can transfer to GPR using movq instruction.

x86 on x64 (2.3.1)
• Benefit : Remained Addressing Format
– Some addressing modes are added but
still x86-based addressing format.
– x86 have complex addressing mode:
• Like 2-add, 1-shift : [esi+edx*4+123]
• We can use it to separate memory access!
– Address Translation : [segbase+offset]
• All memory access if segbase-relative.
(segbase contains 64-bit address of segment base.)
– Achieving Memory Isolation
• Like Google Native Client for x64

x86 on x64 (2.3.2)
– (e.g. 1) : inc [ds:ecx] → inc [rbx+rcx]
• rbx : Base address of DS segment.
• rcx : Guest ECX register.
– Wait a minute, ecx register is 32-bit but
using rcx register that is 64-bit register!
(You sure that way?)
• No problem. As I described before,
result of 32-bit operations are also clamped.
• We can guarantee that the value of
rcx is in the 32-bit range (0x0000_0000-0xffff_ffff.)

x86 on x64 (2.3.3)
– (e.g. 2 [wrong]) : inc [ds:ecx+edx] → inc [rbx+rcx+rdx]
• Store intermediate result to temporary register.
– (e.g. 2 [correct]) : inc [ds:ecx+edx] →
lea edi, [rcx+rdx] ; inc [rbx+rdi]
• edi/rdi : Temporary register
• Almost same as first example.
– I ll take the best encoding x64 have.
• Store 64-bit address to 32-bit register!
• This is also a valid encoding. Address is automatically
clamped and instruction is shortened.

x86 on x64 (2.4.1)
• Benefit : Huge Memory Range
– 64-bit address width
• Valid 48-bit (sign extended) logical address.
• 0x0000_1234_5678 → 0x0000_0000_1234_5678
• 0x8000_1234_5678 → 0xffff_8000_1234_5678
– We can place the data/code that VMM uses
outside the guest accessible region.
• Looking x86 on x86, it needed address compression
to store host/guest data in same address space.
• Increases VMM speed.

x86 on x64 (2.4.2)
– But allocating just 4GB is not enough.
The result of address calculation can over/underflow.
– On 32-bit mode on x86, address calculation is
done by 32-bit precision and overflow/underflow
is ignored. It means lower 32-bits is equivalent
to actual accessed memory address.
– So, we modify the page table to satisfy:
lower 32-bits are equivalent == same physical address

x86 on x64 (2.4.3)
– Allocate virtual memory region.
– Considering address overflow, we allocate
up to 44.5GB range of virtual memory.
• Red and Blue areas point exactly same physical region.
• We use page table to achieve.
44.5GB
42.25GB
2.25GB

x86 on x64 (2.4.4)
– Allocate virtual memory region each
segment and/or segment access control.
• On segment switch, just change base address.
cs.base
ds.base
es.base
ss.base
data3
code0
data3
code3

x86 on x64 (2.5.1)
• Benefit : Simplified Architecture
– Architecture of x64 is relatively simplified
which makes implementing Type-2 VMM easier.
• Only two interrupt handler types:
– Interrupt Gate and Trap Gate
• Now segment is a mere façade.
– Flat memory model for CS, DS, ES and SS.
– Replacing IDT (interrupt vector) to
allocate VM-specific context.
• PatchGuard compatible!
• Nearly stealth but cannot hook system calls.

x86 on x64 (2.5.2)
• Benefit : Simplified Architecture
– Pass-through the interrupts
• We can do it safely with IDT switching.
• There s some overhead.
VM OS
Actually implementation is a bit more complicated
but I show the summary.
IDT switch
IDT switch
OS Kernel
VM Trampoline
OS IntHandler
VM Entry
VM IntHandler
VM Kernel

x86 on x64 (3)
• Using these techniques, implement
binary translation.
– But currently, it is still incomplete.
• To trace the timing, the following
information is required.
– Value of branch counter
(software implementation is possible.)
– Current program counter (IP, EIP)
– Repeat count (CX, ECX)
• only when rep instruction was executing.

Everything into the Ring-0
• Is privilege isolation required?
– Dynamic code is generated safely and
well isolated; enabling run everything in
the kernel-mode (Ring-0.)
• Low-overhead implementation.
• Current implementation do it.
– If this is dangerous behavior, you can also
run the code on user-mode (Ring-3.)

Trace size test (1.1)
• Trace log size required
– DLX Linux bundled Bochs 2.45
• From computer reset until login screen.
• 52,217,403 instructions (no-emulation : 53 sec)
– Specs
• 1 MIPS (1,000,000 instructions/sec)
• 32MB MEM, 10MB HDD
– Use Bochs to generate instruction/memory trace
and convert using specific methods.

• Trace log size required
– Size of initial context is not included.
– Modeled devices in Bochs emulator
and estimated the size of trace log required.
– Due to simplified model, the size
is only estimated (not exact value.)

• Methods (comparison included)
– Raw
Text-format instruction/memory trace generated by Bochs.
– Verbose
Normal tracer (like OllyDbg does)
– Dumb
Record and Replay plus memory monitoring.
– RnR (1)
Record and Replay (tracing EFLAGS)
– PROPOSAL
Improved Record and Replay method
– RnR (2)
Record and Replay (IGNORING EFLAGS)

Method Size (bytes)
Raw 7,178,948,236 6.68GB
Verbose X > 419,430,400 400MB
Dumb 60,713,538 57.90MB
RnR (1) 6,932,542 6.61MB
PROPOSAL 389,013 380KB
RnR (2) 31,788 31KB
This table shows PROPOSAL generates only 1/1,000 of trace log
than Verbose tracer. Record and Replay method (ignoring EFLAGS)
is smaller than PROPSAL but it has low portability.

10,000
100,000
1,000,000
10,000,000
100,000,000
1,000,000,000
10,000,000,000
Size (bytes)

• Conclusion
– This result didn t come from actual implementation
so there is some suspicious points.
– Despite of this, the proposal method generates
really small trace log compared to old methods.

Overhead tests
0
10
20
30
40
min max
without Tracer
with Tracer

Possible Practical Uses
Application

Possible Practical Uses (1)
• Reverse Engineering (non-Malware)
– Everything *worked* is everything *recorded*
• All your program are belong to us!
• Programs behavior is recorded,
including VM detection and/or anti-debugging.
– Of course program is unpacked/decrypted.
• You can integrate multiple analysis.

• Avoiding Anti-debugging/Anti-VM
– No well-known backdoor.
– But binary translation based VM can be detected
by running specific code.
• e.g. Self-modifying code is (extremely) slow.
– You can find how VM is detected.
At least, you can extract useful information to
avoid VM detection.
• Protection of normal program is not so strong.

• Reverse Engineering (Malware)
– It is DANGEROUS to run malware directly!
– However, if you can take care of these problems,
this tracer can be useful.
– Honeypots?

• Fuzzing / Exploit analysis / Bug discovery
– Imagine that Valgrind is applied to all programs
and you can use the guest program interactively.
– By offline-analysis, you can find and track
memory corruption.
– If you can reproduce the issue,
you can extract useful information.
– However, it can be very implementation-dependent
for fuzzing. (efficient or not.)

• Analysis Support
– Export for other well-known tools.
• e.g. Wireshark
– In this case, you have program s behavior so
you can add metadata and/or supplemental info.
• e.g. SSL/TLS auto decryption
• You cannot steal a key from packet dump but
remember, you can run the program which uses
private (common/shared) key!

• <<Place Entry Here>>
– I guess you can use for other purposes.
– I hope that many people work best around
these type of tracer.

Future Challenges / Conclusion
Future Challenges / Summary

Challenge : Multicore (1)
• Original Record and Replay is not for
multi-processing environment.
– Many of communications make tracer slow.
– Almost all implementations restricts
1 CPU/thread. (mine, too )
• But, it doesn t mean this is impossible.
– Time-sharing
– Software emulation of MESI protocol
– Trace memory contents

• Time-sharing
– Only one CPU running simultaneously.
– Switch the CPU execution with timer to
simulate running multiple CPUs.
• Pros.
– Almost no synchronization required.
• Cons.
– More CPUs, less efficiency.
– Difficult to reproduce multi-threading problems
because this is not true multi-procesing.

• Software Implementation of MESI protocol
– Memory coherency algorithm
– CPU uses this protocol (or its varieties) to
make memory/cache coherent.
– We can implement this using page-level protection.
– Lock the page to write them.
• Pros.
– High efficiency on few shared pages.
• Cons.
– Software implementation is quite slow.

• Trace Memory contents
– Also trace memory contents read for shared pages.
• Pros.
– Can achieve high efficiency... maybe.
• Cons.
– It is not a perfect-information tracer.
(Which CPU has written this value?!)
– Memory trace is slow.
• Bandwidth monster may be required.

Challenge : 64-bit / Others
• x64 on x64 is very difficult.
– There are some ways but not so efficient.
• SSE2 / Reciprocal, Square root instructions
– Not exact value is required for these instructions
and fast to run it (this is a problem.)
• Hypervisor again?
– Trace without portability and convert it to
portable one (using same processor model.)
– This is not perfect, but possible choice.

CAUTION : PATENTS
• Some of these techniques are patented!
– Record and Replay
– Optimization for Binary Translation based VMM.
– Difficult/Impossible to avoid these patents.
• However, all patents I have founds are
only United States patent and I guess using this
tracer outside US is no problem.
– Be careful.

Conclusion
• I described how to build tracing-VMM for
x86 on x64.
• Using proposal method, trace log gets smaller
and overhead gets lower too.
– However, proper tests (validations) are required
to check whether this is useful for reverse engineering.
• Many of practical uses!
– Some other?

contact me at : li at livegrid dot org
Open Source Project : Niizh
will be available at http://niizh.org/
Thank you!
Any questions?

A New Tracer for Reverse Engineering - PacSec 2010

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A New Tracer for Reverse Engineering - PacSec 2010

Similar to A New Tracer for Reverse Engineering - PacSec 2010 (20)

More from Tsukasa Oi

More from Tsukasa Oi (10)

Recently uploaded

Recently uploaded (20)

A New Tracer for Reverse Engineering - PacSec 2010