Automating Reverse Engineering: Function Classification and Matching

Automating Reverse Engineering:
Function Classification and Matching w/ Machine
Learning and Binary Analysis
By: Malachi Jones, PhD

ABOUT ME
https://www.linkedin.com/in/malachijonesphd
▪ Education
• Bachelors Degree: Computer Engineering (Univ. of Florida, 2007)
• Master’s Degree: Computer Engineering (Georgia Tech, 2009)
• PhD: Computer Engineering (Georgia Tech, 2013)
▪ Cyber Security Experience
• PhD Thesis: Asymmetric Information Games & Cyber Security (2009-2013)
• Harris Corp.: Cyber Software Engineer/ Vuln. Researcher (2013-2015)
• Booz Allen Dark Labs: Embedded Security Researcher (2016- 2018)
• MITRE: Lead Embedded Security Researcher (2018- Present)

OUTLINE
▪
I. What is Software Reverse Engineering (RE) ?
II. Why is it Difficult?
III. Challenges of Automating RE
▪
I. Machine Learning
II. Markov Decision Process (MDP)
▪
▪
▪
▪
Function Matching and Applying Historic Knowledge at Scale
Leveraging ML & Markov Chains for Function Classification
Motivation
Background
Conclusion
Scope

WHAT IS SOFTWARE REVERSE
ENGINEERING?
Motivation

WHAT IS SOFTWARE RE
▪ Reverse Engineering (RE): The process by which you
deconstruct something to see and understand how
it works
▪ Software RE
• Characterization of software functionality/behavior that
includes intended and unintended functionality/behavior
(e.g. vulnerabilities)
• Characterization of the target environment
• Understanding of the executables purpose (e.g. Why
does it do what it does?)

WHAT IS SOFTWARE RE: EXAMPLE
Case Study: “Sample J” Malware
(sha1: 70cb0b4b8e60dfed949a319a9375fac44168ccbb)

memset(,0, 0x124u)
CreateToolhelp32Snapshot()
Process32First()
stricmp(., “explorer.exe”)
Process32Next(.,.)
CreateThread(,,StartAddress)
Sample J Call Trace Example

memset(,0, 0x124u)
CreateToolhelp32Snapshot()
Process32First()
Process32Next(.,.)
CreateThread(,,StartAddress)
Sample J Call Trace Example
▪Call Trace Analysis
i. Iterate through the process handles to see if a process with
name “explorer.exe” exists (Check if user logged in)
ii. If process exists, create a thread that infects target system

WHAT IS SOFTWARE RE
▪ Applications
• Vulnerability Research
• Malware Research
• Firmware Analysis
• Binary Repurposing

WHY IS RE DIFFICULT?
Motivation

WHY IS RE DIFFICULT
▪ Reverse engineering (RE) applications (e.g. malware and vulnerability analysis)
have historically been a manual and time-intensive process performed by skilled
practitioners
▪ Substantial barriers to entry for RE practitioners
▪ Prevalence, ubiquity, and growth of malware and IoT devices that don’t scale
with respect to skilled practitioners

WHY IS RE DIFFICULT
▪ Non-trivial (often manual-intensive) RE Processes
1. Identifying what the code blob is (e.g. elf, PE, or
compressed)
2. Unpacking a code blob if it is packed
3. Cleaning up a disassembled binary image
4. Applying historic knowledge to a target function that is
similar to previously analyzed function
5. Understanding function behavior and binary purpose

CHALLENGES OF AUTOMATING RE
Motivation

CHALLENGES OF AUTOMATING RE
▪ Computational Complexity Considerations
▪ Architecture Idiosyncrasies
▪ Humans are inherently better at some tasks (e.g.
reasoning)
▪ Software Development & Testing effort

SCOPE
•Non-trivial (often manual-intensive) RE Processes
1. Identifying what the code blob is (e.g. elf, PE, or
compressed)
2. Unpacking a code blob if it is packed
3. Cleaning up a disassembled binary image
4. Applying historic knowledge to a target function that is
similar to previously analyzed functions
5. Understanding function behavior and binary purpose
Focus of this presentation

MACHINE LEARNING: TYPES
▪ Three Types of Machine Learning:
• Labeled Data
• Direct feedback
• Predict outcome/future
Supervised Learning
• No labels/ targets
• No feedback
• Find hidden structure in data
Unsupervised Learning
• Decision process
• Reward system
• Learn series of actions
Reinforcement Learning

MACHINE LEARNING: TYPES
▪ Three Types of Machine Learning:
• Labeled Data
• Direct feedback
• Predict outcome/future
Supervised Learning
• No labels/ targets
• No feedback
• Find hidden structure in data
Unsupervised Learning
• Decision process
• Reward system
• Learn series of actions
Reinforcement Learning
Of interest for this talk

SUPERVISED LEARNING
▪ Supervised learning: Main goal is to learn a model from labeled
training data that allows us to make predictions about unseen or
future data
▪ Supervised refers to a set of samples where the desired output
signals (labels) are already known
Labels
Training Data
Machine Learning
Algorithm
New Data Predictive Model Prediction

SUPERVISED LEARNING EXAMPLE
▪ Suppose we didn’t know what “features”
differentiate reptiles from mammals
▪ Note: Features can be viewed as properties
exhibited by a type/class
▪ Suppose also that we are given a sample set of
animals that have been labeled as either mammal or
reptile

▪ Sample Set
▪ Common Features: eyes, 4 legs, tail, tongue, etc…
▪ Differentiating Feature: Dogs and elephants give birth
to live offspring; alligators and iguanas don’t
▪ Based on the sample set, we’ve learned a feature that
separate mammals from reptiles
Mammal Mammal Reptile Reptile

▪ Sample Set
▪ Differentiating Feature: Dogs and elephants give birth
to live offspring; alligators and iguanas don’t
▪ The quintessence of Supervised ML is learning which
features/properties separate objects that differ in type
▪ These differentiating features can then allow for us to
make predictions about the type of an unlabeled object
Mammal Mammal Reptile Reptile

TYPES OF ML: UNSUPERVISED LEARNING
▪ Unsupervised Learning: Dealing with unlabeled ta or data of unknown structure
▪ In supervised learning, we know the right answer beforehand when we train our model
▪ Using unsupervised learning techniques, we can explore the structure of our data to extract
meaningful information without the guidance of a known outcome variable or reward function

UNSUPERVISED LEARNING
▪ Unsupervised Learning: Dealing with unlabeled ta or data of unknown structure
▪ In supervised learning, we know the right answer beforehand when we train our model
▪ Using unsupervised learning techniques, we can explore the structure of our data to extract
meaningful information without the guidance of a known outcome variable or reward function

MARKOV DECISION PROCESS (MDP)
Background

MARKOV DECISION PROCESS
▪ Markov Process: A stochastic process that satisfies
the Markov property
▪ Markov Property: A process where one can make
predictions for the future of the process based
solely on its present state; memorylessness
▪ Markov Chain: A type of Markov process that has a
discrete state space

▪ Markov Chain Example*: Predicting the Weather
▪ Interpretation: The probabilities of weather
conditions(i.e. modeled as either rainy or sunny),
given the weather on the preceding day
Sunny.9 .5
.1
.5
Rainy
*(Example originally presented at https://en.wikipedia.org/wiki/Examples_of_Markov_chains)

▪ The transition probabilities can be represented by
the following transition matrix:
Sunny.9 .5
.1
.5
Rainy

▪ Suppose the weather on day 1 is sunny, then we can
represent it by the following vector:
Sunny.9 .5
.1
.5
Rainy

▪ The weather on day 2 can be predicted by:
Sunny.9 .5
.1
.5
Rainy

▪ The weather on day 3 can be predicted by:
Sunny.9 .5
.1
.5
Rainy

Function Matching & Applying Historic Knowledge

DERIVING FUNCTION META-DATA
▪ A recurring RE task consists of labeling and annotating a
function that has not been previously analyzed
▪ Annotating can include the following
• Comments about context and behavior of a set of instructions
• Identifying and labeling key data structures
• Determining the function’s prototype
▪ Note: Can also record who derived the annotations to assess
reputation/accuracy/trustworthiness
▪ For the remainder of this talk, we’ll refer to this information
as the function’s meta-data.

DERIVING FUNCTION META-DATA
Labeled Function
(IDA) Defined Function
Prototype
Identified and labeled
key variable
Additional comments to add
context about comparison
operation
Identified and labeled a key data
structure

LEVERAGING FUNCTION META-DATA
▪ A logical next step would be to apply the derived
function meta-data to functions in other binaries
that are either an exact match and/or similar
enough
▪ A huge bonus would be if we could perform this in a
disassembler agnostic manner.
▪ Example: If function meta-data generated in Ida
Pro, someone using Ghidra or Binary Ninja would
also be able to leverage that information.

LEVERAGING FUNCTION META-DATA
▪ Example: Disassembler Agnostic Approach
func1()
Binary A (Ghidra)
func5()
func2()
Binary B (IDA Pro)
func3()
func3()
func0()
Binary C (Binja)
func3()
DB
Save func3()
meta-data
Fetchfunc3()
meta-data
Fetch func3()
meta-data

STANDARD TECHNIQUE FOR EXACT MATCHES
▪Example: Hello World (x86)
.global _start
.text
_start:
# write(1, message, 13)
mov $1, %rax # system call 1 is write
mov $1, %rdi # file handle 1 is stdout
mov $message, %rsi # address of string to output
mov $13, %rdx # number of bytes
syscall # invoke operating system to do the write
# exit(0)
mov $60, %rax # system call 60 is exit
xor %rdi, %rdi # we want return code 0
syscall # invoke operating system to exit
message:
.ascii "Hello, worldn

FUNCTION MATCHING
▪ Standard Technique for Exact Matches
# write(1, message, 13)
mov $1, %rax
mov $1, %rdi
mov $message, %rsi
mov $13, %rdx
syscall
# exit(0)
mov $60, %rax
xor %rdi, %rdi
syscall
main()
mov
mov
mov
mov
syscall
mov
xor
syscall
sha1 hash

FUNCTION MATCHING: APPROXIMATE MATCHING
▪ N-gram Technique for Approximate Match (N=3)
main()
mov
mov
mov
mov
syscall
mov
xor
syscall
mov
SEQ 1
movmov
mov
mov
mov

▪ N-gram Technique (N=3 in this example)
main()
mov
mov
mov
mov
syscall
mov
xor
syscall
mov
SEQ 1
mov
mov
mov
mov
mov
mov
SEQ 2
mov
mov

main()
mov
mov
mov
mov
syscall
mov
xor
syscall
mov
SEQ 1
mov
mov
mov
syscall
mov
mov
SEQ 2
mov
mov
mov
SEQ 3
mov
syscall

main()
mov
mov
mov
mov
syscall
mov
xor
syscall
mov
SEQ 1
mov
mov
syscall
mov
mov
mov
SEQ 2
mov
mov
mov
SEQ 3
mov
syscall
syscall
SEQ 4
mov
mov

main()
mov
mov
mov
mov
syscall
mov
xor
syscall
mov
SEQ 1
mov
syscall
mov
xor
mov
mov
SEQ 2
mov
mov
mov
SEQ 3
mov
syscall
syscall
SEQ 4
mov
mov
mov
SEQ 5
syscall
xor

main()
mov
mov
mov
mov
syscall
mov
xor
syscall
mov
SEQ 1
mov
mov
xor
syscall
mov
mov
SEQ 2
mov
mov
mov
SEQ 3
mov
syscall
syscall
SEQ 4
mov
mov
mov
SEQ 5
syscall
xor
xor
SEQ 6
mov
syscall

▪ Jaccard Index can be used to measure similarity
J(p,q)
Function p
.953
Function q
Function qFunction p

Function Classification w/ Machine Learning and
Binary Analysis

FUNCTION CLASSIFICATION: MOTIVATION
▪ Another recurring task for RE practitioners is to characterize
a function’s behavior
▪ Behaviors can include the following:
• Computation
• Logic
• Crypto
• File I/O
• Networking
▪ Can you determine how the example function on the next
slides should be classified?

push rbp
mov rbp, rsp
push rbx
sub rsp, 40
mov QWORD PTR [rbp-40], rdi
mov QWORD PTR [rbp-48], rsi
mov BYTE PTR [rbp-21], 115
mov DWORD PTR [rbp-20], 0
jmp .L2
.L3:
mov eax, DWORD PTR [rbp-20]
movsx rdx, eax
mov rax, QWORD PTR [rbp-48]
add rax, rdx
movzx eax, BYTE PTR [rax]
mov edx, DWORD PTR [rbp-20]
movsx rcx, edx
mov rdx, QWORD PTR [rbp-48]
add rdx, rcx
xor al, BYTE PTR [rbp-21]
mov BYTE PTR [rdx], al
add DWORD PTR [rbp-20], 1
.L2:
mov eax, DWORD PTR [rbp-20]
movsx rbx, eax
mov rax, QWORD PTR [rbp-40]
mov rdi, rax
call strlen
cmp rbx, rax
jb .L3
mov eax, 0
add rsp, 40
pop rbx
pop rbp
ret

▪ Experienced RE practioners can classify key
behaviors/operations of a function more readily versus
their junior peers
▪ The ability to quickly classify behaviors/operations is a
key aspect that can allow an experienced RE
practitioner to spend more time focusing on what is
relevant to the RE task.
▪ Depending on the complexity and architecture of a
target function, even a RE SME can get bogged down
with classifying.
▪ This idea of function classification was presented in [2]

FUNCTION CLASSIFICATION: APPLYING ML
▪ We need at least two things to apply ML
1. Define a feature vector that can characterize an
arbitrary function
2. Obtaining a “large enough” and diverse sample set of
functions with their appropriate classification (e.g.
crypto or file I/O)

1. Defining a function feature vector
• We can model the function as a Markov Chain*
• The transition matrix P will then be our feature matrix
• An example of the states are the following:
i. Math
ii. Logic
iii. Memory
iv. Other
v. Branch
vi. Register
*The idea for leveraging Markov Chains was presented in [2]

▪ Example Markov Chain Function Representation
Math Memory
Logic Branch
Register
Other
.7
Stack
.1
.2
.1

▪ Note: An intermediate step would be to lift the
disassembly to an intermediate representation (IR)
language (e.g. vex ir)
Math Memory
Logic Branch
Register
Other
.7
Stack
.1
.2
.1

2. Obtaining a “large enough” and diverse sample set
of functions with their appropriate classification
(e.g. crypto or file I/O)
• Labeling functions by hand does not scale in any sense
and especially not across architectures; not feasible
• Commercial binaries tend to be stripped of symbol
information, so would also not be feasible
• Open source code can allow for a diverse sample set
that provides symbol information that can be utilized to
derive a function’s classification

▪ Leveraging Open Source for Function Classification
• An example open source crypto package is openssl
• Typically functions have meaningful names that we can
exploit to derive classification types (e.g. aes_crypto() ➔
crypto type)
• We can also obtain diversity in terms of code
optimization by compiling the binary with various
optimization settings

CONCLUSION
▪ Question: If we had sufficient expertise, couldn’t we
write some scripts to accomplish a lot of the
automation presented in this talk?
• Yes…. However, those scripts probably don’t generalize
past the current problem sets and across architectures
• Most skilled RE practitioners are not necessarily skilled
and experienced software engineers.
• Developing a set of scripts that are robust and can work
across architectures requires a substantial amount of
software engineering and testing

REFERENCES
1. Rieck, K., Trinius, P., Willems, C., & Holz, T. (2011). Automatic analysis of
malware behavior using machine learning. Journal of Computer Security, 19(4),
639-668.
2. Anderson, B., Storlie, C., Yates, M., & McPhall, A. (2014, November).
Automating reverse engineering with machine learning techniques. In
Proceedings of the 2014 Workshop on Artificial Intelligent and Security
Workshop (pp. 103-112). AC
3. Keliris, A., & Maniatakos, M. (2018). ICSREF: A Framework for Automated
Reverse Engineering of Industrial Control Systems Binaries. arXiv preprint
arXiv:1812.03478.
4. He, J., Ivanov, P., Tsankov, P., Raychev, V., & Vechev, M. (2018, October).
Debin: Predicting Debug Information in Stripped Binaries. In Proceedings of the
2018 ACM SIGSAC Conference on Computer and Communications Security (pp.
1667-1680). ACM.
5. C. Chio and D. Freeman, (2018). Machine learning and security: Protecting
systems with data and algorithms. O’Reilly Media.

Automating Reverse Engineering: Function Classification and Matching

Recommended

Recommended

More Related Content

Similar to Automating Reverse Engineering: Function Classification and Matching

Similar to Automating Reverse Engineering: Function Classification and Matching (20)

More from Malachi Jones

More from Malachi Jones (7)

Recently uploaded

Recently uploaded (20)

Automating Reverse Engineering: Function Classification and Matching