A recurring and fundamental challenge that a reverse engineer (RE) experiences is understanding the behavior and functionality exhibited by a binary under examination. To complicate matters, skills needed to succeed in this challenge vary significantly across practitioners and can often takes a considerable amount of experience (5- 7 years) to achieve a sufficient level of competence. Recent work (performed in academia) in applying machine learning (ML) to reverse engineering shows promise in helping to address these issues in a way that can allow junior reverse engineers to make substantial contributions to RE tasking and can allow RE work to be performed in a scalable manner across platforms and architectures.
In this talk, we will discuss how ML techniques can be leveraged to classify behavioral characteristics (e.g. crypto, file I/O, network, IPC, and trampoline) exhibited by a function in a manner that can scale well and without the need for humans to perform labeling. We will also discuss how these techniques can be applied to identify/recover function symbols in stripped binaries. As part of the discussion, we will also explore approaches that have the potential to allow concepts and ideas presented in these academic works to be applied to real world RE problems.
2. ABOUT ME
https://www.linkedin.com/in/malachijonesphd
▪ Education
• Bachelors Degree: Computer Engineering (Univ. of Florida, 2007)
• Master’s Degree: Computer Engineering (Georgia Tech, 2009)
• PhD: Computer Engineering (Georgia Tech, 2013)
▪ Cyber Security Experience
• PhD Thesis: Asymmetric Information Games & Cyber Security (2009-2013)
• Harris Corp.: Cyber Software Engineer/ Vuln. Researcher (2013-2015)
• Booz Allen Dark Labs: Embedded Security Researcher (2016- 2018)
• MITRE: Lead Embedded Security Researcher (2018- Present)
3. OUTLINE
▪
I. What is Software Reverse Engineering (RE) ?
II. Why is it Difficult?
III. Challenges of Automating RE
▪
I. Machine Learning
II. Markov Decision Process (MDP)
▪
▪
▪
▪
Function Matching and Applying Historic Knowledge at Scale
Leveraging ML & Markov Chains for Function Classification
Motivation
Background
Conclusion
Scope
5. WHAT IS SOFTWARE RE
▪ Reverse Engineering (RE): The process by which you
deconstruct something to see and understand how
it works
▪ Software RE
• Characterization of software functionality/behavior that
includes intended and unintended functionality/behavior
(e.g. vulnerabilities)
• Characterization of the target environment
• Understanding of the executables purpose (e.g. Why
does it do what it does?)
6. WHAT IS SOFTWARE RE: EXAMPLE
Case Study: “Sample J” Malware
(sha1: 70cb0b4b8e60dfed949a319a9375fac44168ccbb)
7. WHAT IS SOFTWARE RE: EXAMPLE
Case Study: “Sample J” Malware
memset(,0, 0x124u)
CreateToolhelp32Snapshot()
Process32First()
stricmp(., “explorer.exe”)
Process32Next(.,.)
stricmp(., “explorer.exe”)
CreateThread(,,StartAddress)
Sample J Call Trace Example
8. WHAT IS SOFTWARE RE: EXAMPLE
Case Study: “Sample J” Malware
memset(,0, 0x124u)
CreateToolhelp32Snapshot()
Process32First()
stricmp(., “explorer.exe”)
Process32Next(.,.)
stricmp(., “explorer.exe”)
CreateThread(,,StartAddress)
Sample J Call Trace Example
▪Call Trace Analysis
i. Iterate through the process handles to see if a process with
name “explorer.exe” exists (Check if user logged in)
ii. If process exists, create a thread that infects target system
9. WHAT IS SOFTWARE RE
▪ Applications
• Vulnerability Research
• Malware Research
• Firmware Analysis
• Binary Repurposing
11. WHY IS RE DIFFICULT
▪ Reverse engineering (RE) applications (e.g. malware and vulnerability analysis)
have historically been a manual and time-intensive process performed by skilled
practitioners
▪ Substantial barriers to entry for RE practitioners
▪ Prevalence, ubiquity, and growth of malware and IoT devices that don’t scale
with respect to skilled practitioners
12. WHY IS RE DIFFICULT
▪ Non-trivial (often manual-intensive) RE Processes
1. Identifying what the code blob is (e.g. elf, PE, or
compressed)
2. Unpacking a code blob if it is packed
3. Cleaning up a disassembled binary image
4. Applying historic knowledge to a target function that is
similar to previously analyzed function
5. Understanding function behavior and binary purpose
14. CHALLENGES OF AUTOMATING RE
▪ Computational Complexity Considerations
▪ Architecture Idiosyncrasies
▪ Humans are inherently better at some tasks (e.g.
reasoning)
▪ Software Development & Testing effort
16. SCOPE
•Non-trivial (often manual-intensive) RE Processes
1. Identifying what the code blob is (e.g. elf, PE, or
compressed)
2. Unpacking a code blob if it is packed
3. Cleaning up a disassembled binary image
4. Applying historic knowledge to a target function that is
similar to previously analyzed functions
5. Understanding function behavior and binary purpose
Focus of this presentation
18. MACHINE LEARNING: TYPES
▪ Three Types of Machine Learning:
• Labeled Data
• Direct feedback
• Predict outcome/future
Supervised Learning
• No labels/ targets
• No feedback
• Find hidden structure in data
Unsupervised Learning
• Decision process
• Reward system
• Learn series of actions
Reinforcement Learning
19. MACHINE LEARNING: TYPES
▪ Three Types of Machine Learning:
• Labeled Data
• Direct feedback
• Predict outcome/future
Supervised Learning
• No labels/ targets
• No feedback
• Find hidden structure in data
Unsupervised Learning
• Decision process
• Reward system
• Learn series of actions
Reinforcement Learning
Of interest for this talk
20. SUPERVISED LEARNING
▪ Supervised learning: Main goal is to learn a model from labeled
training data that allows us to make predictions about unseen or
future data
▪ Supervised refers to a set of samples where the desired output
signals (labels) are already known
Labels
Training Data
Machine Learning
Algorithm
New Data Predictive Model Prediction
21. SUPERVISED LEARNING EXAMPLE
▪ Suppose we didn’t know what “features”
differentiate reptiles from mammals
▪ Note: Features can be viewed as properties
exhibited by a type/class
▪ Suppose also that we are given a sample set of
animals that have been labeled as either mammal or
reptile
22. SUPERVISED LEARNING EXAMPLE
▪ Sample Set
▪ Common Features: eyes, 4 legs, tail, tongue, etc…
▪ Differentiating Feature: Dogs and elephants give birth
to live offspring; alligators and iguanas don’t
▪ Based on the sample set, we’ve learned a feature that
separate mammals from reptiles
Mammal Mammal Reptile Reptile
23. SUPERVISED LEARNING EXAMPLE
▪ Sample Set
▪ Differentiating Feature: Dogs and elephants give birth
to live offspring; alligators and iguanas don’t
▪ The quintessence of Supervised ML is learning which
features/properties separate objects that differ in type
▪ These differentiating features can then allow for us to
make predictions about the type of an unlabeled object
Mammal Mammal Reptile Reptile
24. TYPES OF ML: UNSUPERVISED LEARNING
▪ Unsupervised Learning: Dealing with unlabeled ta or data of unknown structure
▪ In supervised learning, we know the right answer beforehand when we train our model
▪ Using unsupervised learning techniques, we can explore the structure of our data to extract
meaningful information without the guidance of a known outcome variable or reward function
25. UNSUPERVISED LEARNING
▪ Unsupervised Learning: Dealing with unlabeled ta or data of unknown structure
▪ In supervised learning, we know the right answer beforehand when we train our model
▪ Using unsupervised learning techniques, we can explore the structure of our data to extract
meaningful information without the guidance of a known outcome variable or reward function
27. MARKOV DECISION PROCESS
▪ Markov Process: A stochastic process that satisfies
the Markov property
▪ Markov Property: A process where one can make
predictions for the future of the process based
solely on its present state; memorylessness
▪ Markov Chain: A type of Markov process that has a
discrete state space
28. ▪ Markov Chain Example*: Predicting the Weather
▪ Interpretation: The probabilities of weather
conditions(i.e. modeled as either rainy or sunny),
given the weather on the preceding day
MARKOV DECISION PROCESS
Sunny.9 .5
.1
.5
Rainy
*(Example originally presented at https://en.wikipedia.org/wiki/Examples_of_Markov_chains)
29. MARKOV DECISION PROCESS
▪ Markov Chain Example*: Predicting the Weather
▪ The transition probabilities can be represented by
the following transition matrix:
Sunny.9 .5
.1
.5
Rainy
*(Example originally presented at https://en.wikipedia.org/wiki/Examples_of_Markov_chains)
30. MARKOV DECISION PROCESS
▪ Markov Chain Example*: Predicting the Weather
▪ Suppose the weather on day 1 is sunny, then we can
represent it by the following vector:
Sunny.9 .5
.1
.5
Rainy
*(Example originally presented at https://en.wikipedia.org/wiki/Examples_of_Markov_chains)
31. MARKOV DECISION PROCESS
▪ Markov Chain Example*: Predicting the Weather
▪ The weather on day 2 can be predicted by:
Sunny.9 .5
.1
.5
Rainy
*(Example originally presented at https://en.wikipedia.org/wiki/Examples_of_Markov_chains)
32. MARKOV DECISION PROCESS
▪ Markov Chain Example*: Predicting the Weather
▪ The weather on day 3 can be predicted by:
Sunny.9 .5
.1
.5
Rainy
*(Example originally presented at https://en.wikipedia.org/wiki/Examples_of_Markov_chains)
34. DERIVING FUNCTION META-DATA
▪ A recurring RE task consists of labeling and annotating a
function that has not been previously analyzed
▪ Annotating can include the following
• Comments about context and behavior of a set of instructions
• Identifying and labeling key data structures
• Determining the function’s prototype
▪ Note: Can also record who derived the annotations to assess
reputation/accuracy/trustworthiness
▪ For the remainder of this talk, we’ll refer to this information
as the function’s meta-data.
36. DERIVING FUNCTION META-DATA
Labeled Function
(IDA) Defined Function
Prototype
Identified and labeled
key variable
Additional comments to add
context about comparison
operation
Identified and labeled a key data
structure
37. LEVERAGING FUNCTION META-DATA
▪ A logical next step would be to apply the derived
function meta-data to functions in other binaries
that are either an exact match and/or similar
enough
▪ A huge bonus would be if we could perform this in a
disassembler agnostic manner.
▪ Example: If function meta-data generated in Ida
Pro, someone using Ghidra or Binary Ninja would
also be able to leverage that information.
38. LEVERAGING FUNCTION META-DATA
▪ Example: Disassembler Agnostic Approach
func1()
Binary A (Ghidra)
func5()
func2()
Binary B (IDA Pro)
func3()
func3()
func0()
Binary C (Binja)
func3()
DB
Save func3()
meta-data
Fetchfunc3()
meta-data
Fetch func3()
meta-data
39. STANDARD TECHNIQUE FOR EXACT MATCHES
▪Example: Hello World (x86)
.global _start
.text
_start:
# write(1, message, 13)
mov $1, %rax # system call 1 is write
mov $1, %rdi # file handle 1 is stdout
mov $message, %rsi # address of string to output
mov $13, %rdx # number of bytes
syscall # invoke operating system to do the write
# exit(0)
mov $60, %rax # system call 60 is exit
xor %rdi, %rdi # we want return code 0
syscall # invoke operating system to exit
message:
.ascii "Hello, worldn
49. FUNCTION CLASSIFICATION: MOTIVATION
▪ Another recurring task for RE practitioners is to characterize
a function’s behavior
▪ Behaviors can include the following:
• Computation
• Logic
• Crypto
• File I/O
• Networking
▪ Can you determine how the example function on the next
slides should be classified?
52. FUNCTION CLASSIFICATION: MOTIVATION
▪ Experienced RE practioners can classify key
behaviors/operations of a function more readily versus
their junior peers
▪ The ability to quickly classify behaviors/operations is a
key aspect that can allow an experienced RE
practitioner to spend more time focusing on what is
relevant to the RE task.
▪ Depending on the complexity and architecture of a
target function, even a RE SME can get bogged down
with classifying.
▪ This idea of function classification was presented in [2]
53. FUNCTION CLASSIFICATION: APPLYING ML
▪ We need at least two things to apply ML
1. Define a feature vector that can characterize an
arbitrary function
2. Obtaining a “large enough” and diverse sample set of
functions with their appropriate classification (e.g.
crypto or file I/O)
54. FUNCTION CLASSIFICATION: APPLYING ML
1. Defining a function feature vector
• We can model the function as a Markov Chain*
• The transition matrix P will then be our feature matrix
• An example of the states are the following:
i. Math
ii. Logic
iii. Memory
iv. Other
v. Branch
vi. Register
*The idea for leveraging Markov Chains was presented in [2]
55. FUNCTION CLASSIFICATION: APPLYING ML
▪ Example Markov Chain Function Representation
Math Memory
Logic Branch
Register
Other
.7
Stack
.1
.2
.1
56. FUNCTION CLASSIFICATION: APPLYING ML
▪ Note: An intermediate step would be to lift the
disassembly to an intermediate representation (IR)
language (e.g. vex ir)
Math Memory
Logic Branch
Register
Other
.7
Stack
.1
.2
.1
57. FUNCTION CLASSIFICATION: APPLYING ML
2. Obtaining a “large enough” and diverse sample set
of functions with their appropriate classification
(e.g. crypto or file I/O)
• Labeling functions by hand does not scale in any sense
and especially not across architectures; not feasible
• Commercial binaries tend to be stripped of symbol
information, so would also not be feasible
• Open source code can allow for a diverse sample set
that provides symbol information that can be utilized to
derive a function’s classification
58. FUNCTION CLASSIFICATION: APPLYING ML
▪ Leveraging Open Source for Function Classification
• An example open source crypto package is openssl
• Typically functions have meaningful names that we can
exploit to derive classification types (e.g. aes_crypto() ➔
crypto type)
• We can also obtain diversity in terms of code
optimization by compiling the binary with various
optimization settings
59. CONCLUSION
▪ Question: If we had sufficient expertise, couldn’t we
write some scripts to accomplish a lot of the
automation presented in this talk?
• Yes…. However, those scripts probably don’t generalize
past the current problem sets and across architectures
• Most skilled RE practitioners are not necessarily skilled
and experienced software engineers.
• Developing a set of scripts that are robust and can work
across architectures requires a substantial amount of
software engineering and testing
61. REFERENCES
1. Rieck, K., Trinius, P., Willems, C., & Holz, T. (2011). Automatic analysis of
malware behavior using machine learning. Journal of Computer Security, 19(4),
639-668.
2. Anderson, B., Storlie, C., Yates, M., & McPhall, A. (2014, November).
Automating reverse engineering with machine learning techniques. In
Proceedings of the 2014 Workshop on Artificial Intelligent and Security
Workshop (pp. 103-112). AC
3. Keliris, A., & Maniatakos, M. (2018). ICSREF: A Framework for Automated
Reverse Engineering of Industrial Control Systems Binaries. arXiv preprint
arXiv:1812.03478.
4. He, J., Ivanov, P., Tsankov, P., Raychev, V., & Vechev, M. (2018, October).
Debin: Predicting Debug Information in Stripped Binaries. In Proceedings of the
2018 ACM SIGSAC Conference on Computer and Communications Security (pp.
1667-1680). ACM.
5. C. Chio and D. Freeman, (2018). Machine learning and security: Protecting
systems with data and algorithms. O’Reilly Media.