SlideShare a Scribd company logo
1 of 61
Download to read offline
Automating Reverse Engineering:
Function Classification and Matching w/ Machine
Learning and Binary Analysis
By: Malachi Jones, PhD
ABOUT ME
https://www.linkedin.com/in/malachijonesphd
▪ Education
• Bachelors Degree: Computer Engineering (Univ. of Florida, 2007)
• Master’s Degree: Computer Engineering (Georgia Tech, 2009)
• PhD: Computer Engineering (Georgia Tech, 2013)
▪ Cyber Security Experience
• PhD Thesis: Asymmetric Information Games & Cyber Security (2009-2013)
• Harris Corp.: Cyber Software Engineer/ Vuln. Researcher (2013-2015)
• Booz Allen Dark Labs: Embedded Security Researcher (2016- 2018)
• MITRE: Lead Embedded Security Researcher (2018- Present)
OUTLINE
▪
I. What is Software Reverse Engineering (RE) ?
II. Why is it Difficult?
III. Challenges of Automating RE
▪
I. Machine Learning
II. Markov Decision Process (MDP)
▪
▪
▪
▪
Function Matching and Applying Historic Knowledge at Scale
Leveraging ML & Markov Chains for Function Classification
Motivation
Background
Conclusion
Scope
WHAT IS SOFTWARE REVERSE
ENGINEERING?
Motivation
WHAT IS SOFTWARE RE
▪ Reverse Engineering (RE): The process by which you
deconstruct something to see and understand how
it works
▪ Software RE
• Characterization of software functionality/behavior that
includes intended and unintended functionality/behavior
(e.g. vulnerabilities)
• Characterization of the target environment
• Understanding of the executables purpose (e.g. Why
does it do what it does?)
WHAT IS SOFTWARE RE: EXAMPLE
Case Study: “Sample J” Malware
(sha1: 70cb0b4b8e60dfed949a319a9375fac44168ccbb)
WHAT IS SOFTWARE RE: EXAMPLE
Case Study: “Sample J” Malware
memset(,0, 0x124u)
CreateToolhelp32Snapshot()
Process32First()
stricmp(., “explorer.exe”)
Process32Next(.,.)
stricmp(., “explorer.exe”)
CreateThread(,,StartAddress)
Sample J Call Trace Example
WHAT IS SOFTWARE RE: EXAMPLE
Case Study: “Sample J” Malware
memset(,0, 0x124u)
CreateToolhelp32Snapshot()
Process32First()
stricmp(., “explorer.exe”)
Process32Next(.,.)
stricmp(., “explorer.exe”)
CreateThread(,,StartAddress)
Sample J Call Trace Example
▪Call Trace Analysis
i. Iterate through the process handles to see if a process with
name “explorer.exe” exists (Check if user logged in)
ii. If process exists, create a thread that infects target system
WHAT IS SOFTWARE RE
▪ Applications
• Vulnerability Research
• Malware Research
• Firmware Analysis
• Binary Repurposing
WHY IS RE DIFFICULT?
Motivation
WHY IS RE DIFFICULT
▪ Reverse engineering (RE) applications (e.g. malware and vulnerability analysis)
have historically been a manual and time-intensive process performed by skilled
practitioners
▪ Substantial barriers to entry for RE practitioners
▪ Prevalence, ubiquity, and growth of malware and IoT devices that don’t scale
with respect to skilled practitioners
WHY IS RE DIFFICULT
▪ Non-trivial (often manual-intensive) RE Processes
1. Identifying what the code blob is (e.g. elf, PE, or
compressed)
2. Unpacking a code blob if it is packed
3. Cleaning up a disassembled binary image
4. Applying historic knowledge to a target function that is
similar to previously analyzed function
5. Understanding function behavior and binary purpose
CHALLENGES OF AUTOMATING RE
Motivation
CHALLENGES OF AUTOMATING RE
▪ Computational Complexity Considerations
▪ Architecture Idiosyncrasies
▪ Humans are inherently better at some tasks (e.g.
reasoning)
▪ Software Development & Testing effort
SCOPE
SCOPE
•Non-trivial (often manual-intensive) RE Processes
1. Identifying what the code blob is (e.g. elf, PE, or
compressed)
2. Unpacking a code blob if it is packed
3. Cleaning up a disassembled binary image
4. Applying historic knowledge to a target function that is
similar to previously analyzed functions
5. Understanding function behavior and binary purpose
Focus of this presentation
MACHINE LEARNING
Background
MACHINE LEARNING: TYPES
▪ Three Types of Machine Learning:
• Labeled Data
• Direct feedback
• Predict outcome/future
Supervised Learning
• No labels/ targets
• No feedback
• Find hidden structure in data
Unsupervised Learning
• Decision process
• Reward system
• Learn series of actions
Reinforcement Learning
MACHINE LEARNING: TYPES
▪ Three Types of Machine Learning:
• Labeled Data
• Direct feedback
• Predict outcome/future
Supervised Learning
• No labels/ targets
• No feedback
• Find hidden structure in data
Unsupervised Learning
• Decision process
• Reward system
• Learn series of actions
Reinforcement Learning
Of interest for this talk
SUPERVISED LEARNING
▪ Supervised learning: Main goal is to learn a model from labeled
training data that allows us to make predictions about unseen or
future data
▪ Supervised refers to a set of samples where the desired output
signals (labels) are already known
Labels
Training Data
Machine Learning
Algorithm
New Data Predictive Model Prediction
SUPERVISED LEARNING EXAMPLE
▪ Suppose we didn’t know what “features”
differentiate reptiles from mammals
▪ Note: Features can be viewed as properties
exhibited by a type/class
▪ Suppose also that we are given a sample set of
animals that have been labeled as either mammal or
reptile
SUPERVISED LEARNING EXAMPLE
▪ Sample Set
▪ Common Features: eyes, 4 legs, tail, tongue, etc…
▪ Differentiating Feature: Dogs and elephants give birth
to live offspring; alligators and iguanas don’t
▪ Based on the sample set, we’ve learned a feature that
separate mammals from reptiles
Mammal Mammal Reptile Reptile
SUPERVISED LEARNING EXAMPLE
▪ Sample Set
▪ Differentiating Feature: Dogs and elephants give birth
to live offspring; alligators and iguanas don’t
▪ The quintessence of Supervised ML is learning which
features/properties separate objects that differ in type
▪ These differentiating features can then allow for us to
make predictions about the type of an unlabeled object
Mammal Mammal Reptile Reptile
TYPES OF ML: UNSUPERVISED LEARNING
▪ Unsupervised Learning: Dealing with unlabeled ta or data of unknown structure
▪ In supervised learning, we know the right answer beforehand when we train our model
▪ Using unsupervised learning techniques, we can explore the structure of our data to extract
meaningful information without the guidance of a known outcome variable or reward function
UNSUPERVISED LEARNING
▪ Unsupervised Learning: Dealing with unlabeled ta or data of unknown structure
▪ In supervised learning, we know the right answer beforehand when we train our model
▪ Using unsupervised learning techniques, we can explore the structure of our data to extract
meaningful information without the guidance of a known outcome variable or reward function
MARKOV DECISION PROCESS (MDP)
Background
MARKOV DECISION PROCESS
▪ Markov Process: A stochastic process that satisfies
the Markov property
▪ Markov Property: A process where one can make
predictions for the future of the process based
solely on its present state; memorylessness
▪ Markov Chain: A type of Markov process that has a
discrete state space
▪ Markov Chain Example*: Predicting the Weather
▪ Interpretation: The probabilities of weather
conditions(i.e. modeled as either rainy or sunny),
given the weather on the preceding day
MARKOV DECISION PROCESS
Sunny.9 .5
.1
.5
Rainy
*(Example originally presented at https://en.wikipedia.org/wiki/Examples_of_Markov_chains)
MARKOV DECISION PROCESS
▪ Markov Chain Example*: Predicting the Weather
▪ The transition probabilities can be represented by
the following transition matrix:
Sunny.9 .5
.1
.5
Rainy
*(Example originally presented at https://en.wikipedia.org/wiki/Examples_of_Markov_chains)
MARKOV DECISION PROCESS
▪ Markov Chain Example*: Predicting the Weather
▪ Suppose the weather on day 1 is sunny, then we can
represent it by the following vector:
Sunny.9 .5
.1
.5
Rainy
*(Example originally presented at https://en.wikipedia.org/wiki/Examples_of_Markov_chains)
MARKOV DECISION PROCESS
▪ Markov Chain Example*: Predicting the Weather
▪ The weather on day 2 can be predicted by:
Sunny.9 .5
.1
.5
Rainy
*(Example originally presented at https://en.wikipedia.org/wiki/Examples_of_Markov_chains)
MARKOV DECISION PROCESS
▪ Markov Chain Example*: Predicting the Weather
▪ The weather on day 3 can be predicted by:
Sunny.9 .5
.1
.5
Rainy
*(Example originally presented at https://en.wikipedia.org/wiki/Examples_of_Markov_chains)
Function Matching & Applying Historic Knowledge
DERIVING FUNCTION META-DATA
▪ A recurring RE task consists of labeling and annotating a
function that has not been previously analyzed
▪ Annotating can include the following
• Comments about context and behavior of a set of instructions
• Identifying and labeling key data structures
• Determining the function’s prototype
▪ Note: Can also record who derived the annotations to assess
reputation/accuracy/trustworthiness
▪ For the remainder of this talk, we’ll refer to this information
as the function’s meta-data.
DERIVING FUNCTION META-DATA
DERIVING FUNCTION META-DATA
Labeled Function
(IDA) Defined Function
Prototype
Identified and labeled
key variable
Additional comments to add
context about comparison
operation
Identified and labeled a key data
structure
LEVERAGING FUNCTION META-DATA
▪ A logical next step would be to apply the derived
function meta-data to functions in other binaries
that are either an exact match and/or similar
enough
▪ A huge bonus would be if we could perform this in a
disassembler agnostic manner.
▪ Example: If function meta-data generated in Ida
Pro, someone using Ghidra or Binary Ninja would
also be able to leverage that information.
LEVERAGING FUNCTION META-DATA
▪ Example: Disassembler Agnostic Approach
func1()
Binary A (Ghidra)
func5()
func2()
Binary B (IDA Pro)
func3()
func3()
func0()
Binary C (Binja)
func3()
DB
Save func3()
meta-data
Fetchfunc3()
meta-data
Fetch func3()
meta-data
STANDARD TECHNIQUE FOR EXACT MATCHES
▪Example: Hello World (x86)
.global _start
.text
_start:
# write(1, message, 13)
mov $1, %rax # system call 1 is write
mov $1, %rdi # file handle 1 is stdout
mov $message, %rsi # address of string to output
mov $13, %rdx # number of bytes
syscall # invoke operating system to do the write
# exit(0)
mov $60, %rax # system call 60 is exit
xor %rdi, %rdi # we want return code 0
syscall # invoke operating system to exit
message:
.ascii "Hello, worldn
FUNCTION MATCHING
▪ Standard Technique for Exact Matches
# write(1, message, 13)
mov $1, %rax
mov $1, %rdi
mov $message, %rsi
mov $13, %rdx
syscall
# exit(0)
mov $60, %rax
xor %rdi, %rdi
syscall
main()
mov
mov
mov
mov
syscall
mov
xor
syscall
sha1 hash
FUNCTION MATCHING: APPROXIMATE MATCHING
▪ N-gram Technique for Approximate Match (N=3)
main()
mov
mov
mov
mov
syscall
mov
xor
syscall
mov
SEQ 1
movmov
mov
mov
mov
FUNCTION MATCHING: APPROXIMATE MATCHING
▪ N-gram Technique (N=3 in this example)
main()
mov
mov
mov
mov
syscall
mov
xor
syscall
mov
SEQ 1
mov
mov
mov
mov
mov
mov
SEQ 2
mov
mov
FUNCTION MATCHING: APPROXIMATE MATCHING
▪ N-gram Technique (N=3 in this example)
main()
mov
mov
mov
mov
syscall
mov
xor
syscall
mov
SEQ 1
mov
mov
mov
syscall
mov
mov
SEQ 2
mov
mov
mov
SEQ 3
mov
syscall
FUNCTION MATCHING: APPROXIMATE MATCHING
▪ N-gram Technique (N=3 in this example)
main()
mov
mov
mov
mov
syscall
mov
xor
syscall
mov
SEQ 1
mov
mov
syscall
mov
mov
mov
SEQ 2
mov
mov
mov
SEQ 3
mov
syscall
syscall
SEQ 4
mov
mov
FUNCTION MATCHING: APPROXIMATE MATCHING
▪ N-gram Technique (N=3 in this example)
main()
mov
mov
mov
mov
syscall
mov
xor
syscall
mov
SEQ 1
mov
syscall
mov
xor
mov
mov
SEQ 2
mov
mov
mov
SEQ 3
mov
syscall
syscall
SEQ 4
mov
mov
mov
SEQ 5
syscall
xor
FUNCTION MATCHING: APPROXIMATE MATCHING
▪ N-gram Technique (N=3 in this example)
main()
mov
mov
mov
mov
syscall
mov
xor
syscall
mov
SEQ 1
mov
mov
xor
syscall
mov
mov
SEQ 2
mov
mov
mov
SEQ 3
mov
syscall
syscall
SEQ 4
mov
mov
mov
SEQ 5
syscall
xor
xor
SEQ 6
mov
syscall
FUNCTION MATCHING: APPROXIMATE MATCHING
▪ Jaccard Index can be used to measure similarity
J(p,q)
Function p
.953
Function q
Function qFunction p
Function Classification w/ Machine Learning and
Binary Analysis
FUNCTION CLASSIFICATION: MOTIVATION
▪ Another recurring task for RE practitioners is to characterize
a function’s behavior
▪ Behaviors can include the following:
• Computation
• Logic
• Crypto
• File I/O
• Networking
▪ Can you determine how the example function on the next
slides should be classified?
FUNCTION CLASSIFICATION: MOTIVATION
push rbp
mov rbp, rsp
push rbx
sub rsp, 40
mov QWORD PTR [rbp-40], rdi
mov QWORD PTR [rbp-48], rsi
mov BYTE PTR [rbp-21], 115
mov DWORD PTR [rbp-20], 0
jmp .L2
.L3:
mov eax, DWORD PTR [rbp-20]
movsx rdx, eax
mov rax, QWORD PTR [rbp-48]
add rax, rdx
movzx eax, BYTE PTR [rax]
mov edx, DWORD PTR [rbp-20]
movsx rcx, edx
mov rdx, QWORD PTR [rbp-48]
add rdx, rcx
xor al, BYTE PTR [rbp-21]
mov BYTE PTR [rdx], al
add DWORD PTR [rbp-20], 1
.L2:
mov eax, DWORD PTR [rbp-20]
movsx rbx, eax
mov rax, QWORD PTR [rbp-40]
mov rdi, rax
call strlen
cmp rbx, rax
jb .L3
mov eax, 0
add rsp, 40
pop rbx
pop rbp
ret
FUNCTION CLASSIFICATION: MOTIVATION
FUNCTION CLASSIFICATION: MOTIVATION
▪ Experienced RE practioners can classify key
behaviors/operations of a function more readily versus
their junior peers
▪ The ability to quickly classify behaviors/operations is a
key aspect that can allow an experienced RE
practitioner to spend more time focusing on what is
relevant to the RE task.
▪ Depending on the complexity and architecture of a
target function, even a RE SME can get bogged down
with classifying.
▪ This idea of function classification was presented in [2]
FUNCTION CLASSIFICATION: APPLYING ML
▪ We need at least two things to apply ML
1. Define a feature vector that can characterize an
arbitrary function
2. Obtaining a “large enough” and diverse sample set of
functions with their appropriate classification (e.g.
crypto or file I/O)
FUNCTION CLASSIFICATION: APPLYING ML
1. Defining a function feature vector
• We can model the function as a Markov Chain*
• The transition matrix P will then be our feature matrix
• An example of the states are the following:
i. Math
ii. Logic
iii. Memory
iv. Other
v. Branch
vi. Register
*The idea for leveraging Markov Chains was presented in [2]
FUNCTION CLASSIFICATION: APPLYING ML
▪ Example Markov Chain Function Representation
Math Memory
Logic Branch
Register
Other
.7
Stack
.1
.2
.1
FUNCTION CLASSIFICATION: APPLYING ML
▪ Note: An intermediate step would be to lift the
disassembly to an intermediate representation (IR)
language (e.g. vex ir)
Math Memory
Logic Branch
Register
Other
.7
Stack
.1
.2
.1
FUNCTION CLASSIFICATION: APPLYING ML
2. Obtaining a “large enough” and diverse sample set
of functions with their appropriate classification
(e.g. crypto or file I/O)
• Labeling functions by hand does not scale in any sense
and especially not across architectures; not feasible
• Commercial binaries tend to be stripped of symbol
information, so would also not be feasible
• Open source code can allow for a diverse sample set
that provides symbol information that can be utilized to
derive a function’s classification
FUNCTION CLASSIFICATION: APPLYING ML
▪ Leveraging Open Source for Function Classification
• An example open source crypto package is openssl
• Typically functions have meaningful names that we can
exploit to derive classification types (e.g. aes_crypto() ➔
crypto type)
• We can also obtain diversity in terms of code
optimization by compiling the binary with various
optimization settings
CONCLUSION
▪ Question: If we had sufficient expertise, couldn’t we
write some scripts to accomplish a lot of the
automation presented in this talk?
• Yes…. However, those scripts probably don’t generalize
past the current problem sets and across architectures
• Most skilled RE practitioners are not necessarily skilled
and experienced software engineers.
• Developing a set of scripts that are robust and can work
across architectures requires a substantial amount of
software engineering and testing
REFERENCES
REFERENCES
1. Rieck, K., Trinius, P., Willems, C., & Holz, T. (2011). Automatic analysis of
malware behavior using machine learning. Journal of Computer Security, 19(4),
639-668.
2. Anderson, B., Storlie, C., Yates, M., & McPhall, A. (2014, November).
Automating reverse engineering with machine learning techniques. In
Proceedings of the 2014 Workshop on Artificial Intelligent and Security
Workshop (pp. 103-112). AC
3. Keliris, A., & Maniatakos, M. (2018). ICSREF: A Framework for Automated
Reverse Engineering of Industrial Control Systems Binaries. arXiv preprint
arXiv:1812.03478.
4. He, J., Ivanov, P., Tsankov, P., Raychev, V., & Vechev, M. (2018, October).
Debin: Predicting Debug Information in Stripped Binaries. In Proceedings of the
2018 ACM SIGSAC Conference on Computer and Communications Security (pp.
1667-1680). ACM.
5. C. Chio and D. Freeman, (2018). Machine learning and security: Protecting
systems with data and algorithms. O’Reilly Media.

More Related Content

Similar to Automating Reverse Engineering: Function Classification and Matching

Artificial Intelligence with Python | Edureka
Artificial Intelligence with Python | EdurekaArtificial Intelligence with Python | Edureka
Artificial Intelligence with Python | EdurekaEdureka!
 
Object Oriented Programming Principles
Object Oriented Programming PrinciplesObject Oriented Programming Principles
Object Oriented Programming PrinciplesAndrew Ferlitsch
 
The Art of Intelligence – Introduction Machine Learning for Java professional...
The Art of Intelligence – Introduction Machine Learning for Java professional...The Art of Intelligence – Introduction Machine Learning for Java professional...
The Art of Intelligence – Introduction Machine Learning for Java professional...Lucas Jellema
 
Introduction to Clova platform with Machine learning development in practice
Introduction to Clova platform with Machine learning development in practiceIntroduction to Clova platform with Machine learning development in practice
Introduction to Clova platform with Machine learning development in practiceLINE Corporation
 
Need of object oriented programming
Need of object oriented programmingNeed of object oriented programming
Need of object oriented programmingAmar Jukuntla
 
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRGData Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRGThamme Gowda
 
Machine learning with R
Machine learning with RMachine learning with R
Machine learning with RMaarten Smeets
 
Data+Science+in+Python+-+Data+Prep+&+EDA.pdf
Data+Science+in+Python+-+Data+Prep+&+EDA.pdfData+Science+in+Python+-+Data+Prep+&+EDA.pdf
Data+Science+in+Python+-+Data+Prep+&+EDA.pdfneelakandan2001kpm
 
Activity Monitoring Using Wearable Sensors and Smart Phone
Activity Monitoring Using Wearable Sensors and Smart PhoneActivity Monitoring Using Wearable Sensors and Smart Phone
Activity Monitoring Using Wearable Sensors and Smart PhoneDrAhmedZoha
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JJosh Patterson
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningJoaquin Delgado PhD.
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningS. Diana Hu
 
Introduction to Software - Coder Forge - John Mulhall
Introduction to Software - Coder Forge - John MulhallIntroduction to Software - Coder Forge - John Mulhall
Introduction to Software - Coder Forge - John MulhallJohn Mulhall
 
Andrii Belas "Modern approaches to working with categorical data in machine l...
Andrii Belas "Modern approaches to working with categorical data in machine l...Andrii Belas "Modern approaches to working with categorical data in machine l...
Andrii Belas "Modern approaches to working with categorical data in machine l...Lviv Startup Club
 
SFScon 22 - Fiete Lüer - Heading towards reproducible machine learning resear...
SFScon 22 - Fiete Lüer - Heading towards reproducible machine learning resear...SFScon 22 - Fiete Lüer - Heading towards reproducible machine learning resear...
SFScon 22 - Fiete Lüer - Heading towards reproducible machine learning resear...South Tyrol Free Software Conference
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark MLAhmet Bulut
 

Similar to Automating Reverse Engineering: Function Classification and Matching (20)

Artificial Intelligence with Python | Edureka
Artificial Intelligence with Python | EdurekaArtificial Intelligence with Python | Edureka
Artificial Intelligence with Python | Edureka
 
Object Oriented Programming Principles
Object Oriented Programming PrinciplesObject Oriented Programming Principles
Object Oriented Programming Principles
 
The Art of Intelligence – Introduction Machine Learning for Java professional...
The Art of Intelligence – Introduction Machine Learning for Java professional...The Art of Intelligence – Introduction Machine Learning for Java professional...
The Art of Intelligence – Introduction Machine Learning for Java professional...
 
1.Java_programming2017.pdf
1.Java_programming2017.pdf1.Java_programming2017.pdf
1.Java_programming2017.pdf
 
Introduction to Clova platform with Machine learning development in practice
Introduction to Clova platform with Machine learning development in practiceIntroduction to Clova platform with Machine learning development in practice
Introduction to Clova platform with Machine learning development in practice
 
Need of object oriented programming
Need of object oriented programmingNeed of object oriented programming
Need of object oriented programming
 
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRGData Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
 
Seminar2017
Seminar2017Seminar2017
Seminar2017
 
Machine learning with R
Machine learning with RMachine learning with R
Machine learning with R
 
machine learning
machine learningmachine learning
machine learning
 
Cs2305 programming paradigms lecturer notes
Cs2305   programming paradigms lecturer notesCs2305   programming paradigms lecturer notes
Cs2305 programming paradigms lecturer notes
 
Data+Science+in+Python+-+Data+Prep+&+EDA.pdf
Data+Science+in+Python+-+Data+Prep+&+EDA.pdfData+Science+in+Python+-+Data+Prep+&+EDA.pdf
Data+Science+in+Python+-+Data+Prep+&+EDA.pdf
 
Activity Monitoring Using Wearable Sensors and Smart Phone
Activity Monitoring Using Wearable Sensors and Smart PhoneActivity Monitoring Using Wearable Sensors and Smart Phone
Activity Monitoring Using Wearable Sensors and Smart Phone
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4J
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
 
Introduction to Software - Coder Forge - John Mulhall
Introduction to Software - Coder Forge - John MulhallIntroduction to Software - Coder Forge - John Mulhall
Introduction to Software - Coder Forge - John Mulhall
 
Andrii Belas "Modern approaches to working with categorical data in machine l...
Andrii Belas "Modern approaches to working with categorical data in machine l...Andrii Belas "Modern approaches to working with categorical data in machine l...
Andrii Belas "Modern approaches to working with categorical data in machine l...
 
SFScon 22 - Fiete Lüer - Heading towards reproducible machine learning resear...
SFScon 22 - Fiete Lüer - Heading towards reproducible machine learning resear...SFScon 22 - Fiete Lüer - Heading towards reproducible machine learning resear...
SFScon 22 - Fiete Lüer - Heading towards reproducible machine learning resear...
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
 

More from Malachi Jones

Automated In-memory Malware/Rootkit Detection via Binary Analysis and Machin...
Automated In-memory Malware/Rootkit  Detection via Binary Analysis and Machin...Automated In-memory Malware/Rootkit  Detection via Binary Analysis and Machin...
Automated In-memory Malware/Rootkit Detection via Binary Analysis and Machin...Malachi Jones
 
Embedded device hacking Session i
Embedded device hacking Session iEmbedded device hacking Session i
Embedded device hacking Session iMalachi Jones
 
SmartphoneHacking_Android_Exploitation
SmartphoneHacking_Android_ExploitationSmartphoneHacking_Android_Exploitation
SmartphoneHacking_Android_ExploitationMalachi Jones
 
Automating Analysis and Exploitation of Embedded Device Firmware
Automating Analysis and Exploitation of Embedded Device FirmwareAutomating Analysis and Exploitation of Embedded Device Firmware
Automating Analysis and Exploitation of Embedded Device FirmwareMalachi Jones
 
Embedded Systems Security
Embedded Systems Security Embedded Systems Security
Embedded Systems Security Malachi Jones
 
Offensive cyber security: Smashing the stack with Python
Offensive cyber security: Smashing the stack with PythonOffensive cyber security: Smashing the stack with Python
Offensive cyber security: Smashing the stack with PythonMalachi Jones
 
Cyber_Attack_Forecasting_Jones_2015
Cyber_Attack_Forecasting_Jones_2015Cyber_Attack_Forecasting_Jones_2015
Cyber_Attack_Forecasting_Jones_2015Malachi Jones
 

More from Malachi Jones (7)

Automated In-memory Malware/Rootkit Detection via Binary Analysis and Machin...
Automated In-memory Malware/Rootkit  Detection via Binary Analysis and Machin...Automated In-memory Malware/Rootkit  Detection via Binary Analysis and Machin...
Automated In-memory Malware/Rootkit Detection via Binary Analysis and Machin...
 
Embedded device hacking Session i
Embedded device hacking Session iEmbedded device hacking Session i
Embedded device hacking Session i
 
SmartphoneHacking_Android_Exploitation
SmartphoneHacking_Android_ExploitationSmartphoneHacking_Android_Exploitation
SmartphoneHacking_Android_Exploitation
 
Automating Analysis and Exploitation of Embedded Device Firmware
Automating Analysis and Exploitation of Embedded Device FirmwareAutomating Analysis and Exploitation of Embedded Device Firmware
Automating Analysis and Exploitation of Embedded Device Firmware
 
Embedded Systems Security
Embedded Systems Security Embedded Systems Security
Embedded Systems Security
 
Offensive cyber security: Smashing the stack with Python
Offensive cyber security: Smashing the stack with PythonOffensive cyber security: Smashing the stack with Python
Offensive cyber security: Smashing the stack with Python
 
Cyber_Attack_Forecasting_Jones_2015
Cyber_Attack_Forecasting_Jones_2015Cyber_Attack_Forecasting_Jones_2015
Cyber_Attack_Forecasting_Jones_2015
 

Recently uploaded

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Recently uploaded (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Automating Reverse Engineering: Function Classification and Matching

  • 1. Automating Reverse Engineering: Function Classification and Matching w/ Machine Learning and Binary Analysis By: Malachi Jones, PhD
  • 2. ABOUT ME https://www.linkedin.com/in/malachijonesphd ▪ Education • Bachelors Degree: Computer Engineering (Univ. of Florida, 2007) • Master’s Degree: Computer Engineering (Georgia Tech, 2009) • PhD: Computer Engineering (Georgia Tech, 2013) ▪ Cyber Security Experience • PhD Thesis: Asymmetric Information Games & Cyber Security (2009-2013) • Harris Corp.: Cyber Software Engineer/ Vuln. Researcher (2013-2015) • Booz Allen Dark Labs: Embedded Security Researcher (2016- 2018) • MITRE: Lead Embedded Security Researcher (2018- Present)
  • 3. OUTLINE ▪ I. What is Software Reverse Engineering (RE) ? II. Why is it Difficult? III. Challenges of Automating RE ▪ I. Machine Learning II. Markov Decision Process (MDP) ▪ ▪ ▪ ▪ Function Matching and Applying Historic Knowledge at Scale Leveraging ML & Markov Chains for Function Classification Motivation Background Conclusion Scope
  • 4. WHAT IS SOFTWARE REVERSE ENGINEERING? Motivation
  • 5. WHAT IS SOFTWARE RE ▪ Reverse Engineering (RE): The process by which you deconstruct something to see and understand how it works ▪ Software RE • Characterization of software functionality/behavior that includes intended and unintended functionality/behavior (e.g. vulnerabilities) • Characterization of the target environment • Understanding of the executables purpose (e.g. Why does it do what it does?)
  • 6. WHAT IS SOFTWARE RE: EXAMPLE Case Study: “Sample J” Malware (sha1: 70cb0b4b8e60dfed949a319a9375fac44168ccbb)
  • 7. WHAT IS SOFTWARE RE: EXAMPLE Case Study: “Sample J” Malware memset(,0, 0x124u) CreateToolhelp32Snapshot() Process32First() stricmp(., “explorer.exe”) Process32Next(.,.) stricmp(., “explorer.exe”) CreateThread(,,StartAddress) Sample J Call Trace Example
  • 8. WHAT IS SOFTWARE RE: EXAMPLE Case Study: “Sample J” Malware memset(,0, 0x124u) CreateToolhelp32Snapshot() Process32First() stricmp(., “explorer.exe”) Process32Next(.,.) stricmp(., “explorer.exe”) CreateThread(,,StartAddress) Sample J Call Trace Example ▪Call Trace Analysis i. Iterate through the process handles to see if a process with name “explorer.exe” exists (Check if user logged in) ii. If process exists, create a thread that infects target system
  • 9. WHAT IS SOFTWARE RE ▪ Applications • Vulnerability Research • Malware Research • Firmware Analysis • Binary Repurposing
  • 10. WHY IS RE DIFFICULT? Motivation
  • 11. WHY IS RE DIFFICULT ▪ Reverse engineering (RE) applications (e.g. malware and vulnerability analysis) have historically been a manual and time-intensive process performed by skilled practitioners ▪ Substantial barriers to entry for RE practitioners ▪ Prevalence, ubiquity, and growth of malware and IoT devices that don’t scale with respect to skilled practitioners
  • 12. WHY IS RE DIFFICULT ▪ Non-trivial (often manual-intensive) RE Processes 1. Identifying what the code blob is (e.g. elf, PE, or compressed) 2. Unpacking a code blob if it is packed 3. Cleaning up a disassembled binary image 4. Applying historic knowledge to a target function that is similar to previously analyzed function 5. Understanding function behavior and binary purpose
  • 13. CHALLENGES OF AUTOMATING RE Motivation
  • 14. CHALLENGES OF AUTOMATING RE ▪ Computational Complexity Considerations ▪ Architecture Idiosyncrasies ▪ Humans are inherently better at some tasks (e.g. reasoning) ▪ Software Development & Testing effort
  • 15. SCOPE
  • 16. SCOPE •Non-trivial (often manual-intensive) RE Processes 1. Identifying what the code blob is (e.g. elf, PE, or compressed) 2. Unpacking a code blob if it is packed 3. Cleaning up a disassembled binary image 4. Applying historic knowledge to a target function that is similar to previously analyzed functions 5. Understanding function behavior and binary purpose Focus of this presentation
  • 18. MACHINE LEARNING: TYPES ▪ Three Types of Machine Learning: • Labeled Data • Direct feedback • Predict outcome/future Supervised Learning • No labels/ targets • No feedback • Find hidden structure in data Unsupervised Learning • Decision process • Reward system • Learn series of actions Reinforcement Learning
  • 19. MACHINE LEARNING: TYPES ▪ Three Types of Machine Learning: • Labeled Data • Direct feedback • Predict outcome/future Supervised Learning • No labels/ targets • No feedback • Find hidden structure in data Unsupervised Learning • Decision process • Reward system • Learn series of actions Reinforcement Learning Of interest for this talk
  • 20. SUPERVISED LEARNING ▪ Supervised learning: Main goal is to learn a model from labeled training data that allows us to make predictions about unseen or future data ▪ Supervised refers to a set of samples where the desired output signals (labels) are already known Labels Training Data Machine Learning Algorithm New Data Predictive Model Prediction
  • 21. SUPERVISED LEARNING EXAMPLE ▪ Suppose we didn’t know what “features” differentiate reptiles from mammals ▪ Note: Features can be viewed as properties exhibited by a type/class ▪ Suppose also that we are given a sample set of animals that have been labeled as either mammal or reptile
  • 22. SUPERVISED LEARNING EXAMPLE ▪ Sample Set ▪ Common Features: eyes, 4 legs, tail, tongue, etc… ▪ Differentiating Feature: Dogs and elephants give birth to live offspring; alligators and iguanas don’t ▪ Based on the sample set, we’ve learned a feature that separate mammals from reptiles Mammal Mammal Reptile Reptile
  • 23. SUPERVISED LEARNING EXAMPLE ▪ Sample Set ▪ Differentiating Feature: Dogs and elephants give birth to live offspring; alligators and iguanas don’t ▪ The quintessence of Supervised ML is learning which features/properties separate objects that differ in type ▪ These differentiating features can then allow for us to make predictions about the type of an unlabeled object Mammal Mammal Reptile Reptile
  • 24. TYPES OF ML: UNSUPERVISED LEARNING ▪ Unsupervised Learning: Dealing with unlabeled ta or data of unknown structure ▪ In supervised learning, we know the right answer beforehand when we train our model ▪ Using unsupervised learning techniques, we can explore the structure of our data to extract meaningful information without the guidance of a known outcome variable or reward function
  • 25. UNSUPERVISED LEARNING ▪ Unsupervised Learning: Dealing with unlabeled ta or data of unknown structure ▪ In supervised learning, we know the right answer beforehand when we train our model ▪ Using unsupervised learning techniques, we can explore the structure of our data to extract meaningful information without the guidance of a known outcome variable or reward function
  • 26. MARKOV DECISION PROCESS (MDP) Background
  • 27. MARKOV DECISION PROCESS ▪ Markov Process: A stochastic process that satisfies the Markov property ▪ Markov Property: A process where one can make predictions for the future of the process based solely on its present state; memorylessness ▪ Markov Chain: A type of Markov process that has a discrete state space
  • 28. ▪ Markov Chain Example*: Predicting the Weather ▪ Interpretation: The probabilities of weather conditions(i.e. modeled as either rainy or sunny), given the weather on the preceding day MARKOV DECISION PROCESS Sunny.9 .5 .1 .5 Rainy *(Example originally presented at https://en.wikipedia.org/wiki/Examples_of_Markov_chains)
  • 29. MARKOV DECISION PROCESS ▪ Markov Chain Example*: Predicting the Weather ▪ The transition probabilities can be represented by the following transition matrix: Sunny.9 .5 .1 .5 Rainy *(Example originally presented at https://en.wikipedia.org/wiki/Examples_of_Markov_chains)
  • 30. MARKOV DECISION PROCESS ▪ Markov Chain Example*: Predicting the Weather ▪ Suppose the weather on day 1 is sunny, then we can represent it by the following vector: Sunny.9 .5 .1 .5 Rainy *(Example originally presented at https://en.wikipedia.org/wiki/Examples_of_Markov_chains)
  • 31. MARKOV DECISION PROCESS ▪ Markov Chain Example*: Predicting the Weather ▪ The weather on day 2 can be predicted by: Sunny.9 .5 .1 .5 Rainy *(Example originally presented at https://en.wikipedia.org/wiki/Examples_of_Markov_chains)
  • 32. MARKOV DECISION PROCESS ▪ Markov Chain Example*: Predicting the Weather ▪ The weather on day 3 can be predicted by: Sunny.9 .5 .1 .5 Rainy *(Example originally presented at https://en.wikipedia.org/wiki/Examples_of_Markov_chains)
  • 33. Function Matching & Applying Historic Knowledge
  • 34. DERIVING FUNCTION META-DATA ▪ A recurring RE task consists of labeling and annotating a function that has not been previously analyzed ▪ Annotating can include the following • Comments about context and behavior of a set of instructions • Identifying and labeling key data structures • Determining the function’s prototype ▪ Note: Can also record who derived the annotations to assess reputation/accuracy/trustworthiness ▪ For the remainder of this talk, we’ll refer to this information as the function’s meta-data.
  • 36. DERIVING FUNCTION META-DATA Labeled Function (IDA) Defined Function Prototype Identified and labeled key variable Additional comments to add context about comparison operation Identified and labeled a key data structure
  • 37. LEVERAGING FUNCTION META-DATA ▪ A logical next step would be to apply the derived function meta-data to functions in other binaries that are either an exact match and/or similar enough ▪ A huge bonus would be if we could perform this in a disassembler agnostic manner. ▪ Example: If function meta-data generated in Ida Pro, someone using Ghidra or Binary Ninja would also be able to leverage that information.
  • 38. LEVERAGING FUNCTION META-DATA ▪ Example: Disassembler Agnostic Approach func1() Binary A (Ghidra) func5() func2() Binary B (IDA Pro) func3() func3() func0() Binary C (Binja) func3() DB Save func3() meta-data Fetchfunc3() meta-data Fetch func3() meta-data
  • 39. STANDARD TECHNIQUE FOR EXACT MATCHES ▪Example: Hello World (x86) .global _start .text _start: # write(1, message, 13) mov $1, %rax # system call 1 is write mov $1, %rdi # file handle 1 is stdout mov $message, %rsi # address of string to output mov $13, %rdx # number of bytes syscall # invoke operating system to do the write # exit(0) mov $60, %rax # system call 60 is exit xor %rdi, %rdi # we want return code 0 syscall # invoke operating system to exit message: .ascii "Hello, worldn
  • 40. FUNCTION MATCHING ▪ Standard Technique for Exact Matches # write(1, message, 13) mov $1, %rax mov $1, %rdi mov $message, %rsi mov $13, %rdx syscall # exit(0) mov $60, %rax xor %rdi, %rdi syscall main() mov mov mov mov syscall mov xor syscall sha1 hash
  • 41. FUNCTION MATCHING: APPROXIMATE MATCHING ▪ N-gram Technique for Approximate Match (N=3) main() mov mov mov mov syscall mov xor syscall mov SEQ 1 movmov mov mov mov
  • 42. FUNCTION MATCHING: APPROXIMATE MATCHING ▪ N-gram Technique (N=3 in this example) main() mov mov mov mov syscall mov xor syscall mov SEQ 1 mov mov mov mov mov mov SEQ 2 mov mov
  • 43. FUNCTION MATCHING: APPROXIMATE MATCHING ▪ N-gram Technique (N=3 in this example) main() mov mov mov mov syscall mov xor syscall mov SEQ 1 mov mov mov syscall mov mov SEQ 2 mov mov mov SEQ 3 mov syscall
  • 44. FUNCTION MATCHING: APPROXIMATE MATCHING ▪ N-gram Technique (N=3 in this example) main() mov mov mov mov syscall mov xor syscall mov SEQ 1 mov mov syscall mov mov mov SEQ 2 mov mov mov SEQ 3 mov syscall syscall SEQ 4 mov mov
  • 45. FUNCTION MATCHING: APPROXIMATE MATCHING ▪ N-gram Technique (N=3 in this example) main() mov mov mov mov syscall mov xor syscall mov SEQ 1 mov syscall mov xor mov mov SEQ 2 mov mov mov SEQ 3 mov syscall syscall SEQ 4 mov mov mov SEQ 5 syscall xor
  • 46. FUNCTION MATCHING: APPROXIMATE MATCHING ▪ N-gram Technique (N=3 in this example) main() mov mov mov mov syscall mov xor syscall mov SEQ 1 mov mov xor syscall mov mov SEQ 2 mov mov mov SEQ 3 mov syscall syscall SEQ 4 mov mov mov SEQ 5 syscall xor xor SEQ 6 mov syscall
  • 47. FUNCTION MATCHING: APPROXIMATE MATCHING ▪ Jaccard Index can be used to measure similarity J(p,q) Function p .953 Function q Function qFunction p
  • 48. Function Classification w/ Machine Learning and Binary Analysis
  • 49. FUNCTION CLASSIFICATION: MOTIVATION ▪ Another recurring task for RE practitioners is to characterize a function’s behavior ▪ Behaviors can include the following: • Computation • Logic • Crypto • File I/O • Networking ▪ Can you determine how the example function on the next slides should be classified?
  • 50. FUNCTION CLASSIFICATION: MOTIVATION push rbp mov rbp, rsp push rbx sub rsp, 40 mov QWORD PTR [rbp-40], rdi mov QWORD PTR [rbp-48], rsi mov BYTE PTR [rbp-21], 115 mov DWORD PTR [rbp-20], 0 jmp .L2 .L3: mov eax, DWORD PTR [rbp-20] movsx rdx, eax mov rax, QWORD PTR [rbp-48] add rax, rdx movzx eax, BYTE PTR [rax] mov edx, DWORD PTR [rbp-20] movsx rcx, edx mov rdx, QWORD PTR [rbp-48] add rdx, rcx xor al, BYTE PTR [rbp-21] mov BYTE PTR [rdx], al add DWORD PTR [rbp-20], 1 .L2: mov eax, DWORD PTR [rbp-20] movsx rbx, eax mov rax, QWORD PTR [rbp-40] mov rdi, rax call strlen cmp rbx, rax jb .L3 mov eax, 0 add rsp, 40 pop rbx pop rbp ret
  • 52. FUNCTION CLASSIFICATION: MOTIVATION ▪ Experienced RE practioners can classify key behaviors/operations of a function more readily versus their junior peers ▪ The ability to quickly classify behaviors/operations is a key aspect that can allow an experienced RE practitioner to spend more time focusing on what is relevant to the RE task. ▪ Depending on the complexity and architecture of a target function, even a RE SME can get bogged down with classifying. ▪ This idea of function classification was presented in [2]
  • 53. FUNCTION CLASSIFICATION: APPLYING ML ▪ We need at least two things to apply ML 1. Define a feature vector that can characterize an arbitrary function 2. Obtaining a “large enough” and diverse sample set of functions with their appropriate classification (e.g. crypto or file I/O)
  • 54. FUNCTION CLASSIFICATION: APPLYING ML 1. Defining a function feature vector • We can model the function as a Markov Chain* • The transition matrix P will then be our feature matrix • An example of the states are the following: i. Math ii. Logic iii. Memory iv. Other v. Branch vi. Register *The idea for leveraging Markov Chains was presented in [2]
  • 55. FUNCTION CLASSIFICATION: APPLYING ML ▪ Example Markov Chain Function Representation Math Memory Logic Branch Register Other .7 Stack .1 .2 .1
  • 56. FUNCTION CLASSIFICATION: APPLYING ML ▪ Note: An intermediate step would be to lift the disassembly to an intermediate representation (IR) language (e.g. vex ir) Math Memory Logic Branch Register Other .7 Stack .1 .2 .1
  • 57. FUNCTION CLASSIFICATION: APPLYING ML 2. Obtaining a “large enough” and diverse sample set of functions with their appropriate classification (e.g. crypto or file I/O) • Labeling functions by hand does not scale in any sense and especially not across architectures; not feasible • Commercial binaries tend to be stripped of symbol information, so would also not be feasible • Open source code can allow for a diverse sample set that provides symbol information that can be utilized to derive a function’s classification
  • 58. FUNCTION CLASSIFICATION: APPLYING ML ▪ Leveraging Open Source for Function Classification • An example open source crypto package is openssl • Typically functions have meaningful names that we can exploit to derive classification types (e.g. aes_crypto() ➔ crypto type) • We can also obtain diversity in terms of code optimization by compiling the binary with various optimization settings
  • 59. CONCLUSION ▪ Question: If we had sufficient expertise, couldn’t we write some scripts to accomplish a lot of the automation presented in this talk? • Yes…. However, those scripts probably don’t generalize past the current problem sets and across architectures • Most skilled RE practitioners are not necessarily skilled and experienced software engineers. • Developing a set of scripts that are robust and can work across architectures requires a substantial amount of software engineering and testing
  • 61. REFERENCES 1. Rieck, K., Trinius, P., Willems, C., & Holz, T. (2011). Automatic analysis of malware behavior using machine learning. Journal of Computer Security, 19(4), 639-668. 2. Anderson, B., Storlie, C., Yates, M., & McPhall, A. (2014, November). Automating reverse engineering with machine learning techniques. In Proceedings of the 2014 Workshop on Artificial Intelligent and Security Workshop (pp. 103-112). AC 3. Keliris, A., & Maniatakos, M. (2018). ICSREF: A Framework for Automated Reverse Engineering of Industrial Control Systems Binaries. arXiv preprint arXiv:1812.03478. 4. He, J., Ivanov, P., Tsankov, P., Raychev, V., & Vechev, M. (2018, October). Debin: Predicting Debug Information in Stripped Binaries. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (pp. 1667-1680). ACM. 5. C. Chio and D. Freeman, (2018). Machine learning and security: Protecting systems with data and algorithms. O’Reilly Media.