SlideShare a Scribd company logo
1 of 37
Download to read offline
Half-automatic Compilable Source Code
Recovery
Joxean Koret
Introduction
●
The problem
●
The idea
●
The prototype
●
The future
The problem
The problem
●
Often, for so many reverse engineering tasks, we need to extract
pieces of code from binaries to copy in our source codes.
●
Some basic examples of when it might be needed:
– Compatibility.
– Copying decryption routines from malwares.
– Recovery/reconstruction of lost source codes.
– Half-automated porting of codes written in assembler to high level
languages.
– ...
Compatibility
●
Let’s say that we want/need to be compatible with some commercial software
that is only available in binary form.
●
After we have reverse engineered this or that obscure algorithm for some
obscure file format they invented in their own, we can:
– Either implement everything from zero by writing specifications from the reverse
engineered piece of software (something very common)
– Or use portions of that commercial software directly into our software.
●
I’m ignoring legality here.
– This might sound barely legal, but it’s regularly done in many industries.
– Yes, regularly. Some random examples: antivirus or commercial games cheats.
Copying decryption routines
●
It’s very common in the anti-malware industry to just copy, verbatim,
algorithms from malwares after reverse engineering them.
●
Indeed, I have done this task myself more than once: manually
rewrite from assembler to C and write AV plugins for cleaning this or
that file infector.
●
I know people that have even directly copied raw assembler and put it
in __asm__ blocks…
●
Legality, you mean? I don’t think malware authors are going to
complain about their IP being used anywhere. Call me crazy.
Recovery/Reconstruction of Lost Source Codes
●
This is one of the top 10 most common reverse engineering task. It’s
also one of the top 5 most tedious reverse engineering tasks.
●
One easy example:
– Company ACME produces a software named $SOFT for some specific
industry.
– Due to a disaster, ACME loses all or part of the source codes for $SOFT.
– ACME contracts reverse engineering services from some poor souls to
reconstruct from the binaries they distribute to customers.
●
With a bit of luck, but not always, the reversers might be lucky enough to have some
version with debugging symbols (DWARF or PDB files).
The problem
●
There are various other examples where we need to extract pieces
of code from binaries to copy in our source codes, but I think these
are good ones.
●
Right now, the only partial solution to this problem is the following:
– Reverse engineer the software, discover structs, enums, function names,
etc…
– Copy & paste from IDA/Ghidra’s decompiler to our source codes.
– Adapt the code to make it compilable.
– Not feasible for big codes or at least non practical.
The idea
The idea
●
The solution to the previous problem is obvious:
– Write a tool that automates most of the tedious and boring routine
tasks.
– Make it interactive.
– Allow incremental changes.
– Integrate with the de-facto reverse engineering tools.
– …
– Profit?
The idea
●
It sounds easier than it really is. But it isn’t either that-that hard,
honestly.
– Unless you want to write a tool that doesn’t use an already existing
disassembler & decompiler like IDA/Ghidra.
●
Indeed, I don’t really consider it a pure reverse engineering
problem but more of a software engineering problem.
– We’re not, say, searching for classes and their hierarchies in binaries.
– We’re just going to output compilable source codes using the decompiler
using hints from and interacting with the reverse engineer.
The idea
●
So, what such a tool should do, in my opinion?
– Find functions and their correct prototypes.
– Find local and global variables.
– Find imported functions and their corresponding header and library files.
– Find source files in the binaries.
– Find (and ignore) C runtime libraries.
– Find the hierarchy of structs, enums, functions, globals, etc…
– And, finally, output source codes with all the required requisites.
●
Easy. Isn’t it?
Disclaimers
●
Assume that I’m only talking about C programs. No
Visual Basic, Go, Delphi, etc… Just plain C programs.
●
During this talk I will only consider the tools IDA and
Ghidra.
●
The reason is easy: these are the only reverse
engineering tools with their own decompiler.
●
That explained, let’s continue...
Finding functions
●
Again, let’s assume that we’re talking about just C programs.
●
Finding functions is easy: we just need to use whatever IDA/Ghidra APIs offer for
us to walk them.
– We might need to manually find and create some functions that IDA/Ghidra didn’t find.
– Distinguishing between data and code is not trivial, although nowadays tools work pretty well
in most of the cases.
●
We also need to fix the function prototypes, which is one of the areas where
decompilers fail too much.
– So many times calling conventions, number of arguments and types are wrongly guessed by
decompilers.
– That’s normal. Reversers can manually fix them, fortunately.
Finding Functions
●
Extract from the paper “JTR: A Binary Solution
for Switch-Case Recovery”:
Finding Functions
●
There are many cases, however, where functions can be
missed or their boundaries wrongly guessed:
– Virtual functions, jump tables, switch idioms, self-modifying code, on-
the-fly generated code, function tables exported by external libraries,
etc…
●
Writing a tool that works in all cases is pretty much impossible.
●
But we can focus on a tool that might work in the general case
and then enhance/improve it later.
Finding Local & Global Variables
●
This might sound easy too, but it isn’t either.
●
Variables aren’t a concept that exists in (most?) CPUs:
– They are just memory areas.
– Data flow analysis is required to find them and their alias.
– In Ilfak’s words in his white-paper “Decompilers and Beyond”,
“variable allocation” is “worth a separate paper”.
Finding Local & Global Variables
●
Extract from the white-paper “Decompilers and
Beyond”, by Ilfak Guilfanov:
Finding Local & Global Variables
●
Fortunately, in many cases (most for multiple targets), finding local variables isn’t that
hard.
– And that’s not your job in most cases anyway, it’s the job of IDA/Ghidra. We just need to fix things.
●
Finding global variables is, usually, easier.
– We just need to find references to memory addresses outside of the segment where the code is.
●
Or read+write memory references to outside of the current function’s boundaries.
– Distinguishing between constants and variables might be done by checking the permissions of the
segment (ie: if it’s read-only).
– In opposite to the previous cases, this is, however, our task to determine if we’re dealing with a
constant or a global variable to output compilable source codes.
●
We will talk about it later on…
●
TIP: We don’t really care, but it would be prettier.
Finding External Functions
●
Another thing we need to do is to find the external libraries & runtime functions
used by our target.
– For example, if it’s using CreateFileA or sqrt.
●
Depending on these functions, we will need to add proper header files as well as
their corresponding library files to link with. For example:
– CreateFileA: include <windows.h>.
– sqrt: include <math.h> & link with -lm.
●
Header files must be added to the specific source code files we are going to
write that are using them.
●
Library files are going to be used only to output build files.
Finding source files in the binaries
●
Only this task by itself is a whole research topic:
– How can we guess object files’ boundaries, and thus,
source files’ boundaries, in binaries?
●
And, indeed, it has been already researched:
– “A Code Pirate’s Cutlass. Recovering Software
Architecture from Embedded Binaries”, by
@evm_sec.
Finding source files in the binaries
●
The previously mentioned talk, and tool, try to infer object files boundaries
from binaries without using debugging information.
– If we have debugging information, we can skip this step!
●
In order to generate compilable source codes from binaries, we will need to
“know” the boundaries of the object files.
NOTE: Image extracted from the previously mentioned talk.
Finding Runtime Functions
●
We also need to find the used C/C++ runtime functions. Basically, to ignore
them. We don’t want to add __libc_start_main or gmon_start functions to our
generated source code files.
●
The solution to this problem is kind of “easy”:
– Thanks to IDA’s FLIRT signature we can ignore anything that seems to come from a
library.
– If we have function names, we can also black-list some of them.
●
Is a never ending history…
– Also, we must let the reverser, somehow, specify which functions must be skipped or
not.
●
In my experience, interactivity is always key in reverse engineering tools.
Finding Objects’ Hierarchy
●
Let’s say that we have reverse engineered a target for some time and we have
proper function names, local variables, structs, enums, etc…
●
When writing source code files we need to know which structs, functions, enums,
global variables, etc… are used by each source code.
●
Also, we need to remember that structs, functions, global variable types, etc… might
depend on other structs, types, functions, etc…
●
It’s required to build a hierarchy of objects to output proper compialble source codes.
– We could also add just a lot of “extern” declarations #IFDEF’ed or apply similar ugly
workaround but, well… it isn’t pretty.
– And look pretty dangerous.
Generating Compilable Sources
●
And, finally, the last step, is to “just” write the
generated source code files with all the required
dependencies and with the whole hierarchy
resolved + building files.
●
In my prototype, for now, I’m generating just plain
Makefiles. But one could generate anything:
Ninja, Visual Studio project files, CMake, etc...
The Prototype
The Prototype
●
Since September 2019 I have been working on a
prototype (for now) of such a tool.
●
The prototype is an IDA plugin + IDA Python tool.
●
It’s called “Source REcoverer”.
– Call me original.
●
Let’s briefly discuss about it...
Components of the Prototype
●
An IDA C++ plugin (idaunexposed) exporting a
single function to IDA Python: get_cdef.
– This plugin uses print_type and format_data, which
aren’t really useable from Python.
●
An IDA Python independent script that uses the
previously mentioned plugin and does everything
expected from a source code recovery tool.
How it works?
●
Iterate all functions in the binary.
●
“Guess” all the source files, if possible (using debugging information that is not available
by default or using IDAMagicStrings.py to get possible source files using debugging
strings).
●
Find structs, enums and global variables.
●
Decompile all the functions.
●
Write a project file, source codes with dependencies mostly fully resolved and a Makefile.
●
When something is changed in IDA, only the modified part/source file will be modified.
– The idea is that the reverser doesn’t need to modify generated source files.
– The reverser will just interact with the tool or with the generated project file.
DEMOS!
The Future
The Future
●
The current tool is just a quick prototype.
– It works. But it sucks.
●
I will, most likely, rewrite it soon using C/C++ code.
●
Supporting Ghidra was considered but… the
decompiler generates too many constructions that
aren’t compilable and must be cleaned.
The Future
●
An integrated reverser-friendly GUI.
●
Right now, we have to manually update the
project file (a JSON formatted file).
●
We need a GUI to assign functions to source
files, select which local types we want to export,
which functions we want to ignore, etc...
The Future
●
In the current version, source files are “found” using of the
following 2 methods:
– Using debugging information (DWARF, mainly).
– Using debugging strings containing file paths.
●
In the next version, I plan to implement my own version of
@evm_sec’s algorithm for Local Function Affinity (LFA).
– It tries to infer translation units’ boundaries in binary files.
The Future
●
Classes recovery. This is a whole project by
itself.
●
The idea is to try to find classes, resolve the
hierarchy and, finally, write the definition of the
classes too.
●
Non trivial. But that would be awesome to have.
The Future
●
The prototype will be released at some point
this year.
●
The final tool, hopefully, will be available by the
end of this year.
And that’s all!
●
Thank you!
●
Questions?

More Related Content

What's hot

Stuxnet redux. malware attribution & lessons learned
Stuxnet redux. malware attribution & lessons learnedStuxnet redux. malware attribution & lessons learned
Stuxnet redux. malware attribution & lessons learned
Yury Chemerkin
 
Listen and look at your PHP code
Listen and look at your PHP codeListen and look at your PHP code
Listen and look at your PHP code
Gabriele Santini
 
JProfiler / an introduction
JProfiler / an introductionJProfiler / an introduction
JProfiler / an introduction
Tommaso Torti
 

What's hot (20)

Guadalajara con 2012
Guadalajara con 2012Guadalajara con 2012
Guadalajara con 2012
 
Stuxnet redux. malware attribution & lessons learned
Stuxnet redux. malware attribution & lessons learnedStuxnet redux. malware attribution & lessons learned
Stuxnet redux. malware attribution & lessons learned
 
Interview with Dmitriy Vyukov - the author of Relacy Race Detector (RRD)
Interview with Dmitriy Vyukov - the author of Relacy Race Detector (RRD)Interview with Dmitriy Vyukov - the author of Relacy Race Detector (RRD)
Interview with Dmitriy Vyukov - the author of Relacy Race Detector (RRD)
 
Static Code Analysis and Cppcheck
Static Code Analysis and CppcheckStatic Code Analysis and Cppcheck
Static Code Analysis and Cppcheck
 
Production Debugging at Code Camp Philly
Production Debugging at Code Camp PhillyProduction Debugging at Code Camp Philly
Production Debugging at Code Camp Philly
 
Ch01 basic-java-programs
Ch01 basic-java-programsCh01 basic-java-programs
Ch01 basic-java-programs
 
Model-checking for efficient malware detection
Model-checking for efficient malware detectionModel-checking for efficient malware detection
Model-checking for efficient malware detection
 
Automated In-memory Malware/Rootkit Detection via Binary Analysis and Machin...
Automated In-memory Malware/Rootkit  Detection via Binary Analysis and Machin...Automated In-memory Malware/Rootkit  Detection via Binary Analysis and Machin...
Automated In-memory Malware/Rootkit Detection via Binary Analysis and Machin...
 
Difficulties of comparing code analyzers, or don't forget about usability
Difficulties of comparing code analyzers, or don't forget about usabilityDifficulties of comparing code analyzers, or don't forget about usability
Difficulties of comparing code analyzers, or don't forget about usability
 
Difficulties of comparing code analyzers, or don't forget about usability
Difficulties of comparing code analyzers, or don't forget about usabilityDifficulties of comparing code analyzers, or don't forget about usability
Difficulties of comparing code analyzers, or don't forget about usability
 
DEFCON 21: EDS: Exploitation Detection System WP
DEFCON 21: EDS: Exploitation Detection System WPDEFCON 21: EDS: Exploitation Detection System WP
DEFCON 21: EDS: Exploitation Detection System WP
 
Listen and look at your PHP code
Listen and look at your PHP codeListen and look at your PHP code
Listen and look at your PHP code
 
Difficulties of comparing code analyzers, or don't forget about usability
Difficulties of comparing code analyzers, or don't forget about usabilityDifficulties of comparing code analyzers, or don't forget about usability
Difficulties of comparing code analyzers, or don't forget about usability
 
Python Foundation – A programmer's introduction to Python concepts & style
Python Foundation – A programmer's introduction to Python concepts & stylePython Foundation – A programmer's introduction to Python concepts & style
Python Foundation – A programmer's introduction to Python concepts & style
 
DEFCON 21: EDS: Exploitation Detection System Slides
DEFCON 21: EDS: Exploitation Detection System SlidesDEFCON 21: EDS: Exploitation Detection System Slides
DEFCON 21: EDS: Exploitation Detection System Slides
 
Cppcheck
CppcheckCppcheck
Cppcheck
 
Unit 8 Java
Unit 8 JavaUnit 8 Java
Unit 8 Java
 
SherLog: Error Diagnosis by Connecting Clues from Run-time Logs
SherLog: Error Diagnosis by Connecting Clues from Run-time LogsSherLog: Error Diagnosis by Connecting Clues from Run-time Logs
SherLog: Error Diagnosis by Connecting Clues from Run-time Logs
 
JProfiler / an introduction
JProfiler / an introductionJProfiler / an introduction
JProfiler / an introduction
 
Linux binary analysis and exploitation
Linux binary analysis and exploitationLinux binary analysis and exploitation
Linux binary analysis and exploitation
 

Similar to Half-automatic Compilable Source Code Recovery

Reverse engineering
Reverse engineeringReverse engineering
Reverse engineering
Saswat Padhi
 
Python debuggers slides
Python debuggers slidesPython debuggers slides
Python debuggers slides
mattboehm
 
Programming Languages #devcon2013
Programming Languages #devcon2013Programming Languages #devcon2013
Programming Languages #devcon2013
Iván Montes
 

Similar to Half-automatic Compilable Source Code Recovery (20)

Pigaios: A Tool for Diffing Source Codes against Binaries (Hacktivity 2018)
Pigaios: A Tool for Diffing Source Codes against Binaries (Hacktivity 2018)Pigaios: A Tool for Diffing Source Codes against Binaries (Hacktivity 2018)
Pigaios: A Tool for Diffing Source Codes against Binaries (Hacktivity 2018)
 
What every C++ programmer should know about modern compilers (w/ comments, AC...
What every C++ programmer should know about modern compilers (w/ comments, AC...What every C++ programmer should know about modern compilers (w/ comments, AC...
What every C++ programmer should know about modern compilers (w/ comments, AC...
 
DDD with Behat
DDD with BehatDDD with Behat
DDD with Behat
 
Oh the compilers you'll build
Oh the compilers you'll buildOh the compilers you'll build
Oh the compilers you'll build
 
Writing code for people
Writing code for peopleWriting code for people
Writing code for people
 
Reverse engineering
Reverse engineeringReverse engineering
Reverse engineering
 
New c sharp4_features_part_iv
New c sharp4_features_part_ivNew c sharp4_features_part_iv
New c sharp4_features_part_iv
 
Introduction to the intermediate Python - v1.1
Introduction to the intermediate Python - v1.1Introduction to the intermediate Python - v1.1
Introduction to the intermediate Python - v1.1
 
What is dev ops?
What is dev ops?What is dev ops?
What is dev ops?
 
Demystifying Binary Reverse Engineering - Pixels Camp
Demystifying Binary Reverse Engineering - Pixels CampDemystifying Binary Reverse Engineering - Pixels Camp
Demystifying Binary Reverse Engineering - Pixels Camp
 
IDE and Toolset For Magento Development
IDE and Toolset For Magento DevelopmentIDE and Toolset For Magento Development
IDE and Toolset For Magento Development
 
engage 2014 - JavaBlast
engage 2014 - JavaBlastengage 2014 - JavaBlast
engage 2014 - JavaBlast
 
Python debuggers slides
Python debuggers slidesPython debuggers slides
Python debuggers slides
 
Metaprogramming Go
Metaprogramming GoMetaprogramming Go
Metaprogramming Go
 
Itroduction about java
Itroduction about javaItroduction about java
Itroduction about java
 
Introducing systems analysis, design & development Concepts
Introducing systems analysis, design & development ConceptsIntroducing systems analysis, design & development Concepts
Introducing systems analysis, design & development Concepts
 
Whirlwind tour of the Runtime Dynamic Linker
Whirlwind tour of the Runtime Dynamic LinkerWhirlwind tour of the Runtime Dynamic Linker
Whirlwind tour of the Runtime Dynamic Linker
 
Documenting code yapceu2016
Documenting code yapceu2016Documenting code yapceu2016
Documenting code yapceu2016
 
Programming Languages #devcon2013
Programming Languages #devcon2013Programming Languages #devcon2013
Programming Languages #devcon2013
 
Refactoring, 2nd Edition
Refactoring, 2nd EditionRefactoring, 2nd Edition
Refactoring, 2nd Edition
 

Recently uploaded

FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 

Recently uploaded (20)

University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 

Half-automatic Compilable Source Code Recovery

  • 1. Half-automatic Compilable Source Code Recovery Joxean Koret
  • 4. The problem ● Often, for so many reverse engineering tasks, we need to extract pieces of code from binaries to copy in our source codes. ● Some basic examples of when it might be needed: – Compatibility. – Copying decryption routines from malwares. – Recovery/reconstruction of lost source codes. – Half-automated porting of codes written in assembler to high level languages. – ...
  • 5. Compatibility ● Let’s say that we want/need to be compatible with some commercial software that is only available in binary form. ● After we have reverse engineered this or that obscure algorithm for some obscure file format they invented in their own, we can: – Either implement everything from zero by writing specifications from the reverse engineered piece of software (something very common) – Or use portions of that commercial software directly into our software. ● I’m ignoring legality here. – This might sound barely legal, but it’s regularly done in many industries. – Yes, regularly. Some random examples: antivirus or commercial games cheats.
  • 6. Copying decryption routines ● It’s very common in the anti-malware industry to just copy, verbatim, algorithms from malwares after reverse engineering them. ● Indeed, I have done this task myself more than once: manually rewrite from assembler to C and write AV plugins for cleaning this or that file infector. ● I know people that have even directly copied raw assembler and put it in __asm__ blocks… ● Legality, you mean? I don’t think malware authors are going to complain about their IP being used anywhere. Call me crazy.
  • 7. Recovery/Reconstruction of Lost Source Codes ● This is one of the top 10 most common reverse engineering task. It’s also one of the top 5 most tedious reverse engineering tasks. ● One easy example: – Company ACME produces a software named $SOFT for some specific industry. – Due to a disaster, ACME loses all or part of the source codes for $SOFT. – ACME contracts reverse engineering services from some poor souls to reconstruct from the binaries they distribute to customers. ● With a bit of luck, but not always, the reversers might be lucky enough to have some version with debugging symbols (DWARF or PDB files).
  • 8. The problem ● There are various other examples where we need to extract pieces of code from binaries to copy in our source codes, but I think these are good ones. ● Right now, the only partial solution to this problem is the following: – Reverse engineer the software, discover structs, enums, function names, etc… – Copy & paste from IDA/Ghidra’s decompiler to our source codes. – Adapt the code to make it compilable. – Not feasible for big codes or at least non practical.
  • 10. The idea ● The solution to the previous problem is obvious: – Write a tool that automates most of the tedious and boring routine tasks. – Make it interactive. – Allow incremental changes. – Integrate with the de-facto reverse engineering tools. – … – Profit?
  • 11. The idea ● It sounds easier than it really is. But it isn’t either that-that hard, honestly. – Unless you want to write a tool that doesn’t use an already existing disassembler & decompiler like IDA/Ghidra. ● Indeed, I don’t really consider it a pure reverse engineering problem but more of a software engineering problem. – We’re not, say, searching for classes and their hierarchies in binaries. – We’re just going to output compilable source codes using the decompiler using hints from and interacting with the reverse engineer.
  • 12. The idea ● So, what such a tool should do, in my opinion? – Find functions and their correct prototypes. – Find local and global variables. – Find imported functions and their corresponding header and library files. – Find source files in the binaries. – Find (and ignore) C runtime libraries. – Find the hierarchy of structs, enums, functions, globals, etc… – And, finally, output source codes with all the required requisites. ● Easy. Isn’t it?
  • 13. Disclaimers ● Assume that I’m only talking about C programs. No Visual Basic, Go, Delphi, etc… Just plain C programs. ● During this talk I will only consider the tools IDA and Ghidra. ● The reason is easy: these are the only reverse engineering tools with their own decompiler. ● That explained, let’s continue...
  • 14. Finding functions ● Again, let’s assume that we’re talking about just C programs. ● Finding functions is easy: we just need to use whatever IDA/Ghidra APIs offer for us to walk them. – We might need to manually find and create some functions that IDA/Ghidra didn’t find. – Distinguishing between data and code is not trivial, although nowadays tools work pretty well in most of the cases. ● We also need to fix the function prototypes, which is one of the areas where decompilers fail too much. – So many times calling conventions, number of arguments and types are wrongly guessed by decompilers. – That’s normal. Reversers can manually fix them, fortunately.
  • 15. Finding Functions ● Extract from the paper “JTR: A Binary Solution for Switch-Case Recovery”:
  • 16. Finding Functions ● There are many cases, however, where functions can be missed or their boundaries wrongly guessed: – Virtual functions, jump tables, switch idioms, self-modifying code, on- the-fly generated code, function tables exported by external libraries, etc… ● Writing a tool that works in all cases is pretty much impossible. ● But we can focus on a tool that might work in the general case and then enhance/improve it later.
  • 17. Finding Local & Global Variables ● This might sound easy too, but it isn’t either. ● Variables aren’t a concept that exists in (most?) CPUs: – They are just memory areas. – Data flow analysis is required to find them and their alias. – In Ilfak’s words in his white-paper “Decompilers and Beyond”, “variable allocation” is “worth a separate paper”.
  • 18. Finding Local & Global Variables ● Extract from the white-paper “Decompilers and Beyond”, by Ilfak Guilfanov:
  • 19. Finding Local & Global Variables ● Fortunately, in many cases (most for multiple targets), finding local variables isn’t that hard. – And that’s not your job in most cases anyway, it’s the job of IDA/Ghidra. We just need to fix things. ● Finding global variables is, usually, easier. – We just need to find references to memory addresses outside of the segment where the code is. ● Or read+write memory references to outside of the current function’s boundaries. – Distinguishing between constants and variables might be done by checking the permissions of the segment (ie: if it’s read-only). – In opposite to the previous cases, this is, however, our task to determine if we’re dealing with a constant or a global variable to output compilable source codes. ● We will talk about it later on… ● TIP: We don’t really care, but it would be prettier.
  • 20. Finding External Functions ● Another thing we need to do is to find the external libraries & runtime functions used by our target. – For example, if it’s using CreateFileA or sqrt. ● Depending on these functions, we will need to add proper header files as well as their corresponding library files to link with. For example: – CreateFileA: include <windows.h>. – sqrt: include <math.h> & link with -lm. ● Header files must be added to the specific source code files we are going to write that are using them. ● Library files are going to be used only to output build files.
  • 21. Finding source files in the binaries ● Only this task by itself is a whole research topic: – How can we guess object files’ boundaries, and thus, source files’ boundaries, in binaries? ● And, indeed, it has been already researched: – “A Code Pirate’s Cutlass. Recovering Software Architecture from Embedded Binaries”, by @evm_sec.
  • 22. Finding source files in the binaries ● The previously mentioned talk, and tool, try to infer object files boundaries from binaries without using debugging information. – If we have debugging information, we can skip this step! ● In order to generate compilable source codes from binaries, we will need to “know” the boundaries of the object files. NOTE: Image extracted from the previously mentioned talk.
  • 23. Finding Runtime Functions ● We also need to find the used C/C++ runtime functions. Basically, to ignore them. We don’t want to add __libc_start_main or gmon_start functions to our generated source code files. ● The solution to this problem is kind of “easy”: – Thanks to IDA’s FLIRT signature we can ignore anything that seems to come from a library. – If we have function names, we can also black-list some of them. ● Is a never ending history… – Also, we must let the reverser, somehow, specify which functions must be skipped or not. ● In my experience, interactivity is always key in reverse engineering tools.
  • 24. Finding Objects’ Hierarchy ● Let’s say that we have reverse engineered a target for some time and we have proper function names, local variables, structs, enums, etc… ● When writing source code files we need to know which structs, functions, enums, global variables, etc… are used by each source code. ● Also, we need to remember that structs, functions, global variable types, etc… might depend on other structs, types, functions, etc… ● It’s required to build a hierarchy of objects to output proper compialble source codes. – We could also add just a lot of “extern” declarations #IFDEF’ed or apply similar ugly workaround but, well… it isn’t pretty. – And look pretty dangerous.
  • 25. Generating Compilable Sources ● And, finally, the last step, is to “just” write the generated source code files with all the required dependencies and with the whole hierarchy resolved + building files. ● In my prototype, for now, I’m generating just plain Makefiles. But one could generate anything: Ninja, Visual Studio project files, CMake, etc...
  • 27. The Prototype ● Since September 2019 I have been working on a prototype (for now) of such a tool. ● The prototype is an IDA plugin + IDA Python tool. ● It’s called “Source REcoverer”. – Call me original. ● Let’s briefly discuss about it...
  • 28. Components of the Prototype ● An IDA C++ plugin (idaunexposed) exporting a single function to IDA Python: get_cdef. – This plugin uses print_type and format_data, which aren’t really useable from Python. ● An IDA Python independent script that uses the previously mentioned plugin and does everything expected from a source code recovery tool.
  • 29. How it works? ● Iterate all functions in the binary. ● “Guess” all the source files, if possible (using debugging information that is not available by default or using IDAMagicStrings.py to get possible source files using debugging strings). ● Find structs, enums and global variables. ● Decompile all the functions. ● Write a project file, source codes with dependencies mostly fully resolved and a Makefile. ● When something is changed in IDA, only the modified part/source file will be modified. – The idea is that the reverser doesn’t need to modify generated source files. – The reverser will just interact with the tool or with the generated project file.
  • 32. The Future ● The current tool is just a quick prototype. – It works. But it sucks. ● I will, most likely, rewrite it soon using C/C++ code. ● Supporting Ghidra was considered but… the decompiler generates too many constructions that aren’t compilable and must be cleaned.
  • 33. The Future ● An integrated reverser-friendly GUI. ● Right now, we have to manually update the project file (a JSON formatted file). ● We need a GUI to assign functions to source files, select which local types we want to export, which functions we want to ignore, etc...
  • 34. The Future ● In the current version, source files are “found” using of the following 2 methods: – Using debugging information (DWARF, mainly). – Using debugging strings containing file paths. ● In the next version, I plan to implement my own version of @evm_sec’s algorithm for Local Function Affinity (LFA). – It tries to infer translation units’ boundaries in binary files.
  • 35. The Future ● Classes recovery. This is a whole project by itself. ● The idea is to try to find classes, resolve the hierarchy and, finally, write the definition of the classes too. ● Non trivial. But that would be awesome to have.
  • 36. The Future ● The prototype will be released at some point this year. ● The final tool, hopefully, will be available by the end of this year.
  • 37. And that’s all! ● Thank you! ● Questions?