Source code recovery is one of the most tedious, and interesting, tasks in reverse engineering. During the course of this talk, the author will talk about a tool being developed (on and off) since last year that aims to generate auto-compilable source code from binaries. The tool is currently working though it needs a lot more work.
4. The problem
●
Often, for so many reverse engineering tasks, we need to extract
pieces of code from binaries to copy in our source codes.
●
Some basic examples of when it might be needed:
– Compatibility.
– Copying decryption routines from malwares.
– Recovery/reconstruction of lost source codes.
– Half-automated porting of codes written in assembler to high level
languages.
– ...
5. Compatibility
●
Let’s say that we want/need to be compatible with some commercial software
that is only available in binary form.
●
After we have reverse engineered this or that obscure algorithm for some
obscure file format they invented in their own, we can:
– Either implement everything from zero by writing specifications from the reverse
engineered piece of software (something very common)
– Or use portions of that commercial software directly into our software.
●
I’m ignoring legality here.
– This might sound barely legal, but it’s regularly done in many industries.
– Yes, regularly. Some random examples: antivirus or commercial games cheats.
6. Copying decryption routines
●
It’s very common in the anti-malware industry to just copy, verbatim,
algorithms from malwares after reverse engineering them.
●
Indeed, I have done this task myself more than once: manually
rewrite from assembler to C and write AV plugins for cleaning this or
that file infector.
●
I know people that have even directly copied raw assembler and put it
in __asm__ blocks…
●
Legality, you mean? I don’t think malware authors are going to
complain about their IP being used anywhere. Call me crazy.
7. Recovery/Reconstruction of Lost Source Codes
●
This is one of the top 10 most common reverse engineering task. It’s
also one of the top 5 most tedious reverse engineering tasks.
●
One easy example:
– Company ACME produces a software named $SOFT for some specific
industry.
– Due to a disaster, ACME loses all or part of the source codes for $SOFT.
– ACME contracts reverse engineering services from some poor souls to
reconstruct from the binaries they distribute to customers.
●
With a bit of luck, but not always, the reversers might be lucky enough to have some
version with debugging symbols (DWARF or PDB files).
8. The problem
●
There are various other examples where we need to extract pieces
of code from binaries to copy in our source codes, but I think these
are good ones.
●
Right now, the only partial solution to this problem is the following:
– Reverse engineer the software, discover structs, enums, function names,
etc…
– Copy & paste from IDA/Ghidra’s decompiler to our source codes.
– Adapt the code to make it compilable.
– Not feasible for big codes or at least non practical.
10. The idea
●
The solution to the previous problem is obvious:
– Write a tool that automates most of the tedious and boring routine
tasks.
– Make it interactive.
– Allow incremental changes.
– Integrate with the de-facto reverse engineering tools.
– …
– Profit?
11. The idea
●
It sounds easier than it really is. But it isn’t either that-that hard,
honestly.
– Unless you want to write a tool that doesn’t use an already existing
disassembler & decompiler like IDA/Ghidra.
●
Indeed, I don’t really consider it a pure reverse engineering
problem but more of a software engineering problem.
– We’re not, say, searching for classes and their hierarchies in binaries.
– We’re just going to output compilable source codes using the decompiler
using hints from and interacting with the reverse engineer.
12. The idea
●
So, what such a tool should do, in my opinion?
– Find functions and their correct prototypes.
– Find local and global variables.
– Find imported functions and their corresponding header and library files.
– Find source files in the binaries.
– Find (and ignore) C runtime libraries.
– Find the hierarchy of structs, enums, functions, globals, etc…
– And, finally, output source codes with all the required requisites.
●
Easy. Isn’t it?
13. Disclaimers
●
Assume that I’m only talking about C programs. No
Visual Basic, Go, Delphi, etc… Just plain C programs.
●
During this talk I will only consider the tools IDA and
Ghidra.
●
The reason is easy: these are the only reverse
engineering tools with their own decompiler.
●
That explained, let’s continue...
14. Finding functions
●
Again, let’s assume that we’re talking about just C programs.
●
Finding functions is easy: we just need to use whatever IDA/Ghidra APIs offer for
us to walk them.
– We might need to manually find and create some functions that IDA/Ghidra didn’t find.
– Distinguishing between data and code is not trivial, although nowadays tools work pretty well
in most of the cases.
●
We also need to fix the function prototypes, which is one of the areas where
decompilers fail too much.
– So many times calling conventions, number of arguments and types are wrongly guessed by
decompilers.
– That’s normal. Reversers can manually fix them, fortunately.
16. Finding Functions
●
There are many cases, however, where functions can be
missed or their boundaries wrongly guessed:
– Virtual functions, jump tables, switch idioms, self-modifying code, on-
the-fly generated code, function tables exported by external libraries,
etc…
●
Writing a tool that works in all cases is pretty much impossible.
●
But we can focus on a tool that might work in the general case
and then enhance/improve it later.
17. Finding Local & Global Variables
●
This might sound easy too, but it isn’t either.
●
Variables aren’t a concept that exists in (most?) CPUs:
– They are just memory areas.
– Data flow analysis is required to find them and their alias.
– In Ilfak’s words in his white-paper “Decompilers and Beyond”,
“variable allocation” is “worth a separate paper”.
18. Finding Local & Global Variables
●
Extract from the white-paper “Decompilers and
Beyond”, by Ilfak Guilfanov:
19. Finding Local & Global Variables
●
Fortunately, in many cases (most for multiple targets), finding local variables isn’t that
hard.
– And that’s not your job in most cases anyway, it’s the job of IDA/Ghidra. We just need to fix things.
●
Finding global variables is, usually, easier.
– We just need to find references to memory addresses outside of the segment where the code is.
●
Or read+write memory references to outside of the current function’s boundaries.
– Distinguishing between constants and variables might be done by checking the permissions of the
segment (ie: if it’s read-only).
– In opposite to the previous cases, this is, however, our task to determine if we’re dealing with a
constant or a global variable to output compilable source codes.
●
We will talk about it later on…
●
TIP: We don’t really care, but it would be prettier.
20. Finding External Functions
●
Another thing we need to do is to find the external libraries & runtime functions
used by our target.
– For example, if it’s using CreateFileA or sqrt.
●
Depending on these functions, we will need to add proper header files as well as
their corresponding library files to link with. For example:
– CreateFileA: include <windows.h>.
– sqrt: include <math.h> & link with -lm.
●
Header files must be added to the specific source code files we are going to
write that are using them.
●
Library files are going to be used only to output build files.
21. Finding source files in the binaries
●
Only this task by itself is a whole research topic:
– How can we guess object files’ boundaries, and thus,
source files’ boundaries, in binaries?
●
And, indeed, it has been already researched:
– “A Code Pirate’s Cutlass. Recovering Software
Architecture from Embedded Binaries”, by
@evm_sec.
22. Finding source files in the binaries
●
The previously mentioned talk, and tool, try to infer object files boundaries
from binaries without using debugging information.
– If we have debugging information, we can skip this step!
●
In order to generate compilable source codes from binaries, we will need to
“know” the boundaries of the object files.
NOTE: Image extracted from the previously mentioned talk.
23. Finding Runtime Functions
●
We also need to find the used C/C++ runtime functions. Basically, to ignore
them. We don’t want to add __libc_start_main or gmon_start functions to our
generated source code files.
●
The solution to this problem is kind of “easy”:
– Thanks to IDA’s FLIRT signature we can ignore anything that seems to come from a
library.
– If we have function names, we can also black-list some of them.
●
Is a never ending history…
– Also, we must let the reverser, somehow, specify which functions must be skipped or
not.
●
In my experience, interactivity is always key in reverse engineering tools.
24. Finding Objects’ Hierarchy
●
Let’s say that we have reverse engineered a target for some time and we have
proper function names, local variables, structs, enums, etc…
●
When writing source code files we need to know which structs, functions, enums,
global variables, etc… are used by each source code.
●
Also, we need to remember that structs, functions, global variable types, etc… might
depend on other structs, types, functions, etc…
●
It’s required to build a hierarchy of objects to output proper compialble source codes.
– We could also add just a lot of “extern” declarations #IFDEF’ed or apply similar ugly
workaround but, well… it isn’t pretty.
– And look pretty dangerous.
25. Generating Compilable Sources
●
And, finally, the last step, is to “just” write the
generated source code files with all the required
dependencies and with the whole hierarchy
resolved + building files.
●
In my prototype, for now, I’m generating just plain
Makefiles. But one could generate anything:
Ninja, Visual Studio project files, CMake, etc...
27. The Prototype
●
Since September 2019 I have been working on a
prototype (for now) of such a tool.
●
The prototype is an IDA plugin + IDA Python tool.
●
It’s called “Source REcoverer”.
– Call me original.
●
Let’s briefly discuss about it...
28. Components of the Prototype
●
An IDA C++ plugin (idaunexposed) exporting a
single function to IDA Python: get_cdef.
– This plugin uses print_type and format_data, which
aren’t really useable from Python.
●
An IDA Python independent script that uses the
previously mentioned plugin and does everything
expected from a source code recovery tool.
29. How it works?
●
Iterate all functions in the binary.
●
“Guess” all the source files, if possible (using debugging information that is not available
by default or using IDAMagicStrings.py to get possible source files using debugging
strings).
●
Find structs, enums and global variables.
●
Decompile all the functions.
●
Write a project file, source codes with dependencies mostly fully resolved and a Makefile.
●
When something is changed in IDA, only the modified part/source file will be modified.
– The idea is that the reverser doesn’t need to modify generated source files.
– The reverser will just interact with the tool or with the generated project file.
32. The Future
●
The current tool is just a quick prototype.
– It works. But it sucks.
●
I will, most likely, rewrite it soon using C/C++ code.
●
Supporting Ghidra was considered but… the
decompiler generates too many constructions that
aren’t compilable and must be cleaned.
33. The Future
●
An integrated reverser-friendly GUI.
●
Right now, we have to manually update the
project file (a JSON formatted file).
●
We need a GUI to assign functions to source
files, select which local types we want to export,
which functions we want to ignore, etc...
34. The Future
●
In the current version, source files are “found” using of the
following 2 methods:
– Using debugging information (DWARF, mainly).
– Using debugging strings containing file paths.
●
In the next version, I plan to implement my own version of
@evm_sec’s algorithm for Local Function Affinity (LFA).
– It tries to infer translation units’ boundaries in binary files.
35. The Future
●
Classes recovery. This is a whole project by
itself.
●
The idea is to try to find classes, resolve the
hierarchy and, finally, write the definition of the
classes too.
●
Non trivial. But that would be awesome to have.
36. The Future
●
The prototype will be released at some point
this year.
●
The final tool, hopefully, will be available by the
end of this year.