3. Disassembly Theory
īĒ First-generation languages
īˇ These are the lowest form of language, generally consisting of ones and
zeros or some shorthand form such as hexadecimal, and readable only by
binary ninjas.
īˇ Things are confusing at this level because it is often difficult to distinguish
data from instructions since everything looks pretty much the same.
īˇ First-generation languages may also be referred to as machine languages,
and in some cases byte code, while machine language programs are often
referred to as binaries.
4. Disassembly Theory
īĒ Second-generation languages
īˇ Also called assembly languages, second-generation languages are a mere
table lookup away from machine language and generally map specific bit
patterns, or operation codes (opcodes), to short but memorable character
sequences called mnemonics.
īˇ Occasionally these mnemonics actually help programmers remember the
instructions with which they are associated.
īˇ An assembler is a tool used by programmers to translate their assembly
language programs into machine language suitable for execution.
5. Disassembly Theory
īĒ Third-generation languages
īˇ These languages take another step toward the expressive capability of natural
languages by introducing keywords and constructs those programmers use as
the building blocks for their programs.
īˇ Third-generation languages are generally platform independent, though
programs written using them may be platform dependent as a result of using
features unique to a specific operating system.
īˇ Often-cited examples include FORTRAN, COBOL, C, and Java.
Programmers generally use compilers to translate their programs into
assembly language or all the way to machine language (or some rough
equivalent such as byte code).
6. The Why and How of
Disassembly
īĒ The Why of Disassembly
īˇ The purpose of disassembly tools is often to facilitate understanding of
programs when source code is unavailable.
īˇ Common situations in which disassembly is used include these:
o Analysis of malware
o Analysis of closed-source software for vulnerabilities
o Analysis of closed-source software for interoperability
o Analysis of compiler-generated code to validate compiler performance/
correctness
o Display of program instructions while debugging
7. The Why and How of
Disassembly
īĒ Malware Analysis
īˇ Unless you are dealing with a script-based worm, malware authors seldom do
you the favor of providing the source code to their creations.
īˇ Lacking source code, you are faced with a very limited set of options for
discovering exactly how the malware behaves.
īˇ The two main techniques for malware analysis are dynamic analysis and static
analysis.
o Dynamic analysis involves allowing the malware to execute in a carefully
controlled environment (sandbox) while recording every observable aspect
of its behavior using any number of system instrumentation utilities.
o In contrast, static analysis attempts to understand the behavior of a program
simply by reading through the program code, which, in the case of malware,
generally consists of a disassembly listing.
8. The Why and How of
Disassembly
īĒ Vulnerability Analysis
īˇ For the sake of simplification, letâs break the entire security-auditing process
into three steps:
o vulnerability discovery,
o vulnerability analysis, and
o exploit development.
īˇ The same steps apply whether you have source code or not; however, the level
of effort increases substantially when all you have is a binary.
īˇ The first step in the process is to discover a potentially exploitable condition in
a program.
īˇ This is often accomplished using dynamic techniques such as fuzzing, 1 but it
can also be performed (usually with much more effort) via static analysis.
9. The Why and How of
Disassembly
īˇ Once a problem has been discovered, further analysis is often required to
determine whether the problem is exploitable at all and, if so, under what
conditions.
īˇ Disassembly listings provide the level of detail required to understand exactly
how the compiler has chosen to allocate program variables.
īˇ For example, it might be useful to know that a 70-byte character array declared
by a programmer was rounded up to 80 bytes when allocated by the compiler.
īˇ Disassembly listings also provide the only means to determine exactly how a
compiler has chosen to order all of the variables declared globally or within
functions.
īˇ Understanding the spatial relationships among variables is often essential when
attempting to develop exploits.
īˇ Ultimately, by using a disassembler and a debugger together, an exploit may be
developed.
10. The Why and How of
Disassembly
īĒ Software Interoperability
īˇ When software is released in binary form only, it is very difficult for competitors to
create software that can interoperate with it or to provide plug-in replacements for that
software.
īˇ A common example is driver code released for hardware that is supported on only one
platform.
īˇ When a vendor is slow to support or, worse yet, refuses to support the use of its
hardware with alternative platforms, substantial reverse engineering effort may be
required in order to develop software drivers to support the hardware.
īˇ In these cases, static code analysis is almost the only remedy and often must go beyond
the software driver to understand embedded firmware.
11. The Why and How of
Disassembly
īĒ Compiler Validation
īˇ Since the purpose of a compiler (or assembler) is to generate machine language, good
disassembly tools are often required to verify that the compiler is doing its job in
accordance with any design specifications.
īˇ Analysts may also be interested in locating additional opportunities for optimizing
compiler output and, from a security standpoint, ascertaining whether the compiler itself
has been compromised to the extent that it may be inserting back doors into generated
code.
12. The Why and How of
Disassembly
īĒ Debugging Displays
īˇ Perhaps the single most common use of disassemblers is to generate listings within
debuggers.
īˇ Unfortunately, disassemblers embedded within debuggers tend to be fairly
unsophisticated.
īˇ They are generally incapable of batch disassembly and sometimes balk at disassembling
when they cannot determine the boundaries of a function.
īˇ This is one of the reasons why it is best to use a debugger in conjunction with a high-
quality disassembler to provide better situational awareness and context during
debugging.
īĒ
13. The Why and How of
Disassembly
īĒ The How of Disassembly
īˇ Now that youâre well versed in the purposes of disassembly, itâs time to move on to how
the process actually works.
īˇ Consider a typical daunting task faced by a disassembler: Take these 100KB, distinguish
code from data, convert the code to assembly language for display to a user, and please
donât miss anything along the way.
īˇ We could tack any number of special requests on the end of this, such as asking the
disassembler to locate functions, recognize jump tables, and identify local variables,
making the disassemblerâs job that much more difficult.
īˇ In order to accommodate all of our demands, any disassembler will need to pick and
choose from a variety of algorithms as it navigates through the files that we feed it.
īˇ The quality of the generated disassembly listing will be directly related to the quality of
the algorithms utilized and how well they have been implemented.
14. The Why and How of
Disassembly
īĒ A Basic Disassembly Algorithm
īˇ For starters, letâs develop a simple algorithm for accepting machine language as input
and producing assembly language as output.
īˇ In doing so, we will gain an understanding of the challenges, assumptions, and
compromises that underlie an automated disassembly process.
īĒ Step 1 The first step in the disassembly process is to identify a region of code to disassemble.
This is not necessarily as straightforward as it may seem. Instructions are generally mixed with
data, and it is important to distinguish between the two. In the most common case, disassembly
of an executable file, the file will conform to a common format for executable files such as the
Portable Executable (PE) format used on Windows or the Executable and Linking Format (ELF)
common on many Unix-based systems. These formats typically contain mechanisms (often in
the form of hierarchical file headers) for locating the sections of the file that contain code and
entry points2 into that code.
15. The Why and How of
Disassembly
o Step 2 Given an initial address of an instruction, the next step is to read the value
contained at that address (or file offset) and perform a table lookup to match the
binary opcode value to its assembly language mnemonic. Depending on the
complexity of the instruction set being disassembled, this may be a trivial process,
or it may involve several additional operations such as understanding any prefixes
that may modify the instructionâs behavior and determining any operands required
by the instruction. For instruction sets with variable-length instructions, such as the
Intel x86, additional instruction bytes may need to be retrieved in order to
completely disassemble a single instruction.
16. The Why and How of
Disassembly
īĒ Step 3 Once an instruction has been fetched and any required operands decoded, its
assembly language equivalent is formatted and output as part of the disassembly
listing. It may be possible to choose from more than one assembly language output
syntax. For example, the two predominant formats for x86 assembly language are
the Intel format and the AT&T format.
īĒ Step 4 Following the output of an instruction, we need to advance to the next
instruction and repeat the previous process until we have disassembled every
instruction in the file.
17. The Why and How of
Disassembly
īˇ Various algorithms exist for determining where to begin a disassembly, how to
choose the next instruction to be disassembled, how to distinguish code from
data, and how to determine when the last instruction has been disassembled.
īˇ The two predominant disassembly algorithms are
īˇ linear sweep and
īˇ recursive descent.
18. The Why and How of
Disassembly
Linear Sweep Algorithm
īĒ The linear sweep disassembly algorithm takes a very straightforward approach to locating
instructions to disassemble: Where one instruction ends, another begins.
īĒ As a result, the most difficult decision faced is where to begin.
īĒ The usual solution is to assume that everything contained in sections of a program
marked as code (typically specified by the program fileâs headers) represents machine
language instructions.
īĒ Disassembly begins with the first byte in a code section and moves, in a linear fashion,
through the section, disassembling one instruction after another until the end of the
section is reached.
īĒ No effort is made to understand the programâs control flow through recognition of
nonlinear instructions such as branches.
19. The Why and How of
Disassembly
īĒ During the disassembly process, a pointer can be maintained to mark the
beginning of the instruction currently being disassembled.
īĒ As part of the disassembly process, the length of each instruction is computed
and used to determine the location of the next instruction to be disassembled.
īĒ Instruction sets with fixed-length instructions (MIPS, for example) are
somewhat easier to disassemble, as locating subsequent instructions is
straightforward.
20. The Why and How of
Disassembly
īĒ The main advantage of the linear sweep algorithm is that it provides complete
coverage of a programâs code sections.
īĒ One of the primary disadvantages of the linear sweep method is that it fails to
account for the fact that data may be comingled with code.
21. The Why and How of
Disassembly
īĒ Recursive Descent Disassembly
īĒ Recursive descent takes a different approach to locating instructions.
īĒ Recursive descent focuses on the concept of control flow, which determines
whether an instruction should be disassembled or not based on whether it is
referenced by another instruction.
īĒ To understand recursive descent, it is helpful to classify instructions according to
how they affect the CPU Instruction pointer.
īĒ Sequential Flow Instructions
īĒ Conditional Branching Instructions
īĒ Unconditional Branching Instructions
īĒ Function Call Instructions
īĒ Return Instructions
23. The Why and How of
Disassembly
īĒ Function call & Return
24. Reversing and Disassembly
tools
īĒ Classification Tools
īĒ file
īĒ The file command is a standard utility, included with most *NIX-style
operating systems and with the Cygwin1 or MinGW2 tools for Windows.
īĒ File attempts to identify a fileâs type by examining specific fields within the
file.
īĒ PE Tools
īĒ PE Tools4 is a collection of tools useful for analyzing both running processes
and executable files on Windows systems.
25. Reversing and Disassembly
tools
īĒ Classification Tools
īĒ PEiD
īĒ PEiD6 is another Windows tool whose primary purposes are to identify the
compiler used to build a particular Windows PE binary and to identify any
tools used to obfuscate a Windows PE binary.
26. Reversing and Disassembly
tools
īĒ Summary Tools
īĒ nm
īĒ When nm is used to examine an intermediate object file (a .o file rather than
an executable), the default output yields the names of any functions and
global variables declared in the file
īĒ Idd
īĒ The ldd (list dynamic dependencies) utility is a tool used to list the dynamic
libraries required by any executable.
īĒ Objdump
īĒ The purpose of objdump is to âdisplay information from object files.
īĒ otool
īĒ otool is most easily described as an objdump-like utility for OS X, and it is
useful for parsing information about OS X Mach-O binaries.
27. Reversing and Disassembly
tools
īĒ Deep Inspection Tools
īĒ strings
īĒ The strings utility is designed specifically to extract string content from files,
often without regard for the format of those files.
īĒ When using strings on executable files, it is important to remember that, by
default, only the loadable, initialized sections of the file will be scanned. Use the
-a command-line argument to force strings to scan the entire input file.
īĒ strings gives no indication of where, within a file, a string is located. Use the -t
command-line argument to have strings print file offset information for each
string found.
īĒ Many files utilize alternate character sets. Utilize the -e command-line argument
to cause strings to search for wide characters such as 16-bit Unicode.