Every program needs some input and some logic to translate such input into structured data in memory.
Language parsers, in particular, are notoriously error-prone to write,so programmers often define a high-level grammar and use parsergenerators to make the parsing process easier.
What does this entail for a security analyst?
In this talk you will learn what to do when you encounter GNU bison-generated code in a binary, what kind of code it creates, and how to exploit its structure to analyze targets faster and improve your fuzzers in the process.
You will also walk away with a free (as in freedom) tool to automate your reverse engineering efforts.
[CB20] Reflex: you give me a parser, I give you a token generator by Paolo Montesel
1. Reflex:
You give me a parser, I give you a token generator
@ CODE BLUE 2020
Paolo “babush” Montesel // rev.ng srls
Mario Polino, Ph.D // Politecnico di Milano
Prof. Stefano Zanero // Politecnico di Milano
paolo.montesel@gmail.com
@pmontesel
https://babush.me/
2. Who am I
● CTF w/ mHackeroni + Tower of Hanoi + NoPwnIntended
● Contractor @ rev.ng srls
○ Security Research Engineer
○ LLVM-based Dynamic Binary Translation
○ Machine Learning for Reverse Engineering
● Main Interests
○ Reverse engineering
○ Full system fuzzing
○ Machine Learning on code/binaries
○ Obfuscation
2
6. Scenario
● looking at a binary target
● custom language with poor or no docs
● few or no input samples available
● might not be able to instrument the target (e.g.: firmware)
6
19. How to get valid tokens out of binaries?
● Dynamic approaches
○ Assume target can be executed
○ Exploit coverage information
○ Fail with table-based parsers
● Require large corpus of valid inputs
● Make assumptions on program behavior (e.g.: early exit)
19
20. State of the art TL;DR
20“Parser-Directed Fuzzing”, Björn Mathis et al. (PLDI 2019)
21. State of the art TL;DR
21“Parser-Directed Fuzzing”, Björn Mathis et al. (PLDI 2019)
What about closed-source programs?
27. How it works
1. Some background
2. Identifying the information we need from the target
3. Recover the Finite State Machine
4. FSM => regular expression
5. Generate valid tokens
27
28. How it works
1. Some background
2. Identifying the information we need from the target
3. Recover the Finite State Machine
4. FSM => regular expression
5. Generate valid tokens
28
34. How it works
1. Some background
2. Identifying the information we need from the target
3. Recover the Finite State Machine
4. FSM => regular expression
5. Generate valid tokens
34
35. What do we need from flex?
35
table offsets
+
element size
max state
36. 36
Steps needed
1. Find yylex() in the binary
2. Find tables using data-flow analysis
3. Infer their element size from memory accesses
4. Interpret the tables to recover the finite-state machine
37. 37
How to find the tables in a binary?
2
VAR
+
VAR
LOAD
2
+
VAR
LOAD
39. How it works
1. Some background
2. Identifying the information we need from the target
3. Recover the Finite State Machine
4. FSM => regular expression
5. Generate valid tokens
39
41. 41
Example: table decoding (simplified)
XX XX XX XX XX XX
XX XX XX XX XX XX
XX XX XX XX XX XX
XX XX XX XX XX XX
XX XX XX XX XX XX
XX XX XX XX XX XX
next state
inputchar
curr state
next = next_state[curr_state][input_char]
42. XX 0A XX XX XX XX
XX 08 XX XX XX XX
XX 0A XX XX XX XX
XX 09 XX XX XX XX
XX 0A XX XX XX XX
XX 02 XX XX XX XX
42
Example: table decoding (simplified)
curr state = 1
“a”
next = next_state[1][input_char]
“b”
“d”
44. How it works
1. Some background
2. Identifying the information we need from the target
3. Recover the Finite State Machine
4. FSM => regular expression
5. Generate valid tokens
44
48. 48
def(ine)?
See: “Introduction to Automata Theory, Languages, and Computation” by Hopcroft, Ullman
Reflex
simplify
Recovering the regular expressions
49. How it works
1. Some background
2. Identifying the information we need from the target
3. Recover the Finite State Machine
4. FSM => regular expression
5. Generate valid tokens
49
52. 52
Takeaways
● Identify flex’s tables automatically ✅
● Extract the original FSM
● Recover the lexer patterns
● Generate valid tokens
● No need to run the target
53. 53
Takeaways
● Identify flex’s tables automatically ✅
● Extract the original FSM ✅
● Recover the lexer patterns
● Generate valid tokens
● No need to run the target
54. 54
Takeaways
● Identify flex’s tables automatically ✅
● Extract the original FSM ✅
● Recover the lexer patterns ✅
● Generate valid tokens
● No need to run the target
55. 55
Takeaways
● Identify flex’s tables automatically ✅
● Extract the original FSM ✅
● Recover the lexer patterns ✅
● Generate valid tokens ✅
● No need to run the target
56. 56
Takeaways
● Identify flex’s tables automatically ✅
● Extract the original FSM ✅
● Recover the lexer patterns ✅
● Generate valid tokens ✅
● No need to run the target ✅
57. 57
Limitations
● Ghidra-based dataflow analysis not reliable
● Works only for table-driven parsers
● Implementation limited to bison/flex
● Doesn’t (yet) handle flex => bison
○ No structured file generation
58. 58
Future work
● Find more bison-based targets (contact me)
● Reverse bison
● Improve yylex()/tables identification
○ rev.ng + LLVM IR + SCEV
● Look at other parser generators (e.g. antlr)
● Find bugs? (: