[CB20] Reflex: you give me a parser, I give you a token generator by Paolo Montesel

Reﬂex:
You give me a parser, I give you a token generator
@ CODE BLUE 2020
Paolo “babush” Montesel // rev.ng srls
Mario Polino, Ph.D // Politecnico di Milano
Prof. Stefano Zanero // Politecnico di Milano
paolo.montesel@gmail.com
@pmontesel
https://babush.me/

Who am I
● CTF w/ mHackeroni + Tower of Hanoi + NoPwnIntended
● Contractor @ rev.ng srls
○ Security Research Engineer
○ LLVM-based Dynamic Binary Translation
○ Machine Learning for Reverse Engineering
● Main Interests
○ Reverse engineering
○ Full system fuzzing
○ Machine Learning on code/binaries
○ Obfuscation
2

Overview
1. Introduction
○ What’s Reﬂex
○ How it was born
○ Why it’s useful
2. How it works
3

Overview
1. Introduction
○ How it was born
2. How it works
4

What’s Reﬂex?
5
“def”
“if”
“then”
“variable”
“switch”
...
<token>
tokensbinary

Scenario
● looking at a binary target
● custom language with poor or no docs
● few or no input samples available
● might not be able to instrument the target (e.g.: ﬁrmware)
6

Overview
1. Introduction
○ How it was born
2. How it works
7

Overview
1. Introduction
○ How it was born
2. How it works
17

Analyzing parsers in binaries is time-consuming
● Auto-generated code
● Finite-state machine
● Table-based generators
○ Flex / bison
● CFG-based generators
18

How to get valid tokens out of binaries?
● Dynamic approaches
○ Assume target can be executed
○ Exploit coverage information
○ Fail with table-based parsers
● Require large corpus of valid inputs
● Make assumptions on program behavior (e.g.: early exit)
19

State of the art TL;DR
20“Parser-Directed Fuzzing”, Björn Mathis et al. (PLDI 2019)

State of the art TL;DR
21“Parser-Directed Fuzzing”, Björn Mathis et al. (PLDI 2019)
What about closed-source programs?

Popular open-source parser generators
22
bison / ﬂex antlr
parseclemon

Popular open-source parser generators
23
bison / ﬂex antlr
parseclemon

Where can you ﬁnd bison in the wild?
24

Where can you ﬁnd bison? (lulz version)
25https://lists.gnu.org/archive/html/help-bison/2005-11/msg00029.html

Overview
1. Introduction
○ How it was born
2. How it works
26

How it works
1. Some background
2. Identifying the information we need from the target
3. Recover the Finite State Machine
4. FSM => regular expression
5. Generate valid tokens
27

How it works
1. Some background
28

(simpliﬁed) Parser theory in ~2 minutes
29
developer grammar
parser
lexer

The lexer
30
“x = y * 2 + 3”
x = y * 2 + 3 tokens
text
lexer

The parser
31
x = y * 2 + 3
abstract
syntax
tree
(AST)
tokens
parser
x
=
+
y 2
* 3

How does ﬂex work?
32
Rules:
enum {
B = 1,
DEFINE = 2,
A = 3,
}
1. a => A
2. def(ine)? => DEFINE
3. b => B
...
N. <regexp> => <action_id>

33
See: “Compilers: Principles, Techniques, and Tools“ by Aho, Lam, Sethi, Ullman (AKA “The Dragon Book”)

How it works
1. Some background
34

What do we need from ﬂex?
35
table offsets
+
element size
max state

36
Steps needed
1. Find yylex() in the binary
2. Find tables using data-ﬂow analysis
3. Infer their element size from memory accesses
4. Interpret the tables to recover the ﬁnite-state machine

37
How to ﬁnd the tables in a binary?
2
VAR
+
VAR
LOAD
2
+
VAR
LOAD

jge
38
Example: dataﬂow query
global_var1 global_var2
[ ]
>=
+
constant

How it works
1. Some background
39

41
Example: table decoding (simpliﬁed)
XX XX XX XX XX XX
XX XX XX XX XX XX
XX XX XX XX XX XX
XX XX XX XX XX XX
XX XX XX XX XX XX
XX XX XX XX XX XX
next state
inputchar
curr state
next = next_state[curr_state][input_char]

XX 0A XX XX XX XX
XX 08 XX XX XX XX
XX 0A XX XX XX XX
XX 09 XX XX XX XX
XX 0A XX XX XX XX
XX 02 XX XX XX XX
42
Example: table decoding (simpliﬁed)
curr state = 1
“a”
next = next_state[1][input_char]
“b”
“d”

How it works
1. Some background
44

48
def(ine)?
See: “Introduction to Automata Theory, Languages, and Computation” by Hopcroft, Ullman
Reﬂex
simplify
Recovering the regular expressions

How it works
1. Some background
49

52
Takeaways
● Identify ﬂex’s tables automatically ✅
● Extract the original FSM
● Recover the lexer patterns
● Generate valid tokens
● No need to run the target

53
Takeaways
● Extract the original FSM ✅
● Recover the lexer patterns

54
Takeaways
● Recover the lexer patterns ✅

55
Takeaways
● Generate valid tokens ✅

56
Takeaways
● Generate valid tokens ✅
● No need to run the target ✅

57
Limitations
● Ghidra-based dataflow analysis not reliable
● Works only for table-driven parsers
● Implementation limited to bison/flex
● Doesn’t (yet) handle flex => bison
○ No structured file generation

58
Future work
● Find more bison-based targets (contact me)
● Reverse bison
● Improve yylex()/tables identiﬁcation
○ rev.ng + LLVM IR + SCEV
● Look at other parser generators (e.g. antlr)
● Find bugs? (:

59
https://github.com/thebabush/reﬂex

Thank you!
(^o^)/
質問がありますか？
60
paolo.montesel@gmail.com
@pmontesel
https://babush.me/

Credits
Icons
● ﬂaticon.com
● public domain
61

[CB20] Reflex: you give me a parser, I give you a token generator by Paolo Montesel

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Similar to [CB20] Reflex: you give me a parser, I give you a token generator by Paolo Montesel

Similar to [CB20] Reflex: you give me a parser, I give you a token generator by Paolo Montesel (20)

More from CODE BLUE

More from CODE BLUE (20)

Recently uploaded

Recently uploaded (20)

[CB20] Reflex: you give me a parser, I give you a token generator by Paolo Montesel