Presenting a regular expression engine, that gives parse trees in a single pass by modifying the standard non-deterministic finite-state automaton algorithm. My master thesis.
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Regular expression that produce parse trees
1. Efficient Regular Expressions that produce Parse Trees
Aaron Karper Niko Schwarz
University of Bern
January 7, 2014
Aaron Karper, Niko Schwarz (UniBe)
Regex Parse Trees
January 7, 2014
1 / 38
2. Regular expressions so far
Regular expressions
https? : // (([a − z] + .) + ([a − z]+)) ((/[a − z0 − 9]+)/?)
domain
Aaron Karper, Niko Schwarz (UniBe)
Regex Parse Trees
path segments
January 7, 2014
2 / 38
3. Regular expressions so far
Regular expressions
https? : // (([a − z] + .) + ([a − z]+)) ((/[a − z0 − 9]+)/?)
domain
path segments
http : // www . reddit . com / r / computerscience / comments / 1sg 69d /
domain domain domain
Aaron Karper, Niko Schwarz (UniBe)
path
path
Regex Parse Trees
path
path
January 7, 2014
2 / 38
4. Regular expressions so far
Regular expressions are greedy by default:
(a+)(a?) on "aaa" → "aaa" in group 0 and "" in group 1.
Aaron Karper, Niko Schwarz (UniBe)
Regex Parse Trees
January 7, 2014
3 / 38
5. Regular expressions so far
Regular expressions so far
Posix gives only one match.
Regular languages are recognized, but parsing with combinatorical parsers
takes O(n3 ).
Backtracking implementations (Java, python, perl, . . . ) are exponentially
slow in the worst case.
Aaron Karper, Niko Schwarz (UniBe)
Regex Parse Trees
January 7, 2014
4 / 38
7. Benchmarks
Benchmarks
Matching ((a+b)+c)+ against
(a200 bc)2000 .
Tool
Time
JParsec
java.util.regex
Ours
Extract all class names from our project
with complex regular expression1 .
4,498
1,992
5,332
Tool
Time
java.util.regex
Ours
11,319
8,047
1 (.*?([a-z]+.)*([A-Z][a-zA-Z]*))*.*?
Aaron Karper, Niko Schwarz (UniBe)
Regex Parse Trees
January 7, 2014
6 / 38
8. Benchmarks
Optimizations of the algorithm
Benchmarks – Optimizations of the algorithm
Typically most time is spent in long repetitions, we optimize for that case by:
Lazily compile deterministic FA.
Avoiding to recreate state if seen similar state.
Use compressed representation if in static repetition.
Aaron Karper, Niko Schwarz (UniBe)
Regex Parse Trees
January 7, 2014
7 / 38
14. Benchmarks
NFA interpretation
Threads
State:
q
Copy of thread is modified.
Copy of array of histories makes
reading a character O(m2 )
Histories:
h1 h2 h3 h4 h5 h6
Aaron Karper, Niko Schwarz (UniBe)
Need faster persistent data
structure to get O(m log m).
Regex Parse Trees
January 7, 2014
13 / 38
16. Benchmarks
NFA interpretation
Optimized thread forking
Set entry 2 to 20:
1
2
20
3
4
1
9
6
5
Aaron Karper, Niko Schwarz (UniBe)
7
10
8
Regex Parse Trees
11
13
12
January 7, 2014
15 / 38
17. Benchmarks
NFA interpretation
Example: (a?(a)b)+
Reading "aabab"
01234
q1
q2
q3
q4
[[], [], [], []]
[[0], [], [], []]
[[0], [], [], []]
[[0], [], [0], []]
q5
q6
q7
q8
-
q9
For each character read, threads start hungry and must eat immediately.
Aaron Karper, Niko Schwarz (UniBe)
Regex Parse Trees
January 7, 2014
16 / 38
18. Benchmarks
NFA interpretation
Example: (a?(a)b)+
Reading "aabab"
01234
q1
q2
q3
[[], [], [], []]
[[0], [], [], []]
[[0], [], [], []]
q5
q6
-
q4
q7
q8
[[0], [], [0], []]
q9
For each character read, threads start hungry and must eat immediately.
Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe)
Regex Parse Trees
January 7, 2014
17 / 38
19. Benchmarks
NFA interpretation
Example: (a?(a)b)+
Reading "aabab"
01234
q1
q2
q3
[[], [], [], []]
[[0], [], [], []]
[[0], [], [], []]
q5
q6
[[0], [], [0], []]
-
q4
[[0], [], [0], [0]]
q7
q8
q9
For each character read, threads start hungry and must eat immediately.
Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe)
Regex Parse Trees
January 7, 2014
18 / 38
20. Benchmarks
NFA interpretation
Example: (a?(a)b)+
Reading "aabab"
01234
q1
q2
q3
[[], [], [], []]
[[0], [], [], []]
q7
q5
q8
-
q6
[[0], [], [0], []]
q4
[[0], [], [0], [0]]
q9
For each character read, threads start hungry and must eat immediately.
Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe)
Regex Parse Trees
January 7, 2014
19 / 38
21. Benchmarks
NFA interpretation
Example: (a?(a)b)+
Reading "aabab"
01234
q2
q1
q3
q5
q8
q6
[[0], [], [0], []]
[[0], [], [1], []]
q7
-
q4
[[0], [], [], []]
[[0], [], [0], [0]]
q9
For each character read, threads start hungry and must eat immediately.
Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe)
Regex Parse Trees
January 7, 2014
20 / 38
22. Benchmarks
NFA interpretation
Example: (a?(a)b)+
Reading "aabab"
01234
q2
q1
q3
q4
-
q5
q6
[[0], [], [1], []]
q7
q8
[[0], [], [0], [0]]
q9
For each character read, threads start hungry and must eat immediately.
Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe)
Regex Parse Trees
January 7, 2014
21 / 38
23. Benchmarks
NFA interpretation
Example: (a?(a)b)+
Reading "aabab"
01234
q2
q1
q3
q4
-
q5
q6
[[0], [], [1], []]
q7
q8
[[0], [], [0], [0]]
q9
For each character read, threads start hungry and must eat immediately.
Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe)
Regex Parse Trees
January 7, 2014
22 / 38
24. Benchmarks
NFA interpretation
Example: (a?(a)b)+
Reading "aabab"
01234
q2
q1
q3
q4
q7
q8
-
q5
q6
[[0], [], [1], []]
[[0], [], [1], [1]]
q9
For each character read, threads start hungry and must eat immediately.
Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe)
Regex Parse Trees
January 7, 2014
23 / 38
25. Benchmarks
NFA interpretation
Example: (a?(a)b)+
Reading "aabab"
01234
q2
q1
q3
q4
q7
q8
-
q5
q6
[[0], [], [1], [1]]
q9
For each character read, threads start hungry and must eat immediately.
Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe)
Regex Parse Trees
January 7, 2014
24 / 38
26. Benchmarks
NFA interpretation
Example: (a?(a)b)+
Reading "aabab"
01234
q1
q2
q3
q4
[[0], [2], [1], [1]]
[[0,2], [2], [1], [1]]
[[0,2], [2], [1], [1]]
[[0,2], [2], [1,3], [1]]
q5
q6
q7
q8
[[0], [], [1], [1]]
[[0], [2], [1], [1]]
-
q9
[[0], [2], [1], [1]]
For each character read, threads start hungry and must eat immediately.
Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe)
Regex Parse Trees
January 7, 2014
25 / 38
27. Benchmarks
NFA interpretation
Example: (a?(a)b)+
Reading "aabab"
01234
q1
q2
q3
q4
[[0], [2], [1], [1]]
[[0,2], [2], [1], [1]]
[[0,2], [2], [1], [1]]
[[0,2], [2], [1,3], [1]]
q5
q6
q7
q8
[[0], [], [1], [1]]
[[0], [2], [1], [1]]
-
q9
[[0], [2], [1], [1]]
For each character read, threads start hungry and must eat immediately.
Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe)
Regex Parse Trees
January 7, 2014
26 / 38
28. Benchmarks
NFA interpretation
Example: (a?(a)b)+
Reading "aabab"
01234
q2
q1
q3
q5
q8
q6
[[0,2], [2], [1,3], [1]]
[[0,2], [2], [1,4], [1]]
q7
-
q4
[[0,2], [2], [1], [1]]
[[0,2], [2], [1,3], [1,3]]
q9
For each character read, threads start hungry and must eat immediately.
Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe)
Regex Parse Trees
January 7, 2014
27 / 38
29. Benchmarks
NFA interpretation
Example: (a?(a)b)+
Reading "aabab"
01234
q2
q1
q3
q5
q8
q6
[[0,2], [2], [1,3], [1]]
[[0,2], [2], [1,4], [1]]
q7
-
q4
[[0,2], [2], [1], [1]]
[[0,2], [2], [1,3], [1,3]]
q9
For each character read, threads start hungry and must eat immediately.
Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe)
Regex Parse Trees
January 7, 2014
28 / 38
30. Benchmarks
NFA interpretation
Example: (a?(a)b)+
Reading "aabab"
01234
q1
q2
[[0,2], [2,4], [1,3], [1,3]]
[[0,2,5], [2,4], [1,3], [1,3]]
q3
q5
q6
-
q4
[[0,2,5], [2,4], [1,3], [1,3]]
[[0,2,5], [2,4,5], [1,3], [1,3]]
q7
q8
[[0,2], [2], [1,3], [1,3]]
[[0,2], [2,4], [1,3], [1,3]]
q9
[[0,2], [2,4], [1,3], [1,3]]
For each character read, threads start hungry and must eat immediately.
Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe)
Regex Parse Trees
January 7, 2014
29 / 38
34. Backtracking’s nightmare
(a + a+) + b
against
”an b”
will backtrack Θ(2n ) times.
Aaron Karper, Niko Schwarz (UniBe)
Regex Parse Trees
January 7, 2014
33 / 38
35. Backtracking’s nightmare
Extract the first cell in a CSV that starts with "P"1 :
∧(.∗?, ) + (P.∗?),
failing against
”1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13”
is exponential.
1 From
http://www.regular-expressions.info/catastrophic.html
Aaron Karper, Niko Schwarz (UniBe)
Regex Parse Trees
January 7, 2014
34 / 38
36. Thread execution order matters
.*(a?)
start
q1
τ1 ↑
q3
a
q4
τ1 ↓
q5
any
q2
Aaron Karper, Niko Schwarz (UniBe)
Regex Parse Trees
January 7, 2014
35 / 38
38. Optimization Pipeline
1
Convert to nondeterministic FA
2
Interpret nondeterministic FA, building deterministic FA lazily.
3
Find similar/mappable states to avoid creating infinite DFA.
4
Run on DFA if possible
5
Compactify DFA if creation of new states wasn’t necessary for a while.
Aaron Karper, Niko Schwarz (UniBe)
Regex Parse Trees
January 7, 2014
37 / 38