Upcoming SlideShare
×

# Regular expression that produce parse trees

• 591 views

Presenting a regular expression engine, that gives parse trees in a single pass by modifying the standard non-deterministic finite-state automaton algorithm. My master thesis.

Presenting a regular expression engine, that gives parse trees in a single pass by modifying the standard non-deterministic finite-state automaton algorithm. My master thesis.

More in: Education , Technology
• Comment goes here.
Are you sure you want to
Be the first to comment
Be the first to like this

Total Views
591
On Slideshare
0
From Embeds
0
Number of Embeds
2

Shares
3
0
Likes
0

No embeds

### Report content

No notes for slide

### Transcript

• 1. Eﬃcient Regular Expressions that produce Parse Trees Aaron Karper Niko Schwarz University of Bern January 7, 2014 Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 1 / 38
• 2. Regular expressions so far Regular expressions https? : // (([a − z] + .) + ([a − z]+)) ((/[a − z0 − 9]+)/?) domain Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees path segments January 7, 2014 2 / 38
• 3. Regular expressions so far Regular expressions https? : // (([a − z] + .) + ([a − z]+)) ((/[a − z0 − 9]+)/?) domain path segments http : // www . reddit . com / r / computerscience / comments / 1sg 69d / domain domain domain Aaron Karper, Niko Schwarz (UniBe) path path Regex Parse Trees path path January 7, 2014 2 / 38
• 4. Regular expressions so far Regular expressions are greedy by default: (a+)(a?) on "aaa" → "aaa" in group 0 and "" in group 1. Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 3 / 38
• 5. Regular expressions so far Regular expressions so far Posix gives only one match. Regular languages are recognized, but parsing with combinatorical parsers takes O(n3 ). Backtracking implementations (Java, python, perl, . . . ) are exponentially slow in the worst case. Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 4 / 38
• 6. Benchmarks Parsing with https?://(([a-z]+.)+([a-z]+))((/[a-z0-9]+)/?) 0 3 2 4 http:// www. reddit. com /r /computerscience /comments /1sg69d 1 Figure : Posix 0 1 3 4 4 4 4 2 2 2 http:// www. reddit. com /r /computerscience /comments /1sg69d Figure : Our approach Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 5 / 38
• 7. Benchmarks Benchmarks Matching ((a+b)+c)+ against (a200 bc)2000 . Tool Time JParsec java.util.regex Ours Extract all class names from our project with complex regular expression1 . 4,498 1,992 5,332 Tool Time java.util.regex Ours 11,319 8,047 1 (.*?([a-z]+.)*([A-Z][a-zA-Z]*))*.*? Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 6 / 38
• 8. Benchmarks Optimizations of the algorithm Benchmarks – Optimizations of the algorithm Typically most time is spent in long repetitions, we optimize for that case by: Lazily compile deterministic FA. Avoiding to recreate state if seen similar state. Use compressed representation if in static repetition. Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 7 / 38
• 9. Benchmarks NFA interpretation Example: (a?(a)b)+ Parse (a?(a)b)+ over ”aabab” 01234 1 2 1 2 a a b a b 0 1 2 3 4 Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 8 / 38
• 10. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q1 q2 [[], [], [], []] q5 q3 q4 q7 q8 - q6 q9 Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 9 / 38
• 11. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q1 q2 q3 [[], [], [], []] [[0], [], [], []] q5 q6 q4 q7 q8 - q9 Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 10 / 38
• 12. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q1 q2 q3 [[], [], [], []] [[0], [], [], []] [[0], [], [], []] q5 q6 - q4 q7 q8 q9 Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 11 / 38
• 13. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q1 q2 q3 q4 [[], [], [], []] [[0], [], [], []] [[0], [], [], []] [[0], [], [0], []] q5 q6 q7 q8 - q9 Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 12 / 38
• 14. Benchmarks NFA interpretation Threads State: q Copy of thread is modiﬁed. Copy of array of histories makes reading a character O(m2 ) Histories: h1 h2 h3 h4 h5 h6 Aaron Karper, Niko Schwarz (UniBe) Need faster persistent data structure to get O(m log m). Regex Parse Trees January 7, 2014 13 / 38
• 15. Benchmarks NFA interpretation Optimized thread forking Set entry 2 to 20: 1 2 9 3 4 6 5 Aaron Karper, Niko Schwarz (UniBe) 7 10 8 Regex Parse Trees 11 13 12 January 7, 2014 14 / 38
• 16. Benchmarks NFA interpretation Optimized thread forking Set entry 2 to 20: 1 2 20 3 4 1 9 6 5 Aaron Karper, Niko Schwarz (UniBe) 7 10 8 Regex Parse Trees 11 13 12 January 7, 2014 15 / 38
• 17. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q1 q2 q3 q4 [[], [], [], []] [[0], [], [], []] [[0], [], [], []] [[0], [], [0], []] q5 q6 q7 q8 - q9 For each character read, threads start hungry and must eat immediately. Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 16 / 38
• 18. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q1 q2 q3 [[], [], [], []] [[0], [], [], []] [[0], [], [], []] q5 q6 - q4 q7 q8 [[0], [], [0], []] q9 For each character read, threads start hungry and must eat immediately. Only a hungry thread can eat Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 17 / 38
• 19. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q1 q2 q3 [[], [], [], []] [[0], [], [], []] [[0], [], [], []] q5 q6 [[0], [], [0], []] - q4 [[0], [], [0], [0]] q7 q8 q9 For each character read, threads start hungry and must eat immediately. Only a hungry thread can eat Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 18 / 38
• 20. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q1 q2 q3 [[], [], [], []] [[0], [], [], []] q7 q5 q8 - q6 [[0], [], [0], []] q4 [[0], [], [0], [0]] q9 For each character read, threads start hungry and must eat immediately. Only a hungry thread can eat Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 19 / 38
• 21. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q2 q1 q3 q5 q8 q6 [[0], [], [0], []] [[0], [], [1], []] q7 - q4 [[0], [], [], []] [[0], [], [0], [0]] q9 For each character read, threads start hungry and must eat immediately. Only a hungry thread can eat Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 20 / 38
• 22. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q2 q1 q3 q4 - q5 q6 [[0], [], [1], []] q7 q8 [[0], [], [0], [0]] q9 For each character read, threads start hungry and must eat immediately. Only a hungry thread can eat Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 21 / 38
• 23. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q2 q1 q3 q4 - q5 q6 [[0], [], [1], []] q7 q8 [[0], [], [0], [0]] q9 For each character read, threads start hungry and must eat immediately. Only a hungry thread can eat Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 22 / 38
• 24. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q2 q1 q3 q4 q7 q8 - q5 q6 [[0], [], [1], []] [[0], [], [1], [1]] q9 For each character read, threads start hungry and must eat immediately. Only a hungry thread can eat Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 23 / 38
• 25. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q2 q1 q3 q4 q7 q8 - q5 q6 [[0], [], [1], [1]] q9 For each character read, threads start hungry and must eat immediately. Only a hungry thread can eat Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 24 / 38
• 26. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q1 q2 q3 q4 [[0], [2], [1], [1]] [[0,2], [2], [1], [1]] [[0,2], [2], [1], [1]] [[0,2], [2], [1,3], [1]] q5 q6 q7 q8 [[0], [], [1], [1]] [[0], [2], [1], [1]] - q9 [[0], [2], [1], [1]] For each character read, threads start hungry and must eat immediately. Only a hungry thread can eat Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 25 / 38
• 27. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q1 q2 q3 q4 [[0], [2], [1], [1]] [[0,2], [2], [1], [1]] [[0,2], [2], [1], [1]] [[0,2], [2], [1,3], [1]] q5 q6 q7 q8 [[0], [], [1], [1]] [[0], [2], [1], [1]] - q9 [[0], [2], [1], [1]] For each character read, threads start hungry and must eat immediately. Only a hungry thread can eat Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 26 / 38
• 28. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q2 q1 q3 q5 q8 q6 [[0,2], [2], [1,3], [1]] [[0,2], [2], [1,4], [1]] q7 - q4 [[0,2], [2], [1], [1]] [[0,2], [2], [1,3], [1,3]] q9 For each character read, threads start hungry and must eat immediately. Only a hungry thread can eat Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 27 / 38
• 29. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q2 q1 q3 q5 q8 q6 [[0,2], [2], [1,3], [1]] [[0,2], [2], [1,4], [1]] q7 - q4 [[0,2], [2], [1], [1]] [[0,2], [2], [1,3], [1,3]] q9 For each character read, threads start hungry and must eat immediately. Only a hungry thread can eat Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 28 / 38
• 30. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q1 q2 [[0,2], [2,4], [1,3], [1,3]] [[0,2,5], [2,4], [1,3], [1,3]] q3 q5 q6 - q4 [[0,2,5], [2,4], [1,3], [1,3]] [[0,2,5], [2,4,5], [1,3], [1,3]] q7 q8 [[0,2], [2], [1,3], [1,3]] [[0,2], [2,4], [1,3], [1,3]] q9 [[0,2], [2,4], [1,3], [1,3]] For each character read, threads start hungry and must eat immediately. Only a hungry thread can eat Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 29 / 38
• 31. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q9 [[0,2], [2,4], [1,3], [1,3]] 1 2 1 2 a a b a b 0 1 2 3 4 Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 30 / 38
• 32. Download https://github.com/nes1983/tree-regex Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 31 / 38
• 33. NFA construction S1 S - Optional S? S2 Alternation S1|S2 S S - Capture group (S) Star operation S*? Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 32 / 38
• 34. Backtracking’s nightmare (a + a+) + b against ”an b” will backtrack Θ(2n ) times. Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 33 / 38
• 35. Backtracking’s nightmare Extract the ﬁrst cell in a CSV that starts with "P"1 : ∧(.∗?, ) + (P.∗?), failing against ”1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13” is exponential. 1 From http://www.regular-expressions.info/catastrophic.html Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 34 / 38
• 36. Thread execution order matters .*(a?) start q1 τ1 ↑ q3 a q4 τ1 ↓ q5 any q2 Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 35 / 38
• 37. Priority matters (a)|(a) q2 a q4 τ1 ↓ τ1 ↑ start q1 q6 τ2 ↑ τ2 ↓ q3 Aaron Karper, Niko Schwarz (UniBe) a q5 Regex Parse Trees January 7, 2014 36 / 38
• 38. Optimization Pipeline 1 Convert to nondeterministic FA 2 Interpret nondeterministic FA, building deterministic FA lazily. 3 Find similar/mappable states to avoid creating inﬁnite DFA. 4 Run on DFA if possible 5 Compactify DFA if creation of new states wasn’t necessary for a while. Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 37 / 38
• 39. NFA interpretation Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 38 / 38