Efficient Regular Expressions that produce Parse Trees
Aaron Karper Niko Schwarz
University of Bern

January 7, 2014

Aaron ...
Regular expressions so far

Regular expressions

https? : // (([a − z] + .) + ([a − z]+)) ((/[a − z0 − 9]+)/?)
domain

Aar...
Regular expressions so far

Regular expressions

https? : // (([a − z] + .) + ([a − z]+)) ((/[a − z0 − 9]+)/?)
domain

pat...
Regular expressions so far

Regular expressions are greedy by default:
(a+)(a?) on "aaa" → "aaa" in group 0 and "" in grou...
Regular expressions so far

Regular expressions so far

Posix gives only one match.
Regular languages are recognized, but ...
Benchmarks

Parsing with https?://(([a-z]+.)+([a-z]+))((/[a-z0-9]+)/?)
0

3
2
4
http:// www. reddit. com /r /computerscien...
Benchmarks

Benchmarks

Matching ((a+b)+c)+ against
(a200 bc)2000 .
Tool

Time

JParsec
java.util.regex
Ours

Extract all ...
Benchmarks

Optimizations of the algorithm

Benchmarks – Optimizations of the algorithm

Typically most time is spent in l...
Benchmarks

NFA interpretation

Example: (a?(a)b)+

Parse
(a?(a)b)+
over
”aabab”
01234

1

2

1
2

a a b a b
0 1 2 3 4
Aar...
Benchmarks

NFA interpretation

Example: (a?(a)b)+
Reading "aabab"
01234

q1

q2

[[], [], [], []]

q5

q3

q4

q7

q8

-
...
Benchmarks

NFA interpretation

Example: (a?(a)b)+
Reading "aabab"
01234

q1

q2

q3

[[], [], [], []]

[[0], [], [], []]
...
Benchmarks

NFA interpretation

Example: (a?(a)b)+
Reading "aabab"
01234

q1

q2

q3

[[], [], [], []]

[[0], [], [], []]
...
Benchmarks

NFA interpretation

Example: (a?(a)b)+
Reading "aabab"
01234

q1

q2

q3

q4

[[], [], [], []]

[[0], [], [], ...
Benchmarks

NFA interpretation

Threads

State:

q

Copy of thread is modified.
Copy of array of histories makes
reading a ...
Benchmarks

NFA interpretation

Optimized thread forking
Set entry 2 to 20:
1

2

9

3

4

6

5

Aaron Karper, Niko Schwar...
Benchmarks

NFA interpretation

Optimized thread forking
Set entry 2 to 20:
1

2

20

3

4

1

9

6

5

Aaron Karper, Niko...
Benchmarks

NFA interpretation

Example: (a?(a)b)+
Reading "aabab"
01234

q1

q2

q3

q4

[[], [], [], []]

[[0], [], [], ...
Benchmarks

NFA interpretation

Example: (a?(a)b)+
Reading "aabab"
01234

q1

q2

q3

[[], [], [], []]

[[0], [], [], []]
...
Benchmarks

NFA interpretation

Example: (a?(a)b)+
Reading "aabab"
01234

q1

q2

q3

[[], [], [], []]

[[0], [], [], []]
...
Benchmarks

NFA interpretation

Example: (a?(a)b)+
Reading "aabab"
01234

q1

q2

q3

[[], [], [], []]

[[0], [], [], []]
...
Benchmarks

NFA interpretation

Example: (a?(a)b)+
Reading "aabab"
01234

q2

q1

q3

q5

q8

q6

[[0], [], [0], []]

[[0]...
Benchmarks

NFA interpretation

Example: (a?(a)b)+
Reading "aabab"
01234

q2

q1

q3

q4

-

q5

q6

[[0], [], [1], []]

q...
Benchmarks

NFA interpretation

Example: (a?(a)b)+
Reading "aabab"
01234

q2

q1

q3

q4

-

q5

q6

[[0], [], [1], []]

q...
Benchmarks

NFA interpretation

Example: (a?(a)b)+
Reading "aabab"
01234

q2

q1

q3

q4

q7

q8

-

q5

q6

[[0], [], [1]...
Benchmarks

NFA interpretation

Example: (a?(a)b)+
Reading "aabab"
01234

q2

q1

q3

q4

q7

q8

-

q5

q6
[[0], [], [1],...
Benchmarks

NFA interpretation

Example: (a?(a)b)+
Reading "aabab"
01234

q1

q2

q3

q4

[[0], [2], [1], [1]]

[[0,2], [2...
Benchmarks

NFA interpretation

Example: (a?(a)b)+
Reading "aabab"
01234

q1

q2

q3

q4

[[0], [2], [1], [1]]

[[0,2], [2...
Benchmarks

NFA interpretation

Example: (a?(a)b)+
Reading "aabab"
01234

q2

q1

q3

q5

q8

q6

[[0,2], [2], [1,3], [1]]...
Benchmarks

NFA interpretation

Example: (a?(a)b)+
Reading "aabab"
01234

q2

q1

q3

q5

q8

q6

[[0,2], [2], [1,3], [1]]...
Benchmarks

NFA interpretation

Example: (a?(a)b)+
Reading "aabab"
01234

q1

q2

[[0,2], [2,4], [1,3], [1,3]]

[[0,2,5], ...
Benchmarks

NFA interpretation

Example: (a?(a)b)+
Reading "aabab"
01234

q9
[[0,2], [2,4], [1,3], [1,3]]

1

2

1
2

a a ...
Download

https://github.com/nes1983/tree-regex

Aaron Karper, Niko Schwarz (UniBe)

Regex Parse Trees

January 7, 2014

3...
NFA construction
S1

S

-

Optional
S?

S2
Alternation
S1|S2

S

S

-

Capture group
(S)

Star operation
S*?

Aaron Karper...
Backtracking’s nightmare

(a + a+) + b
against
”an b”
will backtrack Θ(2n ) times.

Aaron Karper, Niko Schwarz (UniBe)

Re...
Backtracking’s nightmare

Extract the first cell in a CSV that starts with "P"1 :
∧(.∗?, ) + (P.∗?),
failing against
”1, 2,...
Thread execution order matters

.*(a?)
start

q1

τ1 ↑

q3

a

q4

τ1 ↓

q5

any

q2

Aaron Karper, Niko Schwarz (UniBe)

...
Priority matters

(a)|(a)
q2

a

q4
τ1 ↓

τ1 ↑

start

q1

q6
τ2 ↑

τ2 ↓
q3

Aaron Karper, Niko Schwarz (UniBe)

a

q5

Re...
Optimization Pipeline

1

Convert to nondeterministic FA

2

Interpret nondeterministic FA, building deterministic FA lazi...
NFA interpretation

Aaron Karper, Niko Schwarz (UniBe)

Regex Parse Trees

January 7, 2014

38 / 38
Upcoming SlideShare
Loading in …5
×

Regular expression that produce parse trees

1,678 views

Published on

Presenting a regular expression engine, that gives parse trees in a single pass by modifying the standard non-deterministic finite-state automaton algorithm. My master thesis.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,678
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Regular expression that produce parse trees

  1. 1. Efficient Regular Expressions that produce Parse Trees Aaron Karper Niko Schwarz University of Bern January 7, 2014 Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 1 / 38
  2. 2. Regular expressions so far Regular expressions https? : // (([a − z] + .) + ([a − z]+)) ((/[a − z0 − 9]+)/?) domain Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees path segments January 7, 2014 2 / 38
  3. 3. Regular expressions so far Regular expressions https? : // (([a − z] + .) + ([a − z]+)) ((/[a − z0 − 9]+)/?) domain path segments http : // www . reddit . com / r / computerscience / comments / 1sg 69d / domain domain domain Aaron Karper, Niko Schwarz (UniBe) path path Regex Parse Trees path path January 7, 2014 2 / 38
  4. 4. Regular expressions so far Regular expressions are greedy by default: (a+)(a?) on "aaa" → "aaa" in group 0 and "" in group 1. Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 3 / 38
  5. 5. Regular expressions so far Regular expressions so far Posix gives only one match. Regular languages are recognized, but parsing with combinatorical parsers takes O(n3 ). Backtracking implementations (Java, python, perl, . . . ) are exponentially slow in the worst case. Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 4 / 38
  6. 6. Benchmarks Parsing with https?://(([a-z]+.)+([a-z]+))((/[a-z0-9]+)/?) 0 3 2 4 http:// www. reddit. com /r /computerscience /comments /1sg69d 1 Figure : Posix 0 1 3 4 4 4 4 2 2 2 http:// www. reddit. com /r /computerscience /comments /1sg69d Figure : Our approach Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 5 / 38
  7. 7. Benchmarks Benchmarks Matching ((a+b)+c)+ against (a200 bc)2000 . Tool Time JParsec java.util.regex Ours Extract all class names from our project with complex regular expression1 . 4,498 1,992 5,332 Tool Time java.util.regex Ours 11,319 8,047 1 (.*?([a-z]+.)*([A-Z][a-zA-Z]*))*.*? Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 6 / 38
  8. 8. Benchmarks Optimizations of the algorithm Benchmarks – Optimizations of the algorithm Typically most time is spent in long repetitions, we optimize for that case by: Lazily compile deterministic FA. Avoiding to recreate state if seen similar state. Use compressed representation if in static repetition. Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 7 / 38
  9. 9. Benchmarks NFA interpretation Example: (a?(a)b)+ Parse (a?(a)b)+ over ”aabab” 01234 1 2 1 2 a a b a b 0 1 2 3 4 Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 8 / 38
  10. 10. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q1 q2 [[], [], [], []] q5 q3 q4 q7 q8 - q6 q9 Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 9 / 38
  11. 11. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q1 q2 q3 [[], [], [], []] [[0], [], [], []] q5 q6 q4 q7 q8 - q9 Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 10 / 38
  12. 12. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q1 q2 q3 [[], [], [], []] [[0], [], [], []] [[0], [], [], []] q5 q6 - q4 q7 q8 q9 Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 11 / 38
  13. 13. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q1 q2 q3 q4 [[], [], [], []] [[0], [], [], []] [[0], [], [], []] [[0], [], [0], []] q5 q6 q7 q8 - q9 Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 12 / 38
  14. 14. Benchmarks NFA interpretation Threads State: q Copy of thread is modified. Copy of array of histories makes reading a character O(m2 ) Histories: h1 h2 h3 h4 h5 h6 Aaron Karper, Niko Schwarz (UniBe) Need faster persistent data structure to get O(m log m). Regex Parse Trees January 7, 2014 13 / 38
  15. 15. Benchmarks NFA interpretation Optimized thread forking Set entry 2 to 20: 1 2 9 3 4 6 5 Aaron Karper, Niko Schwarz (UniBe) 7 10 8 Regex Parse Trees 11 13 12 January 7, 2014 14 / 38
  16. 16. Benchmarks NFA interpretation Optimized thread forking Set entry 2 to 20: 1 2 20 3 4 1 9 6 5 Aaron Karper, Niko Schwarz (UniBe) 7 10 8 Regex Parse Trees 11 13 12 January 7, 2014 15 / 38
  17. 17. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q1 q2 q3 q4 [[], [], [], []] [[0], [], [], []] [[0], [], [], []] [[0], [], [0], []] q5 q6 q7 q8 - q9 For each character read, threads start hungry and must eat immediately. Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 16 / 38
  18. 18. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q1 q2 q3 [[], [], [], []] [[0], [], [], []] [[0], [], [], []] q5 q6 - q4 q7 q8 [[0], [], [0], []] q9 For each character read, threads start hungry and must eat immediately. Only a hungry thread can eat Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 17 / 38
  19. 19. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q1 q2 q3 [[], [], [], []] [[0], [], [], []] [[0], [], [], []] q5 q6 [[0], [], [0], []] - q4 [[0], [], [0], [0]] q7 q8 q9 For each character read, threads start hungry and must eat immediately. Only a hungry thread can eat Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 18 / 38
  20. 20. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q1 q2 q3 [[], [], [], []] [[0], [], [], []] q7 q5 q8 - q6 [[0], [], [0], []] q4 [[0], [], [0], [0]] q9 For each character read, threads start hungry and must eat immediately. Only a hungry thread can eat Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 19 / 38
  21. 21. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q2 q1 q3 q5 q8 q6 [[0], [], [0], []] [[0], [], [1], []] q7 - q4 [[0], [], [], []] [[0], [], [0], [0]] q9 For each character read, threads start hungry and must eat immediately. Only a hungry thread can eat Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 20 / 38
  22. 22. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q2 q1 q3 q4 - q5 q6 [[0], [], [1], []] q7 q8 [[0], [], [0], [0]] q9 For each character read, threads start hungry and must eat immediately. Only a hungry thread can eat Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 21 / 38
  23. 23. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q2 q1 q3 q4 - q5 q6 [[0], [], [1], []] q7 q8 [[0], [], [0], [0]] q9 For each character read, threads start hungry and must eat immediately. Only a hungry thread can eat Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 22 / 38
  24. 24. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q2 q1 q3 q4 q7 q8 - q5 q6 [[0], [], [1], []] [[0], [], [1], [1]] q9 For each character read, threads start hungry and must eat immediately. Only a hungry thread can eat Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 23 / 38
  25. 25. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q2 q1 q3 q4 q7 q8 - q5 q6 [[0], [], [1], [1]] q9 For each character read, threads start hungry and must eat immediately. Only a hungry thread can eat Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 24 / 38
  26. 26. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q1 q2 q3 q4 [[0], [2], [1], [1]] [[0,2], [2], [1], [1]] [[0,2], [2], [1], [1]] [[0,2], [2], [1,3], [1]] q5 q6 q7 q8 [[0], [], [1], [1]] [[0], [2], [1], [1]] - q9 [[0], [2], [1], [1]] For each character read, threads start hungry and must eat immediately. Only a hungry thread can eat Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 25 / 38
  27. 27. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q1 q2 q3 q4 [[0], [2], [1], [1]] [[0,2], [2], [1], [1]] [[0,2], [2], [1], [1]] [[0,2], [2], [1,3], [1]] q5 q6 q7 q8 [[0], [], [1], [1]] [[0], [2], [1], [1]] - q9 [[0], [2], [1], [1]] For each character read, threads start hungry and must eat immediately. Only a hungry thread can eat Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 26 / 38
  28. 28. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q2 q1 q3 q5 q8 q6 [[0,2], [2], [1,3], [1]] [[0,2], [2], [1,4], [1]] q7 - q4 [[0,2], [2], [1], [1]] [[0,2], [2], [1,3], [1,3]] q9 For each character read, threads start hungry and must eat immediately. Only a hungry thread can eat Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 27 / 38
  29. 29. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q2 q1 q3 q5 q8 q6 [[0,2], [2], [1,3], [1]] [[0,2], [2], [1,4], [1]] q7 - q4 [[0,2], [2], [1], [1]] [[0,2], [2], [1,3], [1,3]] q9 For each character read, threads start hungry and must eat immediately. Only a hungry thread can eat Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 28 / 38
  30. 30. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q1 q2 [[0,2], [2,4], [1,3], [1,3]] [[0,2,5], [2,4], [1,3], [1,3]] q3 q5 q6 - q4 [[0,2,5], [2,4], [1,3], [1,3]] [[0,2,5], [2,4,5], [1,3], [1,3]] q7 q8 [[0,2], [2], [1,3], [1,3]] [[0,2], [2,4], [1,3], [1,3]] q9 [[0,2], [2,4], [1,3], [1,3]] For each character read, threads start hungry and must eat immediately. Only a hungry thread can eat Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 29 / 38
  31. 31. Benchmarks NFA interpretation Example: (a?(a)b)+ Reading "aabab" 01234 q9 [[0,2], [2,4], [1,3], [1,3]] 1 2 1 2 a a b a b 0 1 2 3 4 Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 30 / 38
  32. 32. Download https://github.com/nes1983/tree-regex Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 31 / 38
  33. 33. NFA construction S1 S - Optional S? S2 Alternation S1|S2 S S - Capture group (S) Star operation S*? Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 32 / 38
  34. 34. Backtracking’s nightmare (a + a+) + b against ”an b” will backtrack Θ(2n ) times. Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 33 / 38
  35. 35. Backtracking’s nightmare Extract the first cell in a CSV that starts with "P"1 : ∧(.∗?, ) + (P.∗?), failing against ”1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13” is exponential. 1 From http://www.regular-expressions.info/catastrophic.html Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 34 / 38
  36. 36. Thread execution order matters .*(a?) start q1 τ1 ↑ q3 a q4 τ1 ↓ q5 any q2 Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 35 / 38
  37. 37. Priority matters (a)|(a) q2 a q4 τ1 ↓ τ1 ↑ start q1 q6 τ2 ↑ τ2 ↓ q3 Aaron Karper, Niko Schwarz (UniBe) a q5 Regex Parse Trees January 7, 2014 36 / 38
  38. 38. Optimization Pipeline 1 Convert to nondeterministic FA 2 Interpret nondeterministic FA, building deterministic FA lazily. 3 Find similar/mappable states to avoid creating infinite DFA. 4 Run on DFA if possible 5 Compactify DFA if creation of new states wasn’t necessary for a while. Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 37 / 38
  39. 39. NFA interpretation Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 38 / 38

×