Upcoming SlideShare
×

# Regular Expression

1,168 views
1,076 views

Published on

Provides fundamental knowledge on regular expression

Published in: Technology
1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
1,168
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
24
0
Likes
1
Embeds 0
No embeds

No notes for slide

### Regular Expression

1. 1. Regular Expression Minh Hoang TO Portal Team
2. 2. Agenda <ul><li>Finite State Machine </li></ul><ul><li>Pattern Parser </li></ul><ul><li>Java Regex </li></ul><ul><li>Parsers in GateIn </li></ul><ul><li>Advanced Theory </li></ul>
3. 3. Finite State Machine
4. 4. State Diagram
5. 5. JIRA Issue Lifecycle
7. 7. Java Compilation Flow
8. 8. Finite State Machine - FSM <ul><li>Behavioral model to describe working flow of a system </li></ul>
9. 9. Finite State Machine - FSM <ul><li>Directed graph with labeled edges </li></ul>
10. 10. Pattern Parser
11. 11. Classic Problem <ul><li>A – Finite characters set Ex: A = {a, b, c, d,..., z} or A = { a, b, c,..., z, public, class, extends, implements, while, if,...} </li></ul><ul><li>Pattern P and input sequence INPUT made of A 's elements </li></ul><ul><li>Ex: P = “a.*b” or P = “class.*extends.*” INPUT = “aaabbbcc” or INPUT = a Java source file </li></ul><ul><li>-> Parser reads character-by-character INPUT and recognizes all subsequences matching pattern P </li></ul>
12. 12. Classic Problem - Samples <ul><li>Split a sequence of characters into an array of subsequences String path = “/portal/en/classic/home”; String[] segments = path.split(“/”); </li></ul><ul><li>Handle comment block encountered in a file </li></ul><ul><li>Override readLine() in BufferedReader </li></ul><ul><li>Extract data from REST response </li></ul><ul><li>Write an XML parser from scratch </li></ul>
13. 13. Finite State Machine & Classic Problem <ul><li>Acceptor FSM? </li></ul><ul><li>How to transform Classic Problem into graph traversing problem with well-known generic solution? Find pattern occurrences ↔ Traversing directed graph with labeled edges </li></ul>
14. 14. FSM – Word Accepting <ul><li>Consider a word W – sequence of characters from character set A W = “abcd...xyz” FSM having graph edges labeled with characters from A , accepts W if there exists a path connecting START node to one of END nodes START = S1 -> S2 -> … -> Sn = END 1. Duplicate of intermediate nodes is allowed 2 . The transition from S_i -> S_(i+1) is determined by i-th character of W </li></ul><ul><li> </li></ul>
15. 15. FSM – Word Accepting
16. 16. Acceptor FSM <ul><li>Consider a pattern P , a FSM is called Acceptor FSM if it accepts any word matching pattern P . Ex: Acceptor FSM of “a[0-9]b” accepts any element from word set { “a0b”, “a1b”, “a2b”, “a3b”, “a4b”, “a5b”, “a6b”, “a7b”, “a8b”, “a9b”} </li></ul>
17. 17. How Pattern Parser Works? Traversing directed graph associated with Acceptor FSM 1. Start from root node 2. Read next characters from INPUT, then makes move according to transition rules 3. Repeat second step until visiting one leaf node or INPUT becomes empty 4. Return OK if leaf node refers to success match.
18. 18. Example One <ul><li>Recognize pattern </li></ul><ul><li> eXo.*er </li></ul><ul><li>in: AAA eXo123er BBB eXoer CCC eXoeXoer DDD </li></ul>
19. 19. Example One <ul><li>Acceptor FSM with 8 states: START – Start reading input sequence e – encounter e eX – encounter eX eXo – encounter eXo eXo.* – encounter eXo.* eXo.*e – encounter eXo.*e END – subsequence matching eXo.*er found FAILURE </li></ul>
20. 21. Example Two <ul><li>Recognize comment block /* */ in: /* Don't ask * / final int innerClassVariable; </li></ul><ul><li> </li></ul>
21. 22. Example Two <ul><li>Acceptor FSM with 5 states: START – start reading input sequence OUT – stay away from comment blocks ENTERING – at the beginning of comment block IN – stay inside a comment block LEAVING – at the end of comment block </li></ul><ul><li> </li></ul>
22. 24. Finite State Machine With Stack <ul><li>Example Two is slightly harder than Example One as transition decision depends on past information -> We must keep something in memory </li></ul><ul><li>FSM with Stack = Ordinary FSM + Stack Structure storing past info Contextual transition is determined by pair ( next input character , stack state ) </li></ul><ul><li> </li></ul>
23. 25. Java Regex
24. 26. Model <ul><li>Pattern: Acceptor Finite State Machine </li></ul><ul><li>Matcher: Parser </li></ul>
25. 27. java.util.regex.Pattern <ul><li>Construct FSM accepting pattern Pattern p = Pattern.compile(“a.*b”); FSM states are instances of java.util.regex.Pattern\$Node </li></ul><ul><li>Generate parser working on input sequence Matcher matcher = p.matcher(“aaabbbb”); </li></ul>
26. 28. java.util.regex.Matcher <ul><li>Find next subsequence matching pattern find() </li></ul><ul><li>Get capturing groups from latest match group() </li></ul>
27. 29. Capturing Group <ul><li>Two Pattern objects Pattern p = Pattern.compile(“abcd.*efgh”); Pattern q = Pattern.compile(“abcd(.*)efgh”); String text = “abcd12345efgh”; Matcher pM = p.match(text); Matcher qM = q.match(text); </li></ul><ul><li>pM.find() == qM.find(); </li></ul><ul><li>pM.group(1) != qM.group(1); </li></ul>
28. 30. Capturing Group <ul><li>Hold additional information on each match while(matcher.find()) { matcher.group(index); } </li></ul><ul><li>Pattern P = (A)(B(C)) matcher.group(0) = the whole sequence ABC matcher.group(1) = ABC matcher.group(2) = BC matcher.group(3) = C </li></ul>
29. 31. Capturing Group <ul><li>Pattern.compile(“abc(defgh”); Pattern.compile(“abcdef)gh”); -> PatternSyntaxException </li></ul><ul><li>Pattern.compile(“abc(defgh”); Pattern.compile(“abcdef)gh”); -> Success thanks to escape character '' </li></ul>
30. 32. Operators <ul><li>Union [a-zA-Z-0-9] </li></ul><ul><li>Negation [^abc] [^X] </li></ul>
31. 33. Contextual Match <ul><li>X(?=Y) </li></ul><ul><li>Once match X, look ahead to find Y </li></ul><ul><li>X(?!= Y) </li></ul><ul><li>Once match X, look ahead and expect not find Y </li></ul><ul><li>X(?<= Y) </li></ul><ul><li>Once match X, look behind to find Y </li></ul><ul><li>X(?<!= Y) </li></ul><ul><li>Once match X, look behind and expect not find Y </li></ul>
32. 34. Tips <ul><li>Pattern is stateless -> Maximize reuse We often see: static final Pattern p = Pattern.compile(“a*b”); </li></ul><ul><li>Be careful with String.split String.split vs Java loop + String.charAt </li></ul>
33. 35. Parsers in GateIn
34. 36. Parsers in GateIn <ul><li>JavaScript Compressor </li></ul><ul><li>CSS Compressor </li></ul><ul><li>Groovy Template Optimizer </li></ul><ul><li>Navigation Controller Extracting URL param = Regex matching + Backtracking algorithm </li></ul><ul><li>StaxNavigator (Nice XML parser based on StAX) </li></ul>