Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Regular expression made by To Minh Hoang - Portal team


Published on

This is a presentation from eXo Platform SEA.

Published in: Technology
  • Be the first to comment

Regular expression made by To Minh Hoang - Portal team

  1. 1. Regular Expressions Minh Hoang TO Portal Team
  2. 2. Agenda <ul><li>Finite State Machine </li></ul><ul><li>Pattern Parser </li></ul><ul><li>Java Regex </li></ul><ul><li>Parsers in GateIn </li></ul><ul><li>Advanced Theory </li></ul>
  3. 3. Finite State Machine
  4. 4. State Diagram
  5. 5. JIRA Issue Lifecycle
  6. 6. Java Thread Lifecycle
  7. 7. Java Compilation Flow
  8. 8. Finite State Machine - FSM <ul><li>Behavioral model to describe working flow of a system </li></ul>
  9. 9. Finite State Machine - FSM <ul><li>Directed graph with labeled edges </li></ul>
  10. 10. Pattern Parser
  11. 11. Classic Problem <ul><li>A – Finite characters set Ex: A = {a, b, c, d,..., z} or A = { a, b, c,..., z, public, class, extends, implements, while, if,...} </li></ul><ul><li>Pattern P and input sequence INPUT made of A 's elements </li></ul><ul><li>Ex: P = “a.*b” or P = “class.*extends.*” INPUT = “aaabbbcc” or INPUT = a Java source file </li></ul><ul><li>-> Parser reads character-by-character INPUT and recognizes all subsequences matching pattern P </li></ul>
  12. 12. Classic Problem - Samples <ul><li>Split a sequence of characters into an array of subsequences String path = “/portal/en/classic/home”; String[] segments = path.split(“/”); </li></ul><ul><li>Handle comment block encountered in a file </li></ul><ul><li>Override readLine() in BufferedReader </li></ul><ul><li>Extract data from REST response </li></ul><ul><li>Write an XML parser from scratch </li></ul>
  13. 13. Finite State Machine & Classic Problem <ul><li>Acceptor FSM? </li></ul><ul><li>How to transform Classic Problem into graph traversing problem with well-known generic solution? Find pattern occurrences ↔ Traversing directed graph with labeled edges </li></ul>
  14. 14. FSM – Word Accepting <ul><li>Consider a word W – sequence of characters from character set A W = “” FSM having graph edges labeled with characters from A , accepts W if there exists a path connecting START node to one of END nodes START = S1 -> S2 -> … -> Sn = END 1. Duplicate of intermediate nodes is allowed 2 . The transition from S_i -> S_(i+1) is determined (labeled) by i-th character of W </li></ul><ul><li> </li></ul>
  15. 15. Acceptor FSM <ul><li>Given a pattern P , a FSM is called Acceptor FSM if it accepts any word matching pattern P . Ex: Acceptor FSM of “a[0-9]b” accepts any elements from word set { “a0b”, “a1b”, “a2b”, “a3b”, “a4b”, “a5b”, “a6b”, “a7b”, “a8b”, “a9b”} </li></ul>
  16. 16. How Pattern Parser Works? Traversing directed graph associated with Acceptor FSM 1. Start from root node 2. Read next characters from INPUT, then makes move according to transition rules 3. Repeat second step until visiting one leaf node or INPUT becomes empty 4. Return OK if leaf node refers to success match.
  17. 17. Example One <ul><li>Recognize pattern </li></ul><ul><li> eXo.*er </li></ul><ul><li>in: AAA eXo123er BBB eXoer CCC eXoeXoer DDD </li></ul>
  18. 18. Example One <ul><li>Acceptor FSM with 8 states: START – Start reading input sequence e – encounter e eX – encounter eX eXo – encounter eXo eXo.* – encounter eXo.* eXo.*e – encounter eXo.*e END – subsequence matching eXo.*er found FAILURE </li></ul>
  19. 20. Example Two <ul><li>Recognize comment block /* */ in: /* Don't ask * / final int innerClassVariable; </li></ul><ul><li> </li></ul>
  20. 21. Example Two <ul><li>Acceptor FSM with 5 states: START – start reading input sequence OUT – stay away from comment blocks ENTERING – at the beginning of comment block IN – stay inside a comment block LEAVING – at the end of comment block </li></ul><ul><li> </li></ul>
  21. 23. Finite State Machine With Stack <ul><li>Example Two is slightly harder than Example One as transition decision depends on past information -> We must keep something in memory </li></ul><ul><li>FSM with Stack = Ordinary FSM + Stack Structure storing past info Contextual transition is determined by ( next input character , stack state ) </li></ul><ul><li> </li></ul>
  22. 24. Java Regex
  23. 25. Model <ul><li>Pattern: Acceptor Finite State Machine </li></ul><ul><li>Matcher: Parser </li></ul>
  24. 26. java.util.regex.Pattern <ul><li>Construct FSM accepting pattern Pattern p = Pattern.compile(“a.*b”); FSM states are instances of java.util.regex.Pattern$Node </li></ul><ul><li>Generate parser working on input sequence Matcher matcher = p.matcher(“aaabbbb”); </li></ul>
  25. 27. java.util.regex.Matcher <ul><li>Find next subsequence matching pattern find() </li></ul><ul><li>Get capturing groups from latest match group() </li></ul>
  26. 28. Capturing Group <ul><li>Two Pattern objects Pattern p = Pattern.compile(“abcd.*efgh”); Pattern q = Pattern.compile(“abcd(.*)efgh”); String text = “abcd12345efgh”; Matcher pM = p.match(text); Matcher qM = q.match(text); </li></ul><ul><li>pM.find() == qM.find(); </li></ul><ul><li> !=; </li></ul>
  27. 29. Capturing Group <ul><li>Hold additional information on each match while(matcher.find()) {; } </li></ul><ul><li>Pattern P = (A)(B(C)) = the whole sequence ABC = ABC = BC = C </li></ul>
  28. 30. Capturing Group <ul><li>Pattern.compile(“abc(defgh”); Pattern.compile(“abcdef)gh”); -> PatternSyntaxException </li></ul><ul><li>Pattern.compile(“abc(defgh”); Pattern.compile(“abcdef)gh”); -> Success thanks to escape character '' </li></ul>
  29. 31. Operators <ul><li>Union [a-zA-Z-0-9] </li></ul><ul><li>Negation [^abc] [^X] </li></ul>
  30. 32. Contextual Match <ul><li>X(?=Y) </li></ul><ul><li>Once match X, look ahead to find Y </li></ul><ul><li>X(?!= Y) </li></ul><ul><li>Once match X, look ahead and expect not find Y </li></ul><ul><li>X(?<= Y) </li></ul><ul><li>Once match X, look behind to find Y </li></ul><ul><li>X(?<!= Y) </li></ul><ul><li>Once match X, look behind and expect not find Y </li></ul>
  31. 33. Tips <ul><li>Pattern is stateless -> Maximize reuse We often see: static final Pattern p = Pattern.compile(“a*b”); </li></ul><ul><li>Be careful with String.split String.split vs Java loop + String.charAt </li></ul>
  32. 34. Parsers in GateIn
  33. 35. Parsers in GateIn <ul><li>JavaScript Compressor </li></ul><ul><li>CSS Compressor </li></ul><ul><li>Groovy Template Optimizer </li></ul><ul><li>Navigation Controller Extracting URL param = Regex matching + Backtracking algorithm </li></ul><ul><li>StaxNavigator (Nice XML parser based on StAX) </li></ul>
  34. 36. Advanced Theory
  35. 37. Grammar & Language <ul><li>Any word matching pattern eXo.*er is a combination transforms, starting from S S -> eXoQer Q -> RQT Q -> '' R -> {a,b,c,d,...} T -> {a,b,c,d,...} </li></ul><ul><li>Language of a Grammar = Vocabularies generated by finite-combination of transforms, starting from S Ex: Any valid Java source file is generated by a finite number of transforms mentioned in Java Grammar (JLS) </li></ul>
  36. 38. Finite State Machine & Language <ul><li>Language accepted by a FSM with Stack must be built from a context-free grammar Explicit steps to build such context-free grammar are described in Kleene theorem </li></ul><ul><li>Context-free grammar Language is accepted by a FSM with Stack Explicit steps to build such Finite State Machine are described in Kleene theorem </li></ul>