Regular Expression Minh Hoang TO Portal Team
Agenda Finite State Machine Pattern Parser   Java Regex   Parsers in GateIn Advanced Theory
Finite State Machine
State Diagram
JIRA Issue Lifecycle
Java Thread Lifecycle
Java Compilation Flow
Finite State Machine - FSM Behavioral model to describe working flow of a system
Finite State Machine - FSM Directed graph with labeled edges
Pattern Parser
Classic Problem A  – Finite characters set Ex: A  =   {a, b, c, d,..., z}  or  A  =   { a, b, c,..., z, public, class, extends, implements, while, if,...} Pattern  P  and input sequence  INPUT  made of  A  's elements  Ex: P  = “a.*b” or  P  = “class.*extends.*” INPUT  = “aaabbbcc” or  INPUT  = a Java source file   ->  Parser reads character-by-character  INPUT  and recognizes all subsequences matching pattern  P
Classic Problem - Samples Split a sequence of characters into an array of subsequences   String path = “/portal/en/classic/home”;   String[] segments = path.split(“/”); Handle comment block encountered in a file Override  readLine()  in  BufferedReader Extract data from REST response Write an XML parser from scratch
Finite State Machine & Classic Problem Acceptor FSM? How to transform Classic Problem into graph traversing problem  with well-known generic solution?   Find pattern occurrences ↔ Traversing directed graph with labeled edges
FSM – Word Accepting Consider a word  W  – sequence of characters from character set  A     W =  “abcd...xyz” FSM having graph edges labeled with characters from  A , accepts  W  if there exists a path connecting START node to one of END nodes   START  = S1 -> S2 -> … -> Sn  = END 1. Duplicate of intermediate nodes is allowed 2 . The transition from  S_i  ->  S_(i+1)  is determined by  i-th character of  W
FSM – Word Accepting
Acceptor FSM Consider a pattern  P , a FSM is called  Acceptor FSM  if it  accepts any word  matching pattern  P .  Ex:   Acceptor FSM of  “a[0-9]b”  accepts any element from word set   { “a0b”, “a1b”, “a2b”, “a3b”, “a4b”, “a5b”, “a6b”, “a7b”, “a8b”, “a9b”}
How Pattern Parser Works? Traversing directed graph associated with Acceptor FSM   1. Start from root node   2. Read next characters from INPUT, then makes move according to   transition rules   3. Repeat second step until visiting one leaf node or INPUT becomes empty 4. Return OK if leaf node refers to success match.
Example One Recognize pattern   eXo.*er in: AAA eXo123er BBB eXoer CCC eXoeXoer DDD
Example One Acceptor FSM with 8 states: START  –  Start reading input sequence e  –  encounter   e eX  –  encounter   eX eXo  –  encounter   eXo eXo.*  –  encounter   eXo.* eXo.*e  –  encounter   eXo.*e END  –  subsequence matching   eXo.*er  found FAILURE
 
Example Two Recognize comment block   /*  */ in: /* Don't ask * / final int innerClassVariable;
Example Two Acceptor FSM with 5 states: START  –  start reading input sequence OUT  –  stay away from comment blocks ENTERING  –  at the beginning of comment block IN  –  stay inside a comment block LEAVING  –  at the end of comment block
 
Finite State Machine With Stack Example Two is slightly harder than Example One as transition decision depends on past information -> We must keep something in memory FSM with Stack  =  Ordinary FSM  +  Stack Structure  storing past info Contextual transition  is determined by  pair    ( next input character  ,  stack state )
Java Regex
Model Pattern:  Acceptor Finite State Machine Matcher:  Parser
java.util.regex.Pattern Construct FSM accepting pattern   Pattern p =  Pattern.compile(“a.*b”); FSM states are instances of  java.util.regex.Pattern$Node Generate parser working on input sequence   Matcher matcher = p.matcher(“aaabbbb”);
java.util.regex.Matcher Find next subsequence matching pattern   find() Get capturing groups from latest match   group()
Capturing Group Two Pattern objects Pattern p = Pattern.compile(“abcd.*efgh”); Pattern q = Pattern.compile(“abcd(.*)efgh”); String text = “abcd12345efgh”; Matcher pM = p.match(text); Matcher qM = q.match(text); pM.find()  ==  qM.find(); pM.group(1)  !=  qM.group(1);
Capturing Group Hold additional information on each match while(matcher.find()) {   matcher.group(index); } Pattern  P = (A)(B(C)) matcher.group(0) = the whole sequence  ABC matcher.group(1) =  ABC matcher.group(2) =  BC matcher.group(3) =  C
Capturing Group Pattern.compile(“abc(defgh”); Pattern.compile(“abcdef)gh”); ->  PatternSyntaxException Pattern.compile(“abc\\(defgh”); Pattern.compile(“abcdef\\)gh”); ->  Success thanks to escape character '\'
Operators Union   [a-zA-Z-0-9] Negation   [^abc]   [^X]
Contextual Match X(?=Y) Once match X, look ahead to find Y X(?!= Y) Once match X, look ahead and expect not find Y X(?<= Y) Once match X, look behind to find Y X(?<!= Y) Once match X, look behind and expect not find Y
Tips Pattern  is stateless  ->  Maximize reuse We often see:   static final Pattern p = Pattern.compile(“a*b”); Be careful with   String.split    String.split  vs   Java loop + String.charAt
Parsers in GateIn
Parsers in GateIn JavaScript Compressor CSS Compressor Groovy Template Optimizer Navigation Controller   Extracting URL param = Regex matching + Backtracking algorithm StaxNavigator (Nice XML parser based on StAX)
Advanced Theory
Grammar & Language Any word matching pattern eXo.*er is a combination of transforms, starting from  S S -> eXoQer Q -> RQT Q -> '' R -> {a,b,c,d,...} T -> {a,b,c,d,...} Language  of a  Grammar  = Vocabularies generated by finite-combination of transforms, starting from  S Ex: Any valid Java source file is generated by a finite number of transforms mentioned in Java Grammar (JLS)
Finite State Machine & Language Language accepted by a FSM with Stack must be built from a context-free grammar Explicit steps to build such context-free grammar are described in Kleene theorem Context-free grammar Language is accepted by a FSM with Stack   Explicit steps to build such Finite State Machine are described in Kleene theorem

Regular Expression

  • 1.
    Regular Expression MinhHoang TO Portal Team
  • 2.
    Agenda Finite StateMachine Pattern Parser Java Regex Parsers in GateIn Advanced Theory
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
    Finite State Machine- FSM Behavioral model to describe working flow of a system
  • 9.
    Finite State Machine- FSM Directed graph with labeled edges
  • 10.
  • 11.
    Classic Problem A – Finite characters set Ex: A = {a, b, c, d,..., z} or A = { a, b, c,..., z, public, class, extends, implements, while, if,...} Pattern P and input sequence INPUT made of A 's elements Ex: P = “a.*b” or P = “class.*extends.*” INPUT = “aaabbbcc” or INPUT = a Java source file -> Parser reads character-by-character INPUT and recognizes all subsequences matching pattern P
  • 12.
    Classic Problem -Samples Split a sequence of characters into an array of subsequences String path = “/portal/en/classic/home”; String[] segments = path.split(“/”); Handle comment block encountered in a file Override readLine() in BufferedReader Extract data from REST response Write an XML parser from scratch
  • 13.
    Finite State Machine& Classic Problem Acceptor FSM? How to transform Classic Problem into graph traversing problem with well-known generic solution? Find pattern occurrences ↔ Traversing directed graph with labeled edges
  • 14.
    FSM – WordAccepting Consider a word W – sequence of characters from character set A W = “abcd...xyz” FSM having graph edges labeled with characters from A , accepts W if there exists a path connecting START node to one of END nodes START = S1 -> S2 -> … -> Sn = END 1. Duplicate of intermediate nodes is allowed 2 . The transition from S_i -> S_(i+1) is determined by i-th character of W
  • 15.
    FSM – WordAccepting
  • 16.
    Acceptor FSM Considera pattern P , a FSM is called Acceptor FSM if it accepts any word matching pattern P . Ex: Acceptor FSM of “a[0-9]b” accepts any element from word set { “a0b”, “a1b”, “a2b”, “a3b”, “a4b”, “a5b”, “a6b”, “a7b”, “a8b”, “a9b”}
  • 17.
    How Pattern ParserWorks? Traversing directed graph associated with Acceptor FSM 1. Start from root node 2. Read next characters from INPUT, then makes move according to transition rules 3. Repeat second step until visiting one leaf node or INPUT becomes empty 4. Return OK if leaf node refers to success match.
  • 18.
    Example One Recognizepattern eXo.*er in: AAA eXo123er BBB eXoer CCC eXoeXoer DDD
  • 19.
    Example One AcceptorFSM with 8 states: START – Start reading input sequence e – encounter e eX – encounter eX eXo – encounter eXo eXo.* – encounter eXo.* eXo.*e – encounter eXo.*e END – subsequence matching eXo.*er found FAILURE
  • 20.
  • 21.
    Example Two Recognizecomment block /* */ in: /* Don't ask * / final int innerClassVariable;
  • 22.
    Example Two AcceptorFSM with 5 states: START – start reading input sequence OUT – stay away from comment blocks ENTERING – at the beginning of comment block IN – stay inside a comment block LEAVING – at the end of comment block
  • 23.
  • 24.
    Finite State MachineWith Stack Example Two is slightly harder than Example One as transition decision depends on past information -> We must keep something in memory FSM with Stack = Ordinary FSM + Stack Structure storing past info Contextual transition is determined by pair ( next input character , stack state )
  • 25.
  • 26.
    Model Pattern: Acceptor Finite State Machine Matcher: Parser
  • 27.
    java.util.regex.Pattern Construct FSMaccepting pattern Pattern p = Pattern.compile(“a.*b”); FSM states are instances of java.util.regex.Pattern$Node Generate parser working on input sequence Matcher matcher = p.matcher(“aaabbbb”);
  • 28.
    java.util.regex.Matcher Find nextsubsequence matching pattern find() Get capturing groups from latest match group()
  • 29.
    Capturing Group TwoPattern objects Pattern p = Pattern.compile(“abcd.*efgh”); Pattern q = Pattern.compile(“abcd(.*)efgh”); String text = “abcd12345efgh”; Matcher pM = p.match(text); Matcher qM = q.match(text); pM.find() == qM.find(); pM.group(1) != qM.group(1);
  • 30.
    Capturing Group Holdadditional information on each match while(matcher.find()) { matcher.group(index); } Pattern P = (A)(B(C)) matcher.group(0) = the whole sequence ABC matcher.group(1) = ABC matcher.group(2) = BC matcher.group(3) = C
  • 31.
    Capturing Group Pattern.compile(“abc(defgh”);Pattern.compile(“abcdef)gh”); -> PatternSyntaxException Pattern.compile(“abc\\(defgh”); Pattern.compile(“abcdef\\)gh”); -> Success thanks to escape character '\'
  • 32.
    Operators Union [a-zA-Z-0-9] Negation [^abc] [^X]
  • 33.
    Contextual Match X(?=Y)Once match X, look ahead to find Y X(?!= Y) Once match X, look ahead and expect not find Y X(?<= Y) Once match X, look behind to find Y X(?<!= Y) Once match X, look behind and expect not find Y
  • 34.
    Tips Pattern is stateless -> Maximize reuse We often see: static final Pattern p = Pattern.compile(“a*b”); Be careful with String.split String.split vs Java loop + String.charAt
  • 35.
  • 36.
    Parsers in GateInJavaScript Compressor CSS Compressor Groovy Template Optimizer Navigation Controller Extracting URL param = Regex matching + Backtracking algorithm StaxNavigator (Nice XML parser based on StAX)
  • 37.
  • 38.
    Grammar & LanguageAny word matching pattern eXo.*er is a combination of transforms, starting from S S -> eXoQer Q -> RQT Q -> '' R -> {a,b,c,d,...} T -> {a,b,c,d,...} Language of a Grammar = Vocabularies generated by finite-combination of transforms, starting from S Ex: Any valid Java source file is generated by a finite number of transforms mentioned in Java Grammar (JLS)
  • 39.
    Finite State Machine& Language Language accepted by a FSM with Stack must be built from a context-free grammar Explicit steps to build such context-free grammar are described in Kleene theorem Context-free grammar Language is accepted by a FSM with Stack Explicit steps to build such Finite State Machine are described in Kleene theorem