A lecture from my Software engineering seminar about the subject of regular expressions engines

  • Regular expressions

    1. 1. Regular Expressions How do they work
    2. 2. Several important Facts1. Everything in computing was discovered inone form or another in the 70-80’s and wasprobably thought about during the 60’s.2. The easiest way to become a great computerengineer in the 80’s was to work for Bell Labsand have a beard.
    3. 3. Back to the subject at hand
    4. 4. What are regular expressions?From Wikipedia:In computing, a regular expression provides aconcise and flexible means to "match" (specifyand recognize) strings of text, such as particularcharacters, words, or patterns of characters.Common abbreviations for "regular expression"include regex and regexp.
    5. 5. Why do we need regular expressions (in programming)Many reasons but most of them are in their basefinding strings in text .Preferably without reading it^(?("")(""[^""]+?""@)|(([0-9a-z]((.(?!.))|[-!#$%&*+/=?^`{}|~w])*)(?<=[0-9a-z])@))(?([)([(d{1,3}.){3}d{1,3}])|(([0-9a-z][-w]*[0-9a-z]*.)+[a-z0-9]{2,17}))$^(?=.*[^a-zA-Z])(?=.*[a-z])(?=.*[A-Z])S{8,}$
    6. 6. Regular Expressions Syntax meta characters Grouping  . – match any other character  [ ] – grouping, match single character that is inside the group  [^ ] – grouping, match single character that is not inside the group  ( ) – sub expression, in Perl can be recalled later from special variables Quantifier  {m,n} –specifies that the character/sub expression before need to be matched at least m times and no more than n times  * - derived from Kleene star in formal logic, matches 0 or more amount of the character before it.  ? –matches zero or one of the preceding elements  + - derived from Kleene cross in formal logic, matches 1 or more of the character before it. Location  ^ - Marking start of line  $ - Marking end of line
    7. 7. Regular Expressions Syntax Character groups [:alpha:] - Any alphabetical character - [A-Za-z] [:alnum:] - Any alphanumeric character - [A-Za-z0-9] [:ascii:] - Any character in the ASCII character set.[:blank:] - A GNU extension, equal to a space or a horizontal tab ("t") [:cntrl:] - Any control character [:digit:] - Any decimal digit - [0-9], equivalent to "d“ [:graph:] - Any printable character, excluding a space [:lower:] - Any lowercase character - [a-z] [:print:] - Any printable character, including a space [:punct:] - Any graphical character excluding "word" characters [:space:] - Any whitespace character. "s" plus the vertical tab ("cK") [:upper:] - Any uppercase character - [A-Z] [:word:] - A Perl extension - [A-Za-z0-9_], equivalent to "w“ [:xdigit:] - Any hexadecimal digit - [0-9a-fA-F].
    8. 8. What is a regular expression engineA regular expression engine is a program that takesa set of constraints specified in a mini-language, and then applies those constraints to atarget string, and determines whether or not thestring satisfies the constraints.In less grandiose terms, the first part of the job is toturn a pattern into something the computer canefficiently use to find the matching point in thestring, and the second part is performing the searchitself.
    9. 9. Famous Regex Engines
    10. 10. Part 2
    11. 11. How the Perl Regex engine works• Unlike the army only two steps – Compilation • Parsing (Size, Construction) • Peep-hole optimization and analysis – Execution • Start position and no-match optimizations • Program execution
    12. 12. DFA
    13. 13. DFA
    14. 14. NFA Equal in strength to DFA Smaller in size
    15. 15. Ken Thompson
    16. 16. Thompson NFA method• In 1968 Thompson wrote an article on how to convert a regular expression to still unnamed automata (NFA)• The article included code to explain the point
    17. 17. Thompson NFA method1. Check the regex and inject . For concat actiona(b|c)*d2. Convert to reverse polish notationabc|*.d.
    18. 18. Thompson NFA method cont.Check single character OR char exp exp Kleene star exp
    19. 19. Thompson NFA method cont.• 3.Build the NFA B A C D
    20. 20. Problems for regex• NLP• Unicode vs. ASCII
    21. 21. Some examples of Regex• ([^s]+(.(?i)(jpg|png|gif|bmp))$) – Match file with specific extentions• ^(https?://)?([da-z.-]+).([a-z.]{2,6})([/w .-]*)*/?$ – Match URL• /^#?([a-f0-9]{6}|[a-f0-9]{3})$/ – Match a hex value• [ -~] – An interesting one.