Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Advanced regular expressions

293 views

Published on

Follow up from Introduction of Regular Expressions. Introduces the concepts of greedy regex, how regex engines work etc.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Advanced regular expressions

  1. 1. Regular Expressions Part 2: Advanced Concepts
  2. 2. How repetition tokens match a test string? Repetition tokens are greedy. They continue to match until the last matching token. Let’s check with a valid HTML. http://rubular.com/r/nVoDVeAafp How do we solve this greediness?
  3. 3. How to fix greediness? Quick fix is to use laziness. By adding a ? after +. So, <.+?> matches only the HTML tags. Check http://rubular.com/r/yoEJztaClW A better alternative is to use negative character class. <[^>]+>. This is much more efficient in terms of backtracking and hence returns results faster. Check http://rubular.com/r/WHjIrJW3v7
  4. 4. Possessive Quantifiers Greedy tokens match as many repeats as possible. Lazy tokens match as few repeats as possible. Then try permutations by backtracking to match the test string. Possessive quantifiers, on the other hand, hold whatever was matched once and forget the backtracking position. So the regex engine returns as soon as there is no match and doesn’t backtrack. /D*+g/ /string/ Why?? Because, D*+ matches all of string and unlike lazy/greedy tokens, Possessive quantifiers can’t backtrack. Therefore a permutation to match strin with repeat tokens & g as literal character is never tried.
  5. 5. Repetition Quantifiers Property Token Backtracks Greedy (Default) *,+,?,{m,n} Yes Lazy *?,+?,??, {m,n}? Yes Possessive *+,++,?+, {m,n}+ No Rest are possessive! :)
  6. 6. ^d*D+.?S{1,10}[^0-9]+$ Recap Repetition tokens are Greedy but Regular Expression Engines are Eager.tip
  7. 7. ^d*D+.?S{1,10}[^0-9]+$ Recap Challenge 1: Construct A Regex to match ip (v4) address
  8. 8. Alternation Lowest precedence among all regex operators. Matches single one of the many regexes. /I have a cat but no dog./ /I have three clown fish as pet./ What’s your pet? cat|dog|fish
  9. 9. Word Boundaries: Zero Length Assertions There are three different positions that qualify as word boundaries: ● Before the first character in the string, if the first character is a word character. ● After the last character in the string, if the last character is a word character. ● Between two characters in the string, where one is a word character and the other is not a word character. /bw+b/ /bat=cat/ /Bw+B/ /bat=cat/
  10. 10. Groups/Backreference Token Property Regex Example Test String (group) Club characters together as one unit /work(shop)?/ I work at a computer workshop. 1 Default numeric reference for a group /(w+)=1/ Is cat=bat or rat=rat? (?<n>group) Named groups /(?<word>w+)/ $!, cat eats rat. k{n} Named reference for a group /(?<a1>w+)=k{a1}/ Is cat=bat or rat=rat? (?:group) Non-capturing groups /work(?:shop)?/ I work at a computer workshop. (?>group) Atomic groups /a(?>bc|b)c/ abbc, abc
  11. 11. What’s your language? c|c++|java|javascript /I use java for android development and javascript for everything else./
  12. 12. Alternation/Word Boundary/Groups What’s your language? c|c++|java|javascript /I use java for android development and javascript for everything else./ Challenge 2: will this regex ever match c++ and javascript? Fix it to be “inclusive”.
  13. 13. Revisiting HTML tags
  14. 14. /^(.*)?@d+.d{2,3}$/ Regex for matching an email address
  15. 15. /^(.*)?@d+.d{2,3}$/ Regex for matching an email address Challenge 3: Fix the REGEX!
  16. 16. Lookaround Token - Lookahead & Lookbehind Token Property Regex Example Test String (?=text) Positive lookahead q(?=u)D+ question, Iraq (?!text) Negative lookahead q(?!u) qatar, Iraq, question (?>=text) Positive lookbehind (?<=a)b cab, bed, debt (?>!text) Negative lookbehind (?<!a)b cab, bed, debt Let’s try at http://rubular.com/r/cMuagzut6g
  17. 17. Unicode encoding Sample character Regex Unicode Regex Encoded as 2 code points å = U+0061(a)U+0300(`) ^..$ P{M}p{M}*+ or (>P{M}p{M}*) Encoded as one code point U+00E0 &.$ u00E0 Any unicode character Punctuation mark, numerals etc .|.. X
  18. 18. How does a regex engine work?
  19. 19. Mathematics Behind Regex ● Originated in 1956, when mathematician Stephen Cole Kleene described regular languages using his mathematical notation called regular sets. ● Entered popular use from 1968 in two uses: pattern matching in a text editor and lexical analysis in a compiler. ● Among first uses, Ken Thompson, implemented first Regex engine into QED editor and later in UNIX editor ed. That led to `grep`. Guess what grep is: g/re/p
  20. 20. Applications
  21. 21. Wait, there’s more recursion, subroutines. You can even match palindrome strings in ruby and Perl using regex!
  22. 22. https://engineering.linkedin.com/puzzle Let’s take away with the homework
  23. 23. Questions?

×