Secrets of Regexp      Hiro Asari     Red Hat, Inc.
Lets Talk AboutRegular Expressions
Lets Talk About  Regular Expressions• There is no regular expression
Lets Talk About  Regular Expressions• A good approximation as a name
Lets Talk About     Regexp
Some people, when confronted         with a problem, think, "I know,          Ill use regular expressions."        Now the...
Formal Language         Theory• The Language L• Over Alphabet Σ
Formal Language          Theory• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)
Formal Language          Theory• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)• Words over Σ: "a", "b", "ab", "aequafdhfad"
Formal Language          Theory• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)• Words over Σ: "a", "b", "ab", "aequafdhfad...
Formal Language         over Σ• A subset L of Σ* (with various properties)• L can be finite, and enumerate well-formed  wor...
Example• Language L over Σ = {a,b}• a is a word• a word may be obtained by appending ab  to an existing word• only words t...
Well-formed wordsaaabaabab
Ill-formed wordsbaaaababb
Succinctly…• a(ab)*
Expression• Textual representation of the formal  language against which an input is tested  whether it is a well-formed w...
Regular Languages• ∅ (empty language) is regular
Regular Languages• ∅ (empty language) is regular• For each a ∈ Σ (a belongs to Σ), the  singleton language {a} is a regula...
Regular Languages• ∅ (empty language) is regular• For each a ∈ Σ (a belongs to Σ), the  singleton language {a} is a regula...
Regular Languages• ∅ (empty language) is regular• For each a ∈ Σ (a belongs to Σ), the  singleton language {a} is a regula...
Regular Expressions• Expressions of regular languages
Regular Expressions              ot• Expressions of regular languages             N
Regular? Expressions• It turns out that some expressions are  more powerful and expresses non-regular  languages• Language...
How does Regexp        work?• Build a finite state automaton representing  a given regular expression• Feed the String to t...
aa
ab* a      b
.*.
a$a        $
a?a     ε
a|bab
(ab|c)a            b      c
(ab+|c)       ba             b       c
Match is attempted atevery character, left to        right
/a$/         zyxwvutsrqponmlkjihgfedcba         ^Regexp does not think, a$ can match only at the end of the line, so we sh...
/a$/         zyxwvutsrqponmlkjihgfedcba         ^         zyxwvutsrqponmlkjihgfedcba           ^Regexp does not think, a$ ...
/a$/         zyxwvutsrqponmlkjihgfedcba         ^         zyxwvutsrqponmlkjihgfedcba           ^         zyxwvutsrqponmlkj...
/a$/         zyxwvutsrqponmlkjihgfedcba         ^         zyxwvutsrqponmlkjihgfedcba           ^         zyxwvutsrqponmlkj...
/a$/         zyxwvutsrqponmlkjihgfedcba         ^         zyxwvutsrqponmlkjihgfedcba           ^         zyxwvutsrqponmlkj...
^s*(.*)s*$         abc d a dfadg^     abc d a dfadg ^      abc d a dfadg     ^      abc d a dfadg      ^# matches abc d a ...
a?a?a?…a?aaa…adef pathological(n=5)  Regexp.new(a? * n + a * n)end1.upto(40) do |n|  print n, ": "  print Time.now, "n" if...
a?a?a?aaaaaa^
Regexp tips
Use /xUP_TO_256 = /b(?:25[0-5]   #   250-255|2[0-4][0-9]                #   200-249|1[0-9][0-9]                #   100-199...
A, z for strings       ^, $ for lines• A: the beginning of the string• z: the end of the string• ^: after n• $: before n
A, z for strings       ^, $ for lines• A: the beginning of the string• z: the end of the string• ^: after n• $: before n  ...
Whats the problem?also note the difference in what /m means
Whats the problem?         #! /usr/bin/env perl         $a = "abcndef";         if ($a =~ /^d/) {           print "yesn"; ...
Whats the problem?         #! /usr/bin/env ruby         a = "abcndef";         if (a =~ /^d/)           p "yes"         en...
Security Implications         class File < ActiveRecord::Base           validates :name, :format => /^[w.-+]+$/         en...
file.txt%0A<script>alert(‘hello’)</script>
file.txt%0A<script>alert(‘hello’)</script>
file.txtn<script>alert(‘hello’)</script>
file.txtn<script>alert(‘hello’)</script>             /^[w.-+]+$/
file.txtn<script>alert(‘hello’)</script>             /^[w.-+]+$/            Match succeeds    ActiveRecord validation succ...
file.txtn<script>alert(‘hello’)</script>            /A[w.-+]+z/
file.txtn<script>alert(‘hello’)</script>            /A[w.-+]+z/               Match fails       ActiveRecord validation fa...
Prefer Character Class     to Alterationsrequire benchmark# simple benchmark for alternations and character classn = 5_000...
BenchmarksRuby 1.8.7                      user     system      total         realalternation       0.030000   0.010000   0...
Beware of Character                 Classes         # case-insensitively match any non-word character…         # one is un...
/^1?$|^(11+?)1+$/
/^1?$|^(11+?)1+$/    Matches 1 or
/^1?$|^(11+?)1+$/Non-greedily match 2 or more 1s
/^1?$|^(11+?)1+$/1 or more additional times
/^1?$|^(11+?)1+$/matches a composite number
/^1?$|^(11+?)1+$/Matches a string of 1s if and onlyif there are a non-prime # of 1s
Integer#prime?          class Integer            def prime?              "1" * self !~ /^1?$|^(11+?)1+$/            end   ...
• @hiro_asari• Github: BanzaiMan
Upcoming SlideShare
Loading in...5
×

Regexp secrets

485

Published on

Regexp class is in every Rubyist's toolbox. But do you know the theory behind it, and what goes on under the hood?

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
485
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
6
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Regexp secrets

  1. 1. Secrets of Regexp Hiro Asari Red Hat, Inc.
  2. 2. Lets Talk AboutRegular Expressions
  3. 3. Lets Talk About Regular Expressions• There is no regular expression
  4. 4. Lets Talk About Regular Expressions• A good approximation as a name
  5. 5. Lets Talk About Regexp
  6. 6. Some people, when confronted with a problem, think, "I know, Ill use regular expressions." Now they have two problems. Jaime Zawinski 12 Aug, 1997http://regex.info/blog/2006-09-15/247http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.htmlThe point is not so much the evils of regular expressions, but the evils of overuse of it.
  7. 7. Formal Language Theory• The Language L• Over Alphabet Σ
  8. 8. Formal Language Theory• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)
  9. 9. Formal Language Theory• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)• Words over Σ: "a", "b", "ab", "aequafdhfad"
  10. 10. Formal Language Theory• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)• Words over Σ: "a", "b", "ab", "aequafdhfad"• Σ*: The set of all words over Σ
  11. 11. Formal Language over Σ• A subset L of Σ* (with various properties)• L can be finite, and enumerate well-formed words, but often infinite
  12. 12. Example• Language L over Σ = {a,b}• a is a word• a word may be obtained by appending ab to an existing word• only words thus formed are legal
  13. 13. Well-formed wordsaaabaabab
  14. 14. Ill-formed wordsbaaaababb
  15. 15. Succinctly…• a(ab)*
  16. 16. Expression• Textual representation of the formal language against which an input is tested whether it is a well-formed word in that language
  17. 17. Regular Languages• ∅ (empty language) is regular
  18. 18. Regular Languages• ∅ (empty language) is regular• For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language.
  19. 19. Regular Languages• ∅ (empty language) is regular• For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language.• If A and B are regular languages, then A ∪ B (union), A•B (concatenation), and A* (Kleene star) are regular languages
  20. 20. Regular Languages• ∅ (empty language) is regular• For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language.• If A and B are regular languages, then A ∪ B (union), A•B (concatenation), and A* (Kleene star) are regular languages• No other languages over Σ are regular.
  21. 21. Regular Expressions• Expressions of regular languages
  22. 22. Regular Expressions ot• Expressions of regular languages N
  23. 23. Regular? Expressions• It turns out that some expressions are more powerful and expresses non-regular languages• Language of squares: (.*)1 • a, aa, aaaa, WikiWiki
  24. 24. How does Regexp work?• Build a finite state automaton representing a given regular expression• Feed the String to the regular expression and see if the match succeeds
  25. 25. aa
  26. 26. ab* a b
  27. 27. .*.
  28. 28. a$a $
  29. 29. a?a ε
  30. 30. a|bab
  31. 31. (ab|c)a b c
  32. 32. (ab+|c) ba b c
  33. 33. Match is attempted atevery character, left to right
  34. 34. /a$/ zyxwvutsrqponmlkjihgfedcba ^Regexp does not think, a$ can match only at the end of the line, so we should fast forwardto the end of the line
  35. 35. /a$/ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^Regexp does not think, a$ can match only at the end of the line, so we should fast forwardto the end of the line
  36. 36. /a$/ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^Regexp does not think, a$ can match only at the end of the line, so we should fast forwardto the end of the line
  37. 37. /a$/ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^Regexp does not think, a$ can match only at the end of the line, so we should fast forwardto the end of the line
  38. 38. /a$/ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ ⋮ zyxwvutsrqponmlkjihgfedcba ^Regexp does not think, a$ can match only at the end of the line, so we should fast forwardto the end of the line
  39. 39. ^s*(.*)s*$ abc d a dfadg^ abc d a dfadg ^ abc d a dfadg ^ abc d a dfadg ^# matches abc d a dfadg
  40. 40. a?a?a?…a?aaa…adef pathological(n=5) Regexp.new(a? * n + a * n)end1.upto(40) do |n| print n, ": " print Time.now, "n" if a*n =~ pathological(n)end
  41. 41. a?a?a?aaaaaa^
  42. 42. Regexp tips
  43. 43. Use /xUP_TO_256 = /b(?:25[0-5] # 250-255|2[0-4][0-9] # 200-249|1[0-9][0-9] # 100-199|[1-9][0-9] # 2-digit numbers|[0-9]) # single-digit numbersb/xIPV4_ADDRESS = /#{UP_TO_256}(?:.#{UP_TO_256}){3}/
  44. 44. A, z for strings ^, $ for lines• A: the beginning of the string• z: the end of the string• ^: after n• $: before n
  45. 45. A, z for strings ^, $ for lines• A: the beginning of the string• z: the end of the string• ^: after n• $: before n always in Ruby
  46. 46. Whats the problem?also note the difference in what /m means
  47. 47. Whats the problem? #! /usr/bin/env perl $a = "abcndef"; if ($a =~ /^d/) { print "yesn"; } if ($a =~ /^d/m) { print "yes nown"; } # prints yes nowalso note the difference in what /m means
  48. 48. Whats the problem? #! /usr/bin/env ruby a = "abcndef"; if (a =~ /^d/) p "yes" endhttp://guides.rubyonrails.org/security.html#regular-expressions
  49. 49. Security Implications class File < ActiveRecord::Base   validates :name, :format => /^[w.-+]+$/ endhttp://guides.rubyonrails.org/security.html#regular-expressions
  50. 50. file.txt%0A<script>alert(‘hello’)</script>
  51. 51. file.txt%0A<script>alert(‘hello’)</script>
  52. 52. file.txtn<script>alert(‘hello’)</script>
  53. 53. file.txtn<script>alert(‘hello’)</script> /^[w.-+]+$/
  54. 54. file.txtn<script>alert(‘hello’)</script> /^[w.-+]+$/ Match succeeds ActiveRecord validation succeeds
  55. 55. file.txtn<script>alert(‘hello’)</script> /A[w.-+]+z/
  56. 56. file.txtn<script>alert(‘hello’)</script> /A[w.-+]+z/ Match fails ActiveRecord validation fails
  57. 57. Prefer Character Class to Alterationsrequire benchmark# simple benchmark for alternations and character classn = 5_000str = cafebabedeadbeef*5_000Benchmark.bmbm do |x| x.report(alternation) do str =~ /^(a|b|c|d|e|f)+$/ end x.report(character class) do str =~ /^[a-f]+$/ endend
  58. 58. BenchmarksRuby 1.8.7 user system total realalternation 0.030000 0.010000 0.040000 ( 0.036702)character class 0.000000 0.000000 0.000000 ( 0.004704)Ruby 2.0.0 user system total realalternation 0.020000 0.010000 0.030000 ( 0.023139)character class 0.000000 0.000000 0.000000 ( 0.009641)JRuby 1.7.4.dev user system total realalternation 0.030000 0.000000 0.030000 ( 0.021000)character class 0.010000 0.000000 0.010000 ( 0.007000)
  59. 59. Beware of Character Classes # case-insensitively match any non-word character… # one is unlike the others r =~ /(?i:[W])/ s =~ /(?i:[W])/ matches, even if s is a word character t =~ /(?i:[W])/https://bugs.ruby-lang.org/issues/4044
  60. 60. /^1?$|^(11+?)1+$/
  61. 61. /^1?$|^(11+?)1+$/ Matches 1 or
  62. 62. /^1?$|^(11+?)1+$/Non-greedily match 2 or more 1s
  63. 63. /^1?$|^(11+?)1+$/1 or more additional times
  64. 64. /^1?$|^(11+?)1+$/matches a composite number
  65. 65. /^1?$|^(11+?)1+$/Matches a string of 1s if and onlyif there are a non-prime # of 1s
  66. 66. Integer#prime? class Integer def prime? "1" * self !~ /^1?$|^(11+?)1+$/ end end No performance guaranteeAttributed a Perl hacker Abigail
  67. 67. • @hiro_asari• Github: BanzaiMan
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×