Regexp secrets
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Regexp secrets

  • 569 views
Uploaded on

Regexp class is in every Rubyist's toolbox. But do you know the theory behind it, and what goes on under the hood?

Regexp class is in every Rubyist's toolbox. But do you know the theory behind it, and what goes on under the hood?

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
569
On Slideshare
549
From Embeds
20
Number of Embeds
2

Actions

Shares
Downloads
5
Comments
0
Likes
1

Embeds 20

http://talks.ruby.org.nz 14
http://localhost 6

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Secrets of Regexp Hiro Asari Red Hat, Inc.
  • 2. Lets Talk AboutRegular Expressions
  • 3. Lets Talk About Regular Expressions• There is no regular expression
  • 4. Lets Talk About Regular Expressions• A good approximation as a name
  • 5. Lets Talk About Regexp
  • 6. Some people, when confronted with a problem, think, "I know, Ill use regular expressions." Now they have two problems. Jaime Zawinski 12 Aug, 1997http://regex.info/blog/2006-09-15/247http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.htmlThe point is not so much the evils of regular expressions, but the evils of overuse of it.
  • 7. Formal Language Theory• The Language L• Over Alphabet Σ
  • 8. Formal Language Theory• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)
  • 9. Formal Language Theory• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)• Words over Σ: "a", "b", "ab", "aequafdhfad"
  • 10. Formal Language Theory• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)• Words over Σ: "a", "b", "ab", "aequafdhfad"• Σ*: The set of all words over Σ
  • 11. Formal Language over Σ• A subset L of Σ* (with various properties)• L can be finite, and enumerate well-formed words, but often infinite
  • 12. Example• Language L over Σ = {a,b}• a is a word• a word may be obtained by appending ab to an existing word• only words thus formed are legal
  • 13. Well-formed wordsaaabaabab
  • 14. Ill-formed wordsbaaaababb
  • 15. Succinctly…• a(ab)*
  • 16. Expression• Textual representation of the formal language against which an input is tested whether it is a well-formed word in that language
  • 17. Regular Languages• ∅ (empty language) is regular
  • 18. Regular Languages• ∅ (empty language) is regular• For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language.
  • 19. Regular Languages• ∅ (empty language) is regular• For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language.• If A and B are regular languages, then A ∪ B (union), A•B (concatenation), and A* (Kleene star) are regular languages
  • 20. Regular Languages• ∅ (empty language) is regular• For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language.• If A and B are regular languages, then A ∪ B (union), A•B (concatenation), and A* (Kleene star) are regular languages• No other languages over Σ are regular.
  • 21. Regular Expressions• Expressions of regular languages
  • 22. Regular Expressions ot• Expressions of regular languages N
  • 23. Regular? Expressions• It turns out that some expressions are more powerful and expresses non-regular languages• Language of squares: (.*)1 • a, aa, aaaa, WikiWiki
  • 24. How does Regexp work?• Build a finite state automaton representing a given regular expression• Feed the String to the regular expression and see if the match succeeds
  • 25. aa
  • 26. ab* a b
  • 27. .*.
  • 28. a$a $
  • 29. a?a ε
  • 30. a|bab
  • 31. (ab|c)a b c
  • 32. (ab+|c) ba b c
  • 33. Match is attempted atevery character, left to right
  • 34. /a$/ zyxwvutsrqponmlkjihgfedcba ^Regexp does not think, a$ can match only at the end of the line, so we should fast forwardto the end of the line
  • 35. /a$/ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^Regexp does not think, a$ can match only at the end of the line, so we should fast forwardto the end of the line
  • 36. /a$/ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^Regexp does not think, a$ can match only at the end of the line, so we should fast forwardto the end of the line
  • 37. /a$/ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^Regexp does not think, a$ can match only at the end of the line, so we should fast forwardto the end of the line
  • 38. /a$/ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ ⋮ zyxwvutsrqponmlkjihgfedcba ^Regexp does not think, a$ can match only at the end of the line, so we should fast forwardto the end of the line
  • 39. ^s*(.*)s*$ abc d a dfadg^ abc d a dfadg ^ abc d a dfadg ^ abc d a dfadg ^# matches abc d a dfadg
  • 40. a?a?a?…a?aaa…adef pathological(n=5) Regexp.new(a? * n + a * n)end1.upto(40) do |n| print n, ": " print Time.now, "n" if a*n =~ pathological(n)end
  • 41. a?a?a?aaaaaa^
  • 42. Regexp tips
  • 43. Use /xUP_TO_256 = /b(?:25[0-5] # 250-255|2[0-4][0-9] # 200-249|1[0-9][0-9] # 100-199|[1-9][0-9] # 2-digit numbers|[0-9]) # single-digit numbersb/xIPV4_ADDRESS = /#{UP_TO_256}(?:.#{UP_TO_256}){3}/
  • 44. A, z for strings ^, $ for lines• A: the beginning of the string• z: the end of the string• ^: after n• $: before n
  • 45. A, z for strings ^, $ for lines• A: the beginning of the string• z: the end of the string• ^: after n• $: before n always in Ruby
  • 46. Whats the problem?also note the difference in what /m means
  • 47. Whats the problem? #! /usr/bin/env perl $a = "abcndef"; if ($a =~ /^d/) { print "yesn"; } if ($a =~ /^d/m) { print "yes nown"; } # prints yes nowalso note the difference in what /m means
  • 48. Whats the problem? #! /usr/bin/env ruby a = "abcndef"; if (a =~ /^d/) p "yes" endhttp://guides.rubyonrails.org/security.html#regular-expressions
  • 49. Security Implications class File < ActiveRecord::Base   validates :name, :format => /^[w.-+]+$/ endhttp://guides.rubyonrails.org/security.html#regular-expressions
  • 50. file.txt%0A<script>alert(‘hello’)</script>
  • 51. file.txt%0A<script>alert(‘hello’)</script>
  • 52. file.txtn<script>alert(‘hello’)</script>
  • 53. file.txtn<script>alert(‘hello’)</script> /^[w.-+]+$/
  • 54. file.txtn<script>alert(‘hello’)</script> /^[w.-+]+$/ Match succeeds ActiveRecord validation succeeds
  • 55. file.txtn<script>alert(‘hello’)</script> /A[w.-+]+z/
  • 56. file.txtn<script>alert(‘hello’)</script> /A[w.-+]+z/ Match fails ActiveRecord validation fails
  • 57. Prefer Character Class to Alterationsrequire benchmark# simple benchmark for alternations and character classn = 5_000str = cafebabedeadbeef*5_000Benchmark.bmbm do |x| x.report(alternation) do str =~ /^(a|b|c|d|e|f)+$/ end x.report(character class) do str =~ /^[a-f]+$/ endend
  • 58. BenchmarksRuby 1.8.7 user system total realalternation 0.030000 0.010000 0.040000 ( 0.036702)character class 0.000000 0.000000 0.000000 ( 0.004704)Ruby 2.0.0 user system total realalternation 0.020000 0.010000 0.030000 ( 0.023139)character class 0.000000 0.000000 0.000000 ( 0.009641)JRuby 1.7.4.dev user system total realalternation 0.030000 0.000000 0.030000 ( 0.021000)character class 0.010000 0.000000 0.010000 ( 0.007000)
  • 59. Beware of Character Classes # case-insensitively match any non-word character… # one is unlike the others r =~ /(?i:[W])/ s =~ /(?i:[W])/ matches, even if s is a word character t =~ /(?i:[W])/https://bugs.ruby-lang.org/issues/4044
  • 60. /^1?$|^(11+?)1+$/
  • 61. /^1?$|^(11+?)1+$/ Matches 1 or
  • 62. /^1?$|^(11+?)1+$/Non-greedily match 2 or more 1s
  • 63. /^1?$|^(11+?)1+$/1 or more additional times
  • 64. /^1?$|^(11+?)1+$/matches a composite number
  • 65. /^1?$|^(11+?)1+$/Matches a string of 1s if and onlyif there are a non-prime # of 1s
  • 66. Integer#prime? class Integer def prime? "1" * self !~ /^1?$|^(11+?)1+$/ end end No performance guaranteeAttributed a Perl hacker Abigail
  • 67. • @hiro_asari• Github: BanzaiMan