Upcoming SlideShare
×

# Regexp secrets

884 views
695 views

Published on

Regexp class is in every Rubyist's toolbox. But do you know the theory behind it, and what goes on under the hood?

Published in: Technology
2 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
884
On SlideShare
0
From Embeds
0
Number of Embeds
29
Actions
Shares
0
6
0
Likes
2
Embeds 0
No embeds

No notes for slide

### Regexp secrets

1. 1. Secrets of Regexp Hiro Asari Red Hat, Inc.
2. 2. Lets Talk AboutRegular Expressions
3. 3. Lets Talk About Regular Expressions• There is no regular expression
4. 4. Lets Talk About Regular Expressions• A good approximation as a name
5. 5. Lets Talk About Regexp
6. 6. Some people, when confronted with a problem, think, "I know, Ill use regular expressions." Now they have two problems. Jaime Zawinski 12 Aug, 1997http://regex.info/blog/2006-09-15/247http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.htmlThe point is not so much the evils of regular expressions, but the evils of overuse of it.
7. 7. Formal Language Theory• The Language L• Over Alphabet Σ
8. 8. Formal Language Theory• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)
9. 9. Formal Language Theory• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)• Words over Σ: "a", "b", "ab", "aequafdhfad"
10. 10. Formal Language Theory• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)• Words over Σ: "a", "b", "ab", "aequafdhfad"• Σ*: The set of all words over Σ
11. 11. Formal Language over Σ• A subset L of Σ* (with various properties)• L can be ﬁnite, and enumerate well-formed words, but often inﬁnite
12. 12. Example• Language L over Σ = {a,b}• a is a word• a word may be obtained by appending ab to an existing word• only words thus formed are legal
13. 13. Well-formed wordsaaabaabab
14. 14. Ill-formed wordsbaaaababb
15. 15. Succinctly…• a(ab)*
16. 16. Expression• Textual representation of the formal language against which an input is tested whether it is a well-formed word in that language
17. 17. Regular Languages• ∅ (empty language) is regular
18. 18. Regular Languages• ∅ (empty language) is regular• For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language.
19. 19. Regular Languages• ∅ (empty language) is regular• For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language.• If A and B are regular languages, then A ∪ B (union), A•B (concatenation), and A* (Kleene star) are regular languages
20. 20. Regular Languages• ∅ (empty language) is regular• For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language.• If A and B are regular languages, then A ∪ B (union), A•B (concatenation), and A* (Kleene star) are regular languages• No other languages over Σ are regular.
21. 21. Regular Expressions• Expressions of regular languages
22. 22. Regular Expressions ot• Expressions of regular languages N
23. 23. Regular? Expressions• It turns out that some expressions are more powerful and expresses non-regular languages• Language of squares: (.*)1 • a, aa, aaaa, WikiWiki
24. 24. How does Regexp work?• Build a ﬁnite state automaton representing a given regular expression• Feed the String to the regular expression and see if the match succeeds
25. 25. aa
26. 26. ab* a b
27. 27. .*.
28. 28. a\$a \$
29. 29. a?a ε
30. 30. a|bab
31. 31. (ab|c)a b c
32. 32. (ab+|c) ba b c
33. 33. Match is attempted atevery character, left to right
34. 34. /a\$/ zyxwvutsrqponmlkjihgfedcba ^Regexp does not think, a\$ can match only at the end of the line, so we should fast forwardto the end of the line
35. 35. /a\$/ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^Regexp does not think, a\$ can match only at the end of the line, so we should fast forwardto the end of the line
36. 36. /a\$/ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^Regexp does not think, a\$ can match only at the end of the line, so we should fast forwardto the end of the line
37. 37. /a\$/ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^Regexp does not think, a\$ can match only at the end of the line, so we should fast forwardto the end of the line
38. 38. /a\$/ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ ⋮ zyxwvutsrqponmlkjihgfedcba ^Regexp does not think, a\$ can match only at the end of the line, so we should fast forwardto the end of the line
40. 40. a?a?a?…a?aaa…adef pathological(n=5) Regexp.new(a? * n + a * n)end1.upto(40) do |n| print n, ": " print Time.now, "n" if a*n =~ pathological(n)end
41. 41. a?a?a?aaaaaa^
42. 42. Regexp tips
43. 43. Use /xUP_TO_256 = /b(?:25[0-5] # 250-255|2[0-4][0-9] # 200-249|1[0-9][0-9] # 100-199|[1-9][0-9] # 2-digit numbers|[0-9]) # single-digit numbersb/xIPV4_ADDRESS = /#{UP_TO_256}(?:.#{UP_TO_256}){3}/
44. 44. A, z for strings ^, \$ for lines• A: the beginning of the string• z: the end of the string• ^: after n• \$: before n
45. 45. A, z for strings ^, \$ for lines• A: the beginning of the string• z: the end of the string• ^: after n• \$: before n always in Ruby
46. 46. Whats the problem?also note the difference in what /m means
47. 47. Whats the problem? #! /usr/bin/env perl \$a = "abcndef"; if (\$a =~ /^d/) { print "yesn"; } if (\$a =~ /^d/m) { print "yes nown"; } # prints yes nowalso note the difference in what /m means
48. 48. Whats the problem? #! /usr/bin/env ruby a = "abcndef"; if (a =~ /^d/) p "yes" endhttp://guides.rubyonrails.org/security.html#regular-expressions
49. 49. Security Implications class File < ActiveRecord::Base   validates :name, :format => /^[w.-+]+\$/ endhttp://guides.rubyonrails.org/security.html#regular-expressions
54. 54. file.txtn<script>alert(‘hello’)</script> /^[w.-+]+\$/ Match succeeds ActiveRecord validation succeeds