Successfully reported this slideshow.
Upcoming SlideShare
×

# Introduction to Regular Expressions

10,951 views

Published on

^Regular Expressions is one of those tools that every developer should have in their toolbox. You can do your job without regular expressions, but knowing when and how to use them will make you a much more efficient and marketable developer. You'll learn how regular expressions can be used for validating user input, parsing text, and refactoring code. We'll also cover various tools that can be used to help you write and share expressions.\$

Published in: Technology, News & Politics
• Full Name
Comment goes here.

Are you sure you want to Yes No

Are you sure you want to  Yes  No

Are you sure you want to  Yes  No

Are you sure you want to  Yes  No

Are you sure you want to  Yes  No

Are you sure you want to  Yes  No

### Introduction to Regular Expressions

1. 1. Introduction toRegular Expressions<br />Matt Casto<br />http://google.com/profiles/mattcasto<br />
2. 2. Introduction toRegular Expressions<br />Matt Casto<br />Quick Solutions<br />http://google.com/profiles/mattcasto<br />
3. 3. “Some people, when confronted with a problem, think “I know, I&apos;ll use regular expressions. Now they have two problems.”<br />- Jamie Zawinski, August 12, 1997<br />
4. 4.
5. 5. What are Regular Expressions?<br />^w+@[a-zA-Z_]+?.[a-zA-Z]{2,3}\$<br />[w-]+@([w-]+.)+[w-]+<br />^.+@[^.].*.[a-z]{2,}\$<br />^([a-zA-Z0-9_-.]+)@(([[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.)|(([a-zA-Z0-9-]+.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(]?)\$<br />
6. 6.
7. 7. History<br />Stephen Cole Kleene<br />American mathematician credited for inventing Regular Expressions in the 1950’s using a mathematic notation called regular sets.<br />
8. 8. History<br />Ken Thompson<br />American pioneer of computer science who, among many other things, used Kleene’s regular sets for searching in his QED and ed text editors.<br />
9. 9. History<br />grep<br />Global Regular Expression Print<br />
10. 10. History<br />Henry Spencer<br />Wrote the regex library which is what Perl and Tcl languages used for regular expressions.<br />
11. 11. Why Should You Care?<br />Example: finding duplicate words in a file.<br />Requirements:<br /><ul><li> Output lines that contain duplicate words
12. 12. Find doubled words that expand lines
13. 13. Ignore capitalization differences
14. 14. Ignore HTML tags</li></li></ul><li>
15. 15. Why Should You Care?<br />Example: finding duplicate words in a file.<br />Solution:<br />\$/ = “. ”;<br />while (&lt;&gt;) {<br /> next if !s/([a-z]+)((?:s&lt;[^&gt;]+&gt;)+)(1)/e[7m\$1e[m\$2e[7m\$3e[m/ig;<br /> s/^(?:[^e]* )+//mg;<br /> s/^/\$ARGV: /mg;<br /> print;<br />}<br />
16. 16.
17. 17. Literal Characters<br />Any character except a small list of reserved characters.<br />regex<br />is<br />Jack is a boy<br />match in target string<br />
18. 18. Literal Characters<br />Literals will match characters in the middle of words.<br />regex<br />a<br />Jack is a boy<br />matches in target string<br />
19. 19. Literal Characters<br />Literals are case sensitive – capitalization matters!<br />regex<br />j<br />Jack is a boy<br />NOT a match<br />
20. 20. Special Characters<br />[ ^ \$ . | ? * + ( )<br />
21. 21. Special Characters<br />You can match special characters by escaping them with a backslash.<br />1+1=2<br />I wrote 1+1=2 on the chalkboard.<br />
22. 22. Special Characters<br />Some characters, such as { and } are only reserved depending on context.<br />if (true) {<br />else if (true) { beep; }<br />
23. 23. Non-Printable Characters<br />Some literal characters can be escaped to represent non-printable characters.<br /> – tab<br /> – carriage return<br /> – line feed<br />a – bell<br />e – escape<br />f – form feed<br />v – vertical tab<br />
24. 24. Period<br />The period character matches any single character.<br />a.boy<br />Jack is a boy<br />
25. 25. Character Classes<br />Used to match only one of the characters inside square braces.<br />[Gg]r[ae]y<br />Grayson drives a grey sedan.<br />
26. 26. Character Classes<br />Hyphen is a reserved character inside a character class and indicates a range.<br />[0-9a-fA-F]<br />The HTML codefor White is #FFFFFF<br />
27. 27. Character Classes<br />Caret inside a character class negates the match.<br />q[^u]<br />Qatar is home to quite a lot of Iraqi citizens, but is not a city in Iraq<br />
28. 28. Character Classes<br />Normal special characters are valid inside of character classes. Only ] ^ and – are reserved.<br />[+*]<br />6 * 7 and 18 + 24 both equal 42<br />
29. 29. Shorthand Character Classes<br />d – digit or [0-9]<br />w – word or [A-Za-z0-9_]<br />s – whitespace or [ ] (space, tab, CR, LF)<br />[sd]<br />1 + 2 = 3<br />
30. 30. Shorthand Character Classes<br />D – non-digit or [^d]<br />W – non-word or [^w]<br />S – non-whitespace or [^s]<br />[D]<br />1 + 2 = 3<br />
31. 31. Repetition<br />The asterisk repeats the preceding character class 0 or more times.<br />&lt;[A-Za-z][A-Za-z0-9]*&gt;<br />&lt;HTML&gt;Regex is &lt;b&gt;Awesome&lt;/b&gt;&lt;/HTML&gt;<br />
32. 32. Repetition<br />The plus repeats the preceding character class 1 or more times.<br />&lt;[A-Za-z0-9]+&gt;<br />Watch out for invalid &lt;HTML&gt; tags like &lt;1&gt; and &lt;&gt;!<br />
33. 33. Repetition<br />The question mark repeats the preceding character class 0 or 1 times, in effect making it optional.<br />&lt;/?[A-Za-z][A-Za-z0-9]*&gt;<br />&lt;HTML&gt;Regex is &lt;b&gt;Awesome&lt;/b&gt;&lt;/HTML&gt;<br />
34. 34. Anchors<br />The caret anchor matches the position before the first character in a string.<br />^vac<br />vacation evacuation<br />
35. 35. Anchors<br />The dollar sign anchor matches the position after the last character in a string.<br />tion\$<br />vacation evacuation<br />
36. 36. Anchors<br />The caret and dollar sign anchors match the start and end of the line if the engine has multi-line turned on.<br />tion\$<br />vacation evacuation<br />has ruined my evaluation<br />
37. 37. Anchors<br />The A and  shorthand character classes are like<br />^ and \$ but only match the start and end of the string.<br />tion<br />vacation evacuation<br />has ruined my evaluation<br />
38. 38. Word Boundaries<br />The  shorthand character class matches…<br /><ul><li> position before the first character in a string (like ^)
39. 39. position after the last character in a string (like \$)
40. 40. between two characters where one is a word character and the other is not</li></ul>4<br />We’ve got 4 orders for 44 lbs of C4<br />
41. 41. Word Boundaries<br />The B shorthand character class is the negated word boundary – any position between to word characters or two non-word characters.<br />BatB<br />vacation evacuation at that<br />time ate my evaluation<br />
42. 42. Alternation<br />The pipe symbol delimits two or more character classes that can both match.<br />cat|dog<br />A cat and dog are expected to follow<br />the dogma that their presence with one<br />another leads to catastrophe.<br />
43. 43. Alternation<br />Alternations include any character classes.<br />cat|dog<br />A cat and dog are expected to follow<br />the dogma that their presence with one<br />another leads to catastrophe.<br />
44. 44. Alternation<br />Use parenthesis to group alternating matches when you want to limit the reach of alternation.<br />(cat|dog)<br />A cat and dog are expected to follow<br />the dogma that their presence with one<br />another leads to catastrophe.<br />
45. 45. Eagerness<br />Eagerness causes the order of alternations to matter.<br />and|android<br />A robot and an android fight. The ninja wins.<br />
46. 46. Greediness<br />Greediness means that the engine will always try to match as much as possible.<br />anS+<br />A robot and an android fight. The ninja wins.<br />
47. 47. Laziness<br />Laziness, or reluctant, modifies a repetition operator to only match as much as it needs to.<br />anS+?<br />A robot and an android fight. The ninja wins.<br />
48. 48. Limiting Repetition<br />You can limit repetition with curly braces.<br />d{2,4}<br />1 111111111 11111<br />
49. 49. Limiting Repetition<br />The second number can be omitted to mean infinite.<br />Essentially {0,} is the same as * and {1,} same as +.<br />d{2,}<br />1 11111111111111<br />
50. 50. Limiting Repetition<br />The a single number can be used to match an exact number of times.<br />d{4}<br />1 11 111 1111 11111<br />
51. 51. Back References<br />Parenthesis around a character set groups those characters and creates a back reference.<br />([ai]).1.1<br />The magician said abracadabra!<br />
52. 52. Named Groups<br />Named groups let you reference matched groups by their name rather than just index.<br />(?&lt;vowel&gt;[ai]).k&lt;vowel&gt;.1<br />The magician said abracadabra!<br />
53. 53. Negative Lookahead<br />Negative lookaheads match something that is not there.<br />q(?!u)<br />Qatar is home to quite a lot of Iraqi citizens, but is not a city in Iraq<br />
54. 54. Positive Lookahead<br />Positive lookaheads match something that is there without having that group included in the match.<br />q(?=u)<br />Qatar is home to quite a lot of Iraqi citizens, but is not a city in Iraq<br />
55. 55. Positive & Negative Lookbehind<br />Lookbehinds are just like lookaheads, but working backwards.<br />(?&lt;=a)q<br />Qatar is home to quite a lot of Iraqi citizens, but is not a city in Iraq<br />
56. 56. Resources<br />Lots of web pages<br />http://del.icio.us/mattcasto/regex<br />“Mastering Regular Expressions”<br /> by Jeffrey Friedl<br />http://oreilly.com/catalog/9780596528126/<br />