Introduction to Regular Expressions Ben Brumfield THATCamp Texas 2011
What are Regular Expressions? <ul><li>Very small language for describing text. </li></ul><ul><li>Not a programming languag...
Why Use Regular Expressions? <ul><li>Finding every instance of a string in a file – i.e. every mention of “chickens” in a ...
The Basics <ul><li>A regex is a pattern enclosed within delimiters. </li></ul><ul><li>Most characters match themselves. </...
/at/ <ul><li>Matches strings with “a” followed by “t”. </li></ul>Athens aft atlas that hat at
/at/ <ul><li>Matches strings with “a” followed by “t”. </li></ul>Athens aft at las th at h at at
Some Theory <ul><li>Finite State Machine for the regex /at/ </li></ul>
Characters <ul><li>Matching is case sensitive.  </li></ul><ul><li>Special characters: ( ) ^ $ { } [ ]  | . + ? * </li></ul...
Character Classes <ul><li>Characters within [ ] are choices for a single-character match. </li></ul><ul><li>Think of a set...
/[ch]at/ <ul><li>Matches strings with “c” or “h”, followed by “a”, followed by “t”. </li></ul>phat fat cat chat at that
/[ch]at/ <ul><li>Matches strings with “c” or “h”, followed by “a”, followed by “t”. </li></ul>p hat fat cat c hat at t hat
Ranges <ul><li>Ranges define sets of characters within a class. </li></ul><ul><ul><li>/[1-9]/ matches any non-zero digit. ...
Shortcuts [^tnrfv ] not space S [^n] (depends on mode) everything . [a-zA-Z0-9_] word w [^a-zA-Z0-9_] not word W [0-9] dig...
/ddd[- ]dddd/ <ul><li>Matches strings with: </li></ul><ul><ul><li>Three digits </li></ul></ul><ul><ul><li>Space or dash </...
/ddd[- ]dddd/ <ul><li>Matches strings with: </li></ul><ul><ul><li>Three digits </li></ul></ul><ul><ul><li>Space or dash </...
Repeaters <ul><li>Symbols indicating that the preceding element of the pattern can repeat. </li></ul><ul><li>/runs?/ match...
Repeaters <ul><li>Strings: </li></ul><ul><li>1: “at” 2: “art” </li></ul><ul><li>3: “arrrrt” 4: “aft” </li></ul><ul><li>Pat...
Repeaters <ul><li>/ar?t/ matches “at” and “art” but not “arrrt”. </li></ul><ul><li>/a[fr]?t/ matches “at”, “art”, and “aft...
Lab Session I <ul><li>http://gskinner.com/RegExr/ </li></ul><ul><li>https://gist.github.com/922838 </li></ul><ul><li>Match...
Lab Reference at least  n  times { n ,} no more than  m  times {, m } between  n  and  m  times { n , m } exactly  n { n }...
Anchors <ul><li>Anchors match between characters. </li></ul><ul><li>Used to assert that the characters you’re matching mus...
Alternation <ul><li>In Regex, | means “or”. </li></ul><ul><li>You can put a full expression on the left and another full e...
Grouping <ul><li>Everything within ( … ) is grouped into a single element for the purposes of repetition and alternation. ...
Grouping Example <ul><li>What regular expression matches “eat”, “eats”, “ate” and “eaten”? </li></ul>
Grouping Example <ul><li>What regular expression matches “eat”, “eats”, “ate” and “eaten”? </li></ul><ul><li>/eat(s|en)?|a...
Replacement <ul><li>Regex most often used for search/replace </li></ul><ul><li>Syntax varies; most scripting languages and...
Capture <ul><li>During searches, ( … ) groups capture patterns for use in replacement. </li></ul><ul><li>Special variables...
Capture <ul><li>How do you convert  </li></ul><ul><ul><li>“Smith, James” and “Jones, Sally” to  </li></ul></ul><ul><ul><li...
Capture <ul><li>How do you convert  </li></ul><ul><ul><li>“ Smith, James” and “Jones, Sally” to  </li></ul></ul><ul><ul><l...
Capture <ul><li>Given a file containing URLs, create a script that  wget s each URL: </li></ul><ul><ul><li>http://bit.ly/D...
Capture <ul><li>Given a file containing URLs, create a script that  wget s each URL: </li></ul><ul><ul><li>http://bit.ly/D...
Lab Session II <ul><li>Convert all Miss and Mrs. to Ms. </li></ul><ul><li>Convert infinitives to gerunds  </li></ul><ul><u...
Caveats <ul><li>Do not use regular expressions to parse (complicated) XML! </li></ul><ul><li>Check the language/applicatio...
Acknowledgments <ul><li>James Edward Gray II and Dana Gray </li></ul><ul><ul><li>Much of the structure and some of the wor...
Upcoming SlideShare
Loading in...5
×

Introduction to regular expressions

2,289

Published on

Introduction to Regular Expressions for THATCamp Texas 2011

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,289
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
68
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Introduction to regular expressions

  1. 1. Introduction to Regular Expressions Ben Brumfield THATCamp Texas 2011
  2. 2. What are Regular Expressions? <ul><li>Very small language for describing text. </li></ul><ul><li>Not a programming language. </li></ul><ul><li>Incredibly powerful tool for search/replace operations. </li></ul><ul><li>Arcane art. </li></ul><ul><li>Ubiquitous. </li></ul>
  3. 3. Why Use Regular Expressions? <ul><li>Finding every instance of a string in a file – i.e. every mention of “chickens” in a farm diary </li></ul><ul><li>How many times does “sing” appear in a text in all tenses and conjugations? </li></ul><ul><li>Reformatting dirty data </li></ul><ul><li>Validating input. </li></ul><ul><li>Command line work – listing files, grepping log files </li></ul>
  4. 4. The Basics <ul><li>A regex is a pattern enclosed within delimiters. </li></ul><ul><li>Most characters match themselves. </li></ul><ul><li>/THATCamp/ is a regular expression that matches “THATCamp”. </li></ul><ul><ul><li>Slash is the delimiter enclosing the expression. </li></ul></ul><ul><ul><li>“ THATCamp” is the pattern. </li></ul></ul>
  5. 5. /at/ <ul><li>Matches strings with “a” followed by “t”. </li></ul>Athens aft atlas that hat at
  6. 6. /at/ <ul><li>Matches strings with “a” followed by “t”. </li></ul>Athens aft at las th at h at at
  7. 7. Some Theory <ul><li>Finite State Machine for the regex /at/ </li></ul>
  8. 8. Characters <ul><li>Matching is case sensitive. </li></ul><ul><li>Special characters: ( ) ^ $ { } [ ] | . + ? * </li></ul><ul><li>To match a special character in your text, precede it with in your pattern: </li></ul><ul><ul><li>/ironic [sic]/ does not match “ironic [sic]” </li></ul></ul><ul><ul><li>/ironic [sic]/ matches “ironic [sic]” </li></ul></ul><ul><li>Regular expressions can support Unicode. </li></ul>
  9. 9. Character Classes <ul><li>Characters within [ ] are choices for a single-character match. </li></ul><ul><li>Think of a set operation, or a type of or . </li></ul><ul><li>Order within the set is unimportant. </li></ul><ul><li>/x[01]/ matches “x0” and “x1”. </li></ul><ul><li>/[10][23]/ matches “02”, “03”, “12” and “13”. </li></ul><ul><li>Initial^ negates the class: </li></ul><ul><ul><li>/[^45]/ matches all characters except 4 or 5. </li></ul></ul>
  10. 10. /[ch]at/ <ul><li>Matches strings with “c” or “h”, followed by “a”, followed by “t”. </li></ul>phat fat cat chat at that
  11. 11. /[ch]at/ <ul><li>Matches strings with “c” or “h”, followed by “a”, followed by “t”. </li></ul>p hat fat cat c hat at t hat
  12. 12. Ranges <ul><li>Ranges define sets of characters within a class. </li></ul><ul><ul><li>/[1-9]/ matches any non-zero digit. </li></ul></ul><ul><ul><li>/[a-zA-Z]/ matches any letter. </li></ul></ul><ul><ul><li>/[12][0-9]/ matches numbers between 10 and 29. </li></ul></ul>
  13. 13. Shortcuts [^tnrfv ] not space S [^n] (depends on mode) everything . [a-zA-Z0-9_] word w [^a-zA-Z0-9_] not word W [0-9] digit d [^0-9] not digit D [tnrfv ] space s Equivalent Class Name Shortcut
  14. 14. /ddd[- ]dddd/ <ul><li>Matches strings with: </li></ul><ul><ul><li>Three digits </li></ul></ul><ul><ul><li>Space or dash </li></ul></ul><ul><ul><li>Four digits </li></ul></ul>653-6464x256 PE6-5000 713-342-7452 652.2648 234 1252 501-1234
  15. 15. /ddd[- ]dddd/ <ul><li>Matches strings with: </li></ul><ul><ul><li>Three digits </li></ul></ul><ul><ul><li>Space or dash </li></ul></ul><ul><ul><li>Four digits </li></ul></ul>653-6464 x256 PE6-5000 713- 342-7452 652.2648 234 1252 501-1234
  16. 16. Repeaters <ul><li>Symbols indicating that the preceding element of the pattern can repeat. </li></ul><ul><li>/runs?/ matches runs or run </li></ul><ul><li>/1d*/ matches any number beginning with “1”. </li></ul>at least n times { n ,} no more than m times {, m } between n and m times { n , m } exactly n { n } zero or more * one or more + zero or one ? Count Repeater
  17. 17. Repeaters <ul><li>Strings: </li></ul><ul><li>1: “at” 2: “art” </li></ul><ul><li>3: “arrrrt” 4: “aft” </li></ul><ul><li>Patterns: </li></ul><ul><li>A: /ar?t/ B: /a[fr]?t/ </li></ul><ul><li>C: /ar*t/ D: /ar+t/ </li></ul><ul><li>E: /a.*t/ F: /a.+t/ </li></ul>at least n times { n ,} no more than m times {, m } between n and m times { n , m } exactly n { n } zero or more * one or more + zero or one ? Count Repeater
  18. 18. Repeaters <ul><li>/ar?t/ matches “at” and “art” but not “arrrt”. </li></ul><ul><li>/a[fr]?t/ matches “at”, “art”, and “aft”. </li></ul><ul><li>/ar*t/ matches “at”, “art”, and “arrrrt” </li></ul><ul><li>/ar+t/ matches “art” and “arrrt” but not “at”. </li></ul><ul><li>/a.*t/ matches anything with an ‘a’ eventually followed by a ‘t’. </li></ul>
  19. 19. Lab Session I <ul><li>http://gskinner.com/RegExr/ </li></ul><ul><li>https://gist.github.com/922838 </li></ul><ul><li>Match the titles “Mr.” and “Ms.”. </li></ul><ul><li>Find all conjugations and tenses of “sing”. </li></ul><ul><li>Find all places where more than one space follows punctuation. </li></ul>
  20. 20. Lab Reference at least n times { n ,} no more than m times {, m } between n and m times { n , m } exactly n { n } zero or more * one or more + zero or one ? Count Repeater everything . not space S space s not word W word w not digit D digit d Name Shortcut
  21. 21. Anchors <ul><li>Anchors match between characters. </li></ul><ul><li>Used to assert that the characters you’re matching must appear in a certain place. </li></ul><ul><li>/batb/ matches “at work” but not “batch”. </li></ul>raw end of string (rare) z end of string Z start of string A not boundary B word boundary b end of line $ start of line ^ Matches Anchor
  22. 22. Alternation <ul><li>In Regex, | means “or”. </li></ul><ul><li>You can put a full expression on the left and another full expression on the right. </li></ul><ul><li>Either can match. </li></ul><ul><li>/seeks?|sought/ matches “seek”, “seeks”, or “sought”. </li></ul>
  23. 23. Grouping <ul><li>Everything within ( … ) is grouped into a single element for the purposes of repetition and alternation. </li></ul><ul><li>The expression /(la)+/ matches “la”, “lala”, “lalalala” but not “all”. </li></ul><ul><li>/schema(ta)?/ matches “schema” and “schemata” but not “schematic”. </li></ul>
  24. 24. Grouping Example <ul><li>What regular expression matches “eat”, “eats”, “ate” and “eaten”? </li></ul>
  25. 25. Grouping Example <ul><li>What regular expression matches “eat”, “eats”, “ate” and “eaten”? </li></ul><ul><li>/eat(s|en)?|ate/ </li></ul><ul><li>Add word boundary anchors to exclude “sate” and “eating”: /b(eat(s|en)?|ate)b/ </li></ul>
  26. 26. Replacement <ul><li>Regex most often used for search/replace </li></ul><ul><li>Syntax varies; most scripting languages and CLI tools use s/ pattern / replacement / . </li></ul><ul><li>s/dog/hound/ converts “slobbery dogs” to “slobbery hounds”. </li></ul><ul><li>s/bsheepsb/sheep/ converts </li></ul><ul><ul><li>“ sheepskin is made from sheeps” to </li></ul></ul><ul><ul><li>“ sheepskin is made from sheep” </li></ul></ul>
  27. 27. Capture <ul><li>During searches, ( … ) groups capture patterns for use in replacement. </li></ul><ul><li>Special variables $1, $2, $3 etc. contain the capture. </li></ul><ul><li>/(ddd)-(dddd)/ “123-4567” </li></ul><ul><ul><li>$1 contains “123” </li></ul></ul><ul><ul><li>$2 contains “4567” </li></ul></ul>
  28. 28. Capture <ul><li>How do you convert </li></ul><ul><ul><li>“Smith, James” and “Jones, Sally” to </li></ul></ul><ul><ul><li>“James Smith” and “Sally Jones”? </li></ul></ul>
  29. 29. Capture <ul><li>How do you convert </li></ul><ul><ul><li>“ Smith, James” and “Jones, Sally” to </li></ul></ul><ul><ul><li>“ James Smith” and “Sally Jones”? </li></ul></ul><ul><li>s/(w+), (w+)/$2 $1/ </li></ul>
  30. 30. Capture <ul><li>Given a file containing URLs, create a script that wget s each URL: </li></ul><ul><ul><li>http://bit.ly/DHapiTRANSCRIBE </li></ul></ul><ul><ul><ul><li>becomes: </li></ul></ul></ul><ul><ul><li>wget “http://bit.ly/DHapiTRANSCRIBE” </li></ul></ul>
  31. 31. Capture <ul><li>Given a file containing URLs, create a script that wget s each URL: </li></ul><ul><ul><li>http://bit.ly/DHapiTRANSCRIBE </li></ul></ul><ul><ul><ul><li>becomes </li></ul></ul></ul><ul><ul><li>wget “ http:// bit.ly/DHapiTRANSCRIBE ” </li></ul></ul><ul><li>s/^(.*)$/wget “$1”/ </li></ul>
  32. 32. Lab Session II <ul><li>Convert all Miss and Mrs. to Ms. </li></ul><ul><li>Convert infinitives to gerunds </li></ul><ul><ul><li>“ to sing” -> “singing” </li></ul></ul><ul><li>Extract last name, first name from (title first name last name) </li></ul><ul><ul><li>Dr. Thelma Dunn </li></ul></ul><ul><ul><li>Mr. Clay Shirky </li></ul></ul><ul><ul><li>Dana Gray </li></ul></ul>
  33. 33. Caveats <ul><li>Do not use regular expressions to parse (complicated) XML! </li></ul><ul><li>Check the language/application-specific documentation: some common shortcuts are not universal. </li></ul>
  34. 34. Acknowledgments <ul><li>James Edward Gray II and Dana Gray </li></ul><ul><ul><li>Much of the structure and some of the wording of this presentation comes from </li></ul></ul><ul><ul><li>http://www.slideshare.net/JamesEdwardGrayII/regular-expressions-7337223 </li></ul></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×