• Like
Introduction to regular expressions
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Introduction to regular expressions

  • 1,950 views
Published

Introduction to Regular Expressions for THATCamp Texas 2011

Introduction to Regular Expressions for THATCamp Texas 2011

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,950
On SlideShare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
39
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Introduction to Regular Expressions Ben Brumfield THATCamp Texas 2011
  • 2. What are Regular Expressions?
    • Very small language for describing text.
    • Not a programming language.
    • Incredibly powerful tool for search/replace operations.
    • Arcane art.
    • Ubiquitous.
  • 3. Why Use Regular Expressions?
    • Finding every instance of a string in a file – i.e. every mention of “chickens” in a farm diary
    • How many times does “sing” appear in a text in all tenses and conjugations?
    • Reformatting dirty data
    • Validating input.
    • Command line work – listing files, grepping log files
  • 4. The Basics
    • A regex is a pattern enclosed within delimiters.
    • Most characters match themselves.
    • /THATCamp/ is a regular expression that matches “THATCamp”.
      • Slash is the delimiter enclosing the expression.
      • “ THATCamp” is the pattern.
  • 5. /at/
    • Matches strings with “a” followed by “t”.
    Athens aft atlas that hat at
  • 6. /at/
    • Matches strings with “a” followed by “t”.
    Athens aft at las th at h at at
  • 7. Some Theory
    • Finite State Machine for the regex /at/
  • 8. Characters
    • Matching is case sensitive.
    • Special characters: ( ) ^ $ { } [ ] | . + ? *
    • To match a special character in your text, precede it with in your pattern:
      • /ironic [sic]/ does not match “ironic [sic]”
      • /ironic [sic]/ matches “ironic [sic]”
    • Regular expressions can support Unicode.
  • 9. Character Classes
    • Characters within [ ] are choices for a single-character match.
    • Think of a set operation, or a type of or .
    • Order within the set is unimportant.
    • /x[01]/ matches “x0” and “x1”.
    • /[10][23]/ matches “02”, “03”, “12” and “13”.
    • Initial^ negates the class:
      • /[^45]/ matches all characters except 4 or 5.
  • 10. /[ch]at/
    • Matches strings with “c” or “h”, followed by “a”, followed by “t”.
    phat fat cat chat at that
  • 11. /[ch]at/
    • Matches strings with “c” or “h”, followed by “a”, followed by “t”.
    p hat fat cat c hat at t hat
  • 12. Ranges
    • Ranges define sets of characters within a class.
      • /[1-9]/ matches any non-zero digit.
      • /[a-zA-Z]/ matches any letter.
      • /[12][0-9]/ matches numbers between 10 and 29.
  • 13. Shortcuts [^tnrfv ] not space S [^n] (depends on mode) everything . [a-zA-Z0-9_] word w [^a-zA-Z0-9_] not word W [0-9] digit d [^0-9] not digit D [tnrfv ] space s Equivalent Class Name Shortcut
  • 14. /ddd[- ]dddd/
    • Matches strings with:
      • Three digits
      • Space or dash
      • Four digits
    653-6464x256 PE6-5000 713-342-7452 652.2648 234 1252 501-1234
  • 15. /ddd[- ]dddd/
    • Matches strings with:
      • Three digits
      • Space or dash
      • Four digits
    653-6464 x256 PE6-5000 713- 342-7452 652.2648 234 1252 501-1234
  • 16. Repeaters
    • Symbols indicating that the preceding element of the pattern can repeat.
    • /runs?/ matches runs or run
    • /1d*/ matches any number beginning with “1”.
    at least n times { n ,} no more than m times {, m } between n and m times { n , m } exactly n { n } zero or more * one or more + zero or one ? Count Repeater
  • 17. Repeaters
    • Strings:
    • 1: “at” 2: “art”
    • 3: “arrrrt” 4: “aft”
    • Patterns:
    • A: /ar?t/ B: /a[fr]?t/
    • C: /ar*t/ D: /ar+t/
    • E: /a.*t/ F: /a.+t/
    at least n times { n ,} no more than m times {, m } between n and m times { n , m } exactly n { n } zero or more * one or more + zero or one ? Count Repeater
  • 18. Repeaters
    • /ar?t/ matches “at” and “art” but not “arrrt”.
    • /a[fr]?t/ matches “at”, “art”, and “aft”.
    • /ar*t/ matches “at”, “art”, and “arrrrt”
    • /ar+t/ matches “art” and “arrrt” but not “at”.
    • /a.*t/ matches anything with an ‘a’ eventually followed by a ‘t’.
  • 19. Lab Session I
    • http://gskinner.com/RegExr/
    • https://gist.github.com/922838
    • Match the titles “Mr.” and “Ms.”.
    • Find all conjugations and tenses of “sing”.
    • Find all places where more than one space follows punctuation.
  • 20. Lab Reference at least n times { n ,} no more than m times {, m } between n and m times { n , m } exactly n { n } zero or more * one or more + zero or one ? Count Repeater everything . not space S space s not word W word w not digit D digit d Name Shortcut
  • 21. Anchors
    • Anchors match between characters.
    • Used to assert that the characters you’re matching must appear in a certain place.
    • /batb/ matches “at work” but not “batch”.
    raw end of string (rare) z end of string Z start of string A not boundary B word boundary b end of line $ start of line ^ Matches Anchor
  • 22. Alternation
    • In Regex, | means “or”.
    • You can put a full expression on the left and another full expression on the right.
    • Either can match.
    • /seeks?|sought/ matches “seek”, “seeks”, or “sought”.
  • 23. Grouping
    • Everything within ( … ) is grouped into a single element for the purposes of repetition and alternation.
    • The expression /(la)+/ matches “la”, “lala”, “lalalala” but not “all”.
    • /schema(ta)?/ matches “schema” and “schemata” but not “schematic”.
  • 24. Grouping Example
    • What regular expression matches “eat”, “eats”, “ate” and “eaten”?
  • 25. Grouping Example
    • What regular expression matches “eat”, “eats”, “ate” and “eaten”?
    • /eat(s|en)?|ate/
    • Add word boundary anchors to exclude “sate” and “eating”: /b(eat(s|en)?|ate)b/
  • 26. Replacement
    • Regex most often used for search/replace
    • Syntax varies; most scripting languages and CLI tools use s/ pattern / replacement / .
    • s/dog/hound/ converts “slobbery dogs” to “slobbery hounds”.
    • s/bsheepsb/sheep/ converts
      • “ sheepskin is made from sheeps” to
      • “ sheepskin is made from sheep”
  • 27. Capture
    • During searches, ( … ) groups capture patterns for use in replacement.
    • Special variables $1, $2, $3 etc. contain the capture.
    • /(ddd)-(dddd)/ “123-4567”
      • $1 contains “123”
      • $2 contains “4567”
  • 28. Capture
    • How do you convert
      • “Smith, James” and “Jones, Sally” to
      • “James Smith” and “Sally Jones”?
  • 29. Capture
    • How do you convert
      • “ Smith, James” and “Jones, Sally” to
      • “ James Smith” and “Sally Jones”?
    • s/(w+), (w+)/$2 $1/
  • 30. Capture
    • Given a file containing URLs, create a script that wget s each URL:
      • http://bit.ly/DHapiTRANSCRIBE
        • becomes:
      • wget “http://bit.ly/DHapiTRANSCRIBE”
  • 31. Capture
    • Given a file containing URLs, create a script that wget s each URL:
      • http://bit.ly/DHapiTRANSCRIBE
        • becomes
      • wget “ http:// bit.ly/DHapiTRANSCRIBE ”
    • s/^(.*)$/wget “$1”/
  • 32. Lab Session II
    • Convert all Miss and Mrs. to Ms.
    • Convert infinitives to gerunds
      • “ to sing” -> “singing”
    • Extract last name, first name from (title first name last name)
      • Dr. Thelma Dunn
      • Mr. Clay Shirky
      • Dana Gray
  • 33. Caveats
    • Do not use regular expressions to parse (complicated) XML!
    • Check the language/application-specific documentation: some common shortcuts are not universal.
  • 34. Acknowledgments
    • James Edward Gray II and Dana Gray
      • Much of the structure and some of the wording of this presentation comes from
      • http://www.slideshare.net/JamesEdwardGrayII/regular-expressions-7337223