• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Introduction to regular expressions
 

Introduction to regular expressions

on

  • 1,943 views

Introduction to Regular Expressions for THATCamp Texas 2011

Introduction to Regular Expressions for THATCamp Texas 2011

Statistics

Views

Total Views
1,943
Views on SlideShare
1,934
Embed Views
9

Actions

Likes
1
Downloads
30
Comments
0

4 Embeds 9

https://twitter.com 6
https://si0.twimg.com 1
https://twimg0-a.akamaihd.net 1
http://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Introduction to regular expressions Introduction to regular expressions Presentation Transcript

    • Introduction to Regular Expressions Ben Brumfield THATCamp Texas 2011
    • What are Regular Expressions?
      • Very small language for describing text.
      • Not a programming language.
      • Incredibly powerful tool for search/replace operations.
      • Arcane art.
      • Ubiquitous.
    • Why Use Regular Expressions?
      • Finding every instance of a string in a file – i.e. every mention of “chickens” in a farm diary
      • How many times does “sing” appear in a text in all tenses and conjugations?
      • Reformatting dirty data
      • Validating input.
      • Command line work – listing files, grepping log files
    • The Basics
      • A regex is a pattern enclosed within delimiters.
      • Most characters match themselves.
      • /THATCamp/ is a regular expression that matches “THATCamp”.
        • Slash is the delimiter enclosing the expression.
        • “ THATCamp” is the pattern.
    • /at/
      • Matches strings with “a” followed by “t”.
      Athens aft atlas that hat at
    • /at/
      • Matches strings with “a” followed by “t”.
      Athens aft at las th at h at at
    • Some Theory
      • Finite State Machine for the regex /at/
    • Characters
      • Matching is case sensitive.
      • Special characters: ( ) ^ $ { } [ ] | . + ? *
      • To match a special character in your text, precede it with in your pattern:
        • /ironic [sic]/ does not match “ironic [sic]”
        • /ironic [sic]/ matches “ironic [sic]”
      • Regular expressions can support Unicode.
    • Character Classes
      • Characters within [ ] are choices for a single-character match.
      • Think of a set operation, or a type of or .
      • Order within the set is unimportant.
      • /x[01]/ matches “x0” and “x1”.
      • /[10][23]/ matches “02”, “03”, “12” and “13”.
      • Initial^ negates the class:
        • /[^45]/ matches all characters except 4 or 5.
    • /[ch]at/
      • Matches strings with “c” or “h”, followed by “a”, followed by “t”.
      phat fat cat chat at that
    • /[ch]at/
      • Matches strings with “c” or “h”, followed by “a”, followed by “t”.
      p hat fat cat c hat at t hat
    • Ranges
      • Ranges define sets of characters within a class.
        • /[1-9]/ matches any non-zero digit.
        • /[a-zA-Z]/ matches any letter.
        • /[12][0-9]/ matches numbers between 10 and 29.
    • Shortcuts [^tnrfv ] not space S [^n] (depends on mode) everything . [a-zA-Z0-9_] word w [^a-zA-Z0-9_] not word W [0-9] digit d [^0-9] not digit D [tnrfv ] space s Equivalent Class Name Shortcut
    • /ddd[- ]dddd/
      • Matches strings with:
        • Three digits
        • Space or dash
        • Four digits
      653-6464x256 PE6-5000 713-342-7452 652.2648 234 1252 501-1234
    • /ddd[- ]dddd/
      • Matches strings with:
        • Three digits
        • Space or dash
        • Four digits
      653-6464 x256 PE6-5000 713- 342-7452 652.2648 234 1252 501-1234
    • Repeaters
      • Symbols indicating that the preceding element of the pattern can repeat.
      • /runs?/ matches runs or run
      • /1d*/ matches any number beginning with “1”.
      at least n times { n ,} no more than m times {, m } between n and m times { n , m } exactly n { n } zero or more * one or more + zero or one ? Count Repeater
    • Repeaters
      • Strings:
      • 1: “at” 2: “art”
      • 3: “arrrrt” 4: “aft”
      • Patterns:
      • A: /ar?t/ B: /a[fr]?t/
      • C: /ar*t/ D: /ar+t/
      • E: /a.*t/ F: /a.+t/
      at least n times { n ,} no more than m times {, m } between n and m times { n , m } exactly n { n } zero or more * one or more + zero or one ? Count Repeater
    • Repeaters
      • /ar?t/ matches “at” and “art” but not “arrrt”.
      • /a[fr]?t/ matches “at”, “art”, and “aft”.
      • /ar*t/ matches “at”, “art”, and “arrrrt”
      • /ar+t/ matches “art” and “arrrt” but not “at”.
      • /a.*t/ matches anything with an ‘a’ eventually followed by a ‘t’.
    • Lab Session I
      • http://gskinner.com/RegExr/
      • https://gist.github.com/922838
      • Match the titles “Mr.” and “Ms.”.
      • Find all conjugations and tenses of “sing”.
      • Find all places where more than one space follows punctuation.
    • Lab Reference at least n times { n ,} no more than m times {, m } between n and m times { n , m } exactly n { n } zero or more * one or more + zero or one ? Count Repeater everything . not space S space s not word W word w not digit D digit d Name Shortcut
    • Anchors
      • Anchors match between characters.
      • Used to assert that the characters you’re matching must appear in a certain place.
      • /batb/ matches “at work” but not “batch”.
      raw end of string (rare) z end of string Z start of string A not boundary B word boundary b end of line $ start of line ^ Matches Anchor
    • Alternation
      • In Regex, | means “or”.
      • You can put a full expression on the left and another full expression on the right.
      • Either can match.
      • /seeks?|sought/ matches “seek”, “seeks”, or “sought”.
    • Grouping
      • Everything within ( … ) is grouped into a single element for the purposes of repetition and alternation.
      • The expression /(la)+/ matches “la”, “lala”, “lalalala” but not “all”.
      • /schema(ta)?/ matches “schema” and “schemata” but not “schematic”.
    • Grouping Example
      • What regular expression matches “eat”, “eats”, “ate” and “eaten”?
    • Grouping Example
      • What regular expression matches “eat”, “eats”, “ate” and “eaten”?
      • /eat(s|en)?|ate/
      • Add word boundary anchors to exclude “sate” and “eating”: /b(eat(s|en)?|ate)b/
    • Replacement
      • Regex most often used for search/replace
      • Syntax varies; most scripting languages and CLI tools use s/ pattern / replacement / .
      • s/dog/hound/ converts “slobbery dogs” to “slobbery hounds”.
      • s/bsheepsb/sheep/ converts
        • “ sheepskin is made from sheeps” to
        • “ sheepskin is made from sheep”
    • Capture
      • During searches, ( … ) groups capture patterns for use in replacement.
      • Special variables $1, $2, $3 etc. contain the capture.
      • /(ddd)-(dddd)/ “123-4567”
        • $1 contains “123”
        • $2 contains “4567”
    • Capture
      • How do you convert
        • “Smith, James” and “Jones, Sally” to
        • “James Smith” and “Sally Jones”?
    • Capture
      • How do you convert
        • “ Smith, James” and “Jones, Sally” to
        • “ James Smith” and “Sally Jones”?
      • s/(w+), (w+)/$2 $1/
    • Capture
      • Given a file containing URLs, create a script that wget s each URL:
        • http://bit.ly/DHapiTRANSCRIBE
          • becomes:
        • wget “http://bit.ly/DHapiTRANSCRIBE”
    • Capture
      • Given a file containing URLs, create a script that wget s each URL:
        • http://bit.ly/DHapiTRANSCRIBE
          • becomes
        • wget “ http:// bit.ly/DHapiTRANSCRIBE ”
      • s/^(.*)$/wget “$1”/
    • Lab Session II
      • Convert all Miss and Mrs. to Ms.
      • Convert infinitives to gerunds
        • “ to sing” -> “singing”
      • Extract last name, first name from (title first name last name)
        • Dr. Thelma Dunn
        • Mr. Clay Shirky
        • Dana Gray
    • Caveats
      • Do not use regular expressions to parse (complicated) XML!
      • Check the language/application-specific documentation: some common shortcuts are not universal.
    • Acknowledgments
      • James Edward Gray II and Dana Gray
        • Much of the structure and some of the wording of this presentation comes from
        • http://www.slideshare.net/JamesEdwardGrayII/regular-expressions-7337223