Introduction to Regular      Expressions      Ben Brumfield     RootsTech 2013
Our Texts• My Texts  – Manuscript Transcripts  – OCR• Our Texts  – Name Variants  – Abbreviations  – Spelling Changes  – “...
What are Regular Expressions?• Very small language for describing text.• Not a programming language.• Incredibly powerful ...
Why Use Regular Expressions?• Finding every instance of a string in a file  – i.e. every mention of “chickens” in a farm  ...
The Basics• A regex is a pattern enclosed within  delimiters.• Most characters match themselves.• /rootstech/ is a regular...
/at/• Matches strings with     at     hat  “a” followed by “t”.                           that   atlas                    ...
/at/• Matches strings with     at     hat  “a” followed by “t”.                           that   atlas                    ...
Some Theory• Finite State Machine for the regex /at/
Characters• Matching is case sensitive.• Special characters: ( ) ^ $ { } [ ]  | . + ? *• To match a special character in y...
Character Classes• Characters within [ ] are choices for a  single-character match.• Think of a set operation, or a type o...
/[ch]at/• Matches strings with      that   at  “c” or “h”, followed by  “a”, followed by “t”.                            c...
/[ch]at/• Matches strings with      that   at  “c” or “h”, followed by  “a”, followed by “t”.                            c...
Ranges• Ranges define sets of characters within a  class.  – /[1-9]/ matches any non-zero digit.  – /[a-zA-Z]/ matches any...
ShortcutsShortcut Name           Equivalent Class  d    digit           [0-9]  D    not digit       [^0-9]  w    word     ...
/ddd[- ]dddd/• Matches strings with:   501-1234   234 1252  – Three digits  – Space or dash  – Four digits           652.2...
/ddd[- ]dddd/• Matches strings with:   501-1234   234 1252  – Three digits  – Space or dash  – Four digits           652.2...
Repeaters• Symbols indicating       Repeater   Count  that the preceding           ?      zero or one  element of the patt...
RepeatersStrings:                     Repeater   Count1: “at”       2: “art”           ?      zero or one3: “arrrrt”   4: ...
Repeaters•   /ar?t/ matches “at” and “art” but not “arrrt”.•   /a[fr]?t/ matches “at”, “art”, and “aft”.•   /ar*t/ matches...
Lab Session ITry this URL:tinyurl.com/rootstechlab
Lab Session IMatch “Brumfield” and “Bromfield” in1702 John Bromfields estate had been  proved in Isle of Wight prior to 17...
Lab ReferenceRepeater   Count              Shortcut   Name    ?      zero or one            d     digit    +      one or m...
Anchors• Anchors match            Anchor Matches  between characters.        ^    start of line• Used to assert that      ...
Alternation• In Regex, | means “or”.• You can put a full expression on the left  and another full expression on the right....
Grouping• Everything within ( … ) is grouped into a  single element for the purposes of  repetition and alternation.• The ...
Grouping Example• What regular expression matches “eat”,  “eats”, “ate” and “eaten”?
Grouping Example• What regular expression matches “eat”,  “eats”, “ate” and “eaten”?• /eat(s|en)?|ate/• Add word boundary ...
Lab Session IIMatch “William” and “Wm.” in1736 Robert Mosby and John Brumfield  processioned the lands of Wm. Brittain1739...
Replacement• Regex most often used for search/replace• Syntax varies; most scripting languages  and CLI tools use s/patter...
Capture• During searches, ( … ) groups capture  patterns for use in replacement.• Special variables $1, $2, $3 etc. contai...
Capture• How do you convert  – “Smith, James” and “Jones, Sally” to  – “James Smith” and “Sally Jones”?
Capture• How do you convert  – “Smith, James” and “Jones, Sally” to  – “James Smith” and “Sally Jones”?• s/(w+), (w+)/$2 $1/
Caveats• Check the language/application-specific  documentation: some common shortcuts  are not universal.
QuestionsBen Brumfieldbenwbrum@gmail.comFromThePage.comManuscriptTranscription.blogspot.com
Upcoming SlideShare
Loading in …5
×

Introduction to Regular Expressions RootsTech 2013

341 views

Published on

Published in: Investor Relations
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
341
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Introduction to Regular Expressions RootsTech 2013

  1. 1. Introduction to Regular Expressions Ben Brumfield RootsTech 2013
  2. 2. Our Texts• My Texts – Manuscript Transcripts – OCR• Our Texts – Name Variants – Abbreviations – Spelling Changes – “Mistakes”
  3. 3. What are Regular Expressions?• Very small language for describing text.• Not a programming language.• Incredibly powerful tool for search/replace operations.• Old (1950s-60s)• Arcane art.• Ubiquitous.
  4. 4. Why Use Regular Expressions?• Finding every instance of a string in a file – i.e. every mention of “chickens” in a farm diary• How many times does “sing” appear in a text in all tenses and conjugations?• Reformatting dirty data• Validating input.• Command line work – listing files, grepping log files
  5. 5. The Basics• A regex is a pattern enclosed within delimiters.• Most characters match themselves.• /rootstech/ is a regular expression that matches “rootstech”. – Slash is the delimiter enclosing the expression. – “rootstech” is the pattern.
  6. 6. /at/• Matches strings with at hat “a” followed by “t”. that atlas aft Athens
  7. 7. /at/• Matches strings with at hat “a” followed by “t”. that atlas aft Athens
  8. 8. Some Theory• Finite State Machine for the regex /at/
  9. 9. Characters• Matching is case sensitive.• Special characters: ( ) ^ $ { } [ ] | . + ? *• To match a special character in your text, precede it with in your pattern: – /snarky [sic]/ does not match “snarky [sic]” – /snarky [sic]/ matches “snarky [sic]”• Regular expressions can support Unicode.
  10. 10. Character Classes• Characters within [ ] are choices for a single-character match.• Think of a set operation, or a type of or.• Order within the set is unimportant.• /x[01]/ matches “x0” and “x1”.• /[10][23]/ matches “02”, “03”, “12” and “13”.• Initial^ negates the class: – /[^45]/ matches all characters except 4 or 5.
  11. 11. /[ch]at/• Matches strings with that at “c” or “h”, followed by “a”, followed by “t”. chat cat fat phat
  12. 12. /[ch]at/• Matches strings with that at “c” or “h”, followed by “a”, followed by “t”. chat cat fat phat
  13. 13. Ranges• Ranges define sets of characters within a class. – /[1-9]/ matches any non-zero digit. – /[a-zA-Z]/ matches any letter. – /[12][0-9]/ matches numbers between 10 and 29.
  14. 14. ShortcutsShortcut Name Equivalent Class d digit [0-9] D not digit [^0-9] w word [a-zA-Z0-9_] W not word [^a-zA-Z0-9_] s space [tnrfv ] S not space [^tnrfv ] . everything [^n] (depends on mode)
  15. 15. /ddd[- ]dddd/• Matches strings with: 501-1234 234 1252 – Three digits – Space or dash – Four digits 652.2648 713-342-7452 PE6-5000 653-6464x256
  16. 16. /ddd[- ]dddd/• Matches strings with: 501-1234 234 1252 – Three digits – Space or dash – Four digits 652.2648 713-342-7452 PE6-5000 653-6464x256
  17. 17. Repeaters• Symbols indicating Repeater Count that the preceding ? zero or one element of the pattern + one or more can repeat. * zero or more• /runs?/ matches runs or run {n} exactly n• /1d*/ matches any {n,m} between n and m times number beginning with “1”. {,m} no more than m times {n,} at least n times
  18. 18. RepeatersStrings: Repeater Count1: “at” 2: “art” ? zero or one3: “arrrrt” 4: “aft” + one or more * zero or morePatterns: {n} exactly nA: /ar?t/ B: /a[fr]?t/ {n,m} between n andC: /ar*t/ D: /ar+t/ m timesE: /a.*t/ F: /a.+t/ {,m} no more than m times {n,} at least n times
  19. 19. Repeaters• /ar?t/ matches “at” and “art” but not “arrrt”.• /a[fr]?t/ matches “at”, “art”, and “aft”.• /ar*t/ matches “at”, “art”, and “arrrrt”• /ar+t/ matches “art” and “arrrt” but not “at”.• /a.*t/ matches anything with an ‘a’ eventually followed by a ‘t’.
  20. 20. Lab Session ITry this URL:tinyurl.com/rootstechlab
  21. 21. Lab Session IMatch “Brumfield” and “Bromfield” in1702 John Bromfields estate had been proved in Isle of Wight prior to 1702,Anne Brumfield recd. more than her share from her fathers estate.
  22. 22. Lab ReferenceRepeater Count Shortcut Name ? zero or one d digit + one or more D not digit * zero or more w word {n} exactly n times {n,m} between n and W not word m times s space {,m} no more than m S not space times {n,} at least n times . everything
  23. 23. Anchors• Anchors match Anchor Matches between characters. ^ start of line• Used to assert that $ end of line the characters you’re b word boundary matching must appear in a certain B not boundary place. A start of string• /batb/ matches “at Z end of string work” but not “batch”. z raw end of string (rare)
  24. 24. Alternation• In Regex, | means “or”.• You can put a full expression on the left and another full expression on the right.• Either can match.• /seeks?|sought/ matches “seek”, “seeks”, or “sought”.
  25. 25. Grouping• Everything within ( … ) is grouped into a single element for the purposes of repetition and alternation.• The expression /(la)+/ matches “la”, “lala”, “lalalala” but not “all”.• /schema(ta)?/ matches “schema” and “schemata” but not “schematic”.
  26. 26. Grouping Example• What regular expression matches “eat”, “eats”, “ate” and “eaten”?
  27. 27. Grouping Example• What regular expression matches “eat”, “eats”, “ate” and “eaten”?• /eat(s|en)?|ate/• Add word boundary anchors to exclude “sate” and “eating”: /b(eat(s|en)?|ate)b/
  28. 28. Lab Session IIMatch “William” and “Wm.” in1736 Robert Mosby and John Brumfield processioned the lands of Wm. Brittain1739 … Witnesses: Richard Echols, William Brumfield, John Hendrick
  29. 29. Replacement• Regex most often used for search/replace• Syntax varies; most scripting languages and CLI tools use s/pattern/replacement/ .• s/dog/hound/ converts “slobbery dogs” to “slobbery hounds”.• s/bsheepsb/sheep/ converts – “sheepskin is made from sheeps” to – “sheepskin is made from sheep”
  30. 30. Capture• During searches, ( … ) groups capture patterns for use in replacement.• Special variables $1, $2, $3 etc. contain the capture.• /(ddd)-(dddd)/ “123-4567” – $1 contains “123” – $2 contains “4567”
  31. 31. Capture• How do you convert – “Smith, James” and “Jones, Sally” to – “James Smith” and “Sally Jones”?
  32. 32. Capture• How do you convert – “Smith, James” and “Jones, Sally” to – “James Smith” and “Sally Jones”?• s/(w+), (w+)/$2 $1/
  33. 33. Caveats• Check the language/application-specific documentation: some common shortcuts are not universal.
  34. 34. QuestionsBen Brumfieldbenwbrum@gmail.comFromThePage.comManuscriptTranscription.blogspot.com

×