Your SlideShare is downloading. ×
0
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Using Regular Expressions and Staying Sane
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Using Regular Expressions and Staying Sane

1,477

Published on

Presentation I gave to the local http://www.cocoacoder.org/ meeting on using Regular Expression in Cocoa code (although much of it applies to other languages as well). …

Presentation I gave to the local http://www.cocoacoder.org/ meeting on using Regular Expression in Cocoa code (although much of it applies to other languages as well).

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,477
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • This is not a talk about every possible thing you can do with regular expressions. In fact, it’s exactly the opposite. This is about how to do a useful thing and do it without going crazy.\n
  • \n
  • So before I get too far, how many of you know what a regular expression is?\nHow many have used them before? How many feel comfortable with them?\n
  • So here’s a quick example, just so those of you who haven’t touched them have an idea what I’m talking about between now and when we dig into examples later on.\n
  • Well, it depends...you see...\n
  • I’m saying OOish because I have issues with perl’s OO, but that’s another talk.\nI went from Basic to Pascal to C to perl (to C++ to Lisp to Java to Ruby to Objective-C). I started learning perl in 1989 or so, and it was exactly what I needed at the time - it was a language that was really good at exactly what C made very painful: String handling. I have better alternatives than perl now, but it taught me regex’s.\n
  • This is an example of a usage in a language where a Regex is a first-class citizen.\n
  • This is a WTF. And it brings to mind a bunch of questions...\n
  • and the most often asked question in Cocoa Regex...\n
  • \n
  • This is better (but you have to do the #import).\n
  • re.match in python implicitly anchors you to the beginning of a string. This is hideous.\n
  • Well, I’d say no. I use them all the time.\n
  • This is a actual regex I found in a program I was once asked to find the performance problem in.\n
  • This is unmaintainable, and worse...\n
  • We’ll come back to this one later\n
  • \n
  • \n
  • Let me do a quick phrasebook first.\n
  • Let me do a quick phrasebook first.\n
  • You can (and should) put whatever characters you are looking for in square brackets. \nIf you omit the first [0–9] you might match nothing.\n\nLikewise, in the second part [^0-9] means “anything that isn’t a number”.\n
  • You can (and should) put whatever characters you are looking for in square brackets. \nIf you omit the first [0–9] you might match nothing.\n\nLikewise, in the second part [^0-9] means “anything that isn’t a number”.\n
  • Anything else that you see that’s special (like ‘^’ or ‘\\\\’) gets matched with a ‘\\’ in front of it, too.\n
  • Anything else that you see that’s special (like ‘^’ or ‘\\\\’) gets matched with a ‘\\’ in front of it, too.\n
  • Anything else that you see that’s special (like ‘^’ or ‘\\\\’) gets matched with a ‘\\’ in front of it, too.\n
  • Anything else that you see that’s special (like ‘^’ or ‘\\\\’) gets matched with a ‘\\’ in front of it, too.\n
  • I mean it, I’m done.\n
  • But there’s all these other characters...\n
  • \n
  • \n
  • can you tell the difference between ‘w’ and ‘W’ every time, without looking?\n\nCan you promise you’ll never get confused about whether ‘w’ means ‘word’ or ‘whitespace’?\n
  • Maximize the utility of your investment \nThere is a ‘+’ operator that *Sometimes* means “one or more” like ::*. + works in Cocoa, but not in grep. If you stick to the ones that are the same everywhere, you will get more use out of it and be less confused\nSame with .*? to handle greedy matching\n
  • Maximize the utility of your investment \nThere is a ‘+’ operator that *Sometimes* means “one or more” like ::*. + works in Cocoa, but not in grep. If you stick to the ones that are the same everywhere, you will get more use out of it and be less confused\nSame with .*? to handle greedy matching\n
  • \n
  • \n
  • \n
  • Note - regex’s don’t parse HTML/XML “correctly” so be careful\n
  • \n
  • \n
  • You get the HTML between the links, don’t you?\n
  • You get the HTML between the links, don’t you?\n
  • You get the HTML between the links, don’t you?\n
  • Although you can use .*? at least on some platforms\n
  • Although you can use .*? at least on some platforms\n
  • This code was used in production on a project I was asked to consult on in a Content Management System (of sorts) to detect links that should be clickable on a web page, but weren’t, and make them clickable.\n
  • And the customer fed that Content Management System a big list of links\n
  • note it’s looking at http followed by :// followed by stuff, then anything, then /A.\n
  • The regex library grabs the longest string it can, first, to see if that’s a match (because it’s supposed to be greedy)\n
  • then, when that doesn’t match, the next longest string\n
  • and so on\n
  • \n
  • \n
  • and then, when it’s exhausted the shortest string for that beginning match,\n
  • It does it again for the next beginning match it finds\n
  • and so on there.\n\nBAD IDEA.\n
  • When I’m doing Core Data on the iPhone, the images go in a directory (NEVER in the DB!!), and I put info I might need (like when I should refresh it) in the image name, so I can do maintenance without having to ask the DB.\n
  • \n
  • \n
  • And coming up next, my current favorite to use in XCode’s search project box...\n
  • Which, of course, means the price just went up 25%.\n\nOnce you get comfortable with them, you start to see chances to use them everywhere.\n
  • \n
  • Transcript

    • 1. Regular Expressions How not to turn one problem into two. Carl Brown CarlB@PDAgent.com
    • 2. “Common Wisdom” “Some people, when confronted with a problem, think ‘I know, Ill use regular expressions.’ Now they have two problems.”*See http://regex.info/blog/2006-09-15/247 for source.
    • 3. What is a ‘Regular Expression’?“...a concise and flexible means for‘matching’ (specifying and recognizing) stringsof text, such as particular characters, words,or patterns of characters” (So says Wikipedia)“... a way of extracting substrings from text in a‘usefully fuzzy’ way” (So says me)
    • 4. ...so for example?Pull out the host from a URL string: http://([^/]*)/find the date in a string ([0-9][0-9]*[-/][0-9][0-9]*[-/][0-9][0-9]*)
    • 5. But they’re a Pain to Use Aren’t they?
    • 6. Two Kinds of (OOish) Languages Some languages, Like perl or ruby, have Regex build into their strings, so they get used often. Most others, like Cocoa, Java, Python have Regular Expression Objects, that are complicated and a Pain in the Ass
    • 7. Rubystring.sub(“pattern”,“replacement”)
    • 8. Cocoa (Apple)+[NSRegularExpression regularExpressionWithPattern:(NSString *)pattern options:(NSRegularExpressionOptions)options error:(NSError**) error]-[NSRegularExpression replaceMatchesInString:(NSMutableString *)string options:(NSMatchingOptions)options range:(NSRange)rangewithTemplate:(NSString *)template]
    • 9. Cocoa (Apple)+[NSRegularExpression regularExpressionWithPattern:(NSString *)pattern options:(NSRegularExpressionOptions)options error:(NSError**) error]-[NSRegularExpression replaceMatchesInString:(NSMutableString *)string options:(NSMatchingOptions)options range:(NSRange)rangewithTemplate:(NSString *)template] NSRegularExpressionOptions? NSMatchingOptions? Why do I need a Range? What’s a template string?
    • 10. Cocoa (Apple)+[NSRegularExpression regularExpressionWithPattern:(NSString *)pattern options:(NSRegularExpressionOptions)options error:(NSError**) error]-[NSRegularExpression replaceMatchesInString:(NSMutableString *)string options:(NSMatchingOptions)options range:(NSRange)rangewithTemplate:(NSString *)template] NSRegularExpressionOptions? NSMatchingOptions? Why do I need a Range? What’s a template string? Is it really worth it?
    • 11. Cocoa (sane) #import "NSString+PDRegex.h" [string stringByReplacingRegexPattern:@"pattern" withString:@"replacement" caseInsensitive:NO];*See https://github.com/carlbrown/RegexOnNSString/
    • 12. Python (an aside)import rere.match(“pattern”,“a pattern”) #no matchre.search(“pattern”,“a pattern”) #matches fine
    • 13. But Regex’s areimpossible to maintain... Aren’t they?
    • 14. But what about?(?<!(=)|(="")|(=))(((http|ftp|https)://)|(www.))+[w]+(.[w]+)([w-.@?^=%&:/~+#]*[w-@?^=%&/~+#])?(?!.*/a>)
    • 15. But what about?(?<!(=)|(="")|(=))(((http|ftp|https)://)|(www.))+[w]+(.[w]+)([w-.@?^=%&:/~+#]*[w-@?^=%&/~+#])?(?!.*/a>) *That* guy has two problems
    • 16. But what about? (?<!(=)|(="")|(=))(((http|ftp| https)://)|(www.))+[w]+(.[w]+) ([w-.@?^=%&:/~+#]*[w-@? ^=%&/~+#])?(?!.*/a>) *That* guy has two problems Well, Actually, he has n! problems where,n is the number of hyperlinks in the input string
    • 17. How to keep that fromhappening (my advice) Limit yourself to only the basic meta- characters. Favor clarity over brevity. Take more smaller bites. Beware of greedy matching
    • 18. The Basic Characters A Phrasebook
    • 19. PhraseBook pt 1
    • 20. PhraseBook pt 1^.* “the junk to the left of what I want” This breaks down as ^ (the beginning of the string) followed by .* any number of any character.
    • 21. PhraseBook pt 1^.* “the junk to the left of what I want” This breaks down as ^ (the beginning of the string) followed by .* any number of any character..*$ “the junk to the right of what I want” This breaks down as any number of any character .* followed by $ (the end of the string)
    • 22. PhraseBook pt 2[0–9][0–9]* “a number with at least one digit” The brackets ([ and ]) mean “any of the characters contained within the brackets”. So this means 1 character of 0–9 (so 0 1 2 3 4 5 6 7 8 or 9) followed by zero or more of the same character.
    • 23. PhraseBook pt 2[0–9][0–9]* “a number with at least one digit” The brackets ([ and ]) mean “any of the characters contained within the brackets”. So this means 1 character of 0–9 (so 0 1 2 3 4 5 6 7 8 or 9) followed by zero or more of the same character.[^A-Za-z] “any character that’s not a letter” The ^ as the first character inside the brackets means “not” so instead of meaning “any letter” it means “anything not a letter”.
    • 24. PhraseBook pt 3. “a literal period” (e.g. to match the dot in .com)
    • 25. PhraseBook pt 3. “a literal period” (e.g. to match the dot in .com)* “a literal * ” (e.g. to match an asterisk)
    • 26. PhraseBook pt 3. “a literal period” (e.g. to match the dot in .com)* “a literal * ” (e.g. to match an asterisk)( ) or [ ] “literal parenthesis/brackets” (in Cocoa, at least)
    • 27. PhraseBook pt 3. “a literal period” (e.g. to match the dot in .com)* “a literal * ” (e.g. to match an asterisk)( ) or [ ] “literal parenthesis/brackets” (in Cocoa, at least)( …stuff… ) “stuff I want to refer to later as $1” (in Cocoa, at least)
    • 28. PhraseBook pt 4
    • 29. PhraseBook pt 4 There is no... Part 4
    • 30. But what about?* Cheat Sheet from http://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/
    • 31. But what about?
    • 32. There is no... Part 4But what about?
    • 33. Clarity > Brevity (Really true of any language)
    • 34. Choose the clearest way:[A-Za-z_] instead of w[^A-Za-z_] instead of W
    • 35. Choose theconsistent way:
    • 36. Choose the consistent way:OSX:~$ grep ^root::* /etc/passwd root:*:0:0:System Administrator:/var/root:/bin/shOSX:~$ grep ^root:+ /etc/passwdOSX:~$
    • 37. Choose the consistent way:OSX:~$ grep ^root::* /etc/passwd root:*:0:0:System Administrator:/var/root:/bin/shOSX:~$ grep ^root:+ /etc/passwdOSX:~$OSX:~$ grep ^root:.* /etc/passwdroot:*:0:0:System Administrator:/var/root:/bin/shOSX:~$ grep ^root:.*? /etc/passwdOSX:~$
    • 38. Except when you can’t ([^/][^]*)/ => 1http:// (POSIX/sed) ([^/][^]*)/ => $1http:// (perl/cocoa)
    • 39. Take Smaller BitesThe less you do at a time, the safer each step is
    • 40. Which is clearer?NSString *domainName = [myHTMLStringstringByReplacingRegexPattern:@"^.*href=[”’]http://(.*)/.*$"withString:@"$1" caseInsensitive:YES];
    • 41. Which is clearer? NSString *leftRemoved = [myHTMLString stringByReplacingRegexPattern: @"^.*href=[‘“]" withString:@"" caseInsensitive:YES]; NSString *myURL = [leftRemoved stringByReplacingRegexPattern: @"[“‘].*$" withString:@"" caseInsensitive:NO]; NSString *hostAndPath = [myURL stringByReplacingRegexPattern: @"^.*http://" withString:@"" caseInsensitive:YES]; NSString *domainName = [hostAndPath stringByReplacingRegexPattern: @"/.*$" withString:@"" caseInsensitive:NO];Bonus: This one can be stepped through with the debugger :-)
    • 42. But isn’t that slower? Yes.
    • 43. But isn’t that slower? Yes. But it doesn’t matter how fast you get the wrong answer.
    • 44. Beware Greedy MatchingRemember this? NSString *domainName = [myHTMLString stringByReplacingRegexPattern: @"^.*href=[”’]http://(.*)/.*$" withString:@"$1" caseInsensitive:YES];
    • 45. Beware Greedy MatchingRemember this? NSString *domainName = [myHTMLString stringByReplacingRegexPattern: @"^.*href=[”’]http://(.*)/.*$" withString:@"$1" caseInsensitive:YES];What does it do if given: <a href=“http://1.example.com/”>This is a link</ a> but <a href=“http://2.example.com/”>This is a link, too.</a>
    • 46. Beware Greedy MatchingRemember this? NSString *domainName = [myHTMLString stringByReplacingRegexPattern: @"^.*href=[”’]http://(.*)/.*$" withString:@"$1" caseInsensitive:YES];What does it do if given: <a href=“http://1.example.com/”>This is a link</ a> but <a href=“http://2.example.com/”>This is a link, too.</a>
    • 47. What you meant was: After ‘http://’ up to but not including the next ‘/’
    • 48. What you meant was: After ‘http://’ up to but not including the next ‘/’ Which is: http://([^/][^/]*)/
    • 49. Remember this? (?<!(=)|(="")|(=))(((http|ftp| https)://)|(www.))+[w]+(.[w]+) ([w-.@?^=%&:/~+#]*[w-@? ^=%&/~+#])?(?!.*/a>) Well, Actually, he has n! problems where,n is the number of hyperlinks in the input string
    • 50. So if you had<p>Today’s Links:</p><UL> <LI><A HREF=”http://example.com/1”>Link 1</A></LI> <LI><A HREF=”http://example.com/2”>Link 2</A></LI> <LI><A HREF=”http://example.com/3”>Link 3</A></LI> <LI><A HREF=”http://example.com/4”>Link 4</A></LI> <LI><A HREF=”http://example.com/5”>Link 5</A></LI> <LI><A HREF=”http://example.com/6”>Link 6</A></LI></UL>
    • 51. And tried to use:(?<!(=)|(="")|(=))(((http|ftp|https)://)|(www.))+[w]+(.[w]+)([w-.@?^=%&:/~+#]*[w-@?^=%&/~+#])?(?!.*/a>)
    • 52. It would have to:<p>Today’s Links:</p><UL> <LI><A HREF=”http://example.com/1”>Link 1</A></LI> <LI><A HREF=”http://example.com/2”>Link 2</A></LI> <LI><A HREF=”http://example.com/3”>Link 3</A></LI> <LI><A HREF=”http://example.com/4”>Link 4</A></LI> <LI><A HREF=”http://example.com/5”>Link 5</A></LI> <LI><A HREF=”http://example.com/6”>Link 6</A></LI></UL>
    • 53. And then:<p>Today’s Links:</p><UL> <LI><A HREF=”http://example.com/1”>Link 1</A></LI> <LI><A HREF=”http://example.com/2”>Link 2</A></LI> <LI><A HREF=”http://example.com/3”>Link 3</A></LI> <LI><A HREF=”http://example.com/4”>Link 4</A></LI> <LI><A HREF=”http://example.com/5”>Link 5</A></LI> <LI><A HREF=”http://example.com/6”>Link 6</A></LI></UL>
    • 54. And then:<p>Today’s Links:</p><UL> <LI><A HREF=”http://example.com/1”>Link 1</A></LI> <LI><A HREF=”http://example.com/2”>Link 2</A></LI> <LI><A HREF=”http://example.com/3”>Link 3</A></LI> <LI><A HREF=”http://example.com/4”>Link 4</A></LI> <LI><A HREF=”http://example.com/5”>Link 5</A></LI> <LI><A HREF=”http://example.com/6”>Link 6</A></LI></UL>
    • 55. And then:<p>Today’s Links:</p><UL> <LI><A HREF=”http://example.com/1”>Link 1</A></LI> <LI><A HREF=”http://example.com/2”>Link 2</A></LI> <LI><A HREF=”http://example.com/3”>Link 3</A></LI> <LI><A HREF=”http://example.com/4”>Link 4</A></LI> <LI><A HREF=”http://example.com/5”>Link 5</A></LI> <LI><A HREF=”http://example.com/6”>Link 6</A></LI></UL>
    • 56. And then:<p>Today’s Links:</p><UL> <LI><A HREF=”http://example.com/1”>Link 1</A></LI> <LI><A HREF=”http://example.com/2”>Link 2</A></LI> <LI><A HREF=”http://example.com/3”>Link 3</A></LI> <LI><A HREF=”http://example.com/4”>Link 4</A></LI> <LI><A HREF=”http://example.com/5”>Link 5</A></LI> <LI><A HREF=”http://example.com/6”>Link 6</A></LI></UL>
    • 57. And then:<p>Today’s Links:</p><UL> <LI><A HREF=”http://example.com/1”>Link 1</A></LI> <LI><A HREF=”http://example.com/2”>Link 2</A></LI> <LI><A HREF=”http://example.com/3”>Link 3</A></LI> <LI><A HREF=”http://example.com/4”>Link 4</A></LI> <LI><A HREF=”http://example.com/5”>Link 5</A></LI> <LI><A HREF=”http://example.com/6”>Link 6</A></LI></UL>
    • 58. And then:<p>Today’s Links:</p><UL> <LI><A HREF=”http://example.com/1”>Link 1</A></LI> <LI><A HREF=”http://example.com/2”>Link 2</A></LI> <LI><A HREF=”http://example.com/3”>Link 3</A></LI> <LI><A HREF=”http://example.com/4”>Link 4</A></LI> <LI><A HREF=”http://example.com/5”>Link 5</A></LI> <LI><A HREF=”http://example.com/6”>Link 6</A></LI></UL>
    • 59. And so on:<p>Today’s Links:</p><UL> <LI><A HREF=”http://example.com/1”>Link 1</A></LI> <LI><A HREF=”http://example.com/2”>Link 2</A></LI> <LI><A HREF=”http://example.com/3”>Link 3</A></LI> <LI><A HREF=”http://example.com/4”>Link 4</A></LI> <LI><A HREF=”http://example.com/5”>Link 5</A></LI> <LI><A HREF=”http://example.com/6”>Link 6</A></LI></UL>
    • 60. But what are they good for?Encoding/decoding metadata from image filenames.
    • 61. But what are they good for?Encoding/decoding metadata from image filenames.Renaming files on the command line (@2x?)
    • 62. But what are they good for? Encoding/decoding metadata from image file names. Renaming files on the command line (@2x?) Grabbing the user’s first name from a Full Name string (careful of Locales*)*See http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/
    • 63. But what are they good for?Encoding/decoding metadata from image filenames.Renaming files on the command line (@2x?)Grabbing the user’s first name from a FullName string (careful of Locales)Stripping crap I don’t want out of user input(trailing spaces, anyone?)
    • 64. But what are they good for?Encoding/decoding metadata from image filenames.Renaming files on the command line (@2x?)Grabbing the user’s first name from a FullName string (careful of Locales)Stripping crap I don’t want out of user input(trailing spaces, anyone?)//.*[.* *release *] *;
    • 65. Questions? CarlB@PDAgent.com @CarlAllenBrown www.escortmissions.com (Blog) www.PDAgent.com (Company) https://github.com/carlbrownhttp://www.slideshare.net/carlbrown

    ×