0
Regular Expression Best Practices
          Tony Stubblebine
          tony@tonystubblebine.com
          www.stubbleblog....
Tabbed indentation is a sin but this isn't?
$string =~ s<
 (?:http://(?:(?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])...
Best Practices for Any Programming
There are programming fundamentals that are
 routinely ignored by regular expression wr...
Good Code
# Given a URL/URI, fetches it.
# Returns an HTTP::Response object.
sub get {
    my $self = shift; my $uri = shi...
What if we didn't include
   documentation or whitespace?


sub get{my$self=shift;my$uri=shift;
 $uri=$self->base?URI->new...
What if we were also as terse as
               possible?
So:

    No documentation

    No whitespace

    One charact...
We'd have a regular expression.
sub g{my($s,$u)=@_;$u=$s->b?U-> n($u,
 $s->b):U->q($u);return$s-
 >SUPER::g($u->a,@_);}
What do we want from best
                   practices?

    Practices that maximize desired goals in certain
    applica...
#1: Use Extended Whitespace
   Add indentation, newlines, and comments to regular
    expressions
   Usage /x: m/regex/x...
Extended Whitespace Gotchas
•   Must explicitly ask to match a space with s
    or <SPACE>
•   Must escape pound signs, #
Before

    What does this match?
$text =~ m/^([01]?dd?|2[0-4]d|
 25[0-5]).([01]?dd?|2[0-4]d|
 25[0-5]).([01]?dd?|2[0-4]d...
After
$text =~ m/
 # Match IP addresses like 169.146.10.45
 ^   # Start of string
 ([01]?dd?|2[0-4]d|25[0-5])
 # Number, 0...
#2 Test


    You don't know your data.

    And you have a typo in your regex.

    Guaranteed surprises on both front...
Fun Gotcha
What file does this code open?
$file =
"/etc/passwd0/var/www/index.html";
if ( $file =~ m/^ .* .html/x ) {
    ...
Typical Gotcha
This matches foo.gif
But also... foojpg and jpg.doc
# match image files
m/ . gif | jpg | jpeg | png $/x
Test framework

    Write your regular expressions in a place where
    you can test them.

    Build up a list of posit...
Hackers Test Framework
   Your “framework” could be this simple:
foreach my $test (@tests) {
    # looks like an image fi...
Real Tests Are Better
my @match = ("foo.gif", "foo.bar.jpg", "bar_foo.gif.jpg.png");
my @fail   = ("gif.foo", "foo.gif.", ...
#3 Use Structure

... as a slow-witted human being I have a very
   small head and I had better learn to live with it
   a...
Breaking up an email regex
We can write an email regex that looks like this:
m/$user@$domain/


Build your regexes from sm...
Use Post Processing

    It's easier to say a number is <= 255 in code than it is as
    a regular expression.
# IP Addre...
#4. Good habits

    Regex are hard to debug, so avoid errors.

    Error avoidance habits:
    
        Group alternat...
Group Alternations
Group your alternations. In this regex, the dot and
 end of string ($) are not part of your alternation...
Use Lazy Quantifiers

    Use lazy quantifiers. It's easier to say when to
    stop.
<td>.*?</td>
Lazy Quantifiers...
Compare that to
#Matches too much
$text = "<td>foo</td><td>bar</td>";
$text =~ m!<td>.*</td>!;


#Matc...
Don't use regular expressions

    Regular expressions don't deal well with
    nesting
$text = "<td> foo
 <table><tr><td...
Don't use regular expressions

    Regular expressions don't deal well with
    nesting
$text = "<td> foo
 <table><tr><td...
#5. Optimize Last

    It's more common for regular expressions to be
    broken then to be slow

    Optimize last.

 ...
Optimizing Quantifiers
# This is slow because the match backtracks
 from the end
# of the file
$text = "M1 text i'm lookin...
Buy The Book!
Available from Amazon for $9.95
http://bit.ly/regexpr


Thank you for reading!
I'm tony@tonystubblebine.com
Upcoming SlideShare
Loading in...5
×

Regex Best Practices

4,325

Published on

There are two reasons regular expressions are so hard to read and are so error prone. One, the syntax is terse. Two, programmers ignore all normal programming practices. This talk reintroduces white space, structure, and basic verification/testing and then calls them "Best Practices."

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,325
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
79
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "Regex Best Practices"

  1. 1. Regular Expression Best Practices Tony Stubblebine tony@tonystubblebine.com www.stubbleblog.com @tonystubblebine
  2. 2. Tabbed indentation is a sin but this isn't? $string =~ s< (?:http://(?:(?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?). )*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?:d+)(?:.(?:d+) ){3}))(?::(?:d+))?)(?:/(?:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-F d]{2}))|[;:@&=])*)(?:/(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{ 2}))|[;:@&=])*))*)(?:?(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{ 2}))|[;:@&=])*))?)?)|(?:ftp://(?:(?:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(? :%[a-fA-Fd]{2}))|[;?&=])*)(?::(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a- fA-Fd]{2}))|[;?&=])*))?@)?(?:(?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|- )*[a-zA-Zd])?).)*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(? :d+)(?:.(?:d+)){3}))(?::(?:d+))?))(?:/(?:(?:(?:(?:[a-zA-Zd$-_.+! *'(),]|(?:%[a-fA-Fd]{2}))|[?:@&=])*)(?:/(?:(?:(?:[a-zA-Zd$-_.+!*'() ,]|(?:%[a-fA-Fd]{2}))|[?:@&=])*))*)(?:;type=[AIDaid])?)?)|(?:news:(?: ................... Abigail, comp.lang.perl.misc, http://aspn.activestate.com/ASPN/Cookbook/Rx/Recipe/59864
  3. 3. Best Practices for Any Programming There are programming fundamentals that are routinely ignored by regular expression writers.  Put a line break after statements and space between expressions.  Throw in a comment or two.  Use subroutines and modules to show structure and avoid duplication.  Test.
  4. 4. Good Code # Given a URL/URI, fetches it. # Returns an HTTP::Response object. sub get { my $self = shift; my $uri = shift; $uri = $self->base ? URI->new_abs( $uri, $self->base ) : URI->new( $uri ); return $self->SUPER::get( $uri- >as_string, @_ ); }
  5. 5. What if we didn't include documentation or whitespace? sub get{my$self=shift;my$uri=shift; $uri=$self->base?URI->new_abs($uri, $self->base):URI- >new($uri);return$self- >SUPER::get($uri->as_string,@_);}
  6. 6. What if we were also as terse as possible? So:  No documentation  No whitespace  One character variable and method names
  7. 7. We'd have a regular expression. sub g{my($s,$u)=@_;$u=$s->b?U-> n($u, $s->b):U->q($u);return$s- >SUPER::g($u->a,@_);}
  8. 8. What do we want from best practices?  Practices that maximize desired goals in certain applications.  Goals of regex best practices:  Maintainability  Correctness  Development Speed
  9. 9. #1: Use Extended Whitespace  Add indentation, newlines, and comments to regular expressions  Usage /x: m/regex/x # Look for green or red foxes $text =~ /(green | red) s fox (es)? # Allow more than one /x;
  10. 10. Extended Whitespace Gotchas • Must explicitly ask to match a space with s or <SPACE> • Must escape pound signs, #
  11. 11. Before  What does this match? $text =~ m/^([01]?dd?|2[0-4]d| 25[0-5]).([01]?dd?|2[0-4]d| 25[0-5]).([01]?dd?|2[0-4]d| 25[0-5]).([01]?dd?|2[0-4]d| 25[0-5])$/;
  12. 12. After $text =~ m/ # Match IP addresses like 169.146.10.45 ^ # Start of string ([01]?dd?|2[0-4]d|25[0-5]) # Number, 0-255 .([01]?dd?|2[0-4]d|25[0-5]) # 0-255 .([01]?dd?|2[0-4]d|25[0-5]) # 0-255 .([01]?dd?|2[0-4]d|25[0-5]) # 0-255 $/x;
  13. 13. #2 Test  You don't know your data.  And you have a typo in your regex.  Guaranteed surprises on both fronts.
  14. 14. Fun Gotcha What file does this code open? $file = "/etc/passwd0/var/www/index.html"; if ( $file =~ m/^ .* .html/x ) { open (FILE, "$file); }
  15. 15. Typical Gotcha This matches foo.gif But also... foojpg and jpg.doc # match image files m/ . gif | jpg | jpeg | png $/x
  16. 16. Test framework  Write your regular expressions in a place where you can test them.  Build up a list of positive and negative matches  Include list in your documentation, ex: # matches 800-555-1212 but not # 800.555.1212 or 800-BETS-OFF
  17. 17. Hackers Test Framework  Your “framework” could be this simple: foreach my $test (@tests) { # looks like an image file? if ( $test =~ m/ . gif | jpg | jpeg | png $/x ) { print "Matched on $testn"; } else { print "Failed match on $testn"; } }
  18. 18. Real Tests Are Better my @match = ("foo.gif", "foo.bar.jpg", "bar_foo.gif.jpg.png"); my @fail = ("gif.foo", "foo.gif.", "foopng", "foo.jpeg.bar"); sub match { return $_[0] =~ m/ . gif | jpg | jpeg | png $/x; } foreach my $test (@match) { ok( match($test), "$test matches"); } foreach my $test (@fail) { ok( !match($test), "$test fails to match"); }
  19. 19. #3 Use Structure ... as a slow-witted human being I have a very small head and I had better learn to live with it and to respect my limitations and give them full credit, rather than to try to ignore them, for the latter vain effort will be punished by failure. ~ Edsger Dijkstra
  20. 20. Breaking up an email regex We can write an email regex that looks like this: m/$user@$domain/ Build your regexes from smaller regexes like this: $user = "w+"; $domain = qr/w+.(w+.)*www?/i;
  21. 21. Use Post Processing  It's easier to say a number is <= 255 in code than it is as a regular expression. # IP Address check $ip =~ m/^(d{1,3}).(d{1,3}).(d{1,3}). (d{1,3})$/; foreach my $num ($1, $2, $3, $4) { $failure++ unless $num < 256; }
  22. 22. #4. Good habits  Regex are hard to debug, so avoid errors.  Error avoidance habits:  Group alternations with parentheses  Use lazy quantifiers  Don't use regular expressions
  23. 23. Group Alternations Group your alternations. In this regex, the dot and end of string ($) are not part of your alternation. m/ . (gif | jpg | jpeg | png) $/x
  24. 24. Use Lazy Quantifiers  Use lazy quantifiers. It's easier to say when to stop. <td>.*?</td>
  25. 25. Lazy Quantifiers... Compare that to #Matches too much $text = "<td>foo</td><td>bar</td>"; $text =~ m!<td>.*</td>!; #Matches too little $text = "<td>foo <b>bar</b> </td>"; $text = m/<td>[^<]*/;
  26. 26. Don't use regular expressions  Regular expressions don't deal well with nesting $text = "<td> foo <table><tr><td>bar</td>..."; $text =~ m!<td> .*? </td>!;  Use something better an HTML or XML parsing library instead.
  27. 27. Don't use regular expressions  Regular expressions don't deal well with nesting $text = "<td> foo <table><tr><td>bar</td>..."; $text =~ m!<td> .*? </td>!;  Use something better an HTML or XML parsing library instead.
  28. 28. #5. Optimize Last  It's more common for regular expressions to be broken then to be slow  Optimize last.  Start with the quantifiers
  29. 29. Optimizing Quantifiers # This is slow because the match backtracks from the end # of the file $text = "M1 text i'm looking for M2 thousand more characters to come..."; $text =~ m/M1 (.*) M2/s; # This is slow because the match looks for </body> at # (nearly) every position. $html =~ m!&ltbody> (.*?) </body>!xs;
  30. 30. Buy The Book! Available from Amazon for $9.95 http://bit.ly/regexpr Thank you for reading! I'm tony@tonystubblebine.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×