Your SlideShare is downloading. ×
Regex Basics
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Regex Basics

2,464
views

Published on

Ciarán Walsh's PHPNW08 slides: …

Ciarán Walsh's PHPNW08 slides:

In the right hands regular expressions can be a powerful tool, but it’s also far too easy for them to be used badly, or in the wrong situations.

This talk will kick off with a look at alternatives to regular expressions, for when the power of pattern matching is not required, and will also go over some cases when there are better alternatives available.
Then there will be a brief refresher on pattern syntax and some general tips and tricks to help when constructing regular expressions, before we go on to look at some situations where the use of pattern matching is a good fit, how to solve some common problems, and some common pitfalls when writing patterns.

Published in: Technology

1 Comment
2 Likes
Statistics
Notes
No Downloads
Views
Total Views
2,464
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
72
Comments
1
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Transcript

    • 1. Regular Expression Basics
      • PHPNW 2008
      • Ciarán Walsh
    • 2. What are regular expressions?
      • Regular expressions allow matching and manipulation of textual data.
      • Abbreviated as regex or regexp , or alternatively just “patterns”.
    • 3. Regular Expression Basics Literals bus Matches a ‘ b ’, followed by a ‘ u ’, followed by an ‘ s ’
    • 4. Regular Expression Basics Anchors ^ Matches at the beginning of a line $ Matches at the end of a line
    • 5. Regular Expression Basics Character Classes [abc] Matches one of ‘ a ’, ‘ b ’ or ‘ c ’ [a-c] Same as above (character range) [^abc] Matches one character that is not listed . Matches any single character
    • 6. Regular Expression Basics Alternation a|b Matches one of ‘ a ’ or ‘ b ’ dog|cat Matches one of “dog” or “cat”
    • 7. Regular Expression Basics Quantifiers (repetition) {x,y} Matches minimum of x and a maximum of y occurrences; either can be omitted * Matches zero or more occurrences (any amount). Same as {0,} + Matches one or more occurrences. Same as {1,} ? Matches zero or one occurrences. Same as {0,1}
    • 8. Regular Expression Basics Grouping (…) Groups the contents of the parentheses. Affects alternation and quantifiers. Allows parts of the match to be captured for|backward “ for” or “backward” (for|back)ward “ forward” or “backward”
    • 9. Regular Expression Basics Delimiters pattern / modifiers / /i Makes match case-insensitive
    • 10. Performing a Match
      • Returns number of matches (0 or 1)
      • $matches will contain captured groups
      • preg_match (
              • '/Te(.)f?/i' ,
              • 'text' ,
              • $ matches
              • );
    • 11. Performing a Replacement
      • Returns string after replacement
      • Can use backreferences with -9
      • preg_replace (
              • '/some(text)/' ,
              • '1' ,
              • $ text
              • )
    • 12.
      • (?:(?: )?[ ])*(?:(?:(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|&quot;(?:[^&quot; ||(?:(?: )?[ ]))*&quot;(?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:
      • (?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|&quot;(?:[^&quot; ||(?:(?: )?[ ]))*&quot;(?:(?: )?[ ])*))*@(?:(?: )?[ ])*(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|[([^[]
      • ]|)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*))*|(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?:
      • )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|&quot;(?:[^&quot; ||(?:(?: )?[ ]))*&quot;(?:(?: )?[ ])*)*<(?:(?: )?[ ])*(?:@(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|[([^[] |)
      • *](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*))*(?:,@(?:(?: )?[ ])*(?:[^()<>@,;:amp;quot;.[] 00-
      • 31]+(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|[([^[] |)*
      • ](?:(?: )?[ ])*))*)*:(?:(?: )?[ ])*)?(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|&quot;(?:[^&quot; ||(?:(?: )?[ ]))*&quot;(?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:
      • &quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|&quot;(?:[^&quot; ||(?:(?: )?[ ]))*&quot;(?:(?: )?[ ])*))*@(?:(?: )?[ ])*(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;
      • .[]]))|[([^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*))*>(?:(?: )?[ ])*)|(?
      • :[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|&quot;(?:[^&quot; ||(?:(?: )?[ ]))*&quot;(?:(?: )?[ ])*)*:(?:(?: )?[ ])*(?:(?:(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(
      • ?=[[&quot;()<>@,;:amp;quot;.[]]))|&quot;(?:[^&quot; ||(?:(?: )?[ ]))*&quot;(?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|&quot;(?:[^&quot; ||(?:(?: )?[ ]
      • ))*&quot;(?:(?: )?[ ])*))*@(?:(?: )?[ ])*(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:amp;quot;.[] 00-31]
      • +(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*))*|(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|&quot;(?:[^&quot; ||(?:(?: )?[ ]))*&quot;(?:(?:
      • )?[ ])*)*<(?:(?: )?[ ])*(?:@(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?
      • :(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*))*(?:,@(?:(?: )?[ ])*(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|[([^[] |)*](?:(?:
      • )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*))*)*:(?:(?: )?[ ])*)?(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?
      • :(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|&quot;(?:[^&quot; ||(?:(?: )?[ ]))*&quot;(?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|&quot;(?:[^&quot;
      • ||(?:(?: )?[ ]))*&quot;(?:(?: )?[ ])*))*@(?:(?: )?[ ])*(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@
      • ,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*))*>(?:(?: )?[ ])*)(?:,s*(?:(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.
      • []]))|&quot;(?:[^&quot; ||(?:(?: )?[ ]))*&quot;(?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|&quot;(?:[^&quot; ||(?:(?: )?[ ]))*&quot;(?:(?: )?[
      • ])*))*@(?:(?: )?[ ])*(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[
      • ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*))*|(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|&quot;(?:[^&quot; ||(?:(?: )?[ ]))*&quot;(?:(?: )?[ ])*)*<(
      • ?:(?: )?[ ])*(?:@(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[ ])+|
      • |(?=[[&quot;()<>@,;:amp;quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*))*(?:,@(?:(?: )?[ ])*(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*)(?:
      • .(?:(?: )?[ ])*(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*))*)*:(?:(?: )?[ ])*)?(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[ ])+|
      • |(?=[[&quot;()<>@,;:amp;quot;.[]]))|&quot;(?:[^&quot; ||(?:(?: )?[ ]))*&quot;(?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|&quot;(?:[^&quot; ||(?:(?: )?[
      • ]))*&quot;(?:(?: )?[ ])*))*@(?:(?: )?[ ])*(?:[^()<>@,;:amp;quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:amp;quot;.[] 00-
      • 031]+(?:(?:(?: )?[ ])+||(?=[[&quot;()<>@,;:amp;quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*))*>(?:(?: )?[ ])*))*)?;s*)
      Don’t Use Regular Expressions! Don’t Abuse Regular Expressions! Some people, when confronted with a problem, think “ I know, I'll use regular expressions.” Now they have two problems. — Jamie Zawinski
    • 13. Testing for a Substring if ( preg_match ( '/foo/' , $ var )) if ( strpos ( $ var , 'foo' ) !== false ) if ( preg_match ( '/foo/i' , $ var )) if ( stripos ( $ var , 'foo' ) !== false )
    • 14. Validating an Integer
      • Intention is not immediately obvious
      • Not efficient
      if ( preg_match ( '/ ^ d +$ /' , $ value )) { // $value is a positive integer } Regular Expression
    • 15. Validating an Integer
      • Native C library (fast)
      • Makes the intention obvious
      ctype (Character Type) if ( ctype_digit ( $ value )) { // $value is a positive integer }
    • 16. Validating an Integer
      • Intention is fairly clear
      • Casting is safe practice
      • Any invalid values will result in zero
      $ casted_value = intval ( $ value ); if ( $ casted_value > 0 ) { // $casted_value is a positive (non-zero) integer } Casting
    • 17. HTML Parsing
    • 18. Using Regular Expressions
    • 19. Using Regular Expressions Postcodes /[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][A-Z]{2}/ IP Addresses @^(d{1,2})/(d{1,2})/(d{4})$@
    • 20. Constructing Patterns
      • Writing patterns is a balance between matching what you do want, against not matching what you don’t want.
    • 21. You don’t need to use /…/ to denote a pattern! /…/ to denote a pattern! preg_match ( '/<b><s> .+ < / s> .+ < / b>/' , $ html ) preg_match ( '@<b><s> .+ </s> .+ </b>@' , $ html )
    • 22. Greediness $ html = <<< HTML <span> some text </span><span> some more text! </span> HTML ; preg_match ( &quot;@<span>(.+)</span>@&quot; , $ html , $ matches ); echo $ matches [ 0 ]; preg_match ( &quot;@<span>(.+?)</span>@&quot; , $ html , $ matches ); echo $ matches [ 0 ];
    • 23. You can make your pattern readable! preg_match ( '`^(w+)://(?:(.+?):(.+?)@)?(.+?).(w+)$`' , $ s , $ matches ) preg_match ( '` ^ (w+):// # Protocol (?: (.+?) # Username : # : (.+?) # Password @ # @ )? # Username/password are optional (.+?) # Hostname .(w+) # Top-level domain $ `x' , $ s , $ matches );
    • 24. Extracting Captures preg_match ( '`^ (?P<protocol>w+):// (?: (?P<user>.+?) : (?P<pass>.+?) @ )? (?P<host>.+?) .(?P<tld>w+) $`x' , $ s , $ matches ); Array(    [0] => http://foo:bar@baz.example.com     [protocol] => http    [1] => http    [user] => foo    [2] => foo    [pass] => bar    [3] => bar    [host] => baz.example    [4] => baz.example    [tld] => com    [5] => com) preg_match ( '`^ (?P<protocol>w+):// (?: (?P<user>.+?) : (?P<pass>.+?) @ )? (?P<host>.+?) .(?P<tld>w+) $`x' , $ s , $ matches );
    • 25. Variable Data if ( preg_match ( &quot;!> $ value </(?:div|span)>!&quot; , $ text )) $ value = preg_quote ( $ value , '!' );
    • 26. Performing Logic on Replacements preg_replace ( '/w + /e' , 'strtoupper(&quot;&quot;)' , 'foo bar baz' )
      • function upper_case_match ( $ matches ) {
      • return strtoupper ( $ matches [ 0 ]);
      • }
      • preg_replace_callback (
              • '/w + /' ,
              • 'upper_case_match' ,
              • 'foo bar baz'
              • )
    • 27. Testing Tools
      • RegexBuddy
      • Reggy
      • http://rubular.com
    • 28. Any Questions?