REGEX 101
The Swiss Army knife of string manipulation
@matthiasmullie
Regular expressions 101
INTRODUCTION
What are regular expressions?

Regular expressions 101
Regular expressions are special
characters that match or capture
portions of a field, as well as the rules
that govern all...
A regular expression provides a
concise and flexible means for
"matching" strings of text, such as
particular characters, ...
/{$([a-z0-9_]*)((.[a-z0-9_]*)*)

(->[a-z0-9_]*((.[a-z0-9_]*)*))?

((|[a-z_][a-z0-9_]*(:.*?)*)*)}/i

Regular expressions 10...
Regular expressions find patterns in
strings.

Me

Regular expressions 101 » Introduction
Neque porro quisquam est qui
dolorem ipsum quia dolor sit amet,
consectetur, adipisci velit...
!

!

‣

/ipsum/

‣

/[a-z]...
BASICS
The syntax everyone should know already

Regular expressions 101
/Delimiter/
‣
‣

Any [^a-zA-Z0-9s] character
Opening char == terminating char
‣ Except for [ ], ( ), { } and < >

Regular ...
Use /
(uniformity, you know)

Regular expressions 101 » Delimiter
Meta characters
‣

.

‣

( )

‣

[ ]

‣



‣

^

‣

*

‣

|

‣

{n}

$

?

Regular expressions 101 » Meta characters

+
{n...
Pattern modifiers //x
‣

i

‣

A

‣

m

‣

D

‣

s

‣

U

‣

x

‣

J

‣

e

‣

...

Regular expressions 101 » Pattern modifi...
Character classes [ ]
Ranges

Inverse ranges

‣

[0-9]

‣

[^0-9]

‣

[a-zA-Z]

‣

[^a-zA-Z]

‣

[A-F0-9]

‣

[^A-F0-9]

R...
Character classes [ ]
No sequence of characters!
!

‣

l, o, r, e or m

‣

[lorem]

lorem

Regular expressions 101 » Chara...
Character classes [ ]
POSIX
‣

[:alnum:]

‣

[:blank:]

‣

[:lower:]

‣

...

Regular expressions 101 » Character classes
Greediness: greedy
<ul><li>list-item1</li><li>list-item2</li></ul>
!

/<li>.*</li>/
‣

<li>list-item1</li><li>list-item2</...
Greediness: lazy
<ul><li>list-item1</li><li>list-item2</li></ul>
!

/<li>.*?</li>/

or

‣

<li>list-item1</li>

‣

/<li>.*...
Subpatterns
/([a-z0-9]*)@([a-z0-9.]*.[a-z0-9]{2,3})/i
!

user

email

hostname

Note: this regex only barely satisfies my ...
Questions?
Regular expressions 101
ADVANCED
The juicy stuff you never knew about, until now

Regular expressions 101
Back references
Problem: /href=['"](.*?)['"]/i
!
Matches:

!

‣

href="xxx"

‣

href="xxx'

‣

href='xxx'

‣

href='xxx"

...
Back references
Solution: /href=(['"])(.*?)1/i
1 references first subpattern!
!

Don’t forget to also string-escape in PHP...
Named subpatterns
Scenario: parsing large CSV
1,a title,5.00,92,green
2,another title,3.50,4,blue
3,one more,33699.99,15,w...
Named subpatterns
/([0-9]+),(.*?),([0-9]+.[0-9]{2}),([0-9]+),([a-z]+)/i
!
!

Result excerpt:

[1]
[2]
[3]
[4]
[5]

=>
=>
=...
Named subpatterns
/(?P<id>[0-9]+),(?P<title>.*?),(?P<price>[0-9]+.[0-9]
{2}),(?P<stock>[0-9]+),(?P<color>[a-z]+)/i
!

Resu...
Named subpatterns
‣

(?P<name>pattern)

‣

(?<name>pattern) & (?'name'pattern)
since PHP 5.2.2

Regular expressions 101 » ...
Named subpatterns +
back references
!

/href=(?P<quotes>['"])(?P<href>.*?)(?P=quotes)/i

Regular expressions 101 » Named s...
Lookahead/-behind assertions
“Take a peek, don’t eat it”

Regular expressions 101 » Assertions
Lookahead/-behind assertions
Scenario: find all occurrences of “here”

!

“Where can I find here, not there?”

Regular expre...
Lookahead/-behind assertions
Deduction:
Find all here’s, not preceded or followed by
an alphabetic character.
!

Solution:...
Lookahead/-behind assertions
‣

Positive lookahead: (?=expression)

‣

Negative lookahead: (?!expression)

‣

Positive loo...
Lookahead/-behind assertions
“lookbehind assertion is not fixed length...”
In PHP, lookbehind can not contain repetition,
...
Conditional subpatterns
if-then(-else) in regular expressions
!
!

YES RLY!

Regular expressions 101 » Conditional subpatt...
Conditional subpatterns
Scenario: match all (x|ht)ml tags
!

Caution!
‣ <element></element>
‣

<element />

Regular expres...
Conditional subpatterns
Solution:

if then else

/<(?P<tag>[a-z]+).*?(?P<self>/)?>(?(self)|.*?</(?P=tag)>)/i

Named patter...
Conditional subpatterns
With subpattern (named or by id):

‣
‣

(?(pattern)then)

‣

(?(pattern)then|else)

With lookahead...
Comments
/
# match currency symbols for USD, EUR, GBP & YEN
[$€£¥]
# must be followed by a number to indicate a price
(?=[...
Comments
‣

# Perl-style

‣

/x modifier

‣

Ignores unescaped whitespace

Regular expressions 101 » Comments
Presentation title
Questions?
Regular expressions 101
Resources
‣

www.mullie.eu/regular-expressions-basics/

‣

www.mullie.eu/regular-expressions-advanced/

mullie.eu
Regular ...
Upcoming SlideShare
Loading in...5
×

Regular expressions 101

189

Published on

Regular expressions are under-valued and most developers tend to only know the basics. Having a thorough understanding of how regular expressions work, will be incredibly helpful when you need to parse structured data.

This presentation will assume you already know what regular expressions are, but will sum up (with an example) some fancy things you probably didn’t know were possible with regular expressions.

If you're interested in a more detailed write-up, I suggest you check out http://www.mullie.eu/regular-expressions-basics/ & http://www.mullie.eu/regular-expressions-advanced/

This presentation is based on the PHP-implementation of PCRE, but nearly all programming languages support the same functionality, albeit sometimes with their own twists.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
189
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Regular expressions 101"

  1. 1. REGEX 101 The Swiss Army knife of string manipulation
  2. 2. @matthiasmullie Regular expressions 101
  3. 3. INTRODUCTION What are regular expressions? Regular expressions 101
  4. 4. Regular expressions are special characters that match or capture portions of a field, as well as the rules that govern all characters. Google Regular expressions 101 » Introduction
  5. 5. A regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Wikipedia Regular expressions 101 » Introduction
  6. 6. /{$([a-z0-9_]*)((.[a-z0-9_]*)*)
 (->[a-z0-9_]*((.[a-z0-9_]*)*))?
 ((|[a-z_][a-z0-9_]*(:.*?)*)*)}/i Regular expressions 101 » Introduction
  7. 7. Regular expressions find patterns in strings. Me Regular expressions 101 » Introduction
  8. 8. Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit... ! ! ‣ /ipsum/ ‣ /[a-z]/i ‣ /(est|qui)/ ‣ /[^w]/i Regular expressions 101 » Introduction
  9. 9. BASICS The syntax everyone should know already Regular expressions 101
  10. 10. /Delimiter/ ‣ ‣ Any [^a-zA-Z0-9s] character Opening char == terminating char ‣ Except for [ ], ( ), { } and < > Regular expressions 101 » Delimiter
  11. 11. Use / (uniformity, you know) Regular expressions 101 » Delimiter
  12. 12. Meta characters ‣ . ‣ ( ) ‣ [ ] ‣ ‣ ^ ‣ * ‣ | ‣ {n} $ ? Regular expressions 101 » Meta characters + {n,m}
  13. 13. Pattern modifiers //x ‣ i ‣ A ‣ m ‣ D ‣ s ‣ U ‣ x ‣ J ‣ e ‣ ... Regular expressions 101 » Pattern modifiers
  14. 14. Character classes [ ] Ranges Inverse ranges ‣ [0-9] ‣ [^0-9] ‣ [a-zA-Z] ‣ [^a-zA-Z] ‣ [A-F0-9] ‣ [^A-F0-9] Regular expressions 101 » Character classes
  15. 15. Character classes [ ] No sequence of characters! ! ‣ l, o, r, e or m ‣ [lorem] lorem Regular expressions 101 » Character classes
  16. 16. Character classes [ ] POSIX ‣ [:alnum:] ‣ [:blank:] ‣ [:lower:] ‣ ... Regular expressions 101 » Character classes
  17. 17. Greediness: greedy <ul><li>list-item1</li><li>list-item2</li></ul> ! /<li>.*</li>/ ‣ <li>list-item1</li><li>list-item2</li> Regular expressions 101 » Greediness
  18. 18. Greediness: lazy <ul><li>list-item1</li><li>list-item2</li></ul> ! /<li>.*?</li>/ or ‣ <li>list-item1</li> ‣ /<li>.*</li>/U <li>list-item2</li> Regular expressions 101 » Greediness
  19. 19. Subpatterns /([a-z0-9]*)@([a-z0-9.]*.[a-z0-9]{2,3})/i ! user email hostname Note: this regex only barely satisfies my needs for this particular example; do not use this really find occurrences of email addresses, it does not fully satisfy RFC5321 & RFC5322 Regular expressions 101 » Subpatterns
  20. 20. Questions? Regular expressions 101
  21. 21. ADVANCED The juicy stuff you never knew about, until now Regular expressions 101
  22. 22. Back references Problem: /href=['"](.*?)['"]/i ! Matches: ! ‣ href="xxx" ‣ href="xxx' ‣ href='xxx' ‣ href='xxx" Regular expressions 101 » Back references
  23. 23. Back references Solution: /href=(['"])(.*?)1/i 1 references first subpattern! ! Don’t forget to also string-escape in PHP: preg_match('/href=(['"])(.*?)1/i', ...); Regular expressions 101 » Back references
  24. 24. Named subpatterns Scenario: parsing large CSV 1,a title,5.00,92,green 2,another title,3.50,4,blue 3,one more,33699.99,15,white ... Regular expressions 101 » Named subpatterns
  25. 25. Named subpatterns /([0-9]+),(.*?),([0-9]+.[0-9]{2}),([0-9]+),([a-z]+)/i ! ! Result excerpt: [1] [2] [3] [4] [5] => => => => => string(1) string(7) string(4) string(2) string(5) ! ! ! ! Regular expressions 101 » Named subpatterns "1" "a title" "5.00" "92" "green"
  26. 26. Named subpatterns /(?P<id>[0-9]+),(?P<title>.*?),(?P<price>[0-9]+.[0-9] {2}),(?P<stock>[0-9]+),(?P<color>[a-z]+)/i ! Result excerpt: ["id"] => string(1) "1" [1] => string(1) "1" ["title"] => string(7) "a title" [2] => string(7) "a title" ["price"] => string(4) "5.00" [3] => string(4) "5.00" ["stock"] => string(2) "92" [4] => string(2) "92" ["color"] => string(5) "green" [5] => string(5) "green" Regular expressions 101 » Named subpatterns
  27. 27. Named subpatterns ‣ (?P<name>pattern) ‣ (?<name>pattern) & (?'name'pattern) since PHP 5.2.2 Regular expressions 101 » Named subpatterns
  28. 28. Named subpatterns + back references ! /href=(?P<quotes>['"])(?P<href>.*?)(?P=quotes)/i Regular expressions 101 » Named subpatterns
  29. 29. Lookahead/-behind assertions “Take a peek, don’t eat it” Regular expressions 101 » Assertions
  30. 30. Lookahead/-behind assertions Scenario: find all occurrences of “here” ! “Where can I find here, not there?” Regular expressions 101 » Assertions
  31. 31. Lookahead/-behind assertions Deduction: Find all here’s, not preceded or followed by an alphabetic character. ! Solution: /(?<![a-z])here(?![a-z])/i Regular expressions 101 » Assertions
  32. 32. Lookahead/-behind assertions ‣ Positive lookahead: (?=expression) ‣ Negative lookahead: (?!expression) ‣ Positive lookbehind: (?<=expression) ‣ Negative lookbehind: (?<!expression) Regular expressions 101 » Assertions
  33. 33. Lookahead/-behind assertions “lookbehind assertion is not fixed length...” In PHP, lookbehind can not contain repetition, while lookahead can. ‣ (?=.*) ‣ (?<=.*) ‣ (?=abc) ‣ (?<=abc) Regular expressions 101 » Assertions
  34. 34. Conditional subpatterns if-then(-else) in regular expressions ! ! YES RLY! Regular expressions 101 » Conditional subpatterns
  35. 35. Conditional subpatterns Scenario: match all (x|ht)ml tags ! Caution! ‣ <element></element> ‣ <element /> Regular expressions 101 » Conditional subpatterns
  36. 36. Conditional subpatterns Solution: if then else /<(?P<tag>[a-z]+).*?(?P<self>/)?>(?(self)|.*?</(?P=tag)>)/i Named patterns If self-closing, then do nothing,
 else, find matching end tag Regular expressions 101 » Conditional subpatterns
  37. 37. Conditional subpatterns With subpattern (named or by id): ‣ ‣ (?(pattern)then) ‣ (?(pattern)then|else) With lookahead/-behind: ‣ ‣ (?(?=assertion)then) ‣ (?(?=assertion)then|else) Regular expressions 101 » Conditional subpatterns
  38. 38. Comments / # match currency symbols for USD, EUR, GBP & YEN [$€£¥] # must be followed by a number to indicate a price (?=[0-9]) # pattern modifiers: # u for UTF-8 interpretation (currency symbols), # x to ignore whitespace (for comments) /ux Regular expressions 101 » Comments
  39. 39. Comments ‣ # Perl-style ‣ /x modifier ‣ Ignores unescaped whitespace Regular expressions 101 » Comments
  40. 40. Presentation title
  41. 41. Questions? Regular expressions 101
  42. 42. Resources ‣ www.mullie.eu/regular-expressions-basics/ ‣ www.mullie.eu/regular-expressions-advanced/ mullie.eu Regular expressions 101

×