Understanding advanced regular expressions

3,581 views
3,499 views

Published on

Published in: Technology
1 Comment
2 Likes
Statistics
Notes
  • On slide 3 you say regexes are Regular languages. But PCRE adds features that go well beyond level-3 grammars. Just so you know...
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
3,581
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
71
Comments
1
Likes
2
Embeds 0
No embeds

No notes for slide

Understanding advanced regular expressions

  1. 1. Deeper down the rabbit hole Advanced Regular Expressions Jakob Westhoff <jakob@php.net> @jakobwesthoff PHPBarcamp.at May 3, 2010 http://westhoffswelt.de jakob@westhoffswelt.de slide: 1 / 26
  2. 2. About Me Jakob Westhoff PHP developer for several years Computer science student at the TU Dortmund Co-Founder of the PHP Usergroup Dortmund Active in different Open Source projects http://westhoffswelt.de jakob@westhoffswelt.de slide: 2 / 26
  3. 3. Asking the audience Who does already work with regular expressions? Regular expressions like this: / [ a−zA−Z]+/ Or like this: ( ? P<image >(? : none | i n h e r i t ) | ( ? : u r l ( s ∗ ( ? : ’ | ” ) ? ( ? : [ ’ ” ) ] | [ ˆ ’ ” ) ] | [ ˆ ’ ” ) ] ) ∗ ( ? : ’ | ” ) ? s ∗ ) )) http://westhoffswelt.de jakob@westhoffswelt.de slide: 3 / 26
  4. 4. Asking the audience Who does already work with regular expressions? Regular expressions like this: / [ a−zA−Z]+/ Or like this: ( ? P<image >(? : none | i n h e r i t ) | ( ? : u r l ( s ∗ ( ? : ’ | ” ) ? ( ? : [ ’ ” ) ] | [ ˆ ’ ” ) ] | [ ˆ ’ ” ) ] ) ∗ ( ? : ’ | ” ) ? s ∗ ) )) http://westhoffswelt.de jakob@westhoffswelt.de slide: 3 / 26
  5. 5. Asking the audience Who does already work with regular expressions? Regular expressions like this: / [ a−zA−Z]+/ Or like this: ( ? P<image >(? : none | i n h e r i t ) | ( ? : u r l ( s ∗ ( ? : ’ | ” ) ? ( ? : [ ’ ” ) ] | [ ˆ ’ ” ) ] | [ ˆ ’ ” ) ] ) ∗ ( ? : ’ | ” ) ? s ∗ ) )) http://westhoffswelt.de jakob@westhoffswelt.de slide: 3 / 26
  6. 6. Goals of this session Learn advanced techniques to use in (PCRE) regular expressions Assertions Once only subpatterns Conditional subpatterns Pattern recursion ... Learn howto to handle Unicode in your regular expressions http://westhoffswelt.de jakob@westhoffswelt.de slide: 4 / 26
  7. 7. Goals of this session Learn advanced techniques to use in (PCRE) regular expressions Assertions Once only subpatterns Conditional subpatterns Pattern recursion ... Learn howto to handle Unicode in your regular expressions http://westhoffswelt.de jakob@westhoffswelt.de slide: 4 / 26
  8. 8. Goals of this session Learn advanced techniques to use in (PCRE) regular expressions Assertions Once only subpatterns Conditional subpatterns Pattern recursion ... Learn howto to handle Unicode in your regular expressions http://westhoffswelt.de jakob@westhoffswelt.de slide: 4 / 26
  9. 9. What Regular Expressions are. . . In theoretical computer science: Express regular languages Languages which can be described by deterministic finite state automata Type 3 grammars in the Chomsky hierarchy http://westhoffswelt.de jakob@westhoffswelt.de slide: 5 / 26
  10. 10. What Regular Expressions are. . . In theoretical computer science: Express regular languages Languages which can be described by deterministic finite state automata Type 3 grammars in the Chomsky hierarchy http://westhoffswelt.de jakob@westhoffswelt.de slide: 5 / 26
  11. 11. What Regular Expressions are. . . In theoretical computer science: Express regular languages Languages which can be described by deterministic finite state automata Type 3 grammars in the Chomsky hierarchy http://westhoffswelt.de jakob@westhoffswelt.de slide: 5 / 26
  12. 12. What Regular Expressions are. . . In practical day to day usage: “[. . . ]regular expressions provide concise and flexible means for identifying strings of text of interest, such as particular characters, words, or patterns of characters.” – Wikipedia [1] http://westhoffswelt.de jakob@westhoffswelt.de slide: 6 / 26
  13. 13. What Regular Expressions are. . . In practical day to day usage: “[. . . ]regular expressions provide concise and flexible means for identifying strings of text of interest, such as particular characters, words, or patterns of characters.” – Wikipedia [1] http://westhoffswelt.de jakob@westhoffswelt.de slide: 6 / 26
  14. 14. What Regular Expressions are. . . In practical day to day usage: “[. . . ]regular expressions provide concise and flexible means for identifying strings of text of interest, such as particular characters, words, or patterns of characters.” – Wikipedia [1] http://westhoffswelt.de jakob@westhoffswelt.de slide: 6 / 26
  15. 15. What Regular Expressions are. . . In practical day to day usage: “[. . . ]regular expressions provide concise and flexible means for identifying strings of text of interest, such as particular characters, words, or patterns of characters.” – Wikipedia [1] http://westhoffswelt.de jakob@westhoffswelt.de slide: 6 / 26
  16. 16. What Regular Expressions are. . . In practical day to day usage: “[. . . ]regular expressions provide concise and flexible means for identifying strings of text of interest, such as particular characters, words, or patterns of characters.” – Wikipedia [1] http://westhoffswelt.de jakob@westhoffswelt.de slide: 6 / 26
  17. 17. Building Blocks of a Regular Expression Basic structure of every regular expression /[a-z]+/im Delimiter Equal characters of arbitrary choice (must be escaped in expression) May be ( and ) in PCRE Expression Modifier A sequence of characters providing processing instructions http://westhoffswelt.de jakob@westhoffswelt.de slide: 7 / 26
  18. 18. Building Blocks of a Regular Expression Basic structure of every regular expression /[a-z]+/im Delimiter Equal characters of arbitrary choice (must be escaped in expression) May be ( and ) in PCRE Expression Modifier A sequence of characters providing processing instructions http://westhoffswelt.de jakob@westhoffswelt.de slide: 7 / 26
  19. 19. Building Blocks of a Regular Expression Basic structure of every regular expression /[a-z]+/im Delimiter Equal characters of arbitrary choice (must be escaped in expression) May be ( and ) in PCRE Expression Modifier A sequence of characters providing processing instructions http://westhoffswelt.de jakob@westhoffswelt.de slide: 7 / 26
  20. 20. Building Blocks of a Regular Expression Basic structure of every regular expression /[a-z]+/im Delimiter Equal characters of arbitrary choice (must be escaped in expression) May be ( and ) in PCRE Expression Modifier A sequence of characters providing processing instructions http://westhoffswelt.de jakob@westhoffswelt.de slide: 7 / 26
  21. 21. Building Blocks of a Regular Expression Basic structure of every regular expression /[a-z]+/im Delimiter Equal characters of arbitrary choice (must be escaped in expression) May be ( and ) in PCRE Expression Modifier A sequence of characters providing processing instructions http://westhoffswelt.de jakob@westhoffswelt.de slide: 7 / 26
  22. 22. Building Blocks of a Regular Expression Basic structure of every regular expression /[a-z]+/im Delimiter Equal characters of arbitrary choice (must be escaped in expression) May be ( and ) in PCRE Expression Modifier A sequence of characters providing processing instructions http://westhoffswelt.de jakob@westhoffswelt.de slide: 7 / 26
  23. 23. Building Blocks of a Regular Expression Basic structure of every regular expression /[a-z]+/im Delimiter Equal characters of arbitrary choice (must be escaped in expression) May be ( and ) in PCRE Expression Modifier A sequence of characters providing processing instructions http://westhoffswelt.de jakob@westhoffswelt.de slide: 7 / 26
  24. 24. Getting everybody up to speed ., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions ^, $ - Start and end of subject (or line in multiline mode) foo|bar - Logical Or (foo)(bar) - Subpattern grouping /(foo|bar)baz(1)/ - Backreferences [a-z], [^a-z] - Character classes http://westhoffswelt.de jakob@westhoffswelt.de slide: 8 / 26
  25. 25. Getting everybody up to speed ., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions ^, $ - Start and end of subject (or line in multiline mode) foo|bar - Logical Or (foo)(bar) - Subpattern grouping /(foo|bar)baz(1)/ - Backreferences [a-z], [^a-z] - Character classes http://westhoffswelt.de jakob@westhoffswelt.de slide: 8 / 26
  26. 26. Getting everybody up to speed ., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions ^, $ - Start and end of subject (or line in multiline mode) foo|bar - Logical Or (foo)(bar) - Subpattern grouping /(foo|bar)baz(1)/ - Backreferences [a-z], [^a-z] - Character classes http://westhoffswelt.de jakob@westhoffswelt.de slide: 8 / 26
  27. 27. Getting everybody up to speed ., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions ^, $ - Start and end of subject (or line in multiline mode) foo|bar - Logical Or (foo)(bar) - Subpattern grouping /(foo|bar)baz(1)/ - Backreferences [a-z], [^a-z] - Character classes http://westhoffswelt.de jakob@westhoffswelt.de slide: 8 / 26
  28. 28. Getting everybody up to speed ., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions ^, $ - Start and end of subject (or line in multiline mode) foo|bar - Logical Or (foo)(bar) - Subpattern grouping /(foo|bar)baz(1)/ - Backreferences [a-z], [^a-z] - Character classes http://westhoffswelt.de jakob@westhoffswelt.de slide: 8 / 26
  29. 29. Getting everybody up to speed ., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions ^, $ - Start and end of subject (or line in multiline mode) foo|bar - Logical Or (foo)(bar) - Subpattern grouping /(foo|bar)baz(1)/ - Backreferences [a-z], [^a-z] - Character classes http://westhoffswelt.de jakob@westhoffswelt.de slide: 8 / 26
  30. 30. Getting everybody up to speed ., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions ^, $ - Start and end of subject (or line in multiline mode) foo|bar - Logical Or (foo)(bar) - Subpattern grouping /(foo|bar)baz(1)/ - Backreferences [a-z], [^a-z] - Character classes http://westhoffswelt.de jakob@westhoffswelt.de slide: 8 / 26
  31. 31. Getting everybody up to speed ., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions ^, $ - Start and end of subject (or line in multiline mode) foo|bar - Logical Or (foo)(bar) - Subpattern grouping /(foo|bar)baz(1)/ - Backreferences [a-z], [^a-z] - Character classes http://westhoffswelt.de jakob@westhoffswelt.de slide: 8 / 26
  32. 32. Getting everybody up to speed ., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions ^, $ - Start and end of subject (or line in multiline mode) foo|bar - Logical Or (foo)(bar) - Subpattern grouping /(foo|bar)baz(1)/ - Backreferences [a-z], [^a-z] - Character classes http://westhoffswelt.de jakob@westhoffswelt.de slide: 8 / 26
  33. 33. Getting everybody up to speed ., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions ^, $ - Start and end of subject (or line in multiline mode) foo|bar - Logical Or (foo)(bar) - Subpattern grouping /(foo|bar)baz(1)/ - Backreferences [a-z], [^a-z] - Character classes http://westhoffswelt.de jakob@westhoffswelt.de slide: 8 / 26
  34. 34. Getting everybody up to speed ., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions ^, $ - Start and end of subject (or line in multiline mode) foo|bar - Logical Or (foo)(bar) - Subpattern grouping /(foo|bar)baz(1)/ - Backreferences [a-z], [^a-z] - Character classes http://westhoffswelt.de jakob@westhoffswelt.de slide: 8 / 26
  35. 35. Grouping Without Subpattern Creation Grouping might be needed without creating a subpattern /(?:foobar)*/ http://westhoffswelt.de jakob@westhoffswelt.de slide: 9 / 26
  36. 36. Grouping Without Subpattern Creation Grouping might be needed without creating a subpattern /(?:foobar)*/ http://westhoffswelt.de jakob@westhoffswelt.de slide: 9 / 26
  37. 37. Subpattern identification Subpatterns are numbered by opening paranthesis /(foo(bar)(baz))/ 1 foobarbaz 2 bar 3 baz Matches available from within PHP $ma tc h e s = a r r a y ( 0 => ” f o o b a r b a z ” , 1 => ” f o o b a r b a z ” , 2 => ” b a r ” , 3 => ” baz ” , ) http://westhoffswelt.de jakob@westhoffswelt.de slide: 10 / 26
  38. 38. Subpattern identification Subpatterns are numbered by opening paranthesis /(foo(bar)(baz))/ 1 foobarbaz 2 bar 3 baz Matches available from within PHP $ma tc h e s = a r r a y ( 0 => ” f o o b a r b a z ” , 1 => ” f o o b a r b a z ” , 2 => ” b a r ” , 3 => ” baz ” , ) http://westhoffswelt.de jakob@westhoffswelt.de slide: 10 / 26
  39. 39. Subpattern identification Subpatterns are numbered by opening paranthesis /(foo(bar)(baz))/ 1 foobarbaz 2 bar 3 baz Matches available from within PHP $ma tc h e s = a r r a y ( 0 => ” f o o b a r b a z ” , 1 => ” f o o b a r b a z ” , 2 => ” b a r ” , 3 => ” baz ” , ) http://westhoffswelt.de jakob@westhoffswelt.de slide: 10 / 26
  40. 40. Subpattern identification Subpatterns are numbered by opening paranthesis /(foo(bar)(baz))/ 1 foobarbaz 2 bar 3 baz Matches available from within PHP $ma tc h e s = a r r a y ( 0 => ” f o o b a r b a z ” , 1 => ” f o o b a r b a z ” , 2 => ” b a r ” , 3 => ” baz ” , ) http://westhoffswelt.de jakob@westhoffswelt.de slide: 10 / 26
  41. 41. Subpattern identification Subpatterns are numbered by opening paranthesis /(foo(bar)(baz))/ 1 foobarbaz 2 bar 3 baz Matches available from within PHP $ma tc h e s = a r r a y ( 0 => ” f o o b a r b a z ” , 1 => ” f o o b a r b a z ” , 2 => ” b a r ” , 3 => ” baz ” , ) http://westhoffswelt.de jakob@westhoffswelt.de slide: 10 / 26
  42. 42. Subpattern identification Subpatterns are numbered by opening paranthesis /(foo(bar)(baz))/ 1 foobarbaz 2 bar 3 baz Matches available from within PHP $ma tc h e s = a r r a y ( 0 => ” f o o b a r b a z ” , 1 => ” f o o b a r b a z ” , 2 => ” b a r ” , 3 => ” baz ” , ) http://westhoffswelt.de jakob@westhoffswelt.de slide: 10 / 26
  43. 43. Subpattern Naming PCRE allows custom naming /(?P<firstname>[A-Za-z]+) (?P<lastname>[A-Za-z]+)/ Result with input Jakob Westhoff array ( 0 => ’ Jakob W e s t h o f f ’ , ’ f i r s t n a m e ’ => ’ Jakob ’ , 1 => ’ Jakob ’ , ’ l a s t n a m e ’ => ’ W e s t h o f f ’ , 2 => ’ W e s t h o f f ’ , ) http://westhoffswelt.de jakob@westhoffswelt.de slide: 11 / 26
  44. 44. Subpattern Naming PCRE allows custom naming /(?P<firstname>[A-Za-z]+) (?P<lastname>[A-Za-z]+)/ Result with input Jakob Westhoff array ( 0 => ’ Jakob W e s t h o f f ’ , ’ f i r s t n a m e ’ => ’ Jakob ’ , 1 => ’ Jakob ’ , ’ l a s t n a m e ’ => ’ W e s t h o f f ’ , 2 => ’ W e s t h o f f ’ , ) http://westhoffswelt.de jakob@westhoffswelt.de slide: 11 / 26
  45. 45. Assertions Formulate assertions on the matched string without consuming them Example /foo(?=foo)/ Input foofoofoo Match foofoofoo http://westhoffswelt.de jakob@westhoffswelt.de slide: 12 / 26
  46. 46. Assertions Formulate assertions on the matched string without consuming them Example /foo(?=foo)/ Input foofoofoo Match foofoofoo http://westhoffswelt.de jakob@westhoffswelt.de slide: 12 / 26
  47. 47. Assertions Formulate assertions on the matched string without consuming them Example /foo(?=foo)/ Input foofoofoo Match foofoofoo http://westhoffswelt.de jakob@westhoffswelt.de slide: 12 / 26
  48. 48. Assertions Formulate assertions on the matched string without consuming them Example /foo(?=foo)/ Input foofoofoo Match foofoofoo http://westhoffswelt.de jakob@westhoffswelt.de slide: 12 / 26
  49. 49. Assertions Formulate assertions on the matched string without consuming them Example /foo(?=foo)/ Input foofoofoo Match foofoofoo http://westhoffswelt.de jakob@westhoffswelt.de slide: 12 / 26
  50. 50. Assertions Formulate assertions on the matched string without consuming them Example /foo(?=foo)/ Input foofoofoo Match foofoofoo http://westhoffswelt.de jakob@westhoffswelt.de slide: 12 / 26
  51. 51. Assertions Formulate assertions on the matched string without consuming them Example /foo(?=foo)/ Input foofoofoo Match foofoofoo http://westhoffswelt.de jakob@westhoffswelt.de slide: 12 / 26
  52. 52. Assertions Formulate assertions on the matched string without consuming them Example /foo(?=foo)/ Input foofoofoo Match foofoofoo http://westhoffswelt.de jakob@westhoffswelt.de slide: 12 / 26
  53. 53. Assertions Formulate assertions on the matched string without consuming them Example /foo(?=foo)/ Input foofoofoo Match foofoofoo http://westhoffswelt.de jakob@westhoffswelt.de slide: 12 / 26
  54. 54. Negative Assertions Negative assertions are possible foo not followed by another foo /foo(?!foo)/ http://westhoffswelt.de jakob@westhoffswelt.de slide: 13 / 26
  55. 55. Negative Assertions Negative assertions are possible foo not followed by another foo /foo(?!foo)/ http://westhoffswelt.de jakob@westhoffswelt.de slide: 13 / 26
  56. 56. Backward Assertions bar preceeded by foo ////////// / /(?=foo)bar// ? ////////// / Backward assertion /(?<=foo)bar/ Negative backward assertion bar not preceeded by foo /(?<!foo)bar/ http://westhoffswelt.de jakob@westhoffswelt.de slide: 14 / 26
  57. 57. Backward Assertions bar preceeded by foo /(?=foo)bar/ ? Backward assertion /(?<=foo)bar/ Negative backward assertion bar not preceeded by foo /(?<!foo)bar/ http://westhoffswelt.de jakob@westhoffswelt.de slide: 14 / 26
  58. 58. Backward Assertions bar preceeded by foo ////////// / /(?=foo)bar// ? ////////// / Backward assertion /(?<=foo)bar/ Negative backward assertion bar not preceeded by foo /(?<!foo)bar/ http://westhoffswelt.de jakob@westhoffswelt.de slide: 14 / 26
  59. 59. Backward Assertions bar preceeded by foo ////////// / /(?=foo)bar// ? ////////// / Backward assertion /(?<=foo)bar/ Negative backward assertion bar not preceeded by foo /(?<!foo)bar/ http://westhoffswelt.de jakob@westhoffswelt.de slide: 14 / 26
  60. 60. Inner workings of the PCRE matcher PCRE uses backtracking to find matches Pattern: /d+foo/ Subject: 123456789bar 1 Eat up all the numbers: 123456789 2 Try to match foo 3 Backtrack one number and try to match foo again 4 Repeat step 3 until a match is found or the subjects beginning is reached http://westhoffswelt.de jakob@westhoffswelt.de slide: 15 / 26
  61. 61. Inner workings of the PCRE matcher PCRE uses backtracking to find matches Pattern: /d+foo/ Subject: 123456789bar 1 Eat up all the numbers: 123456789 2 Try to match foo 3 Backtrack one number and try to match foo again 4 Repeat step 3 until a match is found or the subjects beginning is reached http://westhoffswelt.de jakob@westhoffswelt.de slide: 15 / 26
  62. 62. Inner workings of the PCRE matcher PCRE uses backtracking to find matches Pattern: /d+foo/ Subject: 123456789bar 1 Eat up all the numbers: 123456789 2 Try to match foo 3 Backtrack one number and try to match foo again 4 Repeat step 3 until a match is found or the subjects beginning is reached http://westhoffswelt.de jakob@westhoffswelt.de slide: 15 / 26
  63. 63. Inner workings of the PCRE matcher PCRE uses backtracking to find matches Pattern: /d+foo/ Subject: 123456789bar 1 Eat up all the numbers: 123456789 2 Try to match foo 3 Backtrack one number and try to match foo again 4 Repeat step 3 until a match is found or the subjects beginning is reached http://westhoffswelt.de jakob@westhoffswelt.de slide: 15 / 26
  64. 64. Inner workings of the PCRE matcher PCRE uses backtracking to find matches Pattern: /d+foo/ Subject: 123456789bar 1 Eat up all the numbers: 123456789 2 Try to match foo 3 Backtrack one number and try to match foo again 4 Repeat step 3 until a match is found or the subjects beginning is reached http://westhoffswelt.de jakob@westhoffswelt.de slide: 15 / 26
  65. 65. Inner workings of the PCRE matcher PCRE uses backtracking to find matches Pattern: /d+foo/ Subject: 123456789bar 1 Eat up all the numbers: 123456789 2 Try to match foo 3 Backtrack one number and try to match foo again 4 Repeat step 3 until a match is found or the subjects beginning is reached http://westhoffswelt.de jakob@westhoffswelt.de slide: 15 / 26
  66. 66. Once only subpattern Once only subpatterns prevent backtracking once a certain pattern has acquired a match. Applying a once only pattern to the shown example /(?>d+)foo/ After matching the numbers and determining the following string is not foo the matcher stops 123456789bar Can massively improve regex speed if used correctly http://westhoffswelt.de jakob@westhoffswelt.de slide: 16 / 26
  67. 67. Once only subpattern Once only subpatterns prevent backtracking once a certain pattern has acquired a match. Applying a once only pattern to the shown example /(?>d+)foo/ After matching the numbers and determining the following string is not foo the matcher stops 123456789bar Can massively improve regex speed if used correctly http://westhoffswelt.de jakob@westhoffswelt.de slide: 16 / 26
  68. 68. Once only subpattern Once only subpatterns prevent backtracking once a certain pattern has acquired a match. Applying a once only pattern to the shown example /(?>d+)foo/ After matching the numbers and determining the following string is not foo the matcher stops 123456789bar Can massively improve regex speed if used correctly http://westhoffswelt.de jakob@westhoffswelt.de slide: 16 / 26
  69. 69. Once only subpattern Once only subpatterns prevent backtracking once a certain pattern has acquired a match. Applying a once only pattern to the shown example /(?>d+)foo/ After matching the numbers and determining the following string is not foo the matcher stops 123456789bar Can massively improve regex speed if used correctly http://westhoffswelt.de jakob@westhoffswelt.de slide: 16 / 26
  70. 70. Conditional subpattern If statement aquivalent in PCRE /(?(condition)yes-pattern|no-pattern)/ Conditions can be direct matches or assertions Numbers need to be followed by foo, while everything else needs to be followed by bar /(?(d+)foo|bar)/ http://westhoffswelt.de jakob@westhoffswelt.de slide: 17 / 26
  71. 71. Conditional subpattern If statement aquivalent in PCRE /(?(condition)yes-pattern|no-pattern)/ Conditions can be direct matches or assertions Numbers need to be followed by foo, while everything else needs to be followed by bar /(?(d+)foo|bar)/ http://westhoffswelt.de jakob@westhoffswelt.de slide: 17 / 26
  72. 72. Conditional subpattern If statement aquivalent in PCRE /(?(condition)yes-pattern|no-pattern)/ Conditions can be direct matches or assertions Numbers need to be followed by foo, while everything else needs to be followed by bar /(?(d+)foo|bar)/ http://westhoffswelt.de jakob@westhoffswelt.de slide: 17 / 26
  73. 73. Conditional subpattern If statement aquivalent in PCRE /(?(condition)yes-pattern|no-pattern)/ Conditions can be direct matches or assertions Numbers need to be followed by foo, while everything else needs to be followed by bar /(?(d+)foo|bar)/ http://westhoffswelt.de jakob@westhoffswelt.de slide: 17 / 26
  74. 74. Conditional subpattern If statement aquivalent in PCRE /(?(condition)yes-pattern|no-pattern)/ Conditions can be direct matches or assertions Numbers need to be followed by foo, while everything else needs to be followed by bar /(?(d+)foo|bar)/ http://westhoffswelt.de jakob@westhoffswelt.de slide: 17 / 26
  75. 75. Conditional subpattern If statement aquivalent in PCRE /(?(condition)yes-pattern|no-pattern)/ Conditions can be direct matches or assertions Numbers need to be followed by foo, while everything else needs to be followed by bar /(?(d+)foo|bar)/ http://westhoffswelt.de jakob@westhoffswelt.de slide: 17 / 26
  76. 76. Conditional subpattern If statement aquivalent in PCRE /(?(condition)yes-pattern|no-pattern)/ Conditions can be direct matches or assertions Numbers need to be followed by foo, while everything else needs to be followed by bar /(?(d+)foo|bar)/ http://westhoffswelt.de jakob@westhoffswelt.de slide: 17 / 26
  77. 77. Unicode: Character, code points and graphemes Unicode consists of different code points The letter a: U+0061 The mark ‘: U+0300 One character might consist of multiple code points The letter a with the mark ‘ (`) : U+0061 U+0300 a Some of these combinations exists as single code points The letter `: U+00E0 a http://westhoffswelt.de jakob@westhoffswelt.de slide: 18 / 26
  78. 78. Unicode: Character, code points and graphemes Unicode consists of different code points The letter a: U+0061 The mark ‘: U+0300 One character might consist of multiple code points The letter a with the mark ‘ (`) : U+0061 U+0300 a Some of these combinations exists as single code points The letter `: U+00E0 a http://westhoffswelt.de jakob@westhoffswelt.de slide: 18 / 26
  79. 79. Unicode: Character, code points and graphemes Unicode consists of different code points The letter a: U+0061 The mark ‘: U+0300 One character might consist of multiple code points The letter a with the mark ‘ (`) : U+0061 U+0300 a Some of these combinations exists as single code points The letter `: U+00E0 a http://westhoffswelt.de jakob@westhoffswelt.de slide: 18 / 26
  80. 80. Unicode: Character, code points and graphemes Unicode consists of different code points The letter a: U+0061 The mark ‘: U+0300 One character might consist of multiple code points The letter a with the mark ‘ (`) : U+0061 U+0300 a Some of these combinations exists as single code points The letter `: U+00E0 a http://westhoffswelt.de jakob@westhoffswelt.de slide: 18 / 26
  81. 81. Unicode: Character, code points and graphemes Unicode consists of different code points The letter a: U+0061 The mark ‘: U+0300 One character might consist of multiple code points The letter a with the mark ‘ (`) : U+0061 U+0300 a Some of these combinations exists as single code points The letter `: U+00E0 a http://westhoffswelt.de jakob@westhoffswelt.de slide: 18 / 26
  82. 82. Unicode: Character, code points and graphemes Unicode consists of different code points The letter a: U+0061 The mark ‘: U+0300 One character might consist of multiple code points The letter a with the mark ‘ (`) : U+0061 U+0300 a Some of these combinations exists as single code points The letter `: U+00E0 a http://westhoffswelt.de jakob@westhoffswelt.de slide: 18 / 26
  83. 83. Unicode: Character, code points and graphemes Unicode consists of different code points The letter a: U+0061 The mark ‘: U+0300 One character might consist of multiple code points The letter a with the mark ‘ (`) : U+0061 U+0300 a Some of these combinations exists as single code points The letter `: U+00E0 a http://westhoffswelt.de jakob@westhoffswelt.de slide: 18 / 26
  84. 84. Unicode: Pattern matching Unicode processing is enabled using the u modifier PCRE works on UTF-8 encoded strings Each code point is handled as one character Match any unicode code point: x{FFFF} Remember the letter a with the mark ‘ (`) a /x{0061}x{0030}/U http://westhoffswelt.de jakob@westhoffswelt.de slide: 19 / 26
  85. 85. Unicode: Pattern matching Unicode processing is enabled using the u modifier PCRE works on UTF-8 encoded strings Each code point is handled as one character Match any unicode code point: x{FFFF} Remember the letter a with the mark ‘ (`) a /x{0061}x{0030}/U http://westhoffswelt.de jakob@westhoffswelt.de slide: 19 / 26
  86. 86. Unicode: Pattern matching Unicode processing is enabled using the u modifier PCRE works on UTF-8 encoded strings Each code point is handled as one character Match any unicode code point: x{FFFF} Remember the letter a with the mark ‘ (`) a /x{0061}x{0030}/U http://westhoffswelt.de jakob@westhoffswelt.de slide: 19 / 26
  87. 87. Unicode: Pattern matching Unicode processing is enabled using the u modifier PCRE works on UTF-8 encoded strings Each code point is handled as one character Match any unicode code point: x{FFFF} Remember the letter a with the mark ‘ (`) a /x{0061}x{0030}/U http://westhoffswelt.de jakob@westhoffswelt.de slide: 19 / 26
  88. 88. Unicode: Pattern matching Unicode processing is enabled using the u modifier PCRE works on UTF-8 encoded strings Each code point is handled as one character Match any unicode code point: x{FFFF} Remember the letter a with the mark ‘ (`) a /x{0061}x{0030}/U http://westhoffswelt.de jakob@westhoffswelt.de slide: 19 / 26
  89. 89. Unicode: Pattern matching Unicode processing is enabled using the u modifier PCRE works on UTF-8 encoded strings Each code point is handled as one character Match any unicode code point: x{FFFF} Remember the letter a with the mark ‘ (`) a /x{0061}x{0030}/U http://westhoffswelt.de jakob@westhoffswelt.de slide: 19 / 26
  90. 90. Unicode: Extended unicode sequences How to match the single and multi code point character? Remember: ` = U+0061 U+0300 oder U+00E0 a Using escape for extended unicode sequences: X X is aquivalent to (?>P{M}p{M}*) Wait. What? → Unicode character properties http://westhoffswelt.de jakob@westhoffswelt.de slide: 20 / 26
  91. 91. Unicode: Extended unicode sequences How to match the single and multi code point character? Remember: ` = U+0061 U+0300 oder U+00E0 a Using escape for extended unicode sequences: X X is aquivalent to (?>P{M}p{M}*) Wait. What? → Unicode character properties http://westhoffswelt.de jakob@westhoffswelt.de slide: 20 / 26
  92. 92. Unicode: Extended unicode sequences How to match the single and multi code point character? Remember: ` = U+0061 U+0300 oder U+00E0 a Using escape for extended unicode sequences: X X is aquivalent to (?>P{M}p{M}*) Wait. What? → Unicode character properties http://westhoffswelt.de jakob@westhoffswelt.de slide: 20 / 26
  93. 93. Unicode: Extended unicode sequences How to match the single and multi code point character? Remember: ` = U+0061 U+0300 oder U+00E0 a Using escape for extended unicode sequences: X X is aquivalent to (?>P{M}p{M}*) Wait. What? → Unicode character properties http://westhoffswelt.de jakob@westhoffswelt.de slide: 20 / 26
  94. 94. Unicode: Extended unicode sequences How to match the single and multi code point character? Remember: ` = U+0061 U+0300 oder U+00E0 a Using escape for extended unicode sequences: X X is aquivalent to (?>P{M}p{M}*) Wait. What? → Unicode character properties http://westhoffswelt.de jakob@westhoffswelt.de slide: 20 / 26
  95. 95. Unicode: Character properties Every unicode code point has a certain property assigned Characters may be matched by these properties Escapes p and P are used for this: p{xx}: All code points with the property xx P{xx}: All code points without the property xx Possible properties: L: Letter M: Mark P: Punctation Sc: Currency symbol ... http://westhoffswelt.de jakob@westhoffswelt.de slide: 21 / 26
  96. 96. Unicode: Character properties Every unicode code point has a certain property assigned Characters may be matched by these properties Escapes p and P are used for this: p{xx}: All code points with the property xx P{xx}: All code points without the property xx Possible properties: L: Letter M: Mark P: Punctation Sc: Currency symbol ... http://westhoffswelt.de jakob@westhoffswelt.de slide: 21 / 26
  97. 97. Unicode: Character properties Every unicode code point has a certain property assigned Characters may be matched by these properties Escapes p and P are used for this: p{xx}: All code points with the property xx P{xx}: All code points without the property xx Possible properties: L: Letter M: Mark P: Punctation Sc: Currency symbol ... http://westhoffswelt.de jakob@westhoffswelt.de slide: 21 / 26
  98. 98. Unicode: Character properties Every unicode code point has a certain property assigned Characters may be matched by these properties Escapes p and P are used for this: p{xx}: All code points with the property xx P{xx}: All code points without the property xx Possible properties: L: Letter M: Mark P: Punctation Sc: Currency symbol ... http://westhoffswelt.de jakob@westhoffswelt.de slide: 21 / 26
  99. 99. Pattern Recursion Recursion in regular expressions ? Possible with PCRE Validate BB-Code using PCRE [b]Hello [i]World[/i]![/b] http://westhoffswelt.de jakob@westhoffswelt.de slide: 22 / 26
  100. 100. Pattern Recursion Recursion in regular expressions ? Possible with PCRE Validate BB-Code using PCRE [b]Hello [i]World[/i]![/b] http://westhoffswelt.de jakob@westhoffswelt.de slide: 22 / 26
  101. 101. Pattern Recursion Recursion in regular expressions ? Possible with PCRE Validate BB-Code using PCRE [b]Hello [i]World[/i]![/b] http://westhoffswelt.de jakob@westhoffswelt.de slide: 22 / 26
  102. 102. BB-Code Recursion Example [b]Hello [i]World[/i]![/b] Recursive regular expression pattern ( [^[]* [(b|i)] (?:[^[]+|(?R)) [/1] [^[]* ) http://westhoffswelt.de jakob@westhoffswelt.de slide: 23 / 26
  103. 103. BB-Code Recursion Example [b]Hello [i]World[/i]![/b] Recursive regular expression pattern ( [^[]* [(b|i)] (?:[^[]+|(?R)) [/1] [^[]* ) http://westhoffswelt.de jakob@westhoffswelt.de slide: 23 / 26
  104. 104. BB-Code Recursion Example [b]Hello [i]World[/i]![/b] Recursive regular expression pattern ( [^[]* [(b|i)] (?:[^[]+|(?R)) [/1] [^[]* ) http://westhoffswelt.de jakob@westhoffswelt.de slide: 23 / 26
  105. 105. BB-Code Recursion Example [b]Hello [i]World[/i]![/b] Recursive regular expression pattern ( [^[]* [(b|i)] (?:[^[]+|(?R)) [/1] [^[]* ) http://westhoffswelt.de jakob@westhoffswelt.de slide: 23 / 26
  106. 106. BB-Code Recursion Example [b]Hello [i]World[/i]![/b] Recursive regular expression pattern ( [^[]* [(b|i)] (?:[^[]+|(?R)) [/1] [^[]* ) http://westhoffswelt.de jakob@westhoffswelt.de slide: 23 / 26
  107. 107. BB-Code Recursion Example [b]Hello [i]World[/i]![/b] Recursive regular expression pattern ( [^[]* [(b|i)] (?:[^[]+|(?R)) [/1] [^[]* ) http://westhoffswelt.de jakob@westhoffswelt.de slide: 23 / 26
  108. 108. Do NOT Parse Using Regular Expressions Even though this is possible you do NOT want to do it It is not maintainable It is nearly impossible to find errors Useful information extraction (building an AST) is not possible Use regular expressions for Match Patterns (not recursive structures) Tokenizing strings Validate really restricted input values http://westhoffswelt.de jakob@westhoffswelt.de slide: 24 / 26
  109. 109. Do NOT Parse Using Regular Expressions Even though this is possible you do NOT want to do it It is not maintainable It is nearly impossible to find errors Useful information extraction (building an AST) is not possible Use regular expressions for Match Patterns (not recursive structures) Tokenizing strings Validate really restricted input values http://westhoffswelt.de jakob@westhoffswelt.de slide: 24 / 26
  110. 110. Do NOT Parse Using Regular Expressions Even though this is possible you do NOT want to do it It is not maintainable It is nearly impossible to find errors Useful information extraction (building an AST) is not possible Use regular expressions for Match Patterns (not recursive structures) Tokenizing strings Validate really restricted input values http://westhoffswelt.de jakob@westhoffswelt.de slide: 24 / 26
  111. 111. Do NOT Parse Using Regular Expressions Even though this is possible you do NOT want to do it It is not maintainable It is nearly impossible to find errors Useful information extraction (building an AST) is not possible Use regular expressions for Match Patterns (not recursive structures) Tokenizing strings Validate really restricted input values http://westhoffswelt.de jakob@westhoffswelt.de slide: 24 / 26
  112. 112. Do NOT Parse Using Regular Expressions Even though this is possible you do NOT want to do it It is not maintainable It is nearly impossible to find errors Useful information extraction (building an AST) is not possible Use regular expressions for Match Patterns (not recursive structures) Tokenizing strings Validate really restricted input values http://westhoffswelt.de jakob@westhoffswelt.de slide: 24 / 26
  113. 113. Do NOT Parse Using Regular Expressions Even though this is possible you do NOT want to do it It is not maintainable It is nearly impossible to find errors Useful information extraction (building an AST) is not possible Use regular expressions for Match Patterns (not recursive structures) Tokenizing strings Validate really restricted input values http://westhoffswelt.de jakob@westhoffswelt.de slide: 24 / 26
  114. 114. Do NOT Parse Using Regular Expressions Even though this is possible you do NOT want to do it It is not maintainable It is nearly impossible to find errors Useful information extraction (building an AST) is not possible Use regular expressions for Match Patterns (not recursive structures) Tokenizing strings Validate really restricted input values http://westhoffswelt.de jakob@westhoffswelt.de slide: 24 / 26
  115. 115. Do NOT Parse Using Regular Expressions Even though this is possible you do NOT want to do it It is not maintainable It is nearly impossible to find errors Useful information extraction (building an AST) is not possible Use regular expressions for Match Patterns (not recursive structures) Tokenizing strings Validate really restricted input values http://westhoffswelt.de jakob@westhoffswelt.de slide: 24 / 26
  116. 116. Thanks for listening Questions, comments or annotations? Slides: http://westhoffswelt.de/portfolio.htm Contact: Jakob Westhoff <jakob@php.net> Twitter: @jakobwesthoff Please leave comments and vote at: http://joind.in/1620 http://westhoffswelt.de jakob@westhoffswelt.de slide: 25 / 26
  117. 117. Bibliography I [1] Wikipedia. Regular expressions — wikipedia, the free encyclopedia, 2002. [Online; accessed 25-February-2002]. http://westhoffswelt.de jakob@westhoffswelt.de slide: 26 / 26

×