Regular Expressions:
JavaScript And Beyond

Max Shirshin

Frontend Team Lead
deltamethod
Introduction
Types of regular expressions
• POSIX (BRE, ERE)
• PCRE = Perl-Compatible Regular Expressions
From the JavaScript language ...
JS syntax (overview only)
var re = /^foo/;
 
 

 
 

5
JS syntax (overview only)
var re = /^foo/;
// boolean
re.test('string');
 
 

6
JS syntax (overview only)
var re = /^foo/;
// boolean
re.test('string');
// null or Array
re.exec('string');

7
Regular expressions consist of...
●

Tokens

— common characters
— special characters (metacharacters)
●

Operations

— qu...
Tokens and metacharacters
Any character
/./.test('foo');

// true

/./.test('rn')

// false

 
 
 
 

10
Any character
/./.test('foo');

// true

/./.test('rn')

// false

What do you need instead:
/[sS]/ for JavaScript
or
/./s...
String boundaries
>>> /^something$/.test('something')
true
 
 

 

 

12
String boundaries
>>> /^something$/.test('something')
true
>>> /^something$/.test('somethingnbad')
false
 

 

13
String boundaries
>>> /^something$/.test('something')
true
>>> /^something$/.test('somethingnbad')
false
>>> /^something$/...
Word boundaries
>>> /ba/.test('alabama)
true
 
 

 
 

 
 

15
Word boundaries
>>> /ba/.test('alabama)
true
>>> /ab/.test('alabama')
true
 
 

 
 

16
Word boundaries
>>> /ba/.test('alabama)
true
>>> /ab/.test('alabama')
true
>>> /ab/.test('naïve')
true
 
 

17
Word boundaries
>>> /ba/.test('alabama)
true
>>> /ab/.test('alabama')
true
>>> /ab/.test('naïve')
true
not a word boundary...
Character classes
Whitespace
/s/ (inverted version: /S/)
 
 

 
 

 
 

20

 
Whitespace
/s/ (inverted version: /S/)
FF:
t
u00a0
u2003
u2009

n
v
u1680 u180e
u2004 u2005
u200a u2028

Chrome, IE 9:
as ...
Alphanumeric characters
/d/ ~ digits from 0 to 9
/w/ ~ Latin letters, digits, underscore
Does not work for Cyrillic, Greek...
Custom character classes
Example:
/[abc123]/
 
 

 
 

 

23
Custom character classes
Example:
/[abc123]/
Metacharacters and ranges supported:
/[A-Fd]/
 
 

 

24
Custom character classes
Example:
/[abc123]/
Metacharacters and ranges supported:
/[A-Fd]/
More than one range is okay:
/[...
Custom character classes
Example:
/[abc123]/
Metacharacters and ranges supported:
/[A-Fd]/
More than one range is okay:
/[...
Custom character classes
"dot" means just dot!
/[.]/.test('anything') // false
 
 

27
Custom character classes
"dot" means just dot!
/[.]/.test('anything') // false
adding  ] /[]-]/

28
Inverted character classes
anything except a, b, c:
/[^abc]/
^ as a character:
/[abc^]/

29
Inverted character classes
/[^]/
matches ANY character;
a nice alternative to /[sS]/

30
Inverted character classes
/[^]/
matches ANY character;
could be
a nice alternative to /[sS]/

31
Inverted character classes
/[^]/
matches ANY character;
could be
a nice alternative to /[sS]/
Chrome, FF:
>>> /([^])/.exec...
Inverted character classes
/[^]/
matches ANY character;
could be
a nice alternative to /[sS]/
IE:
>>> /([^])/.exec('a');
[...
Inverted character classes
/[^]/
matches ANY character;
could be
a nice alternative to /[sS]/
IE:
>>> /([sS])/.exec('a');
...
Quantifiers
Zero or more, one or more
/bo*/.test('b') // true
 

 

36
Zero or more, one or more
/bo*/.test('b') // true
/.*/.test('')
 

37

// true
Zero or more, one or more
/bo*/.test('b') // true
/.*/.test('')

// true

/bo+/.test('b') // false

38
Zero or one

/colou?r/.test('color');
/colou?r/.test('colour');

39
How many?
/bo{7}/
 

 

 
 

40

exactly 7
How many?
/bo{7}/

exactly 7

/bo{2,5}/

from 2 to 5, x < y

 

 
 

41
How many?
/bo{7}/

exactly 7

/bo{2,5}/

from 2 to 5, x < y

/bo{5,}/

5 or more

 
 

42
How many?
/bo{7}/

exactly 7

/bo{2,5}/

from 2 to 5, x < y

/bo{5,}/

5 or more

This does not work in JS:
/b{,5}/.test('...
Greedy quantifiers
var r = /a+/.exec('aaaaa');
 
 

44
Greedy quantifiers
var r = /a+/.exec('aaaaa');
>>> r[0]
 

45
Greedy quantifiers
var r = /a+/.exec('aaaaa');
>>> r[0]
"aaaaa"

46
Lazy quantifiers
var r = /a+?/.exec('aaaaa');
 
 

 
 
 

47
Lazy quantifiers
var r = /a+?/.exec('aaaaa');
>>> r[0]
 

 
 
 

48
Lazy quantifiers
var r = /a+?/.exec('aaaaa');
>>> r[0]
"a"
 
 
 

49
Lazy quantifiers
var r = /a+?/.exec('aaaaa');
>>> r[0]
"a"
r = /a*?/.exec('aaaaa');
 
 

50
Lazy quantifiers
var r = /a+?/.exec('aaaaa');
>>> r[0]
"a"
r = /a*?/.exec('aaaaa');
>>> r[0]
 

51
Lazy quantifiers
var r = /a+?/.exec('aaaaa');
>>> r[0]
"a"
r = /a*?/.exec('aaaaa');
>>> r[0]
""
52
Groups
Groups
capturing
/(boo)/.test("boo");
 
 

54
Groups
capturing
/(boo)/.test("boo");
non-capturing
/(?:boo)/.test("boo");

55
Grouping and the RegExp constructor
var result = /(bo)o+(b)/.exec('the booooob');
 
 
 
 
 

 
 
 
 

56
Grouping and the RegExp constructor
var result = /(bo)o+(b)/.exec('the booooob');
>>> RegExp.$1
"bo"
 
 
 

 
 
 
 

57
Grouping and the RegExp constructor
var result = /(bo)o+(b)/.exec('the booooob');
>>> RegExp.$1
"bo"
>>> RegExp.$2
"b"
 

...
Grouping and the RegExp constructor
var result = /(bo)o+(b)/.exec('the booooob');
>>> RegExp.$1
"bo"
>>> RegExp.$2
"b"
>>>...
Grouping and the RegExp constructor
var result = /(bo)o+(b)/.exec('the booooob');
>>> RegExp.$1
"bo"
>>> RegExp.$2
"b"
>>>...
Grouping and the RegExp constructor
var result = /(bo)o+(b)/.exec('the booooob');
>>> RegExp.$1
"bo"
>>> RegExp.$2
"b"
>>>...
Numbering of capturing groups
/((foo) (b(a)r))/
 

 
 
 
62
Numbering of capturing groups
/((foo) (b(a)r))/
$1 (
 
 
 

63

)

foo bar
Numbering of capturing groups
/((foo) (b(a)r))/
$1 (
$2 (
 
 

64

)
)

foo bar
foo
Numbering of capturing groups
/((foo) (b(a)r))/
$1 (
$2 (
$3
 

65

)
)
(

)

foo bar
foo
bar
Numbering of capturing groups
/((foo) (b(a)r))/
$1 (
$2 (
$3
$4
66

)
)
(

)
( )

foo bar
foo
bar
a
Lookahead
var r = /best(?= match)/.exec('best match');
 
 

 
 

 
 

67
Lookahead
var r = /best(?= match)/.exec('best match');
>>> !!r
true
 
 

 
 

68
Lookahead
var r = /best(?= match)/.exec('best match');
>>> !!r
true
>>> r[0]
"best"
 
 

69
Lookahead
var r = /best(?= match)/.exec('best match');
>>> !!r
true
>>> r[0]
"best"
>>> /best(?! match)/.test('best match'...
Lookbehind
NOT supported in JavaScript at all

/(?<=text)match/
positive lookbehind

/(?<!text)match/
negative lookbehind
...
Enumerations
Logical "or"
/red|green|blue light/
/(red|green|blue) light/
>>> /var a(;|$)/.test('var a')
true

73
Backreferences
true
/(red|green) apple is 1/.test('red apple is red')
true
/(red|green) apple is 1/.test('green apple is g...
Alternative character
represenations
Representing a character
x09 === t (not Unicode but ASCII/ANSI)
u20AC === € (in Unicode)
 

 
 

 
 

76
Representing a character
x09 === t (not Unicode but ASCII/ANSI)
u20AC === € (in Unicode)
backslash takes away special char...
Representing a character
x09 === t (not Unicode but ASCII/ANSI)
u20AC === € (in Unicode)
backslash takes away special char...
Flags
Regular expression flags
g i m s x y
 
 
 

 
 
 

80
Regular expression flags
g i m s x y
global match
 
 

 
 
 

81
Regular expression flags
g i m s x y
global match
ignore case
 

 
 
 

82
Regular expression flags
g i m s x y
global match
ignore case
multiline matching for ^ and $
 
 
 

83
Regular expression flags
g i m s x y
global match
ignore case
multiline matching for ^ and $
JavaScript does NOT provide s...
Regular expression flags
g i m s x y
global match
ignore case
multiline matching for ^ and $
Mozilla-only, non-standard:
s...
Alternative syntax for flags
/(?i)foo/
/(?i-m)bar$/
/(?i-sm).x$/
/(?i)foo(?-i)bar/
Some implementations do NOT support fla...
RegExp in JavaScript
Methods
RegExp instances:
/regexp/.exec('string')
null or array ['whole match', $1, $2, ...]
/regexp/.test('string')
false...
Methods
String instances:
'str'.replace(/old/, 'new');
WARNING: special magic supported in the replacement string:
$$
inse...
RegExp injection
// BAD CODE
var re = new RegExp('^' + userInput + '$');
// ...
var userInput = '[abc]'; // oops!

// GOOD...
Recommended reading
Online, just google it:
MDN Guide on Regular Expressions
The Book:

Mastering Regular Expressions
O'Reilly Media
Thank you!
Regular Expressions: JavaScript And Beyond
Upcoming SlideShare
Loading in...5
×

Regular Expressions: JavaScript And Beyond

684

Published on

Regular Expressions is a powerful tool for text and data processing. What kind of support do browsers provide for that? What are those little misconceptions that prevent people from using RE effectively?

The talk gives an overview of the regular expression syntax and typical usage examples.

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
684
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
15
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Regular Expressions: JavaScript And Beyond

  1. 1. Regular Expressions: JavaScript And Beyond Max Shirshin Frontend Team Lead deltamethod
  2. 2. Introduction
  3. 3. Types of regular expressions • POSIX (BRE, ERE) • PCRE = Perl-Compatible Regular Expressions From the JavaScript language specification: "The form and functionality of regular expressions is modelled after the regular expression facility in the Perl 5 programming language". 4
  4. 4. JS syntax (overview only) var re = /^foo/;         5
  5. 5. JS syntax (overview only) var re = /^foo/; // boolean re.test('string');     6
  6. 6. JS syntax (overview only) var re = /^foo/; // boolean re.test('string'); // null or Array re.exec('string'); 7
  7. 7. Regular expressions consist of... ● Tokens — common characters — special characters (metacharacters) ● Operations — quantification — enumeration — grouping 8
  8. 8. Tokens and metacharacters
  9. 9. Any character /./.test('foo'); // true /./.test('rn') // false         10
  10. 10. Any character /./.test('foo'); // true /./.test('rn') // false What do you need instead: /[sS]/ for JavaScript or /./s (works in Perl/PCRE, not in JS) 11
  11. 11. String boundaries >>> /^something$/.test('something') true         12
  12. 12. String boundaries >>> /^something$/.test('something') true >>> /^something$/.test('somethingnbad') false     13
  13. 13. String boundaries >>> /^something$/.test('something') true >>> /^something$/.test('somethingnbad') false >>> /^something$/m.test('somethingnbad') true 14
  14. 14. Word boundaries >>> /ba/.test('alabama) true             15
  15. 15. Word boundaries >>> /ba/.test('alabama) true >>> /ab/.test('alabama') true         16
  16. 16. Word boundaries >>> /ba/.test('alabama) true >>> /ab/.test('alabama') true >>> /ab/.test('naïve') true     17
  17. 17. Word boundaries >>> /ba/.test('alabama) true >>> /ab/.test('alabama') true >>> /ab/.test('naïve') true not a word boundary /Ba/.test('alabama'); 18
  18. 18. Character classes
  19. 19. Whitespace /s/ (inverted version: /S/)             20  
  20. 20. Whitespace /s/ (inverted version: /S/) FF: t u00a0 u2003 u2009 n v u1680 u180e u2004 u2005 u200a u2028 Chrome, IE 9: as in FF plus ufeff f r u2000 u2001 u2006 u2007 u2029 u202f IE 7, 8 :-( only: t n v f r u0020 21 u0020 u2002 u2008 u205f u3000
  21. 21. Alphanumeric characters /d/ ~ digits from 0 to 9 /w/ ~ Latin letters, digits, underscore Does not work for Cyrillic, Greek etc. Inverted forms: /D/ ~ anything but digits /W/ ~ anything but alphanumeric characters 22
  22. 22. Custom character classes Example: /[abc123]/           23
  23. 23. Custom character classes Example: /[abc123]/ Metacharacters and ranges supported: /[A-Fd]/       24
  24. 24. Custom character classes Example: /[abc123]/ Metacharacters and ranges supported: /[A-Fd]/ More than one range is okay: /[a-cG-M0-7]/   25
  25. 25. Custom character classes Example: /[abc123]/ Metacharacters and ranges supported: /[A-Fd]/ More than one range is okay: /[a-cG-M0-7]/ IMPORTANT: ranges come from Unicode, not from national alphabets! 26
  26. 26. Custom character classes "dot" means just dot! /[.]/.test('anything') // false     27
  27. 27. Custom character classes "dot" means just dot! /[.]/.test('anything') // false adding ] /[]-]/ 28
  28. 28. Inverted character classes anything except a, b, c: /[^abc]/ ^ as a character: /[abc^]/ 29
  29. 29. Inverted character classes /[^]/ matches ANY character; a nice alternative to /[sS]/ 30
  30. 30. Inverted character classes /[^]/ matches ANY character; could be a nice alternative to /[sS]/ 31
  31. 31. Inverted character classes /[^]/ matches ANY character; could be a nice alternative to /[sS]/ Chrome, FF: >>> /([^])/.exec('a'); ['a', 'a'] 32
  32. 32. Inverted character classes /[^]/ matches ANY character; could be a nice alternative to /[sS]/ IE: >>> /([^])/.exec('a'); ['a', ''] 33
  33. 33. Inverted character classes /[^]/ matches ANY character; could be a nice alternative to /[sS]/ IE: >>> /([sS])/.exec('a'); ['a', 'a'] 34
  34. 34. Quantifiers
  35. 35. Zero or more, one or more /bo*/.test('b') // true     36
  36. 36. Zero or more, one or more /bo*/.test('b') // true /.*/.test('')   37 // true
  37. 37. Zero or more, one or more /bo*/.test('b') // true /.*/.test('') // true /bo+/.test('b') // false 38
  38. 38. Zero or one /colou?r/.test('color'); /colou?r/.test('colour'); 39
  39. 39. How many? /bo{7}/         40 exactly 7
  40. 40. How many? /bo{7}/ exactly 7 /bo{2,5}/ from 2 to 5, x < y       41
  41. 41. How many? /bo{7}/ exactly 7 /bo{2,5}/ from 2 to 5, x < y /bo{5,}/ 5 or more     42
  42. 42. How many? /bo{7}/ exactly 7 /bo{2,5}/ from 2 to 5, x < y /bo{5,}/ 5 or more This does not work in JS: /b{,5}/.test('bbbbb') 43
  43. 43. Greedy quantifiers var r = /a+/.exec('aaaaa');     44
  44. 44. Greedy quantifiers var r = /a+/.exec('aaaaa'); >>> r[0]   45
  45. 45. Greedy quantifiers var r = /a+/.exec('aaaaa'); >>> r[0] "aaaaa" 46
  46. 46. Lazy quantifiers var r = /a+?/.exec('aaaaa');           47
  47. 47. Lazy quantifiers var r = /a+?/.exec('aaaaa'); >>> r[0]         48
  48. 48. Lazy quantifiers var r = /a+?/.exec('aaaaa'); >>> r[0] "a"       49
  49. 49. Lazy quantifiers var r = /a+?/.exec('aaaaa'); >>> r[0] "a" r = /a*?/.exec('aaaaa');     50
  50. 50. Lazy quantifiers var r = /a+?/.exec('aaaaa'); >>> r[0] "a" r = /a*?/.exec('aaaaa'); >>> r[0]   51
  51. 51. Lazy quantifiers var r = /a+?/.exec('aaaaa'); >>> r[0] "a" r = /a*?/.exec('aaaaa'); >>> r[0] "" 52
  52. 52. Groups
  53. 53. Groups capturing /(boo)/.test("boo");     54
  54. 54. Groups capturing /(boo)/.test("boo"); non-capturing /(?:boo)/.test("boo"); 55
  55. 55. Grouping and the RegExp constructor var result = /(bo)o+(b)/.exec('the booooob');                   56
  56. 56. Grouping and the RegExp constructor var result = /(bo)o+(b)/.exec('the booooob'); >>> RegExp.$1 "bo"               57
  57. 57. Grouping and the RegExp constructor var result = /(bo)o+(b)/.exec('the booooob'); >>> RegExp.$1 "bo" >>> RegExp.$2 "b"           58
  58. 58. Grouping and the RegExp constructor var result = /(bo)o+(b)/.exec('the booooob'); >>> RegExp.$1 "bo" >>> RegExp.$2 "b" >>> RegExp.$9 ""         59
  59. 59. Grouping and the RegExp constructor var result = /(bo)o+(b)/.exec('the booooob'); >>> RegExp.$1 "bo" >>> RegExp.$2 "b" >>> RegExp.$9 "" >>> RegExp.$10 undefined     60
  60. 60. Grouping and the RegExp constructor var result = /(bo)o+(b)/.exec('the booooob'); >>> RegExp.$1 "bo" >>> RegExp.$2 "b" >>> RegExp.$9 "" >>> RegExp.$10 undefined >>> RegExp.$0 undefined 61
  61. 61. Numbering of capturing groups /((foo) (b(a)r))/         62
  62. 62. Numbering of capturing groups /((foo) (b(a)r))/ $1 (       63 ) foo bar
  63. 63. Numbering of capturing groups /((foo) (b(a)r))/ $1 ( $2 (     64 ) ) foo bar foo
  64. 64. Numbering of capturing groups /((foo) (b(a)r))/ $1 ( $2 ( $3   65 ) ) ( ) foo bar foo bar
  65. 65. Numbering of capturing groups /((foo) (b(a)r))/ $1 ( $2 ( $3 $4 66 ) ) ( ) ( ) foo bar foo bar a
  66. 66. Lookahead var r = /best(?= match)/.exec('best match');             67
  67. 67. Lookahead var r = /best(?= match)/.exec('best match'); >>> !!r true         68
  68. 68. Lookahead var r = /best(?= match)/.exec('best match'); >>> !!r true >>> r[0] "best"     69
  69. 69. Lookahead var r = /best(?= match)/.exec('best match'); >>> !!r true >>> r[0] "best" >>> /best(?! match)/.test('best match') false 70
  70. 70. Lookbehind NOT supported in JavaScript at all /(?<=text)match/ positive lookbehind /(?<!text)match/ negative lookbehind 71
  71. 71. Enumerations
  72. 72. Logical "or" /red|green|blue light/ /(red|green|blue) light/ >>> /var a(;|$)/.test('var a') true 73
  73. 73. Backreferences true /(red|green) apple is 1/.test('red apple is red') true /(red|green) apple is 1/.test('green apple is green') 74
  74. 74. Alternative character represenations
  75. 75. Representing a character x09 === t (not Unicode but ASCII/ANSI) u20AC === € (in Unicode)           76
  76. 76. Representing a character x09 === t (not Unicode but ASCII/ANSI) u20AC === € (in Unicode) backslash takes away special character meaning: /()/.test('()') /n/.test('n')     77 // true // true
  77. 77. Representing a character x09 === t (not Unicode but ASCII/ANSI) u20AC === € (in Unicode) backslash takes away special character meaning: /()/.test('()') /n/.test('n') // true // true ...or vice versa! /f/.test('f') // false! 78
  78. 78. Flags
  79. 79. Regular expression flags g i m s x y             80
  80. 80. Regular expression flags g i m s x y global match           81
  81. 81. Regular expression flags g i m s x y global match ignore case         82
  82. 82. Regular expression flags g i m s x y global match ignore case multiline matching for ^ and $       83
  83. 83. Regular expression flags g i m s x y global match ignore case multiline matching for ^ and $ JavaScript does NOT provide support for: string as single line extend pattern 84
  84. 84. Regular expression flags g i m s x y global match ignore case multiline matching for ^ and $ Mozilla-only, non-standard: sticky Match only from the .lastIndex index (a regexp instance property). Thus, ^ can match at a predefined position. 85
  85. 85. Alternative syntax for flags /(?i)foo/ /(?i-m)bar$/ /(?i-sm).x$/ /(?i)foo(?-i)bar/ Some implementations do NOT support flag switching on-the-go. In JS, flags are set for the whole regexp instance and you can't change them. 86
  86. 86. RegExp in JavaScript
  87. 87. Methods RegExp instances: /regexp/.exec('string') null or array ['whole match', $1, $2, ...] /regexp/.test('string') false or true String instances: 'str'.match(/regexp/) 'str'.match('w{1,3}') - same as /regexp/.exec if no 'g' flag used; - array of all matches if 'g' flag used (internal capturing groups ignored) 'str'.search(/regexp/) 'str'.search('w{1,3}') first match index, or -1 88
  88. 88. Methods String instances: 'str'.replace(/old/, 'new'); WARNING: special magic supported in the replacement string: $$ inserts a dollar sign "$" $& substring that matches the regexp $` substring before $& $' substring after $& $1, $2, $3 etc.: string that matches n-th capturing group 'str'.replace(/(r)(e)gexp/g, function(matched, $1, $2, offset, sourceString) { // what should replace the matched part on this iteration? return 'replacement'; }); 89
  89. 89. RegExp injection // BAD CODE var re = new RegExp('^' + userInput + '$'); // ... var userInput = '[abc]'; // oops! // GOOD, DO IT AT HOME RegExp.escape = function(text) { return text.replace(/[-[]{}()*+?.,^$|#s]/g, "$&"); }; var re = new RegExp('^' + RegExp.escape(userInput) + '$'); 90
  90. 90. Recommended reading
  91. 91. Online, just google it: MDN Guide on Regular Expressions The Book: Mastering Regular Expressions O'Reilly Media
  92. 92. Thank you!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×