Unicode regular expression tutorial with examples in Perl, PHP, and JavaScript.
Presented at: Shutterstock “Brown Bag Lunch” Tech Talk, 23 January 2013, New York, NY
2. Unicode Refresher
Unicode attempts to support the
characters of the world — a massive task!
3. Unicode Refresher
It's hard to attach a single meaning to the
word “character” but most folks think of
characters as the smallest stand-alone
components of a writing system.
4. Unicode Refresher
In Unicode, this sense of characters is
represented by one or more code points,
which are each stored in one or more bytes.
5. Unicode Refresher
However, programmers and
programming languages tend to think of
characters as individual code points,
or worse, individual bytes.
We need to modernize our habits!
6. Unicode Refresher
Unicode is not just a big set of characters.
It also defines standard properties for
each character and standard algorithms
for operations such as collation,
normalization, and segmentation.
15. Grapheme Clusters
regex: /^.$/
string 1: ᾂ
⇧
string 2: α◌̓◌̀◌ͅ
⇧
1. anchor beginning of string
2. match code point (excl. n)
16. Grapheme Clusters
regex: /^.$/
string 1: ᾂ
⇧⇧
string 2: α◌̓◌̀◌ͅ
1. anchor beginning of string
2. match code point (excl. n)
3. anchor at end of string
17. Grapheme Clusters
regex: /^.$/
string 1: ᾂ
⇧⇧
string 2: α◌̓◌̀◌ͅ
1. anchor beginning of string
2. match code point (excl. n)
3. anchor at end of string
4. 1 success but 1 failure — mixed results �
26. Match Any Character
two bytes (if byte mode): е..и
code point (exc. n): е.и
code point (incl. n): еp{Any}и
grapheme cluster (incl. n): еXи
27. Match Any Letter
letter code point:еp{General_Category=Letter}и
letter code point: еpLи
Cyrillic code point: еp{Script=Cyrillic}и
Cyrillic code point: еp{Cyrillic}и
letter grapheme cluster: е(?=pL)Xи
28. regex: / о p{Cyrillic} т /x
string 1: който
string 2: кои◌̆то
29. regex: / о p{Cyrillic} т /x
string 1: който
string 2: кои◌̆то
1. match letter о
30. regex: / о p{Cyrillic} т /x
string 1: който
string 2: кои◌̆то
1. match letter о
2. match Cyrillic letter (1 code point)
31. regex: / о p{Cyrillic} т /x
string 1: който
string 2: кои◌̆то
1. match letter о
2. match Cyrillic letter (1 code point)
3. match letter т
32. regex: / о p{Cyrillic} т /x
string 1: който
string 2: кои◌̆то
1. match letter о
2. match Cyrillic letter (1 code point)
3. match letter т
4. 1 success but 1 failure — mixed results �
33. regex: / о (?= p{Cyrillic} ) X т /x
string 1: който
string 2: кои◌̆то
34. regex: / о (?= p{Cyrillic} ) X т /x
string 1: който
string 2: кои◌̆то
1. match letter о
35. regex: / о (?= p{Cyrillic} ) X т /x
string 1: който
⇧
string 2: кои◌̆то
⇧
1. match letter о
2. positive lookahead Cyrillic letter (1 code point)
36. regex: / о (?= p{Cyrillic} ) X т /x
string 1: който
⇧
string 2: кои◌̆то
⇧
1. match letter о
2. positive lookahead Cyrillic letter (1 code point)
3. match grapheme cluster (1+ code points)
37. regex: / о (?= p{Cyrillic} ) X т /x
string 1: който
⇧
string 2: кои◌̆то
⇧
1. match letter о
2. positive lookahead Cyrillic letter (1 code point)
3. match grapheme cluster (1+ code points)
4. match letter т
38. regex: / о (?= p{Cyrillic} ) X т /x
string 1: който
⇧
string 2: кои◌̆то
⇧
1. match letter о
2. positive lookahead Cyrillic letter (1 code point)
3. match grapheme cluster (1+ code points)
4. match letter т
5. success! �
42. Character Literals
[]يی
(?:)ي|ی
[x{064A}x{06CC}]
[N{ARABIC LETTER YEH}
N{ARABIC LETTER FARSI YEH}]
43. Properties
p{Script=Latin}
Name: Script
Value: Latin
Match any code point with the
value “Latin” for the Script property.
44. Properties
P{Script=Latin}
Name: Script
Value: not Latin
Negated form:
Match any code point without the
value “Latin” for the Script property.
45. Properties
p{Latin}
Name: Script (implicit)
Value: Latin
The Script and General Category
properties don't require the name
because they're so common and
their values don't conflict.
46. Properties
p{General_Category=Letter}
Name: General Category
Value: Letter
Match any code point with the value
“Letter” for the General Category property.
47. Properties
p{gc=Letter}
Name: General Category (gc)
Value: Letter
Property names may be abbreviated.
48. Properties
p{gc=L}
Name: General Category (gc)
Value: Letter (L)
The General Category property is
so commonly used that its values
all have standard abbreviations.
49. Properties
p{L}
Name: General Category (implicit)
Value: Letter (L)
And the General Category values may even
be used on their own, like the Script values.
These two properties have distinct values.
50. Properties
pL
Name: General Category (implicit)
Value: Letter (L)
Single-character General Category
values don't require curly braces.
51. Properties
PL
Name: General Category (implicit)
Value: not Letter (L)
Don't forget negation!