Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
UnicodeRegular Expressions  s/�/�/g       Nick Patch    23 January 2013
Unicode Refresher    Unicode attempts to support thecharacters of the world — a massive task!
Unicode RefresherIts hard to attach a single meaning to the  word “character” but most folks think of  characters as the s...
Unicode Refresher  In Unicode, this sense of characters is represented by one or more code points,which are each stored in...
Unicode Refresher      However, programmers andprogramming languages tend to think of  characters as individual code point...
Unicode RefresherUnicode is not just a big set of characters.  It also defines standard properties for each character and ...
NormalizationNFD(ᾀ◌̀) = α◌̓◌̀◌ͅNFC(ᾀ◌̀) = ᾂ̀
NormalizationNFD(Чю◌́рлёнис) = Чю◌́рле◌̈нисNFC(Чю◌́рлёнис) = Чю◌́рлёнис
Normalization  ᾂ ≡ ἂ◌ͅ ≡ ᾀ◌̀ ≡ ᾳ◌̓◌̀ ≡ α◌̓◌̀◌ͅ ≡ α◌̓◌ͅ◌̀ ≡ α◌ͅ◌̓◌̀             ≠ᾲ◌̓ ≡ ὰ◌̓◌ͅ ≡ ὰ◌ͅ◌̓ ≡ ᾳ◌̀◌̓ ≡ α◌̀◌̓◌ͅ ≡ α◌...
Perl Normalizationuse Unicode::Normalize;say $str;          # ᾀ◌̀say NFD($str);     # α◌̓◌̀◌ͅsay NFC($str);     # ᾂ̀
JavaScript Normalizationvar unorm = require(unorm);console.log($str);              # ᾀ◌̀console.log(unorm.nfd($str));   # ...
PHP Normalizationecho $str;            # ᾀ◌̀echo Normalizer::normalize($str,Normalizer::FORM_D); # α◌̓◌̀◌ͅecho Normalizer:...
Grapheme Clustersregex:      /^.$/string 1:   ᾂstring 2:   α◌̓◌̀◌ͅ
Grapheme Clustersregex:         /^.$/string 1:      ᾂ              ⇧string 2:      α◌̓◌̀◌ͅ              ⇧1. anchor beginni...
Grapheme Clustersregex:         /^.$/string 1:      ᾂ              ⇧string 2:      α◌̓◌̀◌ͅ              ⇧1. anchor beginni...
Grapheme Clustersregex:         /^.$/string 1:      ᾂ              ⇧⇧string 2:      α◌̓◌̀◌ͅ1. anchor beginning of string2....
Grapheme Clustersregex:         /^.$/string 1:     ᾂ             ⇧⇧string 2:      α◌̓◌̀◌ͅ1. anchor beginning of string2. m...
Grapheme Clustersregex:      /^X$/string 1:   ᾂstring 2:   α◌̓◌̀◌ͅ
Grapheme Clustersregex:         /^X$/string 1:      ᾂ              ⇧string 2:      α◌̓◌̀◌ͅ              ⇧1. anchor beginni...
Grapheme Clustersregex:         /^X$/string 1:      ᾂ              ⇧string 2:      α◌̓◌̀◌ͅ              ⇧1. anchor beginni...
Grapheme Clustersregex:         /^X$/string 1:      ᾂ              ⇧⇧string 2:      α◌̓◌̀◌ͅ              ⇧      ⇧1. anchor...
Grapheme Clustersregex:         /^X$/string 1:      ᾂ              ⇧⇧string 2:      α◌̓◌̀◌ͅ              ⇧      ⇧1. anchor...
Perluse   v5.12; # better yet: v5.14use   utf8;use   charnames qw( :full ); # unless v5.16use   open qw( :encoding(UTF-8) ...
PHPpreg_match(/^X$/u, $str);preg_replace(/^(X)$/u, ->$1<-, $str);
JavaScript[This slide intentionally left blank.]
Match Any Charactertwo bytes (if byte mode):      е..иcode point (exc. n):          е.иcode point (incl. n):         еp{An...
Match Any Letterletter code point:еp{General_Category=Letter}иletter code point:   еpLиCyrillic code point: еp{Script=Cyri...
regex:      / о p{Cyrillic} т /xstring 1:   койтоstring 2:   кои◌̆то
regex:          / о p{Cyrillic} т /xstring 1:       койтоstring 2:       кои◌̆то1. match letter о
regex:          / о p{Cyrillic} т /xstring 1:       койтоstring 2:       кои◌̆то1. match letter о2. match Cyrillic letter ...
regex:          / о p{Cyrillic} т /xstring 1:       койтоstring 2:       кои◌̆то1. match letter о2. match Cyrillic letter ...
regex:         / о p{Cyrillic} т /xstring 1:      койтоstring 2:      кои◌̆то1. match letter о2. match Cyrillic letter (1 ...
regex:      / о (?= p{Cyrillic} ) X т /xstring 1:   койтоstring 2:   кои◌̆то
regex:          / о (?= p{Cyrillic} ) X т /xstring 1:       койтоstring 2:       кои◌̆то1. match letter о
regex:          / о (?= p{Cyrillic} ) X т /xstring 1:       който                 ⇧string 2:       кои◌̆то                ...
regex:          / о (?= p{Cyrillic} ) X т /xstring 1:       който                 ⇧string 2:       кои◌̆то                ...
regex:          / о (?= p{Cyrillic} ) X т /xstring 1:       който                 ⇧string 2:       кои◌̆то                ...
regex:          / о (?= p{Cyrillic} ) X т /xstring 1:       който                 ⇧string 2:       кои◌̆то                ...
Character Literals      [‫]يی‬    (?:‫)ي|ی‬
Character Literals      [‫]يی‬    (?:‫)ي|ی‬
Character Literals       [‫]يی‬     (?:‫)ي|ی‬[x{064A}x{06CC}]
Character Literals            [‫]يی‬          (?:‫)ي|ی‬     [x{064A}x{06CC}]   [N{ARABIC LETTER YEH}N{ARABIC LETTER FARSI ...
Properties         p{Script=Latin}           Name: Script           Value: Latin   Match any code point with thevalue “Lat...
Properties         P{Script=Latin}           Name: Script          Value: not Latin           Negated form: Match any code...
Properties           p{Latin}     Name: Script (implicit)        Value: LatinThe Script and General Categoryproperties don...
Properties     p{General_Category=Letter}        Name: General Category            Value: Letter   Match any code point wi...
Properties          p{gc=Letter}   Name: General Category (gc)          Value: LetterProperty names may be abbreviated.
Properties            p{gc=L} Name: General Category (gc)      Value: Letter (L)The General Category property isso commonl...
Properties                   p{L}    Name: General Category (implicit)           Value: Letter (L)And the General Category...
Properties               pLName: General Category (implicit)       Value: Letter (L)Single-character General Category valu...
Properties               PLName: General Category (implicit)      Value: not Letter (L)      Dont forget negation!
s/�/�/g
Upcoming SlideShare
Loading in …5
×

Unicode Regular Expressions

1,276 views

Published on

Unicode regular expression tutorial with examples in Perl, PHP, and JavaScript.

Presented at: Shutterstock “Brown Bag Lunch” Tech Talk, 23 January 2013, New York, NY

Published in: Technology
  • Be the first to comment

Unicode Regular Expressions

  1. 1. UnicodeRegular Expressions s/�/�/g Nick Patch 23 January 2013
  2. 2. Unicode Refresher Unicode attempts to support thecharacters of the world — a massive task!
  3. 3. Unicode RefresherIts hard to attach a single meaning to the word “character” but most folks think of characters as the smallest stand-alone components of a writing system.
  4. 4. Unicode Refresher In Unicode, this sense of characters is represented by one or more code points,which are each stored in one or more bytes.
  5. 5. Unicode Refresher However, programmers andprogramming languages tend to think of characters as individual code points, or worse, individual bytes. We need to modernize our habits!
  6. 6. Unicode RefresherUnicode is not just a big set of characters. It also defines standard properties for each character and standard algorithms for operations such as collation, normalization, and segmentation.
  7. 7. NormalizationNFD(ᾀ◌̀) = α◌̓◌̀◌ͅNFC(ᾀ◌̀) = ᾂ̀
  8. 8. NormalizationNFD(Чю◌́рлёнис) = Чю◌́рле◌̈нисNFC(Чю◌́рлёнис) = Чю◌́рлёнис
  9. 9. Normalization ᾂ ≡ ἂ◌ͅ ≡ ᾀ◌̀ ≡ ᾳ◌̓◌̀ ≡ α◌̓◌̀◌ͅ ≡ α◌̓◌ͅ◌̀ ≡ α◌ͅ◌̓◌̀ ≠ᾲ◌̓ ≡ ὰ◌̓◌ͅ ≡ ὰ◌ͅ◌̓ ≡ ᾳ◌̀◌̓ ≡ α◌̀◌̓◌ͅ ≡ α◌̀◌ͅ◌̓ ≡ α◌ͅ◌̀◌̓
  10. 10. Perl Normalizationuse Unicode::Normalize;say $str; # ᾀ◌̀say NFD($str); # α◌̓◌̀◌ͅsay NFC($str); # ᾂ̀
  11. 11. JavaScript Normalizationvar unorm = require(unorm);console.log($str); # ᾀ◌̀console.log(unorm.nfd($str)); # α◌̓◌̀◌ͅconsole.log(unorm.nfc($str)); # ᾂ̀
  12. 12. PHP Normalizationecho $str; # ᾀ◌̀echo Normalizer::normalize($str,Normalizer::FORM_D); # α◌̓◌̀◌ͅecho Normalizer::normalize($str,Normalizer::FORM_C); # ᾂ̀
  13. 13. Grapheme Clustersregex: /^.$/string 1: ᾂstring 2: α◌̓◌̀◌ͅ
  14. 14. Grapheme Clustersregex: /^.$/string 1: ᾂ ⇧string 2: α◌̓◌̀◌ͅ ⇧1. anchor beginning of string
  15. 15. Grapheme Clustersregex: /^.$/string 1: ᾂ ⇧string 2: α◌̓◌̀◌ͅ ⇧1. anchor beginning of string2. match code point (excl. n)
  16. 16. Grapheme Clustersregex: /^.$/string 1: ᾂ ⇧⇧string 2: α◌̓◌̀◌ͅ1. anchor beginning of string2. match code point (excl. n)3. anchor at end of string
  17. 17. Grapheme Clustersregex: /^.$/string 1: ᾂ ⇧⇧string 2: α◌̓◌̀◌ͅ1. anchor beginning of string2. match code point (excl. n)3. anchor at end of string4. 1 success but 1 failure — mixed results �
  18. 18. Grapheme Clustersregex: /^X$/string 1: ᾂstring 2: α◌̓◌̀◌ͅ
  19. 19. Grapheme Clustersregex: /^X$/string 1: ᾂ ⇧string 2: α◌̓◌̀◌ͅ ⇧1. anchor beginning of string
  20. 20. Grapheme Clustersregex: /^X$/string 1: ᾂ ⇧string 2: α◌̓◌̀◌ͅ ⇧1. anchor beginning of string2. match grapheme cluster
  21. 21. Grapheme Clustersregex: /^X$/string 1: ᾂ ⇧⇧string 2: α◌̓◌̀◌ͅ ⇧ ⇧1. anchor beginning of string2. match grapheme cluster3. anchor at end of string
  22. 22. Grapheme Clustersregex: /^X$/string 1: ᾂ ⇧⇧string 2: α◌̓◌̀◌ͅ ⇧ ⇧1. anchor beginning of string2. match grapheme cluster3. anchor at end of string4. success! �
  23. 23. Perluse v5.12; # better yet: v5.14use utf8;use charnames qw( :full ); # unless v5.16use open qw( :encoding(UTF-8) :std );$str =~ /^X$/;$str =~ s/^(X)$/->$1<-/;
  24. 24. PHPpreg_match(/^X$/u, $str);preg_replace(/^(X)$/u, ->$1<-, $str);
  25. 25. JavaScript[This slide intentionally left blank.]
  26. 26. Match Any Charactertwo bytes (if byte mode): е..иcode point (exc. n): е.иcode point (incl. n): еp{Any}иgrapheme cluster (incl. n): еXи
  27. 27. Match Any Letterletter code point:еp{General_Category=Letter}иletter code point: еpLиCyrillic code point: еp{Script=Cyrillic}иCyrillic code point: еp{Cyrillic}иletter grapheme cluster: е(?=pL)Xи
  28. 28. regex: / о p{Cyrillic} т /xstring 1: койтоstring 2: кои◌̆то
  29. 29. regex: / о p{Cyrillic} т /xstring 1: койтоstring 2: кои◌̆то1. match letter о
  30. 30. regex: / о p{Cyrillic} т /xstring 1: койтоstring 2: кои◌̆то1. match letter о2. match Cyrillic letter (1 code point)
  31. 31. regex: / о p{Cyrillic} т /xstring 1: койтоstring 2: кои◌̆то1. match letter о2. match Cyrillic letter (1 code point)3. match letter т
  32. 32. regex: / о p{Cyrillic} т /xstring 1: койтоstring 2: кои◌̆то1. match letter о2. match Cyrillic letter (1 code point)3. match letter т4. 1 success but 1 failure — mixed results �
  33. 33. regex: / о (?= p{Cyrillic} ) X т /xstring 1: койтоstring 2: кои◌̆то
  34. 34. regex: / о (?= p{Cyrillic} ) X т /xstring 1: койтоstring 2: кои◌̆то1. match letter о
  35. 35. regex: / о (?= p{Cyrillic} ) X т /xstring 1: който ⇧string 2: кои◌̆то ⇧1. match letter о2. positive lookahead Cyrillic letter (1 code point)
  36. 36. regex: / о (?= p{Cyrillic} ) X т /xstring 1: който ⇧string 2: кои◌̆то ⇧1. match letter о2. positive lookahead Cyrillic letter (1 code point)3. match grapheme cluster (1+ code points)
  37. 37. regex: / о (?= p{Cyrillic} ) X т /xstring 1: който ⇧string 2: кои◌̆то ⇧1. match letter о2. positive lookahead Cyrillic letter (1 code point)3. match grapheme cluster (1+ code points)4. match letter т
  38. 38. regex: / о (?= p{Cyrillic} ) X т /xstring 1: който ⇧string 2: кои◌̆то ⇧1. match letter о2. positive lookahead Cyrillic letter (1 code point)3. match grapheme cluster (1+ code points)4. match letter т5. success! �
  39. 39. Character Literals [‫]يی‬ (?:‫)ي|ی‬
  40. 40. Character Literals [‫]يی‬ (?:‫)ي|ی‬
  41. 41. Character Literals [‫]يی‬ (?:‫)ي|ی‬[x{064A}x{06CC}]
  42. 42. Character Literals [‫]يی‬ (?:‫)ي|ی‬ [x{064A}x{06CC}] [N{ARABIC LETTER YEH}N{ARABIC LETTER FARSI YEH}]
  43. 43. Properties p{Script=Latin} Name: Script Value: Latin Match any code point with thevalue “Latin” for the Script property.
  44. 44. Properties P{Script=Latin} Name: Script Value: not Latin Negated form: Match any code point without thevalue “Latin” for the Script property.
  45. 45. Properties p{Latin} Name: Script (implicit) Value: LatinThe Script and General Categoryproperties dont require the namebecause theyre so common and their values dont conflict.
  46. 46. Properties p{General_Category=Letter} Name: General Category Value: Letter Match any code point with the value“Letter” for the General Category property.
  47. 47. Properties p{gc=Letter} Name: General Category (gc) Value: LetterProperty names may be abbreviated.
  48. 48. Properties p{gc=L} Name: General Category (gc) Value: Letter (L)The General Category property isso commonly used that its values all have standard abbreviations.
  49. 49. Properties p{L} Name: General Category (implicit) Value: Letter (L)And the General Category values may evenbe used on their own, like the Script values. These two properties have distinct values.
  50. 50. Properties pLName: General Category (implicit) Value: Letter (L)Single-character General Category values dont require curly braces.
  51. 51. Properties PLName: General Category (implicit) Value: not Letter (L) Dont forget negation!
  52. 52. s/�/�/g

×