SlideShare a Scribd company logo
1 of 52
Download to read offline
Unicode
Regular Expressions

  s/�/�/g
       Nick Patch
    23 January 2013
Unicode Refresher

    Unicode attempts to support the
characters of the world — a massive task!
Unicode Refresher

It's hard to attach a single meaning to the
  word “character” but most folks think of
  characters as the smallest stand-alone
      components of a writing system.
Unicode Refresher

  In Unicode, this sense of characters is
 represented by one or more code points,
which are each stored in one or more bytes.
Unicode Refresher

      However, programmers and
programming languages tend to think of
  characters as individual code points,
       or worse, individual bytes.

  We need to modernize our habits!
Unicode Refresher

Unicode is not just a big set of characters.
  It also defines standard properties for
 each character and standard algorithms
      for operations such as collation,
     normalization, and segmentation.
Normalization

NFD(ᾀ◌̀) = α◌̓◌̀◌ͅ
NFC(ᾀ◌̀) = ᾂ̀
Normalization

NFD(Чю◌́рлёнис) = Чю◌́рле◌̈нис
NFC(Чю◌́рлёнис) = Чю◌́рлёнис
Normalization

  ᾂ ≡ ἂ◌ͅ ≡ ᾀ◌̀ ≡ ᾳ◌̓◌̀ ≡
 α◌̓◌̀◌ͅ ≡ α◌̓◌ͅ◌̀ ≡ α◌ͅ◌̓◌̀
             ≠
ᾲ◌̓ ≡ ὰ◌̓◌ͅ ≡ ὰ◌ͅ◌̓ ≡ ᾳ◌̀◌̓ ≡
 α◌̀◌̓◌ͅ ≡ α◌̀◌ͅ◌̓ ≡ α◌ͅ◌̀◌̓
Perl Normalization

use Unicode::Normalize;

say $str;          # ᾀ◌̀
say NFD($str);     # α◌̓◌̀◌ͅ
say NFC($str);     # ᾂ̀
JavaScript Normalization

var unorm = require('unorm');

console.log($str);              # ᾀ◌̀
console.log(unorm.nfd($str));   # α◌̓◌̀◌ͅ
console.log(unorm.nfc($str));   # ᾂ̀
PHP Normalization

echo $str;            # ᾀ◌̀

echo Normalizer::normalize($str,
Normalizer::FORM_D); # α◌̓◌̀◌ͅ

echo Normalizer::normalize($str,
Normalizer::FORM_C); # ᾂ̀
Grapheme Clusters

regex:      /^.$/

string 1:   ᾂ


string 2:   α◌̓◌̀◌ͅ
Grapheme Clusters

regex:         /^.$/

string 1:      ᾂ
              ⇧

string 2:      α◌̓◌̀◌ͅ
              ⇧

1. anchor beginning of string
Grapheme Clusters

regex:         /^.$/

string 1:      ᾂ
              ⇧

string 2:      α◌̓◌̀◌ͅ
              ⇧

1. anchor beginning of string
2. match code point (excl. n)
Grapheme Clusters

regex:         /^.$/

string 1:      ᾂ
              ⇧⇧

string 2:      α◌̓◌̀◌ͅ


1. anchor beginning of string
2. match code point (excl. n)
3. anchor at end of string
Grapheme Clusters

regex:         /^.$/

string 1:     ᾂ
             ⇧⇧

string 2:      α◌̓◌̀◌ͅ


1. anchor beginning of string
2. match code point (excl. n)
3. anchor at end of string
4. 1 success but 1 failure — mixed results �
Grapheme Clusters

regex:      /^X$/

string 1:   ᾂ


string 2:   α◌̓◌̀◌ͅ
Grapheme Clusters

regex:         /^X$/

string 1:      ᾂ
              ⇧

string 2:      α◌̓◌̀◌ͅ
              ⇧

1. anchor beginning of string
Grapheme Clusters

regex:         /^X$/

string 1:      ᾂ
              ⇧

string 2:      α◌̓◌̀◌ͅ
              ⇧

1. anchor beginning of string
2. match grapheme cluster
Grapheme Clusters

regex:         /^X$/

string 1:      ᾂ
              ⇧⇧

string 2:      α◌̓◌̀◌ͅ
              ⇧      ⇧

1. anchor beginning of string
2. match grapheme cluster
3. anchor at end of string
Grapheme Clusters

regex:         /^X$/

string 1:      ᾂ
              ⇧⇧

string 2:      α◌̓◌̀◌ͅ
              ⇧      ⇧

1. anchor beginning of string
2. match grapheme cluster
3. anchor at end of string
4. success! �
Perl

use   v5.12; # better yet: v5.14
use   utf8;
use   charnames qw( :full ); # unless v5.16
use   open qw( :encoding(UTF-8) :std );

$str =~ /^X$/;

$str =~ s/^(X)$/->$1<-/;
PHP

preg_match('/^X$/u', $str);

preg_replace('/^(X)$/u', '->$1<-', $str);
JavaScript
[This slide intentionally left blank.]
Match Any Character

two bytes (if byte mode):      е..и
code point (exc. n):          е.и
code point (incl. n):         еp{Any}и
grapheme cluster (incl. n):   еXи
Match Any Letter

letter code point:еp{General_Category=Letter}и
letter code point:   еpLи
Cyrillic code point: еp{Script=Cyrillic}и
Cyrillic code point: еp{Cyrillic}и

letter grapheme cluster: е(?=pL)Xи
regex:      / о p{Cyrillic} т /x

string 1:   който


string 2:   кои◌̆то
regex:          / о p{Cyrillic} т /x

string 1:       който


string 2:       кои◌̆то


1. match letter о
regex:          / о p{Cyrillic} т /x

string 1:       който


string 2:       кои◌̆то


1. match letter о
2. match Cyrillic letter (1 code point)
regex:          / о p{Cyrillic} т /x

string 1:       който


string 2:       кои◌̆то


1. match letter о
2. match Cyrillic letter (1 code point)
3. match letter т
regex:         / о p{Cyrillic} т /x

string 1:      който


string 2:      кои◌̆то


1. match letter о
2. match Cyrillic letter (1 code point)
3. match letter т
4. 1 success but 1 failure — mixed results �
regex:      / о (?= p{Cyrillic} ) X т /x

string 1:   който


string 2:   кои◌̆то
regex:          / о (?= p{Cyrillic} ) X т /x

string 1:       който


string 2:       кои◌̆то


1. match letter о
regex:          / о (?= p{Cyrillic} ) X т /x

string 1:       който
                 ⇧

string 2:       кои◌̆то
                 ⇧

1. match letter о
2. positive lookahead Cyrillic letter (1 code point)
regex:          / о (?= p{Cyrillic} ) X т /x

string 1:       който
                 ⇧

string 2:       кои◌̆то
                 ⇧

1. match letter о
2. positive lookahead Cyrillic letter (1 code point)
3. match grapheme cluster (1+ code points)
regex:          / о (?= p{Cyrillic} ) X т /x

string 1:       който
                 ⇧

string 2:       кои◌̆то
                 ⇧

1. match letter о
2. positive lookahead Cyrillic letter (1 code point)
3. match grapheme cluster (1+ code points)
4. match letter т
regex:          / о (?= p{Cyrillic} ) X т /x

string 1:       който
                 ⇧

string 2:       кои◌̆то
                 ⇧

1. match letter о
2. positive lookahead Cyrillic letter (1 code point)
3. match grapheme cluster (1+ code points)
4. match letter т
5. success! �
Character Literals

      [‫]يی‬

    (?:‫)ي|ی‬
Character Literals

      [‫]يی‬

    (?:‫)ي|ی‬
Character Literals

       [‫]يی‬

     (?:‫)ي|ی‬

[x{064A}x{06CC}]
Character Literals

            [‫]يی‬

          (?:‫)ي|ی‬

     [x{064A}x{06CC}]

   [N{ARABIC LETTER YEH}
N{ARABIC LETTER FARSI YEH}]
Properties

         p{Script=Latin}

           Name: Script
           Value: Latin


   Match any code point with the
value “Latin” for the Script property.
Properties

         P{Script=Latin}

           Name: Script
          Value: not Latin

           Negated form:
 Match any code point without the
value “Latin” for the Script property.
Properties

           p{Latin}

     Name: Script (implicit)
        Value: Latin


The Script and General Category
properties don't require the name
because they're so common and
    their values don't conflict.
Properties

     p{General_Category=Letter}

        Name: General Category
            Value: Letter


   Match any code point with the value
“Letter” for the General Category property.
Properties

          p{gc=Letter}

   Name: General Category (gc)
          Value: Letter


Property names may be abbreviated.
Properties

            p{gc=L}

 Name: General Category (gc)
      Value: Letter (L)


The General Category property is
so commonly used that its values
 all have standard abbreviations.
Properties

                   p{L}

    Name: General Category (implicit)
           Value: Letter (L)


And the General Category values may even
be used on their own, like the Script values.
 These two properties have distinct values.
Properties

               pL

Name: General Category (implicit)
       Value: Letter (L)


Single-character General Category
 values don't require curly braces.
Properties

               PL

Name: General Category (implicit)
      Value: not Letter (L)


      Don't forget negation!
s/�/�/g

More Related Content

What's hot

Declarative Semantics Definition - Term Rewriting
Declarative Semantics Definition - Term RewritingDeclarative Semantics Definition - Term Rewriting
Declarative Semantics Definition - Term RewritingGuido Wachsmuth
 
Regular Expressions grep and egrep
Regular Expressions grep and egrepRegular Expressions grep and egrep
Regular Expressions grep and egrepTri Truong
 
Finaal application on regular expression
Finaal application on regular expressionFinaal application on regular expression
Finaal application on regular expressionGagan019
 
Introduction - Imperative and Object-Oriented Languages
Introduction - Imperative and Object-Oriented LanguagesIntroduction - Imperative and Object-Oriented Languages
Introduction - Imperative and Object-Oriented LanguagesGuido Wachsmuth
 
Regular Expressions 101 Introduction to Regular Expressions
Regular Expressions 101 Introduction to Regular ExpressionsRegular Expressions 101 Introduction to Regular Expressions
Regular Expressions 101 Introduction to Regular ExpressionsDanny Bryant
 
Regular Expression
Regular ExpressionRegular Expression
Regular ExpressionBharat17485
 
And now you have two problems. Ruby regular expressions for fun and profit by...
And now you have two problems. Ruby regular expressions for fun and profit by...And now you have two problems. Ruby regular expressions for fun and profit by...
And now you have two problems. Ruby regular expressions for fun and profit by...Codemotion
 
Regular expressions
Regular expressionsRegular expressions
Regular expressionsRaj Gupta
 
Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...
Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...
Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...Andrea Telatin
 
Haskell retrospective
Haskell retrospectiveHaskell retrospective
Haskell retrospectivechenge2k
 
DEFUN 2008 - Real World Haskell
DEFUN 2008 - Real World HaskellDEFUN 2008 - Real World Haskell
DEFUN 2008 - Real World HaskellBryan O'Sullivan
 
Introduction to Regular Expressions
Introduction to Regular ExpressionsIntroduction to Regular Expressions
Introduction to Regular ExpressionsMatt Casto
 
Deduplication on large amounts of code
Deduplication on large amounts of codeDeduplication on large amounts of code
Deduplication on large amounts of codesource{d}
 
Regular expressions
Regular expressionsRegular expressions
Regular expressionsEran Zimbler
 

What's hot (20)

Declarative Semantics Definition - Term Rewriting
Declarative Semantics Definition - Term RewritingDeclarative Semantics Definition - Term Rewriting
Declarative Semantics Definition - Term Rewriting
 
Regular Expression
Regular ExpressionRegular Expression
Regular Expression
 
Regular Expressions grep and egrep
Regular Expressions grep and egrepRegular Expressions grep and egrep
Regular Expressions grep and egrep
 
Finaal application on regular expression
Finaal application on regular expressionFinaal application on regular expression
Finaal application on regular expression
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Optimization of dfa
Optimization of dfaOptimization of dfa
Optimization of dfa
 
Introduction - Imperative and Object-Oriented Languages
Introduction - Imperative and Object-Oriented LanguagesIntroduction - Imperative and Object-Oriented Languages
Introduction - Imperative and Object-Oriented Languages
 
Regular Expressions
Regular ExpressionsRegular Expressions
Regular Expressions
 
Regular Expressions 101 Introduction to Regular Expressions
Regular Expressions 101 Introduction to Regular ExpressionsRegular Expressions 101 Introduction to Regular Expressions
Regular Expressions 101 Introduction to Regular Expressions
 
Regular Expression
Regular ExpressionRegular Expression
Regular Expression
 
And now you have two problems. Ruby regular expressions for fun and profit by...
And now you have two problems. Ruby regular expressions for fun and profit by...And now you have two problems. Ruby regular expressions for fun and profit by...
And now you have two problems. Ruby regular expressions for fun and profit by...
 
Dictor
DictorDictor
Dictor
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...
Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...
Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...
 
Haskell retrospective
Haskell retrospectiveHaskell retrospective
Haskell retrospective
 
DEFUN 2008 - Real World Haskell
DEFUN 2008 - Real World HaskellDEFUN 2008 - Real World Haskell
DEFUN 2008 - Real World Haskell
 
Ch3
Ch3Ch3
Ch3
 
Introduction to Regular Expressions
Introduction to Regular ExpressionsIntroduction to Regular Expressions
Introduction to Regular Expressions
 
Deduplication on large amounts of code
Deduplication on large amounts of codeDeduplication on large amounts of code
Deduplication on large amounts of code
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 

Similar to Unicode Regular Expressions

Regular Expressions: JavaScript And Beyond
Regular Expressions: JavaScript And BeyondRegular Expressions: JavaScript And Beyond
Regular Expressions: JavaScript And BeyondMax Shirshin
 
my$talk=qr{((?:ir)?reg(?:ular )?exp(?:ressions?)?)}i;
my$talk=qr{((?:ir)?reg(?:ular )?exp(?:ressions?)?)}i;my$talk=qr{((?:ir)?reg(?:ular )?exp(?:ressions?)?)}i;
my$talk=qr{((?:ir)?reg(?:ular )?exp(?:ressions?)?)}i;dankogai
 
Linux fundamental - Chap 06 regx
Linux fundamental - Chap 06 regxLinux fundamental - Chap 06 regx
Linux fundamental - Chap 06 regxKenny (netman)
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Aslak Hellesøy
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Aslak Hellesøy
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Aslak Hellesøy
 
Saumya Debray The University of Arizona Tucson
Saumya Debray The University of Arizona TucsonSaumya Debray The University of Arizona Tucson
Saumya Debray The University of Arizona Tucsonjeronimored
 
Lecture 3 Perl & FreeBSD administration
Lecture 3 Perl & FreeBSD administrationLecture 3 Perl & FreeBSD administration
Lecture 3 Perl & FreeBSD administrationMohammed Farrag
 
Good Evils In Perl
Good Evils In PerlGood Evils In Perl
Good Evils In PerlKang-min Liu
 
Stop overusing regular expressions!
Stop overusing regular expressions!Stop overusing regular expressions!
Stop overusing regular expressions!Franklin Chen
 
Recursive descent parsing
Recursive descent parsingRecursive descent parsing
Recursive descent parsingBoy Baukema
 
Practical approach to perl day1
Practical approach to perl day1Practical approach to perl day1
Practical approach to perl day1Rakesh Mukundan
 
Introduction to Perl
Introduction to PerlIntroduction to Perl
Introduction to PerlSway Wang
 
Fundamental Unicode in Perl
Fundamental Unicode in PerlFundamental Unicode in Perl
Fundamental Unicode in PerlNova Patch
 
Bioinformatica 06-10-2011-p2 introduction
Bioinformatica 06-10-2011-p2 introductionBioinformatica 06-10-2011-p2 introduction
Bioinformatica 06-10-2011-p2 introductionProf. Wim Van Criekinge
 

Similar to Unicode Regular Expressions (20)

Regular Expressions: JavaScript And Beyond
Regular Expressions: JavaScript And BeyondRegular Expressions: JavaScript And Beyond
Regular Expressions: JavaScript And Beyond
 
my$talk=qr{((?:ir)?reg(?:ular )?exp(?:ressions?)?)}i;
my$talk=qr{((?:ir)?reg(?:ular )?exp(?:ressions?)?)}i;my$talk=qr{((?:ir)?reg(?:ular )?exp(?:ressions?)?)}i;
my$talk=qr{((?:ir)?reg(?:ular )?exp(?:ressions?)?)}i;
 
Linux fundamental - Chap 06 regx
Linux fundamental - Chap 06 regxLinux fundamental - Chap 06 regx
Linux fundamental - Chap 06 regx
 
Perl Presentation
Perl PresentationPerl Presentation
Perl Presentation
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009
 
Saumya Debray The University of Arizona Tucson
Saumya Debray The University of Arizona TucsonSaumya Debray The University of Arizona Tucson
Saumya Debray The University of Arizona Tucson
 
Cleancode
CleancodeCleancode
Cleancode
 
Lecture 3 Perl & FreeBSD administration
Lecture 3 Perl & FreeBSD administrationLecture 3 Perl & FreeBSD administration
Lecture 3 Perl & FreeBSD administration
 
Good Evils In Perl
Good Evils In PerlGood Evils In Perl
Good Evils In Perl
 
Stop overusing regular expressions!
Stop overusing regular expressions!Stop overusing regular expressions!
Stop overusing regular expressions!
 
Recursive descent parsing
Recursive descent parsingRecursive descent parsing
Recursive descent parsing
 
Perl_Part4
Perl_Part4Perl_Part4
Perl_Part4
 
Practical approach to perl day1
Practical approach to perl day1Practical approach to perl day1
Practical approach to perl day1
 
Introduction to Perl
Introduction to PerlIntroduction to Perl
Introduction to Perl
 
Fundamental Unicode in Perl
Fundamental Unicode in PerlFundamental Unicode in Perl
Fundamental Unicode in Perl
 
Bioinformatica 06-10-2011-p2 introduction
Bioinformatica 06-10-2011-p2 introductionBioinformatica 06-10-2011-p2 introduction
Bioinformatica 06-10-2011-p2 introduction
 
Bioinformatica p2-p3-introduction
Bioinformatica p2-p3-introductionBioinformatica p2-p3-introduction
Bioinformatica p2-p3-introduction
 
Quick start reg ex
Quick start reg exQuick start reg ex
Quick start reg ex
 

Recently uploaded

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 

Recently uploaded (20)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 

Unicode Regular Expressions

  • 1. Unicode Regular Expressions s/�/�/g Nick Patch 23 January 2013
  • 2. Unicode Refresher Unicode attempts to support the characters of the world — a massive task!
  • 3. Unicode Refresher It's hard to attach a single meaning to the word “character” but most folks think of characters as the smallest stand-alone components of a writing system.
  • 4. Unicode Refresher In Unicode, this sense of characters is represented by one or more code points, which are each stored in one or more bytes.
  • 5. Unicode Refresher However, programmers and programming languages tend to think of characters as individual code points, or worse, individual bytes. We need to modernize our habits!
  • 6. Unicode Refresher Unicode is not just a big set of characters. It also defines standard properties for each character and standard algorithms for operations such as collation, normalization, and segmentation.
  • 9. Normalization ᾂ ≡ ἂ◌ͅ ≡ ᾀ◌̀ ≡ ᾳ◌̓◌̀ ≡ α◌̓◌̀◌ͅ ≡ α◌̓◌ͅ◌̀ ≡ α◌ͅ◌̓◌̀ ≠ ᾲ◌̓ ≡ ὰ◌̓◌ͅ ≡ ὰ◌ͅ◌̓ ≡ ᾳ◌̀◌̓ ≡ α◌̀◌̓◌ͅ ≡ α◌̀◌ͅ◌̓ ≡ α◌ͅ◌̀◌̓
  • 10. Perl Normalization use Unicode::Normalize; say $str; # ᾀ◌̀ say NFD($str); # α◌̓◌̀◌ͅ say NFC($str); # ᾂ̀
  • 11. JavaScript Normalization var unorm = require('unorm'); console.log($str); # ᾀ◌̀ console.log(unorm.nfd($str)); # α◌̓◌̀◌ͅ console.log(unorm.nfc($str)); # ᾂ̀
  • 12. PHP Normalization echo $str; # ᾀ◌̀ echo Normalizer::normalize($str, Normalizer::FORM_D); # α◌̓◌̀◌ͅ echo Normalizer::normalize($str, Normalizer::FORM_C); # ᾂ̀
  • 13. Grapheme Clusters regex: /^.$/ string 1: ᾂ string 2: α◌̓◌̀◌ͅ
  • 14. Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧ string 2: α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string
  • 15. Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧ string 2: α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string 2. match code point (excl. n)
  • 16. Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧⇧ string 2: α◌̓◌̀◌ͅ 1. anchor beginning of string 2. match code point (excl. n) 3. anchor at end of string
  • 17. Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧⇧ string 2: α◌̓◌̀◌ͅ 1. anchor beginning of string 2. match code point (excl. n) 3. anchor at end of string 4. 1 success but 1 failure — mixed results �
  • 18. Grapheme Clusters regex: /^X$/ string 1: ᾂ string 2: α◌̓◌̀◌ͅ
  • 19. Grapheme Clusters regex: /^X$/ string 1: ᾂ ⇧ string 2: α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string
  • 20. Grapheme Clusters regex: /^X$/ string 1: ᾂ ⇧ string 2: α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string 2. match grapheme cluster
  • 21. Grapheme Clusters regex: /^X$/ string 1: ᾂ ⇧⇧ string 2: α◌̓◌̀◌ͅ ⇧ ⇧ 1. anchor beginning of string 2. match grapheme cluster 3. anchor at end of string
  • 22. Grapheme Clusters regex: /^X$/ string 1: ᾂ ⇧⇧ string 2: α◌̓◌̀◌ͅ ⇧ ⇧ 1. anchor beginning of string 2. match grapheme cluster 3. anchor at end of string 4. success! �
  • 23. Perl use v5.12; # better yet: v5.14 use utf8; use charnames qw( :full ); # unless v5.16 use open qw( :encoding(UTF-8) :std ); $str =~ /^X$/; $str =~ s/^(X)$/->$1<-/;
  • 26. Match Any Character two bytes (if byte mode): е..и code point (exc. n): е.и code point (incl. n): еp{Any}и grapheme cluster (incl. n): еXи
  • 27. Match Any Letter letter code point:еp{General_Category=Letter}и letter code point: еpLи Cyrillic code point: еp{Script=Cyrillic}и Cyrillic code point: еp{Cyrillic}и letter grapheme cluster: е(?=pL)Xи
  • 28. regex: / о p{Cyrillic} т /x string 1: който string 2: кои◌̆то
  • 29. regex: / о p{Cyrillic} т /x string 1: който string 2: кои◌̆то 1. match letter о
  • 30. regex: / о p{Cyrillic} т /x string 1: който string 2: кои◌̆то 1. match letter о 2. match Cyrillic letter (1 code point)
  • 31. regex: / о p{Cyrillic} т /x string 1: който string 2: кои◌̆то 1. match letter о 2. match Cyrillic letter (1 code point) 3. match letter т
  • 32. regex: / о p{Cyrillic} т /x string 1: който string 2: кои◌̆то 1. match letter о 2. match Cyrillic letter (1 code point) 3. match letter т 4. 1 success but 1 failure — mixed results �
  • 33. regex: / о (?= p{Cyrillic} ) X т /x string 1: който string 2: кои◌̆то
  • 34. regex: / о (?= p{Cyrillic} ) X т /x string 1: който string 2: кои◌̆то 1. match letter о
  • 35. regex: / о (?= p{Cyrillic} ) X т /x string 1: който ⇧ string 2: кои◌̆то ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point)
  • 36. regex: / о (?= p{Cyrillic} ) X т /x string 1: който ⇧ string 2: кои◌̆то ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point) 3. match grapheme cluster (1+ code points)
  • 37. regex: / о (?= p{Cyrillic} ) X т /x string 1: който ⇧ string 2: кои◌̆то ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point) 3. match grapheme cluster (1+ code points) 4. match letter т
  • 38. regex: / о (?= p{Cyrillic} ) X т /x string 1: който ⇧ string 2: кои◌̆то ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point) 3. match grapheme cluster (1+ code points) 4. match letter т 5. success! �
  • 39. Character Literals [‫]يی‬ (?:‫)ي|ی‬
  • 40. Character Literals [‫]يی‬ (?:‫)ي|ی‬
  • 41. Character Literals [‫]يی‬ (?:‫)ي|ی‬ [x{064A}x{06CC}]
  • 42. Character Literals [‫]يی‬ (?:‫)ي|ی‬ [x{064A}x{06CC}] [N{ARABIC LETTER YEH} N{ARABIC LETTER FARSI YEH}]
  • 43. Properties p{Script=Latin} Name: Script Value: Latin Match any code point with the value “Latin” for the Script property.
  • 44. Properties P{Script=Latin} Name: Script Value: not Latin Negated form: Match any code point without the value “Latin” for the Script property.
  • 45. Properties p{Latin} Name: Script (implicit) Value: Latin The Script and General Category properties don't require the name because they're so common and their values don't conflict.
  • 46. Properties p{General_Category=Letter} Name: General Category Value: Letter Match any code point with the value “Letter” for the General Category property.
  • 47. Properties p{gc=Letter} Name: General Category (gc) Value: Letter Property names may be abbreviated.
  • 48. Properties p{gc=L} Name: General Category (gc) Value: Letter (L) The General Category property is so commonly used that its values all have standard abbreviations.
  • 49. Properties p{L} Name: General Category (implicit) Value: Letter (L) And the General Category values may even be used on their own, like the Script values. These two properties have distinct values.
  • 50. Properties pL Name: General Category (implicit) Value: Letter (L) Single-character General Category values don't require curly braces.
  • 51. Properties PL Name: General Category (implicit) Value: not Letter (L) Don't forget negation!