SlideShare a Scribd company logo
1 of 65
Download to read offline
Camomile : A Unicode library for OCaml

                   Yoriyuki Yamagata

  National Institute of Advanced Science and Technology (AIST)


        ML Workshop, September 18, 2011
Outline

   Overview


   ASCII to Unicode : A challenge of multilingualization


   Example : Unicode normal forms


   ulib


   Conclusion
Outline

   Overview


   ASCII to Unicode : A challenge of multilingualization


   Example : Unicode normal forms


   ulib


   Conclusion
Overview - functionality
Overview - functionality
   Camomile - A Unicode library for OCaml
Overview - functionality
   Camomile - A Unicode library for OCaml
      Unicode character type
Overview - functionality
   Camomile - A Unicode library for OCaml
      Unicode character type
      UTF-8, UTF-16, UTF-32 strings
Overview - functionality
   Camomile - A Unicode library for OCaml
      Unicode character type
      UTF-8, UTF-16, UTF-32 strings
      Conversion to/from approx 200 encodings
Overview - functionality
   Camomile - A Unicode library for OCaml
      Unicode character type
      UTF-8, UTF-16, UTF-32 strings
      Conversion to/from approx 200 encodings
      Case mapping
Overview - functionality
   Camomile - A Unicode library for OCaml
      Unicode character type
      UTF-8, UTF-16, UTF-32 strings
      Conversion to/from approx 200 encodings
      Case mapping
      Collation (sort and search)
Overview - feature
Overview - feature
      Only support “logical” operations
Overview - feature
      Only support “logical” operations
      No support for rendering or formatting
Overview - feature
      Only support “logical” operations
      No support for rendering or formatting
      Purely written in OCaml
Overview - feature
      Only support “logical” operations
      No support for rendering or formatting
      Purely written in OCaml
      Functors and lazy evaluation play crucial roles
Outline

   Overview


   ASCII to Unicode : A challenge of multilingualization


   Example : Unicode normal forms


   ulib


   Conclusion
ASCII to Unicode : challenge of multilingualization
ASCII to Unicode : challenge of multilingualization
   Large number of characters
ASCII to Unicode : challenge of multilingualization
   Large number of characters
              code range 0x0 - 0x10ffff
ASCII to Unicode : challenge of multilingualization
   Large number of characters
              code range 0x0 - 0x10ffff
   Multiple representation of strings
ASCII to Unicode : challenge of multilingualization
   Large number of characters
              code range 0x0 - 0x10ffff
   Multiple representation of strings
                UTF-8, UTF-16 and UTF-32
ASCII to Unicode : challenge of multilingualization
   Large number of characters
              code range 0x0 - 0x10ffff
   Multiple representation of strings
                UTF-8, UTF-16 and UTF-32
                legacy encodings
ASCII to Unicode : challenge of multilingualization
   Large number of characters
              code range 0x0 - 0x10ffff
   Multiple representation of strings
                UTF-8, UTF-16 and UTF-32
                legacy encodings
   Combining characters
ASCII to Unicode : challenge of multilingualization
   Large number of characters
              code range 0x0 - 0x10ffff
   Multiple representation of strings
                UTF-8, UTF-16 and UTF-32
                legacy encodings
   Combining characters
              ä=a+¨
ASCII to Unicode : challenge of multilingualization
   Large number of characters
              code range 0x0 - 0x10ffff
   Multiple representation of strings
                UTF-8, UTF-16 and UTF-32
                legacy encodings
   Combining characters
              ä=a+¨
                   ˜
              Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en
ASCII to Unicode : challenge of multilingualization
   Large number of characters
              code range 0x0 - 0x10ffff
   Multiple representation of strings
                UTF-8, UTF-16 and UTF-32
                legacy encodings
   Combining characters
              ä=a+¨
                   ˜
              Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en
              â=a+.+ˆ=a+ˆ+.
               .
ASCII to Unicode : challenge of multilingualization
   Large number of characters
              code range 0x0 - 0x10ffff
   Multiple representation of strings
                UTF-8, UTF-16 and UTF-32
                legacy encodings
   Combining characters
                ä=a+¨
                      ˜
                Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en
                â=a+.+ˆ=a+ˆ+.
                .
   Diverse cultural conventions
ASCII to Unicode : challenge of multilingualization
   Large number of characters
              code range 0x0 - 0x10ffff
   Multiple representation of strings
                UTF-8, UTF-16 and UTF-32
                legacy encodings
   Combining characters
                ä=a+¨
                      ˜
                Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en
                â=a+.+ˆ=a+ˆ+.
                .
   Diverse cultural conventions
                Case mapping OΣOΣ → oσoς (Greek)
ASCII to Unicode : challenge of multilingualization
   Large number of characters
              code range 0x0 - 0x10ffff
   Multiple representation of strings
                UTF-8, UTF-16 and UTF-32
                legacy encodings
   Combining characters
                ä=a+¨
                      ˜
                Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en
                â=a+.+ˆ=a+ˆ+.
                .
   Diverse cultural conventions
                Case mapping OΣOΣ → oσoς (Greek)
                     Sorting ... < H < CH < I < ... (Slovak)
Outline

   Overview


   ASCII to Unicode : A challenge of multilingualization


   Example : Unicode normal forms


   ulib


   Conclusion
Unicode normal forms - what is it?
Unicode normal forms - what is it?


   Unicode has multiple representations of “same” strings.
Unicode normal forms - what is it?


   Unicode has multiple representations of “same” strings.
   E.g. â = a + ˆ = a + . + ˆ = a + ˆ + . etc.
        . .
Unicode normal forms - what is it?


   Unicode has multiple representations of “same” strings.
   E.g. â = a + ˆ = a + . + ˆ = a + ˆ + . etc.
        . .
   Normal forms give the unique representations
   There are 4 normal forms
    1. NFD
    2. NFC
    3. NFKD
    4. NFKC
Unicode normal forms - what is it?


   Unicode has multiple representations of “same” strings.
   E.g. â = a + ˆ = a + . + ˆ = a + ˆ + . etc.
        . .
   Normal forms give the unique representations
   There are 4 normal forms
    1. NFD
    2. NFC
    3. NFKD
    4. NFKC

   We concentrate NFD
Unicode normal form - NFD
Unicode normal form - NFD




   1. Decompose characters as much as possible
            â⇒a+ˆ ⇒a+.+ˆ
             .   .
Unicode normal form - NFD




   1. Decompose characters as much as possible
            â⇒a+ˆ ⇒a+.+ˆ
             .   .
   2. Do stable sort on combining characters based on
      combining class
              a+.+ˆ ⇒a+.+ˆ
Camomile strings - UTF8, UTF16, UCS4
Camomile strings - UTF8, UTF16, UCS4
  UTF8
  UTF-8 string as a string
Camomile strings - UTF8, UTF16, UCS4
  UTF8
  UTF-8 string as a string

  UTF16
  UTF-16 string as an unsigned 16-bit integer bigarray
Camomile strings - UTF8, UTF16, UCS4
  UTF8
  UTF-8 string as a string

  UTF16
  UTF-16 string as an unsigned 16-bit integer bigarray

  UCS4
  UTF-32 string as a 32-bit integer bigarray
Camomile strings - UTF8, UTF16, UCS4
  UTF8
  UTF-8 string as a string

  UTF16
  UTF-16 string as an unsigned 16-bit integer bigarray

  UCS4
  UTF-32 string as a 32-bit integer bigarray

  UnicodeString.Type
  UTF-8/16 and UCS4 all confirm UnicodeString.Type
  String operations are functors over UnicodeString.Type
Camomile modules - UNF
  Module for Unicode normal form
       module type Type =
       sig
         type text

         val   nfd : text -> text
         val   nfkd : text -> text
         val   nfc : text -> text
         val   nfkc : text -> text

         val canon_compare : text -> text -> int
       end

       module Make (Text : UnicodeString.Type) :
         Type with type text = Text.t and
         type index = Text.index
Camomile modules - UNF
  Create a module for a given Unicode string
        module type Type =
        sig
          type text

          val   nfd : text -> text
          val   nfkd : text -> text
          val   nfc : text -> text
          val   nfkc : text -> text

          val canon_compare : text -> text -> int
        end

        module Make (Text : UnicodeString.Type) :
          Type with type text = Text.t and
          type index = Text.index
Camomile modules - UNF
  Conversion to NFD
       module type Type =
       sig
         type text

         val   nfd : text -> text
         val   nfkd : text -> text
         val   nfc : text -> text
         val   nfkc : text -> text

         val canon_compare : text -> text -> int
       end

       module Make (Text : UnicodeString.Type) :
         Type with type text = Text.t and
         type index = Text.index
Camomile modules - UNF
  Compare strings by semantic equivalence
       module type Type =
       sig
         type text

         val   nfd : text -> text
         val   nfkd : text -> text
         val   nfc : text -> text
         val   nfkc : text -> text

         val canon_compare : text -> text -> int
       end

       module Make (Text : UnicodeString.Type) :
         Type with type text = Text.t and
         type index = Text.index
Camomile modules - UNF
  By lazily building NFD and compare them
       module type Type =
       sig
         type text

         val   nfd : text -> text
         val   nfkd : text -> text
         val   nfc : text -> text
         val   nfkc : text -> text

         val canon_compare : text -> text -> int
       end

       module Make (Text : UnicodeString.Type) :
         Type with type text = Text.t and
         type index = Text.index
Outline

   Overview


   ASCII to Unicode : A challenge of multilingualization


   Example : Unicode normal forms


   ulib


   Conclusion
ulib - a yet another Unicode library
   Now under development
ulib - a yet another Unicode library
   ulib is compact
ulib - a yet another Unicode library
   ulib is compact
       Minimum functionalities
ulib - a yet another Unicode library
   ulib is compact
       Minimum functionalities
       No data file
ulib - a yet another Unicode library
   ulib is compact
       Minimum functionalities
       No data file
       No initialization
ulib - a yet another Unicode library
   ulib is modern
ulib - a yet another Unicode library
   ulib is modern
       Rope for Unicode string
ulib - a yet another Unicode library
   ulib is modern
       Rope for Unicode string
       Zipper for indexing rope
ulib - a yet another Unicode library
   ulib is modern
       Rope for Unicode string
       Zipper for indexing rope
       Pluggable code converter using first class modules
Outline

   Overview


   ASCII to Unicode : A challenge of multilingualization


   Example : Unicode normal forms


   ulib


   Conclusion
Conclusion
Conclusion
     Unicode is different from ASCII
Conclusion
     Unicode is different from ASCII
     Camomile addresses a "logical" part of Unicode
Conclusion
     Unicode is different from ASCII
     Camomile addresses a "logical" part of Unicode
     Functors and lazyness play crucial roles
Conclusion
     Unicode is different from ASCII
     Camomile addresses a "logical" part of Unicode
     Functors and lazyness play crucial roles
     More simplified library "ulib" is now under development.
Project URL




   Camomile https://github.com/yoriyuki/Camomile
         ulib https://github.com/yoriyuki/ulib

More Related Content

Viewers also liked

Mirage: ML kernels in the cloud (ML Workshop 2010)
Mirage: ML kernels in the cloud (ML Workshop 2010)Mirage: ML kernels in the cloud (ML Workshop 2010)
Mirage: ML kernels in the cloud (ML Workshop 2010)
Anil Madhavapeddy
 
OCamlでWebアプリケーションを作るn個の方法
OCamlでWebアプリケーションを作るn個の方法OCamlでWebアプリケーションを作るn個の方法
OCamlでWebアプリケーションを作るn個の方法
Hiroki Mizuno
 
Os Peytonjones
Os PeytonjonesOs Peytonjones
Os Peytonjones
oscon2007
 
OCaml Labs introduction at OCaml Consortium 2012
OCaml Labs introduction at OCaml Consortium 2012OCaml Labs introduction at OCaml Consortium 2012
OCaml Labs introduction at OCaml Consortium 2012
Anil Madhavapeddy
 
Real World OCamlを読んでLispと協調してみた
Real World OCamlを読んでLispと協調してみたReal World OCamlを読んでLispと協調してみた
Real World OCamlを読んでLispと協調してみた
blackenedgold
 
PythonistaがOCamlを実用する方法
PythonistaがOCamlを実用する方法PythonistaがOCamlを実用する方法
PythonistaがOCamlを実用する方法
Yosuke Onoue
 

Viewers also liked (20)

A taste of Functional Programming
A taste of Functional ProgrammingA taste of Functional Programming
A taste of Functional Programming
 
Ocaml
OcamlOcaml
Ocaml
 
Using functional programming within an industrial product group: perspectives...
Using functional programming within an industrial product group: perspectives...Using functional programming within an industrial product group: perspectives...
Using functional programming within an industrial product group: perspectives...
 
Introduction to functional programming using Ocaml
Introduction to functional programming using OcamlIntroduction to functional programming using Ocaml
Introduction to functional programming using Ocaml
 
Mirage: ML kernels in the cloud (ML Workshop 2010)
Mirage: ML kernels in the cloud (ML Workshop 2010)Mirage: ML kernels in the cloud (ML Workshop 2010)
Mirage: ML kernels in the cloud (ML Workshop 2010)
 
Haskell - Functional Programming
Haskell - Functional ProgrammingHaskell - Functional Programming
Haskell - Functional Programming
 
An Introduction to Functional Programming using Haskell
An Introduction to Functional Programming using HaskellAn Introduction to Functional Programming using Haskell
An Introduction to Functional Programming using Haskell
 
計算数学
計算数学計算数学
計算数学
 
Lispmeetup11
Lispmeetup11Lispmeetup11
Lispmeetup11
 
Introduction to haskell
Introduction to haskellIntroduction to haskell
Introduction to haskell
 
OCamlでWebアプリケーションを作るn個の方法
OCamlでWebアプリケーションを作るn個の方法OCamlでWebアプリケーションを作るn個の方法
OCamlでWebアプリケーションを作るn個の方法
 
Os Peytonjones
Os PeytonjonesOs Peytonjones
Os Peytonjones
 
OCaml Labs introduction at OCaml Consortium 2012
OCaml Labs introduction at OCaml Consortium 2012OCaml Labs introduction at OCaml Consortium 2012
OCaml Labs introduction at OCaml Consortium 2012
 
Hey! There's OCaml in my Rust!
Hey! There's OCaml in my Rust!Hey! There's OCaml in my Rust!
Hey! There's OCaml in my Rust!
 
Real World OCamlを読んでLispと協調してみた
Real World OCamlを読んでLispと協調してみたReal World OCamlを読んでLispと協調してみた
Real World OCamlを読んでLispと協調してみた
 
関数型プログラミング入門 with OCaml
関数型プログラミング入門 with OCaml関数型プログラミング入門 with OCaml
関数型プログラミング入門 with OCaml
 
PythonistaがOCamlを実用する方法
PythonistaがOCamlを実用する方法PythonistaがOCamlを実用する方法
PythonistaがOCamlを実用する方法
 
Why Haskell
Why HaskellWhy Haskell
Why Haskell
 
Neural Turing Machine Tutorial
Neural Turing Machine TutorialNeural Turing Machine Tutorial
Neural Turing Machine Tutorial
 
Object-oriented Basics
Object-oriented BasicsObject-oriented Basics
Object-oriented Basics
 

Similar to Camomile : A Unicode library for OCaml

Data encryption and tokenization for international unicode
Data encryption and tokenization for international unicodeData encryption and tokenization for international unicode
Data encryption and tokenization for international unicode
Ulf Mattsson
 
Unicode and character sets
Unicode and character setsUnicode and character sets
Unicode and character sets
renchenyu
 
Xml For Dummies Chapter 6 Adding Character(S) To Xml
Xml For Dummies   Chapter 6 Adding Character(S) To XmlXml For Dummies   Chapter 6 Adding Character(S) To Xml
Xml For Dummies Chapter 6 Adding Character(S) To Xml
phanleson
 
Jun 29 new privacy technologies for unicode and international data standards ...
Jun 29 new privacy technologies for unicode and international data standards ...Jun 29 new privacy technologies for unicode and international data standards ...
Jun 29 new privacy technologies for unicode and international data standards ...
Ulf Mattsson
 

Similar to Camomile : A Unicode library for OCaml (20)

Comprehasive Exam - IT
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - IT
 
Overview of character encoding
Overview of character encodingOverview of character encoding
Overview of character encoding
 
Lecture_ASCII and Unicode.ppt
Lecture_ASCII and Unicode.pptLecture_ASCII and Unicode.ppt
Lecture_ASCII and Unicode.ppt
 
Data encryption and tokenization for international unicode
Data encryption and tokenization for international unicodeData encryption and tokenization for international unicode
Data encryption and tokenization for international unicode
 
Unicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set CollisionsUnicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set Collisions
 
Character Sets
Character SetsCharacter Sets
Character Sets
 
SignWriting in Unicode dot SWU
SignWriting in Unicode dot SWUSignWriting in Unicode dot SWU
SignWriting in Unicode dot SWU
 
Unicode and character sets
Unicode and character setsUnicode and character sets
Unicode and character sets
 
String Encodings
String EncodingsString Encodings
String Encodings
 
Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)
 
Unicode for Small Children (and Children at Heart)
Unicode for Small Children (and Children at Heart)Unicode for Small Children (and Children at Heart)
Unicode for Small Children (and Children at Heart)
 
Xml For Dummies Chapter 6 Adding Character(S) To Xml
Xml For Dummies   Chapter 6 Adding Character(S) To XmlXml For Dummies   Chapter 6 Adding Character(S) To Xml
Xml For Dummies Chapter 6 Adding Character(S) To Xml
 
Type हिन्दी in Java
Type हिन्दी in JavaType हिन्दी in Java
Type हिन्दी in Java
 
Character encoding and unicode format
Character encoding and unicode formatCharacter encoding and unicode format
Character encoding and unicode format
 
Encodings - Ruby 1.8 and Ruby 1.9
Encodings - Ruby 1.8 and Ruby 1.9Encodings - Ruby 1.8 and Ruby 1.9
Encodings - Ruby 1.8 and Ruby 1.9
 
SignWriting in Unicode Next
SignWriting in Unicode NextSignWriting in Unicode Next
SignWriting in Unicode Next
 
Uncdtalk
UncdtalkUncdtalk
Uncdtalk
 
Jun 29 new privacy technologies for unicode and international data standards ...
Jun 29 new privacy technologies for unicode and international data standards ...Jun 29 new privacy technologies for unicode and international data standards ...
Jun 29 new privacy technologies for unicode and international data standards ...
 
Unicode basics in python
Unicode basics in pythonUnicode basics in python
Unicode basics in python
 
SignWriting in Unicode and rich text considerations
SignWriting in Unicode and rich text considerationsSignWriting in Unicode and rich text considerations
SignWriting in Unicode and rich text considerations
 

More from Yamagata Yoriyuki

Consistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmeticConsistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmetic
Yamagata Yoriyuki
 
Rubyでデータマイニング: RubyKaigi2007ライトニングトーク
Rubyでデータマイニング: RubyKaigi2007ライトニングトークRubyでデータマイニング: RubyKaigi2007ライトニングトーク
Rubyでデータマイニング: RubyKaigi2007ライトニングトーク
Yamagata Yoriyuki
 
CSPによる並行システムの検証(2)
CSPによる並行システムの検証(2)CSPによる並行システムの検証(2)
CSPによる並行システムの検証(2)
Yamagata Yoriyuki
 
CSPによるコンカレントシステムの検証(1)
CSPによるコンカレントシステムの検証(1)CSPによるコンカレントシステムの検証(1)
CSPによるコンカレントシステムの検証(1)
Yamagata Yoriyuki
 
Bounded arithmetic in free logic
Bounded arithmetic in free logicBounded arithmetic in free logic
Bounded arithmetic in free logic
Yamagata Yoriyuki
 
Camomile - OCaml用Unicodeライブラリ
Camomile - OCaml用UnicodeライブラリCamomile - OCaml用Unicodeライブラリ
Camomile - OCaml用Unicodeライブラリ
Yamagata Yoriyuki
 
Google 日本語入力 TechTalk 2010
Google 日本語入力 TechTalk 2010Google 日本語入力 TechTalk 2010
Google 日本語入力 TechTalk 2010
Yamagata Yoriyuki
 

More from Yamagata Yoriyuki (19)

ヴォイニッチ手稿と私
ヴォイニッチ手稿と私ヴォイニッチ手稿と私
ヴォイニッチ手稿と私
 
Scalaによるドメイン特化言語を使ったソフトウェアの動作解析
Scalaによるドメイン特化言語を使ったソフトウェアの動作解析Scalaによるドメイン特化言語を使ったソフトウェアの動作解析
Scalaによるドメイン特化言語を使ったソフトウェアの動作解析
 
Consistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmeticConsistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmetic
 
モデル検査紹介
モデル検査紹介モデル検査紹介
モデル検査紹介
 
Runtime verification based on CSP
Runtime verification based on CSPRuntime verification based on CSP
Runtime verification based on CSP
 
CSPを用いたログ解析その他
CSPを用いたログ解析その他CSPを用いたログ解析その他
CSPを用いたログ解析その他
 
Consistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmeticConsistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmetic
 
Consistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmeticConsistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmetic
 
OCamlとUnicode
OCamlとUnicodeOCamlとUnicode
OCamlとUnicode
 
Rubyでデータマイニング: RubyKaigi2007ライトニングトーク
Rubyでデータマイニング: RubyKaigi2007ライトニングトークRubyでデータマイニング: RubyKaigi2007ライトニングトーク
Rubyでデータマイニング: RubyKaigi2007ライトニングトーク
 
CSPによる並行システムの検証(2)
CSPによる並行システムの検証(2)CSPによる並行システムの検証(2)
CSPによる並行システムの検証(2)
 
CSPによるコンカレントシステムの検証(1)
CSPによるコンカレントシステムの検証(1)CSPによるコンカレントシステムの検証(1)
CSPによるコンカレントシステムの検証(1)
 
Bounded arithmetic in free logic
Bounded arithmetic in free logicBounded arithmetic in free logic
Bounded arithmetic in free logic
 
Bounded arithmetic in free logic
Bounded arithmetic in free logicBounded arithmetic in free logic
Bounded arithmetic in free logic
 
UML&FM 2012
UML&FM 2012UML&FM 2012
UML&FM 2012
 
Translating STM to CSP
Translating STM to CSPTranslating STM to CSP
Translating STM to CSP
 
Camomile - OCaml用Unicodeライブラリ
Camomile - OCaml用UnicodeライブラリCamomile - OCaml用Unicodeライブラリ
Camomile - OCaml用Unicodeライブラリ
 
Google 日本語入力 TechTalk 2010
Google 日本語入力 TechTalk 2010Google 日本語入力 TechTalk 2010
Google 日本語入力 TechTalk 2010
 
CamomileでUnicode
CamomileでUnicodeCamomileでUnicode
CamomileでUnicode
 

Recently uploaded

Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
FIDO Alliance
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
FIDO Alliance
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
FIDO Alliance
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 

Recently uploaded (20)

Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
 
الأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهالأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهله
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdf
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfFrisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 

Camomile : A Unicode library for OCaml

  • 1. Camomile : A Unicode library for OCaml Yoriyuki Yamagata National Institute of Advanced Science and Technology (AIST) ML Workshop, September 18, 2011
  • 2. Outline Overview ASCII to Unicode : A challenge of multilingualization Example : Unicode normal forms ulib Conclusion
  • 3. Outline Overview ASCII to Unicode : A challenge of multilingualization Example : Unicode normal forms ulib Conclusion
  • 5. Overview - functionality Camomile - A Unicode library for OCaml
  • 6. Overview - functionality Camomile - A Unicode library for OCaml Unicode character type
  • 7. Overview - functionality Camomile - A Unicode library for OCaml Unicode character type UTF-8, UTF-16, UTF-32 strings
  • 8. Overview - functionality Camomile - A Unicode library for OCaml Unicode character type UTF-8, UTF-16, UTF-32 strings Conversion to/from approx 200 encodings
  • 9. Overview - functionality Camomile - A Unicode library for OCaml Unicode character type UTF-8, UTF-16, UTF-32 strings Conversion to/from approx 200 encodings Case mapping
  • 10. Overview - functionality Camomile - A Unicode library for OCaml Unicode character type UTF-8, UTF-16, UTF-32 strings Conversion to/from approx 200 encodings Case mapping Collation (sort and search)
  • 12. Overview - feature Only support “logical” operations
  • 13. Overview - feature Only support “logical” operations No support for rendering or formatting
  • 14. Overview - feature Only support “logical” operations No support for rendering or formatting Purely written in OCaml
  • 15. Overview - feature Only support “logical” operations No support for rendering or formatting Purely written in OCaml Functors and lazy evaluation play crucial roles
  • 16. Outline Overview ASCII to Unicode : A challenge of multilingualization Example : Unicode normal forms ulib Conclusion
  • 17. ASCII to Unicode : challenge of multilingualization
  • 18. ASCII to Unicode : challenge of multilingualization Large number of characters
  • 19. ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff
  • 20. ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff Multiple representation of strings
  • 21. ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff Multiple representation of strings UTF-8, UTF-16 and UTF-32
  • 22. ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff Multiple representation of strings UTF-8, UTF-16 and UTF-32 legacy encodings
  • 23. ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff Multiple representation of strings UTF-8, UTF-16 and UTF-32 legacy encodings Combining characters
  • 24. ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff Multiple representation of strings UTF-8, UTF-16 and UTF-32 legacy encodings Combining characters ä=a+¨
  • 25. ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff Multiple representation of strings UTF-8, UTF-16 and UTF-32 legacy encodings Combining characters ä=a+¨ ˜ Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en
  • 26. ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff Multiple representation of strings UTF-8, UTF-16 and UTF-32 legacy encodings Combining characters ä=a+¨ ˜ Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en â=a+.+ˆ=a+ˆ+. .
  • 27. ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff Multiple representation of strings UTF-8, UTF-16 and UTF-32 legacy encodings Combining characters ä=a+¨ ˜ Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en â=a+.+ˆ=a+ˆ+. . Diverse cultural conventions
  • 28. ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff Multiple representation of strings UTF-8, UTF-16 and UTF-32 legacy encodings Combining characters ä=a+¨ ˜ Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en â=a+.+ˆ=a+ˆ+. . Diverse cultural conventions Case mapping OΣOΣ → oσoς (Greek)
  • 29. ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff Multiple representation of strings UTF-8, UTF-16 and UTF-32 legacy encodings Combining characters ä=a+¨ ˜ Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en â=a+.+ˆ=a+ˆ+. . Diverse cultural conventions Case mapping OΣOΣ → oσoς (Greek) Sorting ... < H < CH < I < ... (Slovak)
  • 30. Outline Overview ASCII to Unicode : A challenge of multilingualization Example : Unicode normal forms ulib Conclusion
  • 31. Unicode normal forms - what is it?
  • 32. Unicode normal forms - what is it? Unicode has multiple representations of “same” strings.
  • 33. Unicode normal forms - what is it? Unicode has multiple representations of “same” strings. E.g. â = a + ˆ = a + . + ˆ = a + ˆ + . etc. . .
  • 34. Unicode normal forms - what is it? Unicode has multiple representations of “same” strings. E.g. â = a + ˆ = a + . + ˆ = a + ˆ + . etc. . . Normal forms give the unique representations There are 4 normal forms 1. NFD 2. NFC 3. NFKD 4. NFKC
  • 35. Unicode normal forms - what is it? Unicode has multiple representations of “same” strings. E.g. â = a + ˆ = a + . + ˆ = a + ˆ + . etc. . . Normal forms give the unique representations There are 4 normal forms 1. NFD 2. NFC 3. NFKD 4. NFKC We concentrate NFD
  • 37. Unicode normal form - NFD 1. Decompose characters as much as possible â⇒a+ˆ ⇒a+.+ˆ . .
  • 38. Unicode normal form - NFD 1. Decompose characters as much as possible â⇒a+ˆ ⇒a+.+ˆ . . 2. Do stable sort on combining characters based on combining class a+.+ˆ ⇒a+.+ˆ
  • 39. Camomile strings - UTF8, UTF16, UCS4
  • 40. Camomile strings - UTF8, UTF16, UCS4 UTF8 UTF-8 string as a string
  • 41. Camomile strings - UTF8, UTF16, UCS4 UTF8 UTF-8 string as a string UTF16 UTF-16 string as an unsigned 16-bit integer bigarray
  • 42. Camomile strings - UTF8, UTF16, UCS4 UTF8 UTF-8 string as a string UTF16 UTF-16 string as an unsigned 16-bit integer bigarray UCS4 UTF-32 string as a 32-bit integer bigarray
  • 43. Camomile strings - UTF8, UTF16, UCS4 UTF8 UTF-8 string as a string UTF16 UTF-16 string as an unsigned 16-bit integer bigarray UCS4 UTF-32 string as a 32-bit integer bigarray UnicodeString.Type UTF-8/16 and UCS4 all confirm UnicodeString.Type String operations are functors over UnicodeString.Type
  • 44. Camomile modules - UNF Module for Unicode normal form module type Type = sig type text val nfd : text -> text val nfkd : text -> text val nfc : text -> text val nfkc : text -> text val canon_compare : text -> text -> int end module Make (Text : UnicodeString.Type) : Type with type text = Text.t and type index = Text.index
  • 45. Camomile modules - UNF Create a module for a given Unicode string module type Type = sig type text val nfd : text -> text val nfkd : text -> text val nfc : text -> text val nfkc : text -> text val canon_compare : text -> text -> int end module Make (Text : UnicodeString.Type) : Type with type text = Text.t and type index = Text.index
  • 46. Camomile modules - UNF Conversion to NFD module type Type = sig type text val nfd : text -> text val nfkd : text -> text val nfc : text -> text val nfkc : text -> text val canon_compare : text -> text -> int end module Make (Text : UnicodeString.Type) : Type with type text = Text.t and type index = Text.index
  • 47. Camomile modules - UNF Compare strings by semantic equivalence module type Type = sig type text val nfd : text -> text val nfkd : text -> text val nfc : text -> text val nfkc : text -> text val canon_compare : text -> text -> int end module Make (Text : UnicodeString.Type) : Type with type text = Text.t and type index = Text.index
  • 48. Camomile modules - UNF By lazily building NFD and compare them module type Type = sig type text val nfd : text -> text val nfkd : text -> text val nfc : text -> text val nfkc : text -> text val canon_compare : text -> text -> int end module Make (Text : UnicodeString.Type) : Type with type text = Text.t and type index = Text.index
  • 49. Outline Overview ASCII to Unicode : A challenge of multilingualization Example : Unicode normal forms ulib Conclusion
  • 50. ulib - a yet another Unicode library Now under development
  • 51. ulib - a yet another Unicode library ulib is compact
  • 52. ulib - a yet another Unicode library ulib is compact Minimum functionalities
  • 53. ulib - a yet another Unicode library ulib is compact Minimum functionalities No data file
  • 54. ulib - a yet another Unicode library ulib is compact Minimum functionalities No data file No initialization
  • 55. ulib - a yet another Unicode library ulib is modern
  • 56. ulib - a yet another Unicode library ulib is modern Rope for Unicode string
  • 57. ulib - a yet another Unicode library ulib is modern Rope for Unicode string Zipper for indexing rope
  • 58. ulib - a yet another Unicode library ulib is modern Rope for Unicode string Zipper for indexing rope Pluggable code converter using first class modules
  • 59. Outline Overview ASCII to Unicode : A challenge of multilingualization Example : Unicode normal forms ulib Conclusion
  • 61. Conclusion Unicode is different from ASCII
  • 62. Conclusion Unicode is different from ASCII Camomile addresses a "logical" part of Unicode
  • 63. Conclusion Unicode is different from ASCII Camomile addresses a "logical" part of Unicode Functors and lazyness play crucial roles
  • 64. Conclusion Unicode is different from ASCII Camomile addresses a "logical" part of Unicode Functors and lazyness play crucial roles More simplified library "ulib" is now under development.
  • 65. Project URL Camomile https://github.com/yoriyuki/Camomile ulib https://github.com/yoriyuki/ulib