Camomile : A Unicode library for OCaml

  • 1,516 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,516
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
9
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Camomile : A Unicode library for OCaml Yoriyuki Yamagata National Institute of Advanced Science and Technology (AIST) ML Workshop, September 18, 2011
  • 2. Outline Overview ASCII to Unicode : A challenge of multilingualization Example : Unicode normal forms ulib Conclusion
  • 3. Outline Overview ASCII to Unicode : A challenge of multilingualization Example : Unicode normal forms ulib Conclusion
  • 4. Overview - functionality
  • 5. Overview - functionality Camomile - A Unicode library for OCaml
  • 6. Overview - functionality Camomile - A Unicode library for OCaml Unicode character type
  • 7. Overview - functionality Camomile - A Unicode library for OCaml Unicode character type UTF-8, UTF-16, UTF-32 strings
  • 8. Overview - functionality Camomile - A Unicode library for OCaml Unicode character type UTF-8, UTF-16, UTF-32 strings Conversion to/from approx 200 encodings
  • 9. Overview - functionality Camomile - A Unicode library for OCaml Unicode character type UTF-8, UTF-16, UTF-32 strings Conversion to/from approx 200 encodings Case mapping
  • 10. Overview - functionality Camomile - A Unicode library for OCaml Unicode character type UTF-8, UTF-16, UTF-32 strings Conversion to/from approx 200 encodings Case mapping Collation (sort and search)
  • 11. Overview - feature
  • 12. Overview - feature Only support “logical” operations
  • 13. Overview - feature Only support “logical” operations No support for rendering or formatting
  • 14. Overview - feature Only support “logical” operations No support for rendering or formatting Purely written in OCaml
  • 15. Overview - feature Only support “logical” operations No support for rendering or formatting Purely written in OCaml Functors and lazy evaluation play crucial roles
  • 16. Outline Overview ASCII to Unicode : A challenge of multilingualization Example : Unicode normal forms ulib Conclusion
  • 17. ASCII to Unicode : challenge of multilingualization
  • 18. ASCII to Unicode : challenge of multilingualization Large number of characters
  • 19. ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff
  • 20. ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff Multiple representation of strings
  • 21. ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff Multiple representation of strings UTF-8, UTF-16 and UTF-32
  • 22. ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff Multiple representation of strings UTF-8, UTF-16 and UTF-32 legacy encodings
  • 23. ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff Multiple representation of strings UTF-8, UTF-16 and UTF-32 legacy encodings Combining characters
  • 24. ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff Multiple representation of strings UTF-8, UTF-16 and UTF-32 legacy encodings Combining characters ä=a+¨
  • 25. ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff Multiple representation of strings UTF-8, UTF-16 and UTF-32 legacy encodings Combining characters ä=a+¨ ˜ Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en
  • 26. ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff Multiple representation of strings UTF-8, UTF-16 and UTF-32 legacy encodings Combining characters ä=a+¨ ˜ Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en â=a+.+ˆ=a+ˆ+. .
  • 27. ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff Multiple representation of strings UTF-8, UTF-16 and UTF-32 legacy encodings Combining characters ä=a+¨ ˜ Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en â=a+.+ˆ=a+ˆ+. . Diverse cultural conventions
  • 28. ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff Multiple representation of strings UTF-8, UTF-16 and UTF-32 legacy encodings Combining characters ä=a+¨ ˜ Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en â=a+.+ˆ=a+ˆ+. . Diverse cultural conventions Case mapping OΣOΣ → oσoς (Greek)
  • 29. ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff Multiple representation of strings UTF-8, UTF-16 and UTF-32 legacy encodings Combining characters ä=a+¨ ˜ Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en â=a+.+ˆ=a+ˆ+. . Diverse cultural conventions Case mapping OΣOΣ → oσoς (Greek) Sorting ... < H < CH < I < ... (Slovak)
  • 30. Outline Overview ASCII to Unicode : A challenge of multilingualization Example : Unicode normal forms ulib Conclusion
  • 31. Unicode normal forms - what is it?
  • 32. Unicode normal forms - what is it? Unicode has multiple representations of “same” strings.
  • 33. Unicode normal forms - what is it? Unicode has multiple representations of “same” strings. E.g. â = a + ˆ = a + . + ˆ = a + ˆ + . etc. . .
  • 34. Unicode normal forms - what is it? Unicode has multiple representations of “same” strings. E.g. â = a + ˆ = a + . + ˆ = a + ˆ + . etc. . . Normal forms give the unique representations There are 4 normal forms 1. NFD 2. NFC 3. NFKD 4. NFKC
  • 35. Unicode normal forms - what is it? Unicode has multiple representations of “same” strings. E.g. â = a + ˆ = a + . + ˆ = a + ˆ + . etc. . . Normal forms give the unique representations There are 4 normal forms 1. NFD 2. NFC 3. NFKD 4. NFKC We concentrate NFD
  • 36. Unicode normal form - NFD
  • 37. Unicode normal form - NFD 1. Decompose characters as much as possible â⇒a+ˆ ⇒a+.+ˆ . .
  • 38. Unicode normal form - NFD 1. Decompose characters as much as possible â⇒a+ˆ ⇒a+.+ˆ . . 2. Do stable sort on combining characters based on combining class a+.+ˆ ⇒a+.+ˆ
  • 39. Camomile strings - UTF8, UTF16, UCS4
  • 40. Camomile strings - UTF8, UTF16, UCS4 UTF8 UTF-8 string as a string
  • 41. Camomile strings - UTF8, UTF16, UCS4 UTF8 UTF-8 string as a string UTF16 UTF-16 string as an unsigned 16-bit integer bigarray
  • 42. Camomile strings - UTF8, UTF16, UCS4 UTF8 UTF-8 string as a string UTF16 UTF-16 string as an unsigned 16-bit integer bigarray UCS4 UTF-32 string as a 32-bit integer bigarray
  • 43. Camomile strings - UTF8, UTF16, UCS4 UTF8 UTF-8 string as a string UTF16 UTF-16 string as an unsigned 16-bit integer bigarray UCS4 UTF-32 string as a 32-bit integer bigarray UnicodeString.Type UTF-8/16 and UCS4 all confirm UnicodeString.Type String operations are functors over UnicodeString.Type
  • 44. Camomile modules - UNF Module for Unicode normal form module type Type = sig type text val nfd : text -> text val nfkd : text -> text val nfc : text -> text val nfkc : text -> text val canon_compare : text -> text -> int end module Make (Text : UnicodeString.Type) : Type with type text = Text.t and type index = Text.index
  • 45. Camomile modules - UNF Create a module for a given Unicode string module type Type = sig type text val nfd : text -> text val nfkd : text -> text val nfc : text -> text val nfkc : text -> text val canon_compare : text -> text -> int end module Make (Text : UnicodeString.Type) : Type with type text = Text.t and type index = Text.index
  • 46. Camomile modules - UNF Conversion to NFD module type Type = sig type text val nfd : text -> text val nfkd : text -> text val nfc : text -> text val nfkc : text -> text val canon_compare : text -> text -> int end module Make (Text : UnicodeString.Type) : Type with type text = Text.t and type index = Text.index
  • 47. Camomile modules - UNF Compare strings by semantic equivalence module type Type = sig type text val nfd : text -> text val nfkd : text -> text val nfc : text -> text val nfkc : text -> text val canon_compare : text -> text -> int end module Make (Text : UnicodeString.Type) : Type with type text = Text.t and type index = Text.index
  • 48. Camomile modules - UNF By lazily building NFD and compare them module type Type = sig type text val nfd : text -> text val nfkd : text -> text val nfc : text -> text val nfkc : text -> text val canon_compare : text -> text -> int end module Make (Text : UnicodeString.Type) : Type with type text = Text.t and type index = Text.index
  • 49. Outline Overview ASCII to Unicode : A challenge of multilingualization Example : Unicode normal forms ulib Conclusion
  • 50. ulib - a yet another Unicode library Now under development
  • 51. ulib - a yet another Unicode library ulib is compact
  • 52. ulib - a yet another Unicode library ulib is compact Minimum functionalities
  • 53. ulib - a yet another Unicode library ulib is compact Minimum functionalities No data file
  • 54. ulib - a yet another Unicode library ulib is compact Minimum functionalities No data file No initialization
  • 55. ulib - a yet another Unicode library ulib is modern
  • 56. ulib - a yet another Unicode library ulib is modern Rope for Unicode string
  • 57. ulib - a yet another Unicode library ulib is modern Rope for Unicode string Zipper for indexing rope
  • 58. ulib - a yet another Unicode library ulib is modern Rope for Unicode string Zipper for indexing rope Pluggable code converter using first class modules
  • 59. Outline Overview ASCII to Unicode : A challenge of multilingualization Example : Unicode normal forms ulib Conclusion
  • 60. Conclusion
  • 61. Conclusion Unicode is different from ASCII
  • 62. Conclusion Unicode is different from ASCII Camomile addresses a "logical" part of Unicode
  • 63. Conclusion Unicode is different from ASCII Camomile addresses a "logical" part of Unicode Functors and lazyness play crucial roles
  • 64. Conclusion Unicode is different from ASCII Camomile addresses a "logical" part of Unicode Functors and lazyness play crucial roles More simplified library "ulib" is now under development.
  • 65. Project URL Camomile https://github.com/yoriyuki/Camomile ulib https://github.com/yoriyuki/ulib