Breaking the Kubernetes Kill Chain: Host Path Mount
Camomile : A Unicode library for OCaml
1. Camomile : A Unicode library for OCaml
Yoriyuki Yamagata
National Institute of Advanced Science and Technology (AIST)
ML Workshop, September 18, 2011
2. Outline
Overview
ASCII to Unicode : A challenge of multilingualization
Example : Unicode normal forms
ulib
Conclusion
3. Outline
Overview
ASCII to Unicode : A challenge of multilingualization
Example : Unicode normal forms
ulib
Conclusion
7. Overview - functionality
Camomile - A Unicode library for OCaml
Unicode character type
UTF-8, UTF-16, UTF-32 strings
8. Overview - functionality
Camomile - A Unicode library for OCaml
Unicode character type
UTF-8, UTF-16, UTF-32 strings
Conversion to/from approx 200 encodings
9. Overview - functionality
Camomile - A Unicode library for OCaml
Unicode character type
UTF-8, UTF-16, UTF-32 strings
Conversion to/from approx 200 encodings
Case mapping
10. Overview - functionality
Camomile - A Unicode library for OCaml
Unicode character type
UTF-8, UTF-16, UTF-32 strings
Conversion to/from approx 200 encodings
Case mapping
Collation (sort and search)
13. Overview - feature
Only support “logical” operations
No support for rendering or formatting
14. Overview - feature
Only support “logical” operations
No support for rendering or formatting
Purely written in OCaml
15. Overview - feature
Only support “logical” operations
No support for rendering or formatting
Purely written in OCaml
Functors and lazy evaluation play crucial roles
16. Outline
Overview
ASCII to Unicode : A challenge of multilingualization
Example : Unicode normal forms
ulib
Conclusion
18. ASCII to Unicode : challenge of multilingualization
Large number of characters
19. ASCII to Unicode : challenge of multilingualization
Large number of characters
code range 0x0 - 0x10ffff
20. ASCII to Unicode : challenge of multilingualization
Large number of characters
code range 0x0 - 0x10ffff
Multiple representation of strings
21. ASCII to Unicode : challenge of multilingualization
Large number of characters
code range 0x0 - 0x10ffff
Multiple representation of strings
UTF-8, UTF-16 and UTF-32
22. ASCII to Unicode : challenge of multilingualization
Large number of characters
code range 0x0 - 0x10ffff
Multiple representation of strings
UTF-8, UTF-16 and UTF-32
legacy encodings
23. ASCII to Unicode : challenge of multilingualization
Large number of characters
code range 0x0 - 0x10ffff
Multiple representation of strings
UTF-8, UTF-16 and UTF-32
legacy encodings
Combining characters
24. ASCII to Unicode : challenge of multilingualization
Large number of characters
code range 0x0 - 0x10ffff
Multiple representation of strings
UTF-8, UTF-16 and UTF-32
legacy encodings
Combining characters
ä=a+¨
25. ASCII to Unicode : challenge of multilingualization
Large number of characters
code range 0x0 - 0x10ffff
Multiple representation of strings
UTF-8, UTF-16 and UTF-32
legacy encodings
Combining characters
ä=a+¨
˜
Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en
26. ASCII to Unicode : challenge of multilingualization
Large number of characters
code range 0x0 - 0x10ffff
Multiple representation of strings
UTF-8, UTF-16 and UTF-32
legacy encodings
Combining characters
ä=a+¨
˜
Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en
â=a+.+ˆ=a+ˆ+.
.
27. ASCII to Unicode : challenge of multilingualization
Large number of characters
code range 0x0 - 0x10ffff
Multiple representation of strings
UTF-8, UTF-16 and UTF-32
legacy encodings
Combining characters
ä=a+¨
˜
Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en
â=a+.+ˆ=a+ˆ+.
.
Diverse cultural conventions
28. ASCII to Unicode : challenge of multilingualization
Large number of characters
code range 0x0 - 0x10ffff
Multiple representation of strings
UTF-8, UTF-16 and UTF-32
legacy encodings
Combining characters
ä=a+¨
˜
Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en
â=a+.+ˆ=a+ˆ+.
.
Diverse cultural conventions
Case mapping OΣOΣ → oσoς (Greek)
29. ASCII to Unicode : challenge of multilingualization
Large number of characters
code range 0x0 - 0x10ffff
Multiple representation of strings
UTF-8, UTF-16 and UTF-32
legacy encodings
Combining characters
ä=a+¨
˜
Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en
â=a+.+ˆ=a+ˆ+.
.
Diverse cultural conventions
Case mapping OΣOΣ → oσoς (Greek)
Sorting ... < H < CH < I < ... (Slovak)
30. Outline
Overview
ASCII to Unicode : A challenge of multilingualization
Example : Unicode normal forms
ulib
Conclusion
32. Unicode normal forms - what is it?
Unicode has multiple representations of “same” strings.
33. Unicode normal forms - what is it?
Unicode has multiple representations of “same” strings.
E.g. â = a + ˆ = a + . + ˆ = a + ˆ + . etc.
. .
34. Unicode normal forms - what is it?
Unicode has multiple representations of “same” strings.
E.g. â = a + ˆ = a + . + ˆ = a + ˆ + . etc.
. .
Normal forms give the unique representations
There are 4 normal forms
1. NFD
2. NFC
3. NFKD
4. NFKC
35. Unicode normal forms - what is it?
Unicode has multiple representations of “same” strings.
E.g. â = a + ˆ = a + . + ˆ = a + ˆ + . etc.
. .
Normal forms give the unique representations
There are 4 normal forms
1. NFD
2. NFC
3. NFKD
4. NFKC
We concentrate NFD
37. Unicode normal form - NFD
1. Decompose characters as much as possible
â⇒a+ˆ ⇒a+.+ˆ
. .
38. Unicode normal form - NFD
1. Decompose characters as much as possible
â⇒a+ˆ ⇒a+.+ˆ
. .
2. Do stable sort on combining characters based on
combining class
a+.+ˆ ⇒a+.+ˆ
41. Camomile strings - UTF8, UTF16, UCS4
UTF8
UTF-8 string as a string
UTF16
UTF-16 string as an unsigned 16-bit integer bigarray
42. Camomile strings - UTF8, UTF16, UCS4
UTF8
UTF-8 string as a string
UTF16
UTF-16 string as an unsigned 16-bit integer bigarray
UCS4
UTF-32 string as a 32-bit integer bigarray
43. Camomile strings - UTF8, UTF16, UCS4
UTF8
UTF-8 string as a string
UTF16
UTF-16 string as an unsigned 16-bit integer bigarray
UCS4
UTF-32 string as a 32-bit integer bigarray
UnicodeString.Type
UTF-8/16 and UCS4 all confirm UnicodeString.Type
String operations are functors over UnicodeString.Type
44. Camomile modules - UNF
Module for Unicode normal form
module type Type =
sig
type text
val nfd : text -> text
val nfkd : text -> text
val nfc : text -> text
val nfkc : text -> text
val canon_compare : text -> text -> int
end
module Make (Text : UnicodeString.Type) :
Type with type text = Text.t and
type index = Text.index
45. Camomile modules - UNF
Create a module for a given Unicode string
module type Type =
sig
type text
val nfd : text -> text
val nfkd : text -> text
val nfc : text -> text
val nfkc : text -> text
val canon_compare : text -> text -> int
end
module Make (Text : UnicodeString.Type) :
Type with type text = Text.t and
type index = Text.index
46. Camomile modules - UNF
Conversion to NFD
module type Type =
sig
type text
val nfd : text -> text
val nfkd : text -> text
val nfc : text -> text
val nfkc : text -> text
val canon_compare : text -> text -> int
end
module Make (Text : UnicodeString.Type) :
Type with type text = Text.t and
type index = Text.index
47. Camomile modules - UNF
Compare strings by semantic equivalence
module type Type =
sig
type text
val nfd : text -> text
val nfkd : text -> text
val nfc : text -> text
val nfkc : text -> text
val canon_compare : text -> text -> int
end
module Make (Text : UnicodeString.Type) :
Type with type text = Text.t and
type index = Text.index
48. Camomile modules - UNF
By lazily building NFD and compare them
module type Type =
sig
type text
val nfd : text -> text
val nfkd : text -> text
val nfc : text -> text
val nfkc : text -> text
val canon_compare : text -> text -> int
end
module Make (Text : UnicodeString.Type) :
Type with type text = Text.t and
type index = Text.index
49. Outline
Overview
ASCII to Unicode : A challenge of multilingualization
Example : Unicode normal forms
ulib
Conclusion
50. ulib - a yet another Unicode library
Now under development
51. ulib - a yet another Unicode library
ulib is compact
52. ulib - a yet another Unicode library
ulib is compact
Minimum functionalities
53. ulib - a yet another Unicode library
ulib is compact
Minimum functionalities
No data file
54. ulib - a yet another Unicode library
ulib is compact
Minimum functionalities
No data file
No initialization
55. ulib - a yet another Unicode library
ulib is modern
56. ulib - a yet another Unicode library
ulib is modern
Rope for Unicode string
57. ulib - a yet another Unicode library
ulib is modern
Rope for Unicode string
Zipper for indexing rope
58. ulib - a yet another Unicode library
ulib is modern
Rope for Unicode string
Zipper for indexing rope
Pluggable code converter using first class modules
59. Outline
Overview
ASCII to Unicode : A challenge of multilingualization
Example : Unicode normal forms
ulib
Conclusion
62. Conclusion
Unicode is different from ASCII
Camomile addresses a "logical" part of Unicode
63. Conclusion
Unicode is different from ASCII
Camomile addresses a "logical" part of Unicode
Functors and lazyness play crucial roles
64. Conclusion
Unicode is different from ASCII
Camomile addresses a "logical" part of Unicode
Functors and lazyness play crucial roles
More simplified library "ulib" is now under development.