Data encryption and tokenization for international unicode

1
Data Encryption and Tokenization for International Unicode
Contents
Unicode character encoding standard..........................................................................................................2
Select the character encodings to be used...............................................................................................2
Unicode UTF-8...........................................................................................................................................3
Unicode Code points for the Scripts can be stored in UTF-8 in one to four bytes ...............................3
Unicode Code points (red) in UTF-8 includes a header in each byte....................................................3
Unicode Ranges of Code points ............................................................................................................4
These are examples of European Scripts......................................................................................................4
Examples of Scripts with one to two bytes characters.........................................................................4
UTF-8 and UTF-16 Encoding......................................................................................................................7
Examples of Tokenization of Unicode...........................................................................................................8
Token Fabric generated from input of Unicode Code Points ...................................................................8
Forward and backward chaining of tokens...........................................................................................9
The Token Fabric...........................................................................................................................................9
The IV Pool ................................................................................................................................................9
These are examples of East Asian Scripts ...................................................................................................10

2
Examples of Scripts with three to four bytes characters....................................................................10
Language preservation can be achieved in groups of Scripts.............................................................11
Example of Unicode characters in the Japanese ................................................................................11
UTF-8 can be mapped to the Japanese language standard Shift JIS X 0213 2004..............................12
Examples of the number of characters in some of the East Asian Scripts..........................................12
Example of a Japanese address label..................................................................................................12
Example of tokenizing five Japanese Scripts in a address label.........................................................13
Example of tokenizing 3-byts and 4-bytes Unicode characters..........................................................13
Portability of tokens and lookup tables......................................................................................................14
Encoding UTF-16 4-bytes Unicode..........................................................................................................14
Summary.............................................................................................................................................14
Be careful about such as leaking information due to byte-length preservation........................................14
Notes...........................................................................................................................................................14
Unicode character encoding standard
Unicode is an information technology standard for the consistent encoding, representation, and
handling of text expressed in most of the world's writing systems. The standard is maintained by the
Unicode Consortium, and as of March 2020, it has a total of 143,859 characters, with Unicode 13.0
(these characters consist of 143,696 graphic characters and 163 format characters) covering 154
modern and historic scripts, as well as multiple symbol sets and emoji. The character repertoire of the
Unicode Standard is synchronized with ISO/IEC 10646, each being code-for-code identical with the
other.
The Unicode Standard consists of a set of code charts for visual reference, an encoding method and set
of standard character encodings, a set of reference data files, and a number of related items, such as
character properties, rules for normalization, decomposition, collation, rendering, and bidirectional text
display order (for the correct display of text containing both right-to-left scripts, such as Arabic and
Hebrew, and left-to-right scripts). Unicode's success at unifying character sets has led to its widespread
and predominant use in the internationalization and localization of computer software. The standard
has been implemented in many recent technologies, including modern operating systems, XML, Java
(and other programming languages), and the .NET Framework.
Unicode can be implemented by different character encodings. The Unicode standard defines Unicode
Transformation Formats (UTF) UTF-8, UTF-16, and UTF-32, and several other encodings. The most
commonly used encodings are UTF-8, UTF-16, and UCS-2 (a precursor of UTF-16 without full support for
Unicode)
Select the character encodings to be used
We will focus this paper on UTF-8 since character encodings for websites 2020 reported that UTF-8 is
used by 95.4% :

3
Unicode UTF-8
Unicode Code points for the Scripts can be stored in UTF-8 in one to four bytes
1) 128 characters (US-ASCII)
2) 1,920 characters Latin-script, Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Thaana and
N'Ko alphabets.
3) Characters in common use, including most Chinese, Japanese and Korean characters**.
4) Less common CJK (The commonly used Hanzi/Kanji characters are in the "CJK Unified Ideographs"
block between U+4E00 and U+9FFF, and take 3 bytes in UTF-8. (The Japanese Hiragana and
Katakana characters also take 3 bytes.)) characters, various historic scripts, mathematical symbols,
and emoji (pictographic symbols).
Unicode Code points (red) in UTF-8 includes a header in each byte
The header indicates how many bytes are included in the sequence for each character
5)

4
Unicode Ranges of Code points
Unicode Ranges of Code points for the Scripts that are used in different languages
These are examples of European Scripts
Examples of Scripts with one to two bytes characters

5
Let’s first look at examples of European Scripts Basic Latin, Latin 1 Supplement, and Cyrillic:
Basic Latin
This Script with 1-byte (7 bits code points) characters handles US ASCII characters

6
Latin 1 Supplement
This Script with 2-bytes characters handles for example German un-lauts and Scandinavian characters

7
Cyrillic
Tokenization of the Russian alphabet may include the green (dotted lines) characters and use the red
characters for special purposes
UTF-8 and UTF-16 Encoding
UTF-8, the dominant encoding on the World Wide Web (used in over 95% of websites as of 2020, and up
to 100% for some languages) and on most Unix-like operating systems, uses one byte for the first 128
code points, and up to 4 bytes for other characters. The first 128 Unicode code points represent the
ASCII characters, which means that any ASCII text is also a UTF-8 text.
UCS-2 uses two bytes (16 bits) for each character but can only encode the first 65,536 code points, the
so-called Basic Multilingual Plane (BMP). With 1,112,064 possible Unicode code points corresponding to
characters (see below) on 17 planes, and with over 143,000 code points defined as of version 13.0, UCS-
2 is only able to represent less than half of all encoded Unicode characters. Therefore, UCS-2 is
outdated, though still widely used in software. UTF-16 extends UCS-2, by using the same 16-bit encoding
as UCS-2 for the Basic Multilingual Plane, and a 4-byte encoding for the other planes. As long as it

8
contains no code points in the reserved range U+D800–U+DFFF,[clarification needed] a UCS-2 text is
valid UTF-16 text.
UTF-8 is defined to encode code points in one to four bytes, depending on the number of significant bits
in the numerical value of the code point. The following table shows the structure of the encoding. We
will focus portability aspects between UTF-8 and UTF-16 (used by Teradata and some other large
databases) and start with 1-byte, 2-bytes, and 3-bytes characters in this example with three samples
characters:
We will tokenize the above yellow code points in the following examples.
Examples of Tokenization of Unicode
Token Fabric generated from input of Unicode Code Points
A fabric of intermediate tokens is created to increase the entropy of each final token. The blue tokens
represent temporary results and the final token values are green:

9
Forward and backward chaining of tokens
The tokenization function can be based on randomized lookup tables or encryption. The chaining can
add entropy via additional tokenization input to the tokenization process in each step. This example with
short data is based on a two-character input-string “AA” that will generate the middle layer tokens that
are temporary results and the final tokens are at the bottom layer. The tokens are chained forward and
backwards to increase the entropy:
The Token Fabric
The IV Pool
The IV Pool is a set of pre-generated “Randomized initialization vectors” to be used in different steps
when creating the encoded fabric. Substrings of records in IV Pool will be used in each step of the
tokenization process. The figure shows an example where 12 characters of the clear text in input is
tokenized and 4 characters are use as input to selection from the IV Pool:

10
These are examples of East Asian Scripts
Examples of Scripts with three to four bytes characters

11
Language preservation can be achieved in groups of Scripts
Group X:
1. Kanji: 4E00 - (9FA5) 9FAF.
2. Kanji extension A: 3400 - (4DB5) 4DBF.
3. Kanji extension B: 20000 - (2A6D6) 2A6DF. Old and historic Script.
4. Kanji supplement: 2F800 - (2FA1D) 2FA1F
5. Hiragana: 3040 (3041) - (309E) 309F, (3095 - 3098 unused)
Group Y:
1. Katakana: 30A0–30FF
2. Katakana Phonetic Extensions: 31F0–31FF
3. Small Kana Extension: 1B130-1B16F
4. Kana Supplement: 1B000–1B0FF
5. Kana Extended-A: 1B100–1B12F
6. Halfwidth and Fullwidth Forms: FF00-FFEF (Numeric FF10-FF19, Romaji FF21-FF5A)
7. Punctuation: 3000-3030
Group Z:
1. CJK Unified Ideographs Extension D: 2B740–2B81D
2. CJK Unified Ideographs Extension E: 2B820–2CEA1
3. CJK Unified Ideographs Extension F: 2CEB0–2EBE0
4. CJK Unified Ideographs Extension G: 30000–3134A
Example of Unicode characters in the Japanese
We may focus Unicode tokenization the Japanese language of characters that can be found in the
standard Shift JIS X 0213 2004. we restrict tokenization to the 303 characters in kanji that are specified
in the Japanese Standard Shift JIS X 0213 2004 for 4-byte long characters in UTF-8. Shift JIS X 0213 2004
is the standard for Japanese languages

12
UTF-8 can be mapped to the Japanese language standard Shift JIS X 0213 2004
Examples of the number of characters in some of the East Asian Scripts
Example of a Japanese address label
Examples of the types of characters would be:

13
• Half-width Kana spaces
• 4-byte Kanji characters (Chinese Characters)
• Mixed strings with both Kana and Kanji (different byte sizes)
Example of tokenizing five Japanese Scripts in a address label
Example of tokenizing 3-byts and 4-bytes Unicode characters
The main criteria from my side would be to take a range of Kanji and Kana characters with different
string lengths and validate that the length does not increase.

14
Portability of tokens and lookup tables
Encoding UTF-16 4-bytes Unicode
Portability of code points in tokens can be mapped for up to 3-bytes code points. 4-bytes code points
need to be converted. Conversion is defined in the UTF-16 encoding of ISO 10646 specifications for UTF-
16 and the different endian formats, UTF-16BE and UTF-16LE encodings.
Encoding of a single character from an ISO 10646 character value to UTF-16 proceeds as follows. Let U
be the character number, no greater than 0x10 FFFF.
1) If U < 0x1 0000, encode U as a 16-bit unsigned integer and terminate.
2) Let U' = U - 0x1 0000. Because U is less than or equal to 0x10 FFFF, U' must be less than or equal
to 0xF FFFF. That is, U' can be represented in 20 bits.
3) Initialize two 16-bit unsigned integers, W1 and W2, to 0xD800 and 0xDC00, respectively. These
integers each have 10 bits free to encode the character value, for a total of 20 bits.
4) Assign the 10 high-order bits of the 20-bit U' to the 10 low-order bits of W1 and the 10 low-order
bits of U' to the 10 low-order bits of W2. Terminate.
Graphically, steps 2 through 4 looks like:
 U' = yy yyyy yyyy xx xxxx xxxx
 W1 = 110110 yy yyyy yyyy
 W2 = 110111 xx xxxx xxxx
Summary
Be careful about such as leaking information due to byte-length preservation.
Notes
1. UTF-16, an encoding of ISO 10646, https://www.ietf.org/rfc/rfc2781.txt
2. Unicode, https://en.wikipedia.org/wiki/Unicode
3. "The Unicode Standard: A Technical Introduction",
https://www.unicode.org/standard/principles.html
4. Usage Survey of Character Encodings broken down by Ranking". w3techs.com
5. "Conformance" (PDF). The Unicode Standard,
https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#I1.36559
6. "UAX #29: Unicode Text Segmentation §3 Grapheme Cluster Boundaries". unicode.org.
https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
7. INTERNATIONAL STANDARD ISO/IEC 20889,
https://webstore.ansi.org/Standards/ISO/ISOIEC208892018?gclid=EAIaIQobChMIvI-
k3sXd5gIVw56zCh0Y0QeeEAAYASAAEgLVKfD_BwE

15
8. ISO/IEC 29101:2013 Information technology – Security techniques – Privacy architecture
framework, https://www.iso.org/standard/45124.html
9. ISO/IEC 19592-1:2016 Information technology – Security techniques – Secret sharing – Part
1: General,
10. ISO/IEC 19592-2:2017 Information technology – Security techniques – Secret sharing –Part
2: Fundamental mechanisms, https://www.iso.org/standard/65425.html
11. Bloom, Burton H. (1970), "Space/Time Trade-offs in Hash Coding with Allowable Errors",
Communications of the ACM, 13 (7): 422–426, CiteSeerX 10.1.1.641.9096,
doi:10.1145/362686.362692 , https://dl.acm.org/doi/10.1145/362686.362692
12. X. Song, D. Wagner, A. Perrig, Practical techniques for searches on encrypted data, in:
Proceedings of IEEE Symposium on Security and Privacy, 2000. S&P 2000., 2000, pp. 44–55
13. Cryptographically Protected Database Search, https://arxiv.org/abs/1703.02014
14. A Novel Fuzzy Search Approach over Encrypted Data with Improved Accuracy and Efficiency,
https://arxiv.org/abs/1904.12111
15. Homomorphic encryption, https://brilliant.org/wiki/homomorphic-encryption/
16. Survey on Secure Search Over Encrypted Data on the Cloud,
https://www.researchgate.net/publication/332271636_Survey_on_secure_search_over_en
crypted_data_on_the_cloud
17. J. Singh, T. Pasquier, J. Bacon, H. Ko, D. Eyers, Twenty security considerations for cloud-
supported internet of things, IEEE Internet of Things Journal 3 (3) (2016) 269–284.
18. Sergey Melnik; Andrey Gubarev; Jing Jing Long; Geoffrey Romer; Shiva Shivakumar; Matt
Tolton; Theo Vassilakis (2010). "Dremel: Interactive Analysis of Web-Scale Datasets". Proc.
of the 36th International Conference on Very Large Data Bases (VLDB)
19. Mattsson, Ulf. “Data Security: On Premise or in the Cloud,” ISSA Journal, December 2019 –
https://www.issa.org/journal/december-2019/
20. Mattsson, Ulf. “Data Privacy: De-Identification Techniques, ISSA Journal,” May 2020 –
https://www.issa.org/journal/may-2020/
21. Mattsson, Ulf. “Practical Data Security and Privacy for GDPR and CCPA, ISACA Journal,” May
2020 – https://www.isaca.org/resources/isaca-journal/issues/2020/volume-3/practical-
data-security-and-privacy-for-gdpr-and-ccpa
22. C. B¨osch, P. Hartel, W. Jonker, A. Peter, A survey of provably secure searchable encryption,
ACM Comput. Surv. 47 (2) (2014) 18:1–18:51. doi:10.1145/2636328. URL
http://doi.acm.org/10.1145/2636328 16. G. S. Poh, J.-J. Chin, W.-C. Yau, K.-K. R. Choo, M. S.
Mohamad, Searchable symmetric encryption: Designs and challenges, ACM Comput. Surv.
50 (3) (2017) 40:1–40:37. doi:10.1145/3064005. URL http://doi.acm.org/10.1145/3064005
23. What is Secure Multiparty Computation?, https://www.inpher.io/technology/what-is-
secure-multiparty-computation
24. Privay-protected Cloud Migration, https://cryptonumerics.com/privacy-protected-cloud-
migration/
25. Tokenization Product Security Guidelines, Version: 1.0, April 2015, PCI Security Standards
Council

16
https://www.pcisecuritystandards.org/documents/Tokenization_Product_Security_Guidelin
es.pdf?agreement=true&time=1570880509645

Data encryption and tokenization for international unicode

More Related Content

What's hot

Similar to Data encryption and tokenization for international unicode

More from Ulf Mattsson

Recently uploaded

Data encryption and tokenization for international unicode