Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

怖くないユニコードの話

29 views

Published on

発表Youtube: https://youtu.be/uXk6eOCYz_4
Retrievaセミナーで発表した資料

Published in: Technology
  • Be the first to comment

  • Be the first to like this

怖くないユニコードの話

  1. 1. • • • PFI • • bash vim 2
  2. 2. • 🔊 • ☝ UTF • ♊ • 💣 BOM • 🔍 UTF-8 • 🕳 UTF-8 • 📎 : JSON • 🔇 • 🗞 ( ) • • • 🛠 ( ) • NFD NFC NFKD NFKC • 🔗 ( ) • ZWJ (Family ) • Skintone( ) • • /Yen mark/ etc 3
  3. 3. Unicode UTF • • Unicode UTF-8 • UTF-16 UTF-16LE • UTF-8 UTF-8LE • BOM • UCS-2 UTF-16 4
  4. 4. Unicode • “ ” (Wikipedia) • • : ( ID) • : • • e.g. (JIS X 0208( ): 04 02 ) Shift JIS, EUC-JP • Shift JIS: 0x82 0xa0 / EUC-JP: 0xa4 0xa2 • • = (e.g. ASCII) 5
  5. 5. Unicode • Unicode 1 • • 0x0 0x10FFFF (Unicode ) • Unicode U+XXXX(16 ) • e.g. A U+0041 / Я U+042F / 🍣 U+1F363 • 16 4 4 4 6
  6. 6. Unicode • Unicode • UTF(Unicode Transformation Format) • (UTF-7,) UTF-8, UTF-16, UTF-32 • • (CEF) (CEF) • (CEF): (= 8bit 16bit 32bit ) • (CES): (= LE/BE/BOM ( )) • UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE 7
  7. 7. UTF-32 (UTF-32 / UTF-32LE / UTF-32BE) • Unicode 0x10FFFF 21bit • → 32bit 21bit • e.g. "AЯ🍣" (U+0041 U+042F U+1F363) → [0x41, 0x42F, 0x1f363] (32bit ) • 32bit(=4 ) (ASCII) • • 32bit ( :CES) • e.g. 🍣 U+1F363 → 00 01 F3 63 (UTF-32BE) / 63 F3 01 00(UTF-32LE) 8
  8. 8. UTF-16 (UTF-16 / UTF-16LE / UTF-16BE) • 16bit 16bit 16bit • e.g. "AЯ🍣" (U+0041 U+042F U+1F363) • → [0x41, 0x42f, 0xd83c, 0xdf63] (16bit ) • ( ) • UTF-32 ASCII • • e.g. 🍣 U+1F363 → D8 3C DF 63 (UTF-16BE) / 3C D8 63 DF(UTF-16LE) 9
  9. 9. UTF-8 • 7bit 1byte ( ) • e.g. "AЯ🍣" (U+0041 U+042F U+1F363) • → [0x41, 0xd0, 0xaf, 0xf0, 0x9f, 0x8d, 0xa3] (8bit ) • • • ASCII / ASCII • 10
  10. 10. • Unicode: • : (= ) • : • Unicode • UTF-x 11
  11. 11. : UCS • Universal Coded Character Set ISO/IEC 10646 • Unicode • Unicode • Unicode 32bit UCS-4 16bit UCS-2 • UCS-2 16bit (UCS-2 ) • Unicode UTF-8, UTF-16, UTF-32 ( ) • Unicode Transformation Format UCS Transformation Format • UCS-4 / UCS-2 • ≒ UTF-32BE / UTF-16BE( ) 12
  12. 12. • 16bit • U+D800 U+DFFF ( ) U+D800 U+DBFF (1024 ) U+DC00 U+DFFF (1024 ) 16bit • e.g. 🍣 U+1F363 → (U+D83C, U+DF63) • 0x10000 10bit 13
  13. 13. • 1024 × 1024 104 8576 U+D800 U+DFFF • • 16bit …… • 32bit 0x10ffff • (U+DBFF, U+DFFF) = U+10FFFF 14
  14. 14. BOM • 1byte UTF-16,UTF-32 • Unicode LE( ) BE( ) BOM(Byte Order Mark) • BOM U+FEFF • U+FFFE • …… 15
  15. 15. BOM : BOM ZWNBSP • BOM U+FEFF • U+FEFF (ZERO WIDTH NO-BREAK SPACE(ZWNBSP) / ) • ( BOM ……) • ZWNBSP UTF-16LE( UTF-16BE) BOM UTF-16 • BOM UTF-16 UTF-16LE ZWNBSP • ( ) …… • ZWNBSP U+2060(WORD JOINER) ZWNBSP U+FEFF (Unicode Ver. 3.2 / 2002 ) 16
  16. 16. BOM : BOM UTF-8 • UTF-8 (CEF) BOM • …… • • BOM( ) UTF-8 BOM UTF-16/UTF-32 17
  17. 17. BOM : BOM UTF-8 • UTF-8 BOM(U+FEFF) • Windows • e.g. Excel csv BOM / Excel "CSV UTF-8( )" BOM • UTF-8 BOM(U+FEFF) • Linux ( 2byte "#!" ) • PHP ( "<?PHP" ) • BOM 18
  18. 18. BOM UCS • Unicode UCS BOM • BOM U+FEFF ( ) • Unicode 19
  19. 19. BOM • • 16bit • ( ) ( ) • UTF-8 • BOM • • ( ) BOM 20
  20. 20. UTF-8 : • 7bit 1byte • • • 21
  21. 21. UTF-8 : 7bit (U+0000 U+007F) • 7bit 8bit 0 • ASCII (ASCII ) • e.g. "ABC" (U+0041 U+0042 U+0043) • → [0x41 0x42 0x43] 22
  22. 22. UTF-8 : U+0080 • U+0080 • byte (110xxxxx / 1110xxxx / 11110xxx) • (10xxxxxx) (2byte 10xxxxxx ) • bit 1 1 • 110xxxxx 2byte 1 (10xxxxxx ) • 1110xxxx 3byte 1 (10xxxxxx ) • 23
  23. 23. UTF-8 : U+0080 ( ) • 1...10 bit • e.g. f0 9f 8d a3 → 1110000 10011111 10001101 10100011 • 000 011111 001101 100011 = 0x1f363 (🍣 U+1F363) • 24
  24. 24. UTF-8 : • …… • 10xxxxxx • 110xxxxx 1110xxxx 11110xxx • 10xxxxxx 2 2 10xxxxxx • UTF-8 • 111110xx 1111110x • Unicode U+10FFFF 4 5 6 • 111111110 11111111 • 25
  25. 25. UTF-8 : • UTF-8 : 0xc0 0xbc • 0xc0 0xbc: 11000000 10111100 • → 00000111100(2) = 0x3c / U+003C "<" • "<" ASCII UTF-8 0x3c 1 …… • 0xe0 0x80 0xbc 11100000 10000000 10111100 0000000000111100(2) = 0x3C • bit 0 UTF-8 • → • e.g. XSS (UTF-8) "<"(0x3c) "&lt;" 0xc0 0xbc DB UTF-16LE "<" 26
  26. 26. UTF-8 : • (U+D800 U+DFFF) UTF-8 • • U+10000 ("<" "/" ) • U+10000 4 3 ×2 UTF- 8 (CESU-8) • e.g. Ver. 8 Oracle Database • 5 / 6 ISO/IEC 10646(UCS) • UCS 0x10ffff 27
  27. 27. UTF-8 : • • • • → • 28
  28. 28. : JSON Unicode • JSON Unicode • JSON { ASCII 4 • "{" U+007B BOM • UTF-8: 0x7b (0x00 ) UTF-8(BOM): 0xEF 0xBB 0xBF • UTF-16LE: 0x7b 0x00 UTF-16(BOM) LE: 0xFF 0xFE • UTF-16BE: 0x00 0x7b UTF-16(BOM) BE: 0xFE 0xFF • UTF-32LE: 0x7b 0x00 0x00 0x00 UTF-32(BOM) LE: 0xFF 0xFE 0x00 0x00 • UTF-32BE: 0x00 0x00 0x00 0x7b UTF-32(BOM) BE: 0x00 0x00 0xFE 0xFF • ( : RFC 8259 UTF-8 ) 31
  29. 29. : JSON Unicode( ) • JSON Unicode ¥uxxxx • ASCII Unicode • ¥ + u + (4 hexadecimal digits) • 4 hexadecimal digits • 🍣 (U+1F363) • → • {"sushi": "ud83cudf63"} • _(´ཀ` ∠)_ 32👉
  30. 30. ( ) • 🔊 • ☝ UTF • ♊ • 💣 BOM • 🔍 UTF-8 • 🕳 UTF-8 • 📎 : JSON • 🔇 • 🗞 ( ) • • • 🛠 ( ) • NFD NFC NFKD NFKC • 🔗 ( ) • ZWJ (Family ) • Skintone( ) • • /Yen mark/ etc 33

×