Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 82

Unicode the hero or villain

1

Share

Download to read offline

Getting the Unicode text validation right

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Unicode the hero or villain

  1. 1. Unicode: The hero or villain? Input Validation of free-form Unicode text in Web Applications Pawel Krawczyk
  2. 2. Unicode: The hero or villain? In application security since 90’s - pentesting, security architecture, SSDLC, DevSecOps Active developer Python, C, Java https://github.com/kravietz OWASP - SAML, PL/SQL, authentication cheatsheets WebCookies.org - web privacy and security scanner Immusec.com - competitive pentesting & incident response in UK About Contact pkrawczyk@immusec.com +44 7879 180015 https://www.linkedin.com/in/pawelkrawczyk Pawel Krawczyk
  3. 3. Definition of the problem
  4. 4. Unicode: The hero or villain? Free-form text validation Pawel Krawczyk
  5. 5. Unicode: The hero or villain? Free-form text validation Pawel Krawczyk
  6. 6. Unicode: The hero or villain? Free-form text validation Pawel Krawczyk
  7. 7. Unicode: The hero or villain? Free-form text validation Pawel Krawczyk
  8. 8. Unicode Primer
  9. 9. Author name her Official name: A-OGONEK “a letter in the Polish, Kashubian, Lithuanian, Creek, Navajo, Western Apache, Chiricahua, Osage, Hocąk, Mescalero, Gwich'in, Tutchone, and Elfdalian alphabets” (Wikipedia) The rise and fall of letter “Ą” Author name here This is the abstract title Ą
  10. 10. Unicode: The hero or villain? ASCII: just write “Ą” as “A” Pawel Krawczyk
  11. 11. Unicode: The hero or villain? ASCII: just write “Ą” as “A” Pawel Krawczyk (Pol.) KĄT = (Eng.) ANGLE (Pol.) KAT = (Eng.) HANGMAN Contextual guessing, confusion, misunderstandings, we had lots of fun on IRC back in 90’s...
  12. 12. Unicode: The hero or villain? Windows-1250: “Ą” is 0xa5 Pawel Krawczyk
  13. 13. Unicode: The hero or villain? ISO-8859-2: “Ą” is 0xa1 Pawel Krawczyk
  14. 14. Unicode: The hero or villain? To możliwe! Pawel Krawczyk Source: Wikipedia
  15. 15. Unicode: The hero or villain? Confused beyond diacritics Pawel Krawczyk
  16. 16. Unicode: The hero or villain? Confused beyond diacritics Pawel Krawczyk
  17. 17. Unicode: The hero or villain? Confusing Unicode Pawel Krawczyk Examples: PythonHosted.org, Luciano Ramalho “Fluent Python”
  18. 18. Unicode: The hero or villain? Confusing Unicode Pawel Krawczyk Examples: PythonHosted.org, Luciano Ramalho “Fluent Python”
  19. 19. Unicode: The hero or villain? Confusing Unicode Pawel Krawczyk Examples: PythonHosted.org, Luciano Ramalho “Fluent Python”
  20. 20. Unicode: The hero or villain? Confusing Unicode Pawel Krawczyk Examples: PythonHosted.org, Luciano Ramalho “Fluent Python”
  21. 21. Stay Calm and Unicode
  22. 22. Unicode: The hero or villain? Forget everything you’ve learned about pre-Unicode characters and strings* *including MBCS and UCS Rule #1 Pawel Krawczyk
  23. 23. Unicode: The hero or villain? Pawel Krawczyk Ą
  24. 24. Unicode: The hero or villain? Pawel Krawczyk Ą Character
  25. 25. Unicode: The hero or villain? Pawel Krawczyk Ą Character LATIN CAPITAL LETTER A WITH OGONEK
  26. 26. Unicode: The hero or villain? Pawel Krawczyk Ą Character LATIN CAPITAL LETTER A WITH OGONEK
  27. 27. Unicode: The hero or villain? Pawel Krawczyk Ą Character LATIN CAPITAL LETTER A WITH OGONEK Code point U+0104
  28. 28. Unicode: The hero or villain? Pawel Krawczyk Ą Character LATIN CAPITAL LETTER A WITH OGONEK Code point U+0104 “Character’s catalog number” This is not encoding
  29. 29. Unicode: The hero or villain? Pawel Krawczyk Ą Character LATIN CAPITAL LETTER A WITH OGONEK Code point U+0104 “Character’s catalog number” This is not encoding
  30. 30. Unicode: The hero or villain? Pawel Krawczyk Ą Character LATIN CAPITAL LETTER A WITH OGONEK Code point U+0104 “Character’s catalog number” This is not encoding!!!
  31. 31. Unicode: The hero or villain? Pawel Krawczyk Ą Character LATIN CAPITAL LETTER A WITH OGONEK Code point U+0104 “Character’s catalog number” This is encoding 0xC4 0x84
  32. 32. Unicode: The hero or villain? Pawel Krawczyk Ą Character LATIN CAPITAL LETTER A WITH OGONEK Code point U+0104 “Character’s catalog number” 0xC4 0x84 0x01 0x04
  33. 33. Unicode: The hero or villain? Pawel Krawczyk Ą Character LATIN CAPITAL LETTER A WITH OGONEK Code point U+0104 “Character’s catalog number” 0xC4 0x84 0x01 0x04 0x04 0x01
  34. 34. Unicode: The hero or villain? Pawel Krawczyk Ą Character LATIN CAPITAL LETTER A WITH OGONEK Code point U+0104 “Character’s catalog number” 0xC4 0x84 0x01 0x04 0x04 0x01 0x00 0x00 0x01 0x04
  35. 35. Unicode: The hero or villain? Pawel Krawczyk Ą Character LATIN CAPITAL LETTER A WITH OGONEK Code point U+0104 “Character’s catalog number” 0xC4 0x84 0x01 0x04 0x04 0x01 0x00 0x00 0x01 0x04 0x04 0x01 0x00 0x00
  36. 36. Unicode: The hero or villain? Pawel Krawczyk Ą Character LATIN CAPITAL LETTER A WITH OGONEK Code point U+0104 “Character’s catalog number” 0xC4 0x84 0x01 0x04 0x04 0x01 0x00 0x00 0x01 0x04 0x04 0x01 0x00 0x00 +AQQ- UTF-7
  37. 37. Unicode: The hero or villain? Pawel Krawczyk Ą Character LATIN CAPITAL LETTER A WITH OGONEK Code point U+0104 “Character’s catalog number” 0xC4 0x84 0x01 0x04 0x04 0x01 0x00 0x00 0x01 0x04 0x04 0x01 0x00 0x00 +AQQ- UTF-7 +ADw-script+AD4-alert(+ACc- xss+ACc-)+ADw-+AC8-script+AD4-
  38. 38. Homoglyphs
  39. 39. Unicode: The hero or villain? Pawel Krawczyk ՍnᎥⅽоԁе
  40. 40. Unicode: The hero or villain? Pawel Krawczyk ՍnᎥⅽоԁе ARMENIAN CAPITAL LETTER SEH
  41. 41. Unicode: The hero or villain? Pawel Krawczyk ՍnᎥⅽоԁе ARMENIAN CAPITAL LETTER SEH LATIN SMALL LETTER N
  42. 42. Unicode: The hero or villain? Pawel Krawczyk ՍnᎥⅽоԁе ARMENIAN CAPITAL LETTER SEH LATIN SMALL LETTER N CHEROKEE LETTER V
  43. 43. Unicode: The hero or villain? Pawel Krawczyk ՍnᎥⅽоԁе ARMENIAN CAPITAL LETTER SEH LATIN SMALL LETTER N CHEROKEE LETTER V SMALL ROMAN NUMERAL ONE HUNDRED
  44. 44. Unicode: The hero or villain? Pawel Krawczyk ՍnᎥⅽоԁе ARMENIAN CAPITAL LETTER SEH LATIN SMALL LETTER N CHEROKEE LETTER V SMALL ROMAN NUMERAL ONE HUNDRED CYRILLIC SMALL LETTER O
  45. 45. Unicode: The hero or villain? Pawel Krawczyk ՍnᎥⅽоԁе ARMENIAN CAPITAL LETTER SEH LATIN SMALL LETTER N CHEROKEE LETTER V SMALL ROMAN NUMERAL ONE HUNDRED CYRILLIC SMALL LETTER O CYRILLIC SMALL LETTER KOMI DE
  46. 46. Unicode: The hero or villain? Pawel Krawczyk ՍnᎥⅽоԁе ARMENIAN CAPITAL LETTER SEH LATIN SMALL LETTER N CHEROKEE LETTER V SMALL ROMAN NUMERAL ONE HUNDRED CYRILLIC SMALL LETTER O CYRILLIC SMALL LETTER KOMI DE CYRILLIC SMALL LETTER IE
  47. 47. Unicode: The hero or villain? Pawel Krawczyk ՍnᎥⅽоԁе
  48. 48. Surviving in the Unicode world
  49. 49. Unicode: The hero or villain? Inside your application think text composed of characters; forget about bytes Rule #2 Pawel Krawczyk
  50. 50. Unicode: The hero or villain? Decode bytes into text on input Encode text into bytes on output Rule #3 Pawel Krawczyk
  51. 51. Unicode: The hero or villain? Input decoding example Pawel Krawczyk Transport metadata Decoding Internal representation Transport data Text: Jolanta Kozak “Dziaberlak” (“Jabberwocky” by Lewis Carroll) Exceptions!
  52. 52. Unicode: The hero or villain? Text processing, persistence* and fun Pawel Krawczyk Example text processing * do not persist before watching this presentation till the end
  53. 53. Unicode: The hero or villain? Output encoding Pawel Krawczyk U+FEFF BYTE ORDER MARK (BOM) “Unicode signature”
  54. 54. When things go south
  55. 55. Unicode: The hero or villain? Wrong decoder Pawel Krawczyk ● Client told us so ● Client told us nothing, so we assumed so ● Client told us ‘utf-8’, but we ignored it and assumed ‘ascii’ because we have been writing this software since 1986
  56. 56. Unicode: The hero or villain? What to do - a policy decision! Pawel Krawczyk Reject incorrect information (fail closed) Partially lose information (fail open) Recover information (fail pretending it’s fine)
  57. 57. Unicode: The hero or villain? Policy decision Pawel Krawczyk Reject incorrect information (fail closed) Partially lose information (fail open) Recover information (fail pretending it’s fine)
  58. 58. Unicode: The hero or villain? Policy decision Pawel Krawczyk Reject incorrect information (fail closed) Partialy lose information (fail open) Recover information (fail pretending it’s fine)
  59. 59. Validation techniques
  60. 60. Unicode: The hero or villain? Character category enforcement Pawel Krawczyk
  61. 61. Unicode: The hero or villain? Character category enforcement Pawel Krawczyk Source: Unicode Standard Annex #44
  62. 62. Unicode: The hero or villain? Enforce Unicode categories Rule #5 Pawel Krawczyk
  63. 63. Unicode: The hero or villain? Character script enforcement Pawel Krawczyk Scripts used in the example: ● LATIN ● CYRILLIC ● CJK ● ARABIC
  64. 64. Unicode: The hero or villain? Enforce Unicode scripts Rule #6 Pawel Krawczyk
  65. 65. Unicode: The hero or villain? Text direction enforcement Pawel Krawczyk RIGHT-TO-LEFT OVERRIDE U+202E Visual spoof* *now prevented by many client programs
  66. 66. Unicode: The hero or villain? Text direction enforcement Pawel Krawczyk Source: Unicode Standard Annex #44
  67. 67. Unicode: The hero or villain? Enforce consistent text direction Rule #7 Pawel Krawczyk
  68. 68. Normalization
  69. 69. Unicode: The hero or villain? When café is not café? Pawel Krawczyk Single character U+00E9 Two U+000E and U+0301 characters combined on display
  70. 70. Unicode: The hero or villain? When it is a problem? Pawel Krawczyk ● Collation ● Sorting ● Comparison ● Persistence of data with different composition
  71. 71. Unicode: The hero or villain? Unicode normalization Pawel Krawczyk ● NFC = Normalization Form “C” ● Converts Unicode characters to a single, consistent form ● U+000E U+0301 and U+00E9 are Unicode “canonical equivalents”
  72. 72. Unicode: The hero or villain? NFC, NFD, NFKC, NFKD? Pawel Krawczyk ● NFC will compose, make shorter - é becomes U+00E9 ● NFD will decompose, make longer - é becomes U+000E U+0301 ● NFKC and NFKD will also replace “compatibility characters” ○ Possible loss of information!
  73. 73. Unicode: The hero or villain? NFKC, NFKD Pawel Krawczyk This is a significant loss of information! Source: Luciano Ramalho “Fluent Python”
  74. 74. Unicode: The hero or villain? Compatibility normalization Pawel Krawczyk ● Precomposed Roman numerals Ⅻ U+216B Ⅻ
  75. 75. Unicode: The hero or villain? Compatibility normalization Pawel Krawczyk ● Precomposed Roman numerals Ⅻ U+216B ● Ⅹ U+2169 Ⅰ U+2160 Ⅰ U+2160 (Roman numerals) Ⅻ
  76. 76. Unicode: The hero or villain? Compatibility normalization Pawel Krawczyk ● Precomposed Roman numerals Ⅻ U+216B ● Ⅹ U+2169 Ⅰ U+2160 Ⅰ U+2160 (Roman numerals) ● X U+0058 I U+0049 I U+0049 (Latin letters) Another typical example: ● ¼ U+00BC → 1 U+0031 ⁄ U+2044 4 U+0034 Ⅻ
  77. 77. Unicode: The hero or villain? NFKC, NFKD Pawel Krawczyk NKFC replaces single Ⅻ U+216B ROMAN NUMERAL TWELVE character by three Roman digits represented by Latin letters X and I Useful for search and comparison ● is there “X” in “XII”? ● is there “f” in “ffi”? Many typesetting programs will replace popular “compatibility sequences” by appropriate Unicode characters: --> replaced by → U+2192 RIGHTWARDS ARROW
  78. 78. Unicode: The hero or villain? Normalize Unicode text Rule #8 Pawel Krawczyk
  79. 79. Validation strategies
  80. 80. Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… ● Are they expected to communicate in Chinese, English, Polish, Arabic…? ● Do we expect text in Linear-B, Ugaritic, Klingon? ○ Can you process data in any of these languages? ● What kind of text is expected where? ● E.g. names - are they composed of letters only (“Portia Sutcliffe”) ○ Or maybe we also expect digits and punctuation (“Cynthia O'Keefe”, “Howard Upperton-Wildingham III”) ● Define what is valid text ● Define appropriate valid categories, scripts and text directions ● Normalize Unicode input prior to validation
  81. 81. Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… ● Are they expected to communicate in Chinese, English, Polish, Arabic…? ● Do we expect text in Linear-B, Ugaritic, Klingon? ○ Can you process data in any of these languages? ● What kind of text is expected where? ● E.g. names - are they composed of letters only (“Portia Sutcliffe”) ○ Or maybe we also expect digits and punctuation (“Cynthia O'Keefe”, “Howard Upperton-Wildingham III”) ● Define what is valid text ● Define appropriate valid categories, scripts and text directions ● Normalize Unicode input prior to validation ● Do we have free-text fields? ○ If yes, how free is the “free text”? ○ Letters, digits, punctuation? ○ Symbols? ■ Because if someone explains “I clicked File > Properties > General” it takes symbols ● Is all this part of localization for given region? ○ Along with database collation, currency, numbers...
  82. 82. Unicode: The hero or villain? Questions? Pawel Krawczyk pawel.krawczyk@hush.com +44 7879 180015 https://www.linkedin.com/in/pawelkrawczyk/ 🤔

×