Unicode (and Python)
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Unicode (and Python)

on

  • 2,839 views

An introduction to Unicode and its processing in Python.

An introduction to Unicode and its processing in Python.

Statistics

Views

Total Views
2,839
Views on SlideShare
2,838
Embed Views
1

Actions

Likes
2
Downloads
84
Comments
0

1 Embed 1

http://www.slideshare.net 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Unicode (and Python) Presentation Transcript

  • 1. Unicode (and Python) Juan Manuel Gimeno Illa jmgimeno@diei.udl.cat November 2008 J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 1 / 21
  • 2. Outline 1 Before Unicode 2 Unicode Unicode Concepts Encodings 3 Python’s Unicode Support Unicode String Type Source Code Encoding 4 Bibliography J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 2 / 21
  • 3. Before Unicode Before Unicode In the beginning, computing was mainly centered in North America and done in English. Characters were stored one-per-byte by using either ASCII (7 bits) EBCDIC (8 bits) In other parts of the world, different ways of storing their characters were invented Japan: various flavours of JIS encodings Russian: KOI8 India: ISCI standard Also, there were some proprietary encodings defined by operating system vendors J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
  • 4. Before Unicode Before Unicode In the beginning, computing was mainly centered in North America and done in English. Characters were stored one-per-byte by using either ASCII (7 bits) EBCDIC (8 bits) In other parts of the world, different ways of storing their characters were invented Japan: various flavours of JIS encodings Russian: KOI8 India: ISCI standard Also, there were some proprietary encodings defined by operating system vendors J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
  • 5. Before Unicode Before Unicode In the beginning, computing was mainly centered in North America and done in English. Characters were stored one-per-byte by using either ASCII (7 bits) EBCDIC (8 bits) In other parts of the world, different ways of storing their characters were invented Japan: various flavours of JIS encodings Russian: KOI8 India: ISCI standard Also, there were some proprietary encodings defined by operating system vendors J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
  • 6. Before Unicode Before Unicode In the beginning, computing was mainly centered in North America and done in English. Characters were stored one-per-byte by using either ASCII (7 bits) EBCDIC (8 bits) In other parts of the world, different ways of storing their characters were invented Japan: various flavours of JIS encodings Russian: KOI8 India: ISCI standard Also, there were some proprietary encodings defined by operating system vendors J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
  • 7. Before Unicode Before Unicode In the beginning, computing was mainly centered in North America and done in English. Characters were stored one-per-byte by using either ASCII (7 bits) EBCDIC (8 bits) In other parts of the world, different ways of storing their characters were invented Japan: various flavours of JIS encodings Russian: KOI8 India: ISCI standard Also, there were some proprietary encodings defined by operating system vendors J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
  • 8. Before Unicode Before Unicode In the beginning, computing was mainly centered in North America and done in English. Characters were stored one-per-byte by using either ASCII (7 bits) EBCDIC (8 bits) In other parts of the world, different ways of storing their characters were invented Japan: various flavours of JIS encodings Russian: KOI8 India: ISCI standard Also, there were some proprietary encodings defined by operating system vendors J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
  • 9. Before Unicode Before Unicode In the beginning, computing was mainly centered in North America and done in English. Characters were stored one-per-byte by using either ASCII (7 bits) EBCDIC (8 bits) In other parts of the world, different ways of storing their characters were invented Japan: various flavours of JIS encodings Russian: KOI8 India: ISCI standard Also, there were some proprietary encodings defined by operating system vendors J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
  • 10. Before Unicode Before Unicode In the beginning, computing was mainly centered in North America and done in English. Characters were stored one-per-byte by using either ASCII (7 bits) EBCDIC (8 bits) In other parts of the world, different ways of storing their characters were invented Japan: various flavours of JIS encodings Russian: KOI8 India: ISCI standard Also, there were some proprietary encodings defined by operating system vendors J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
  • 11. Before Unicode ISO-8859-* For the huge number of people in America, Europe, and the Middle East who use relatively small alphabets, there was ISO-8859 left ASCII as ASCII (range 0 to 127) used the range 128 through 255 for different purposes 1-4 Different accented characters (e.g. latin-1) 5 Cyrillic 6 Arabic 7 Greek 8 Hebrew 9 Turkish 10 Nordic languages But you could only be using one at a time, so one couldn’t easily mix Greek and Cyrillic in the same file. J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 4 / 21
  • 12. Before Unicode ISO-8859-* For the huge number of people in America, Europe, and the Middle East who use relatively small alphabets, there was ISO-8859 left ASCII as ASCII (range 0 to 127) used the range 128 through 255 for different purposes 1-4 Different accented characters (e.g. latin-1) 5 Cyrillic 6 Arabic 7 Greek 8 Hebrew 9 Turkish 10 Nordic languages But you could only be using one at a time, so one couldn’t easily mix Greek and Cyrillic in the same file. J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 4 / 21
  • 13. Before Unicode ISO-8859-* For the huge number of people in America, Europe, and the Middle East who use relatively small alphabets, there was ISO-8859 left ASCII as ASCII (range 0 to 127) used the range 128 through 255 for different purposes 1-4 Different accented characters (e.g. latin-1) 5 Cyrillic 6 Arabic 7 Greek 8 Hebrew 9 Turkish 10 Nordic languages But you could only be using one at a time, so one couldn’t easily mix Greek and Cyrillic in the same file. J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 4 / 21
  • 14. Before Unicode ISO-8859-* For the huge number of people in America, Europe, and the Middle East who use relatively small alphabets, there was ISO-8859 left ASCII as ASCII (range 0 to 127) used the range 128 through 255 for different purposes 1-4 Different accented characters (e.g. latin-1) 5 Cyrillic 6 Arabic 7 Greek 8 Hebrew 9 Turkish 10 Nordic languages But you could only be using one at a time, so one couldn’t easily mix Greek and Cyrillic in the same file. J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 4 / 21
  • 15. Before Unicode ISO-8859-* For the huge number of people in America, Europe, and the Middle East who use relatively small alphabets, there was ISO-8859 left ASCII as ASCII (range 0 to 127) used the range 128 through 255 for different purposes 1-4 Different accented characters (e.g. latin-1) 5 Cyrillic 6 Arabic 7 Greek 8 Hebrew 9 Turkish 10 Nordic languages But you could only be using one at a time, so one couldn’t easily mix Greek and Cyrillic in the same file. J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 4 / 21
  • 16. Before Unicode ISO-8859-* For the huge number of people in America, Europe, and the Middle East who use relatively small alphabets, there was ISO-8859 left ASCII as ASCII (range 0 to 127) used the range 128 through 255 for different purposes 1-4 Different accented characters (e.g. latin-1) 5 Cyrillic 6 Arabic 7 Greek 8 Hebrew 9 Turkish 10 Nordic languages But you could only be using one at a time, so one couldn’t easily mix Greek and Cyrillic in the same file. J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 4 / 21
  • 17. Before Unicode Huston, Huston, . . . Clearly this was an very unsatisfactory situation ISO-2022 provided a partial solution allowing to shift encodings in the middle of a string it was difficult to use so it wasn’t widespread What was needed was an universal way to refer to all the different characters in all the alphabets ISO/IEC 10646 Unicode J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21
  • 18. Before Unicode Huston, Huston, . . . Clearly this was an very unsatisfactory situation ISO-2022 provided a partial solution allowing to shift encodings in the middle of a string it was difficult to use so it wasn’t widespread What was needed was an universal way to refer to all the different characters in all the alphabets ISO/IEC 10646 Unicode J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21
  • 19. Before Unicode Huston, Huston, . . . Clearly this was an very unsatisfactory situation ISO-2022 provided a partial solution allowing to shift encodings in the middle of a string it was difficult to use so it wasn’t widespread What was needed was an universal way to refer to all the different characters in all the alphabets ISO/IEC 10646 Unicode J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21
  • 20. Before Unicode Huston, Huston, . . . Clearly this was an very unsatisfactory situation ISO-2022 provided a partial solution allowing to shift encodings in the middle of a string it was difficult to use so it wasn’t widespread What was needed was an universal way to refer to all the different characters in all the alphabets ISO/IEC 10646 Unicode J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21
  • 21. Before Unicode Huston, Huston, . . . Clearly this was an very unsatisfactory situation ISO-2022 provided a partial solution allowing to shift encodings in the middle of a string it was difficult to use so it wasn’t widespread What was needed was an universal way to refer to all the different characters in all the alphabets ISO/IEC 10646 Unicode J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21
  • 22. Before Unicode Huston, Huston, . . . Clearly this was an very unsatisfactory situation ISO-2022 provided a partial solution allowing to shift encodings in the middle of a string it was difficult to use so it wasn’t widespread What was needed was an universal way to refer to all the different characters in all the alphabets ISO/IEC 10646 Unicode J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21
  • 23. Before Unicode Huston, Huston, . . . Clearly this was an very unsatisfactory situation ISO-2022 provided a partial solution allowing to shift encodings in the middle of a string it was difficult to use so it wasn’t widespread What was needed was an universal way to refer to all the different characters in all the alphabets ISO/IEC 10646 Unicode J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21
  • 24. Unicode Unicode Concepts Unicode’s Solution One encoding for all scripts of the world ASCII compatibility (even Latin-1) Includes character meta data Case mapping information Character category information Accounts for scripts using different orientations Enables sorting and normalization support J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21
  • 25. Unicode Unicode Concepts Unicode’s Solution One encoding for all scripts of the world ASCII compatibility (even Latin-1) Includes character meta data Case mapping information Character category information Accounts for scripts using different orientations Enables sorting and normalization support J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21
  • 26. Unicode Unicode Concepts Unicode’s Solution One encoding for all scripts of the world ASCII compatibility (even Latin-1) Includes character meta data Case mapping information Character category information Accounts for scripts using different orientations Enables sorting and normalization support J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21
  • 27. Unicode Unicode Concepts Unicode’s Solution One encoding for all scripts of the world ASCII compatibility (even Latin-1) Includes character meta data Case mapping information Character category information Accounts for scripts using different orientations Enables sorting and normalization support J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21
  • 28. Unicode Unicode Concepts Unicode’s Solution One encoding for all scripts of the world ASCII compatibility (even Latin-1) Includes character meta data Case mapping information Character category information Accounts for scripts using different orientations Enables sorting and normalization support J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21
  • 29. Unicode Unicode Concepts Unicode’s Solution One encoding for all scripts of the world ASCII compatibility (even Latin-1) Includes character meta data Case mapping information Character category information Accounts for scripts using different orientations Enables sorting and normalization support J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21
  • 30. Unicode Unicode Concepts Unicode’s Solution One encoding for all scripts of the world ASCII compatibility (even Latin-1) Includes character meta data Case mapping information Character category information Accounts for scripts using different orientations Enables sorting and normalization support J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21
  • 31. Unicode Unicode Concepts Unicode’s Terminology Grapheme This is what users regard as a character - Andr´e Code points This is an Unicode encoding of the string - AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE) - Andre’=AndreU+0301 (COMBINING ACUTE ACCENT) Code Units This is what the implementation stores (e.g. UTF-8 - Andre0xCC 0x81 This can be explored in Linux using the program gucharmap J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 7 / 21
  • 32. Unicode Unicode Concepts Unicode’s Terminology Grapheme This is what users regard as a character - Andr´e Code points This is an Unicode encoding of the string - AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE) - Andre’=AndreU+0301 (COMBINING ACUTE ACCENT) Code Units This is what the implementation stores (e.g. UTF-8 - Andre0xCC 0x81 This can be explored in Linux using the program gucharmap J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 7 / 21
  • 33. Unicode Unicode Concepts Unicode’s Terminology Grapheme This is what users regard as a character - Andr´e Code points This is an Unicode encoding of the string - AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE) - Andre’=AndreU+0301 (COMBINING ACUTE ACCENT) Code Units This is what the implementation stores (e.g. UTF-8 - Andre0xCC 0x81 This can be explored in Linux using the program gucharmap J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 7 / 21
  • 34. Unicode Unicode Concepts Unicode’s Terminology Grapheme This is what users regard as a character - Andr´e Code points This is an Unicode encoding of the string - AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE) - Andre’=AndreU+0301 (COMBINING ACUTE ACCENT) Code Units This is what the implementation stores (e.g. UTF-8 - Andre0xCC 0x81 This can be explored in Linux using the program gucharmap J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 7 / 21
  • 35. Unicode Unicode Concepts Unicode’s Terminology Grapheme This is what users regard as a character - Andr´e Code points This is an Unicode encoding of the string - AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE) - Andre’=AndreU+0301 (COMBINING ACUTE ACCENT) Code Units This is what the implementation stores (e.g. UTF-8 - Andre0xCC 0x81 This can be explored in Linux using the program gucharmap J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 7 / 21
  • 36. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 37. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 38. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 39. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 40. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 41. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 42. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 43. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 44. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 45. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 46. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 47. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 48. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 49. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 50. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 51. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 52. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 53. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 54. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 55. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 56. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 57. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 58. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 59. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 60. Unicode Encodings Encodings Along with the code points, Unicode also defines methods for storing them in byte sequences in a computer There are three approaches named UTF-8, UTF-16 and UTF-32 UTF stands for Unicode Transformation Format or UCS Transformation Format where UCS stands for Unicode Character Set The characters we will use in the explanations are: Number Name Plane U+0026 (38) AMPERSAND BMP U+0416 (1046) CYRILLIC CAPITAL LETTER ZHE BMP U+4E2D (20013) HAN IDEOGRAPH 4E2E BMP U+10346 (66374) GOTHIC LETTER FAIHU Astral J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 10 / 21
  • 61. Unicode Encodings Encodings Along with the code points, Unicode also defines methods for storing them in byte sequences in a computer There are three approaches named UTF-8, UTF-16 and UTF-32 UTF stands for Unicode Transformation Format or UCS Transformation Format where UCS stands for Unicode Character Set The characters we will use in the explanations are: Number Name Plane U+0026 (38) AMPERSAND BMP U+0416 (1046) CYRILLIC CAPITAL LETTER ZHE BMP U+4E2D (20013) HAN IDEOGRAPH 4E2E BMP U+10346 (66374) GOTHIC LETTER FAIHU Astral J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 10 / 21
  • 62. Unicode Encodings Encodings Along with the code points, Unicode also defines methods for storing them in byte sequences in a computer There are three approaches named UTF-8, UTF-16 and UTF-32 UTF stands for Unicode Transformation Format or UCS Transformation Format where UCS stands for Unicode Character Set The characters we will use in the explanations are: Number Name Plane U+0026 (38) AMPERSAND BMP U+0416 (1046) CYRILLIC CAPITAL LETTER ZHE BMP U+4E2D (20013) HAN IDEOGRAPH 4E2E BMP U+10346 (66374) GOTHIC LETTER FAIHU Astral J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 10 / 21
  • 63. Unicode Encodings Encodings Along with the code points, Unicode also defines methods for storing them in byte sequences in a computer There are three approaches named UTF-8, UTF-16 and UTF-32 UTF stands for Unicode Transformation Format or UCS Transformation Format where UCS stands for Unicode Character Set The characters we will use in the explanations are: Number Name Plane U+0026 (38) AMPERSAND BMP U+0416 (1046) CYRILLIC CAPITAL LETTER ZHE BMP U+4E2D (20013) HAN IDEOGRAPH 4E2E BMP U+10346 (66374) GOTHIC LETTER FAIHU Astral J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 10 / 21
  • 64. Unicode Encodings UTF-32 The simplest way to storing characters: you use 32 bits (4 bytes) to store each character So we store 38, 1046, 20013 and 66374 as 32 bit integers For Latin-1 characters it wastes too much space Problems with C strings because most bytes are zero (use wchar t) There are lots of ways of storing 4 byte integers among 4 bytes (remember big-endian and little-endian?) So if you send one of these 4-byte integers to another machine problems occur if they use different orderings Solutions: Explicitness UTF-32BE and UTF-32LE encodings Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH NO-BREAK SPACE) and the guarantee that U+FFFE will never be a character J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
  • 65. Unicode Encodings UTF-32 The simplest way to storing characters: you use 32 bits (4 bytes) to store each character So we store 38, 1046, 20013 and 66374 as 32 bit integers For Latin-1 characters it wastes too much space Problems with C strings because most bytes are zero (use wchar t) There are lots of ways of storing 4 byte integers among 4 bytes (remember big-endian and little-endian?) So if you send one of these 4-byte integers to another machine problems occur if they use different orderings Solutions: Explicitness UTF-32BE and UTF-32LE encodings Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH NO-BREAK SPACE) and the guarantee that U+FFFE will never be a character J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
  • 66. Unicode Encodings UTF-32 The simplest way to storing characters: you use 32 bits (4 bytes) to store each character So we store 38, 1046, 20013 and 66374 as 32 bit integers For Latin-1 characters it wastes too much space Problems with C strings because most bytes are zero (use wchar t) There are lots of ways of storing 4 byte integers among 4 bytes (remember big-endian and little-endian?) So if you send one of these 4-byte integers to another machine problems occur if they use different orderings Solutions: Explicitness UTF-32BE and UTF-32LE encodings Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH NO-BREAK SPACE) and the guarantee that U+FFFE will never be a character J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
  • 67. Unicode Encodings UTF-32 The simplest way to storing characters: you use 32 bits (4 bytes) to store each character So we store 38, 1046, 20013 and 66374 as 32 bit integers For Latin-1 characters it wastes too much space Problems with C strings because most bytes are zero (use wchar t) There are lots of ways of storing 4 byte integers among 4 bytes (remember big-endian and little-endian?) So if you send one of these 4-byte integers to another machine problems occur if they use different orderings Solutions: Explicitness UTF-32BE and UTF-32LE encodings Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH NO-BREAK SPACE) and the guarantee that U+FFFE will never be a character J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
  • 68. Unicode Encodings UTF-32 The simplest way to storing characters: you use 32 bits (4 bytes) to store each character So we store 38, 1046, 20013 and 66374 as 32 bit integers For Latin-1 characters it wastes too much space Problems with C strings because most bytes are zero (use wchar t) There are lots of ways of storing 4 byte integers among 4 bytes (remember big-endian and little-endian?) So if you send one of these 4-byte integers to another machine problems occur if they use different orderings Solutions: Explicitness UTF-32BE and UTF-32LE encodings Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH NO-BREAK SPACE) and the guarantee that U+FFFE will never be a character J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
  • 69. Unicode Encodings UTF-32 The simplest way to storing characters: you use 32 bits (4 bytes) to store each character So we store 38, 1046, 20013 and 66374 as 32 bit integers For Latin-1 characters it wastes too much space Problems with C strings because most bytes are zero (use wchar t) There are lots of ways of storing 4 byte integers among 4 bytes (remember big-endian and little-endian?) So if you send one of these 4-byte integers to another machine problems occur if they use different orderings Solutions: Explicitness UTF-32BE and UTF-32LE encodings Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH NO-BREAK SPACE) and the guarantee that U+FFFE will never be a character J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
  • 70. Unicode Encodings UTF-32 The simplest way to storing characters: you use 32 bits (4 bytes) to store each character So we store 38, 1046, 20013 and 66374 as 32 bit integers For Latin-1 characters it wastes too much space Problems with C strings because most bytes are zero (use wchar t) There are lots of ways of storing 4 byte integers among 4 bytes (remember big-endian and little-endian?) So if you send one of these 4-byte integers to another machine problems occur if they use different orderings Solutions: Explicitness UTF-32BE and UTF-32LE encodings Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH NO-BREAK SPACE) and the guarantee that U+FFFE will never be a character J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
  • 71. Unicode Encodings UTF-32 The simplest way to storing characters: you use 32 bits (4 bytes) to store each character So we store 38, 1046, 20013 and 66374 as 32 bit integers For Latin-1 characters it wastes too much space Problems with C strings because most bytes are zero (use wchar t) There are lots of ways of storing 4 byte integers among 4 bytes (remember big-endian and little-endian?) So if you send one of these 4-byte integers to another machine problems occur if they use different orderings Solutions: Explicitness UTF-32BE and UTF-32LE encodings Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH NO-BREAK SPACE) and the guarantee that U+FFFE will never be a character J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
  • 72. Unicode Encodings UTF-32 The simplest way to storing characters: you use 32 bits (4 bytes) to store each character So we store 38, 1046, 20013 and 66374 as 32 bit integers For Latin-1 characters it wastes too much space Problems with C strings because most bytes are zero (use wchar t) There are lots of ways of storing 4 byte integers among 4 bytes (remember big-endian and little-endian?) So if you send one of these 4-byte integers to another machine problems occur if they use different orderings Solutions: Explicitness UTF-32BE and UTF-32LE encodings Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH NO-BREAK SPACE) and the guarantee that U+FFFE will never be a character J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
  • 73. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 74. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 75. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 76. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 77. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 78. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 79. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 80. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 81. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 82. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 83. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 84. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 85. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 86. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 87. Unicode Encodings UTF-8 UTF-8 was invented by Ken Thompson on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. It works like this: characters whose value is less that 128 (ASCII) are encoded as themselves in one byte the rest will have its bits ripped apart and deal out into several (from two to four) bytes as follows: The first byte has a bunch of high-order one bits telling how many bytes are used to encode the character, followed by a zero bit The rest of the bytes each begin with a single one byte followed by a zero bit The bits of the character are dealt out in the space left over after these signalling bits J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21
  • 88. Unicode Encodings UTF-8 UTF-8 was invented by Ken Thompson on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. It works like this: characters whose value is less that 128 (ASCII) are encoded as themselves in one byte the rest will have its bits ripped apart and deal out into several (from two to four) bytes as follows: The first byte has a bunch of high-order one bits telling how many bytes are used to encode the character, followed by a zero bit The rest of the bytes each begin with a single one byte followed by a zero bit The bits of the character are dealt out in the space left over after these signalling bits J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21
  • 89. Unicode Encodings UTF-8 UTF-8 was invented by Ken Thompson on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. It works like this: characters whose value is less that 128 (ASCII) are encoded as themselves in one byte the rest will have its bits ripped apart and deal out into several (from two to four) bytes as follows: The first byte has a bunch of high-order one bits telling how many bytes are used to encode the character, followed by a zero bit The rest of the bytes each begin with a single one byte followed by a zero bit The bits of the character are dealt out in the space left over after these signalling bits J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21
  • 90. Unicode Encodings UTF-8 UTF-8 was invented by Ken Thompson on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. It works like this: characters whose value is less that 128 (ASCII) are encoded as themselves in one byte the rest will have its bits ripped apart and deal out into several (from two to four) bytes as follows: The first byte has a bunch of high-order one bits telling how many bytes are used to encode the character, followed by a zero bit The rest of the bytes each begin with a single one byte followed by a zero bit The bits of the character are dealt out in the space left over after these signalling bits J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21
  • 91. Unicode Encodings UTF-8 UTF-8 was invented by Ken Thompson on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. It works like this: characters whose value is less that 128 (ASCII) are encoded as themselves in one byte the rest will have its bits ripped apart and deal out into several (from two to four) bytes as follows: The first byte has a bunch of high-order one bits telling how many bytes are used to encode the character, followed by a zero bit The rest of the bytes each begin with a single one byte followed by a zero bit The bits of the character are dealt out in the space left over after these signalling bits J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21
  • 92. Unicode Encodings UTF-8 UTF-8 was invented by Ken Thompson on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. It works like this: characters whose value is less that 128 (ASCII) are encoded as themselves in one byte the rest will have its bits ripped apart and deal out into several (from two to four) bytes as follows: The first byte has a bunch of high-order one bits telling how many bytes are used to encode the character, followed by a zero bit The rest of the bytes each begin with a single one byte followed by a zero bit The bits of the character are dealt out in the space left over after these signalling bits J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21
  • 93. Unicode Encodings UTF-8 UTF-8 was invented by Ken Thompson on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. It works like this: characters whose value is less that 128 (ASCII) are encoded as themselves in one byte the rest will have its bits ripped apart and deal out into several (from two to four) bytes as follows: The first byte has a bunch of high-order one bits telling how many bytes are used to encode the character, followed by a zero bit The rest of the bytes each begin with a single one byte followed by a zero bit The bits of the character are dealt out in the space left over after these signalling bits J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21
  • 94. Unicode Encodings UTF-8 The following table summarizes the rules: Hex range Binary UTF-8 000000–00007F 0zzzzzzz 0zzzzzzz 000080–0007FF 00000yyy yyzzzzzz 110yyyyy 10zzzzzz 000800–00FFFF xxxxyyyy yyzzzzzz 1110xxxx 10yyyyyy 10zzzzzz 010000–10FFFF 000wwwxx xxxxyyyy yyzzzzzz 11110www 10xxxxxx 10yyyyyy 10zzzzzz Our examples result in: Character Binary UTF-8 U+0026 00100110 00100110 U+0416 00000100 00010110 11010000 10010110 U+4E2D 01001110 00101101 11100100 10111000 10101101 U+10346 00000001 00000011 01000110 11110000 10010000 10001101 10000110 Using hexadecimal: Character Hexadecimal U+0026 0x26 U+0416 0xD0 0x96 U+4E2D 0xE4 0xB8 0xAD U+10346 0xF0 0x90 0x8D 0x86 J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 14 / 21
  • 95. Unicode Encodings UTF-8 The following table summarizes the rules: Hex range Binary UTF-8 000000–00007F 0zzzzzzz 0zzzzzzz 000080–0007FF 00000yyy yyzzzzzz 110yyyyy 10zzzzzz 000800–00FFFF xxxxyyyy yyzzzzzz 1110xxxx 10yyyyyy 10zzzzzz 010000–10FFFF 000wwwxx xxxxyyyy yyzzzzzz 11110www 10xxxxxx 10yyyyyy 10zzzzzz Our examples result in: Character Binary UTF-8 U+0026 00100110 00100110 U+0416 00000100 00010110 11010000 10010110 U+4E2D 01001110 00101101 11100100 10111000 10101101 U+10346 00000001 00000011 01000110 11110000 10010000 10001101 10000110 Using hexadecimal: Character Hexadecimal U+0026 0x26 U+0416 0xD0 0x96 U+4E2D 0xE4 0xB8 0xAD U+10346 0xF0 0x90 0x8D 0x86 J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 14 / 21
  • 96. Unicode Encodings UTF-8 The following table summarizes the rules: Hex range Binary UTF-8 000000–00007F 0zzzzzzz 0zzzzzzz 000080–0007FF 00000yyy yyzzzzzz 110yyyyy 10zzzzzz 000800–00FFFF xxxxyyyy yyzzzzzz 1110xxxx 10yyyyyy 10zzzzzz 010000–10FFFF 000wwwxx xxxxyyyy yyzzzzzz 11110www 10xxxxxx 10yyyyyy 10zzzzzz Our examples result in: Character Binary UTF-8 U+0026 00100110 00100110 U+0416 00000100 00010110 11010000 10010110 U+4E2D 01001110 00101101 11100100 10111000 10101101 U+10346 00000001 00000011 01000110 11110000 10010000 10001101 10000110 Using hexadecimal: Character Hexadecimal U+0026 0x26 U+0416 0xD0 0x96 U+4E2D 0xE4 0xB8 0xAD U+10346 0xF0 0x90 0x8D 0x86 J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 14 / 21
  • 97. Unicode Encodings UTF-8 The following table summarizes the rules: Hex range Binary UTF-8 000000–00007F 0zzzzzzz 0zzzzzzz 000080–0007FF 00000yyy yyzzzzzz 110yyyyy 10zzzzzz 000800–00FFFF xxxxyyyy yyzzzzzz 1110xxxx 10yyyyyy 10zzzzzz 010000–10FFFF 000wwwxx xxxxyyyy yyzzzzzz 11110www 10xxxxxx 10yyyyyy 10zzzzzz Our examples result in: Character Binary UTF-8 U+0026 00100110 00100110 U+0416 00000100 00010110 11010000 10010110 U+4E2D 01001110 00101101 11100100 10111000 10101101 U+10346 00000001 00000011 01000110 11110000 10010000 10001101 10000110 Using hexadecimal: Character Hexadecimal U+0026 0x26 U+0416 0xD0 0x96 U+4E2D 0xE4 0xB8 0xAD U+10346 0xF0 0x90 0x8D 0x86 J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 14 / 21
  • 98. Unicode Encodings UTF-8 The following table summarizes the rules: Hex range Binary UTF-8 000000–00007F 0zzzzzzz 0zzzzzzz 000080–0007FF 00000yyy yyzzzzzz 110yyyyy 10zzzzzz 000800–00FFFF xxxxyyyy yyzzzzzz 1110xxxx 10yyyyyy 10zzzzzz 010000–10FFFF 000wwwxx xxxxyyyy yyzzzzzz 11110www 10xxxxxx 10yyyyyy 10zzzzzz Our examples result in: Character Binary UTF-8 U+0026 00100110 00100110 U+0416 00000100 00010110 11010000 10010110 U+4E2D 01001110 00101101 11100100 10111000 10101101 U+10346 00000001 00000011 01000110 11110000 10010000 10001101 10000110 Using hexadecimal: Character Hexadecimal U+0026 0x26 U+0416 0xD0 0x96 U+4E2D 0xE4 0xB8 0xAD U+10346 0xF0 0x90 0x8D 0x86 J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 14 / 21
  • 99. Unicode Encodings UTF-8 The following table summarizes the rules: Hex range Binary UTF-8 000000–00007F 0zzzzzzz 0zzzzzzz 000080–0007FF 00000yyy yyzzzzzz 110yyyyy 10zzzzzz 000800–00FFFF xxxxyyyy yyzzzzzz 1110xxxx 10yyyyyy 10zzzzzz 010000–10FFFF 000wwwxx xxxxyyyy yyzzzzzz 11110www 10xxxxxx 10yyyyyy 10zzzzzz Our examples result in: Character Binary UTF-8 U+0026 00100110 00100110 U+0416 00000100 00010110 11010000 10010110 U+4E2D 01001110 00101101 11100100 10111000 10101101 U+10346 00000001 00000011 01000110 11110000 10010000 10001101 10000110 Using hexadecimal: Character Hexadecimal U+0026 0x26 U+0416 0xD0 0x96 U+4E2D 0xE4 0xB8 0xAD U+10346 0xF0 0x90 0x8D 0x86 J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 14 / 21
  • 100. Unicode Encodings UTF-8 UTF-8 is a kind of racist favouring us with round-eyes anglophones get one byte per character most people west of the Indus river get away with two bytes India and points east need three bytes per character Processing UTF-8 characters sequentially is about as efficient as in any other encoding But you can’t easily index into a buffer (this is the same as UTF-16) count characters array of positions UTF-8 has no embedded zero bytes so some C routines work No byte-ordering problems J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
  • 101. Unicode Encodings UTF-8 UTF-8 is a kind of racist favouring us with round-eyes anglophones get one byte per character most people west of the Indus river get away with two bytes India and points east need three bytes per character Processing UTF-8 characters sequentially is about as efficient as in any other encoding But you can’t easily index into a buffer (this is the same as UTF-16) count characters array of positions UTF-8 has no embedded zero bytes so some C routines work No byte-ordering problems J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
  • 102. Unicode Encodings UTF-8 UTF-8 is a kind of racist favouring us with round-eyes anglophones get one byte per character most people west of the Indus river get away with two bytes India and points east need three bytes per character Processing UTF-8 characters sequentially is about as efficient as in any other encoding But you can’t easily index into a buffer (this is the same as UTF-16) count characters array of positions UTF-8 has no embedded zero bytes so some C routines work No byte-ordering problems J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
  • 103. Unicode Encodings UTF-8 UTF-8 is a kind of racist favouring us with round-eyes anglophones get one byte per character most people west of the Indus river get away with two bytes India and points east need three bytes per character Processing UTF-8 characters sequentially is about as efficient as in any other encoding But you can’t easily index into a buffer (this is the same as UTF-16) count characters array of positions UTF-8 has no embedded zero bytes so some C routines work No byte-ordering problems J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
  • 104. Unicode Encodings UTF-8 UTF-8 is a kind of racist favouring us with round-eyes anglophones get one byte per character most people west of the Indus river get away with two bytes India and points east need three bytes per character Processing UTF-8 characters sequentially is about as efficient as in any other encoding But you can’t easily index into a buffer (this is the same as UTF-16) count characters array of positions UTF-8 has no embedded zero bytes so some C routines work No byte-ordering problems J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
  • 105. Unicode Encodings UTF-8 UTF-8 is a kind of racist favouring us with round-eyes anglophones get one byte per character most people west of the Indus river get away with two bytes India and points east need three bytes per character Processing UTF-8 characters sequentially is about as efficient as in any other encoding But you can’t easily index into a buffer (this is the same as UTF-16) count characters array of positions UTF-8 has no embedded zero bytes so some C routines work No byte-ordering problems J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
  • 106. Unicode Encodings UTF-8 UTF-8 is a kind of racist favouring us with round-eyes anglophones get one byte per character most people west of the Indus river get away with two bytes India and points east need three bytes per character Processing UTF-8 characters sequentially is about as efficient as in any other encoding But you can’t easily index into a buffer (this is the same as UTF-16) count characters array of positions UTF-8 has no embedded zero bytes so some C routines work No byte-ordering problems J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
  • 107. Unicode Encodings UTF-8 UTF-8 is a kind of racist favouring us with round-eyes anglophones get one byte per character most people west of the Indus river get away with two bytes India and points east need three bytes per character Processing UTF-8 characters sequentially is about as efficient as in any other encoding But you can’t easily index into a buffer (this is the same as UTF-16) count characters array of positions UTF-8 has no embedded zero bytes so some C routines work No byte-ordering problems J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
  • 108. Unicode Encodings UTF-8 UTF-8 is a kind of racist favouring us with round-eyes anglophones get one byte per character most people west of the Indus river get away with two bytes India and points east need three bytes per character Processing UTF-8 characters sequentially is about as efficient as in any other encoding But you can’t easily index into a buffer (this is the same as UTF-16) count characters array of positions UTF-8 has no embedded zero bytes so some C routines work No byte-ordering problems J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
  • 109. Unicode Encodings UTF-8 UTF-8 is a kind of racist favouring us with round-eyes anglophones get one byte per character most people west of the Indus river get away with two bytes India and points east need three bytes per character Processing UTF-8 characters sequentially is about as efficient as in any other encoding But you can’t easily index into a buffer (this is the same as UTF-16) count characters array of positions UTF-8 has no embedded zero bytes so some C routines work No byte-ordering problems J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
  • 110. Python’s Unicode Support Unicode String Type Python’s Unicode type Python has a built-in Unicode type Unicode string literals has the same syntax as the normal ones, with a u or U prefixing the quotes (e.g. uquot;This is Unicodequot;) Unicode literals can include the escape sequence uXXXX to denote character point U+XXXX and UXXXXXXXX for U+XXXXXXXX (e.g. uquot;u0026u0416u4e2dU00010346quot;) Unicode characters can be named using the escape sequence N{name} (e.g. uquot;N{Ampersand}quot;) unichr(i) returns a Unicode String with character i (the inverse is ord) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 16 / 21
  • 111. Python’s Unicode Support Unicode String Type Python’s Unicode type Python has a built-in Unicode type Unicode string literals has the same syntax as the normal ones, with a u or U prefixing the quotes (e.g. uquot;This is Unicodequot;) Unicode literals can include the escape sequence uXXXX to denote character point U+XXXX and UXXXXXXXX for U+XXXXXXXX (e.g. uquot;u0026u0416u4e2dU00010346quot;) Unicode characters can be named using the escape sequence N{name} (e.g. uquot;N{Ampersand}quot;) unichr(i) returns a Unicode String with character i (the inverse is ord) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 16 / 21
  • 112. Python’s Unicode Support Unicode String Type Python’s Unicode type Python has a built-in Unicode type Unicode string literals has the same syntax as the normal ones, with a u or U prefixing the quotes (e.g. uquot;This is Unicodequot;) Unicode literals can include the escape sequence uXXXX to denote character point U+XXXX and UXXXXXXXX for U+XXXXXXXX (e.g. uquot;u0026u0416u4e2dU00010346quot;) Unicode characters can be named using the escape sequence N{name} (e.g. uquot;N{Ampersand}quot;) unichr(i) returns a Unicode String with character i (the inverse is ord) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 16 / 21
  • 113. Python’s Unicode Support Unicode String Type Python’s Unicode type Python has a built-in Unicode type Unicode string literals has the same syntax as the normal ones, with a u or U prefixing the quotes (e.g. uquot;This is Unicodequot;) Unicode literals can include the escape sequence uXXXX to denote character point U+XXXX and UXXXXXXXX for U+XXXXXXXX (e.g. uquot;u0026u0416u4e2dU00010346quot;) Unicode characters can be named using the escape sequence N{name} (e.g. uquot;N{Ampersand}quot;) unichr(i) returns a Unicode String with character i (the inverse is ord) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 16 / 21
  • 114. Python’s Unicode Support Unicode String Type Python’s Unicode type Python has a built-in Unicode type Unicode string literals has the same syntax as the normal ones, with a u or U prefixing the quotes (e.g. uquot;This is Unicodequot;) Unicode literals can include the escape sequence uXXXX to denote character point U+XXXX and UXXXXXXXX for U+XXXXXXXX (e.g. uquot;u0026u0416u4e2dU00010346quot;) Unicode characters can be named using the escape sequence N{name} (e.g. uquot;N{Ampersand}quot;) unichr(i) returns a Unicode String with character i (the inverse is ord) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 16 / 21
  • 115. Python’s Unicode Support Unicode String Type Encoding and Decoding You can convert between plain string objects (bytes) to Unicode string objects by means of a codec s.encode(codec=None, errors=’strict’) Returns a plain string encoded from the (plain or unicode) string s using the given encoding (for example ’ascii’, ’latin-1’, ’utf-8’) and error handling (’strict’, ’replace’ or ’ignore’) s.decode(codec=None, errors=’strict’) Returns an Unicode string decoded from the plain string s using the given encoding and error handling. This is the same as: unicode(s, codec=None, errors=’strict’) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 17 / 21
  • 116. Python’s Unicode Support Unicode String Type Encoding and Decoding You can convert between plain string objects (bytes) to Unicode string objects by means of a codec s.encode(codec=None, errors=’strict’) Returns a plain string encoded from the (plain or unicode) string s using the given encoding (for example ’ascii’, ’latin-1’, ’utf-8’) and error handling (’strict’, ’replace’ or ’ignore’) s.decode(codec=None, errors=’strict’) Returns an Unicode string decoded from the plain string s using the given encoding and error handling. This is the same as: unicode(s, codec=None, errors=’strict’) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 17 / 21
  • 117. Python’s Unicode Support Unicode String Type Encoding and Decoding You can convert between plain string objects (bytes) to Unicode string objects by means of a codec s.encode(codec=None, errors=’strict’) Returns a plain string encoded from the (plain or unicode) string s using the given encoding (for example ’ascii’, ’latin-1’, ’utf-8’) and error handling (’strict’, ’replace’ or ’ignore’) s.decode(codec=None, errors=’strict’) Returns an Unicode string decoded from the plain string s using the given encoding and error handling. This is the same as: unicode(s, codec=None, errors=’strict’) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 17 / 21
  • 118. Python’s Unicode Support Unicode String Type Encoding and Decoding You can convert between plain string objects (bytes) to Unicode string objects by means of a codec s.encode(codec=None, errors=’strict’) Returns a plain string encoded from the (plain or unicode) string s using the given encoding (for example ’ascii’, ’latin-1’, ’utf-8’) and error handling (’strict’, ’replace’ or ’ignore’) s.decode(codec=None, errors=’strict’) Returns an Unicode string decoded from the plain string s using the given encoding and error handling. This is the same as: unicode(s, codec=None, errors=’strict’) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 17 / 21
  • 119. Python’s Unicode Support Unicode String Type Encoding and Decoding You can convert between plain string objects (bytes) to Unicode string objects by means of a codec s.encode(codec=None, errors=’strict’) Returns a plain string encoded from the (plain or unicode) string s using the given encoding (for example ’ascii’, ’latin-1’, ’utf-8’) and error handling (’strict’, ’replace’ or ’ignore’) s.decode(codec=None, errors=’strict’) Returns an Unicode string decoded from the plain string s using the given encoding and error handling. This is the same as: unicode(s, codec=None, errors=’strict’) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 17 / 21
  • 120. Python’s Unicode Support Unicode String Type Encoding and Decoding You can convert between plain string objects (bytes) to Unicode string objects by means of a codec s.encode(codec=None, errors=’strict’) Returns a plain string encoded from the (plain or unicode) string s using the given encoding (for example ’ascii’, ’latin-1’, ’utf-8’) and error handling (’strict’, ’replace’ or ’ignore’) s.decode(codec=None, errors=’strict’) Returns an Unicode string decoded from the plain string s using the given encoding and error handling. This is the same as: unicode(s, codec=None, errors=’strict’) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 17 / 21
  • 121. Python’s Unicode Support Unicode String Type Modules related to Unicode The codecs module This module defines base classes for standard Python codecs (encoders and decoders) and provides access to the internal Python codec registry which manages the codec and error handling look-up process open(filename, mode[, encoding[, errors[, buffering]]]) Open an encoded file using the given mode and return a wrapped version providing transparent encoding/decoding EncodedFile(file, input[, output[, errors]]) Return a wrapped version of file which provides transparent encoding translation The unicodedata module Supplies easy access to the Unicode Character Database J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 18 / 21
  • 122. Python’s Unicode Support Unicode String Type Modules related to Unicode The codecs module This module defines base classes for standard Python codecs (encoders and decoders) and provides access to the internal Python codec registry which manages the codec and error handling look-up process open(filename, mode[, encoding[, errors[, buffering]]]) Open an encoded file using the given mode and return a wrapped version providing transparent encoding/decoding EncodedFile(file, input[, output[, errors]]) Return a wrapped version of file which provides transparent encoding translation The unicodedata module Supplies easy access to the Unicode Character Database J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 18 / 21
  • 123. Python’s Unicode Support Unicode String Type Modules related to Unicode The codecs module This module defines base classes for standard Python codecs (encoders and decoders) and provides access to the internal Python codec registry which manages the codec and error handling look-up process open(filename, mode[, encoding[, errors[, buffering]]]) Open an encoded file using the given mode and return a wrapped version providing transparent encoding/decoding EncodedFile(file, input[, output[, errors]]) Return a wrapped version of file which provides transparent encoding translation The unicodedata module Supplies easy access to the Unicode Character Database J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 18 / 21
  • 124. Python’s Unicode Support Unicode String Type Modules related to Unicode The codecs module This module defines base classes for standard Python codecs (encoders and decoders) and provides access to the internal Python codec registry which manages the codec and error handling look-up process open(filename, mode[, encoding[, errors[, buffering]]]) Open an encoded file using the given mode and return a wrapped version providing transparent encoding/decoding EncodedFile(file, input[, output[, errors]]) Return a wrapped version of file which provides transparent encoding translation The unicodedata module Supplies easy access to the Unicode Character Database J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 18 / 21
  • 125. Python’s Unicode Support Unicode String Type Modules related to Unicode The codecs module This module defines base classes for standard Python codecs (encoders and decoders) and provides access to the internal Python codec registry which manages the codec and error handling look-up process open(filename, mode[, encoding[, errors[, buffering]]]) Open an encoded file using the given mode and return a wrapped version providing transparent encoding/decoding EncodedFile(file, input[, output[, errors]]) Return a wrapped version of file which provides transparent encoding translation The unicodedata module Supplies easy access to the Unicode Character Database J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 18 / 21
  • 126. Python’s Unicode Support Unicode String Type Modules related to Unicode The codecs module This module defines base classes for standard Python codecs (encoders and decoders) and provides access to the internal Python codec registry which manages the codec and error handling look-up process open(filename, mode[, encoding[, errors[, buffering]]]) Open an encoded file using the given mode and return a wrapped version providing transparent encoding/decoding EncodedFile(file, input[, output[, errors]]) Return a wrapped version of file which provides transparent encoding translation The unicodedata module Supplies easy access to the Unicode Character Database J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 18 / 21
  • 127. Python’s Unicode Support Unicode String Type Modules related to Unicode The codecs module This module defines base classes for standard Python codecs (encoders and decoders) and provides access to the internal Python codec registry which manages the codec and error handling look-up process open(filename, mode[, encoding[, errors[, buffering]]]) Open an encoded file using the given mode and return a wrapped version providing transparent encoding/decoding EncodedFile(file, input[, output[, errors]]) Return a wrapped version of file which provides transparent encoding translation The unicodedata module Supplies easy access to the Unicode Character Database J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 18 / 21
  • 128. Python’s Unicode Support Unicode String Type Modules related to Unicode The codecs module This module defines base classes for standard Python codecs (encoders and decoders) and provides access to the internal Python codec registry which manages the codec and error handling look-up process open(filename, mode[, encoding[, errors[, buffering]]]) Open an encoded file using the given mode and return a wrapped version providing transparent encoding/decoding EncodedFile(file, input[, output[, errors]]) Return a wrapped version of file which provides transparent encoding translation The unicodedata module Supplies easy access to the Unicode Character Database J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 18 / 21
  • 129. Python’s Unicode Support Source Code Encoding Source Code Encodings By default, Python source must only contain characters from the ascii set But you are allowed to tell Python that you use a superset of ascii These characters can only appear in comments string literals To accomplish this, in the first or second line (if there is a shebang line) of your source file, put a comment like this: # -*- coding: latin-1 -*- J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 19 / 21
  • 130. Python’s Unicode Support Source Code Encoding Source Code Encodings By default, Python source must only contain characters from the ascii set But you are allowed to tell Python that you use a superset of ascii These characters can only appear in comments string literals To accomplish this, in the first or second line (if there is a shebang line) of your source file, put a comment like this: # -*- coding: latin-1 -*- J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 19 / 21
  • 131. Python’s Unicode Support Source Code Encoding Source Code Encodings By default, Python source must only contain characters from the ascii set But you are allowed to tell Python that you use a superset of ascii These characters can only appear in comments string literals To accomplish this, in the first or second line (if there is a shebang line) of your source file, put a comment like this: # -*- coding: latin-1 -*- J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 19 / 21
  • 132. Python’s Unicode Support Source Code Encoding Source Code Encodings By default, Python source must only contain characters from the ascii set But you are allowed to tell Python that you use a superset of ascii These characters can only appear in comments string literals To accomplish this, in the first or second line (if there is a shebang line) of your source file, put a comment like this: # -*- coding: latin-1 -*- J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 19 / 21
  • 133. Python’s Unicode Support Source Code Encoding Source Code Encodings By default, Python source must only contain characters from the ascii set But you are allowed to tell Python that you use a superset of ascii These characters can only appear in comments string literals To accomplish this, in the first or second line (if there is a shebang line) of your source file, put a comment like this: # -*- coding: latin-1 -*- J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 19 / 21
  • 134. Python’s Unicode Support Source Code Encoding Source Code Encodings By default, Python source must only contain characters from the ascii set But you are allowed to tell Python that you use a superset of ascii These characters can only appear in comments string literals To accomplish this, in the first or second line (if there is a shebang line) of your source file, put a comment like this: # -*- coding: latin-1 -*- J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 19 / 21
  • 135. Python’s Unicode Support Source Code Encoding Source Code Encodings By default, Python source must only contain characters from the ascii set But you are allowed to tell Python that you use a superset of ascii These characters can only appear in comments string literals To accomplish this, in the first or second line (if there is a shebang line) of your source file, put a comment like this: # -*- coding: latin-1 -*- J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 19 / 21
  • 136. Python’s Unicode Support Source Code Encoding Source Code Encodings By default, Python source must only contain characters from the ascii set But you are allowed to tell Python that you use a superset of ascii These characters can only appear in comments string literals To accomplish this, in the first or second line (if there is a shebang line) of your source file, put a comment like this: # -*- coding: latin-1 -*- J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 19 / 21
  • 137. Bibliography Kumar McMillan, Unicode In Python, Completely Demystified (recorded in this video), PyCon 2008, Chicago Tim Bray, On the Goodness of Unicode, Characters vs. Bytes Roman Czyborra, Unicode’s Characters Michael Foord, A Crash Course in Character Encoding A.M. Kuchling, The Unicode-HOWTO Markus Kuhn, UTF-8 and Unicode FAQ for Unix/Linux Marc-Andr´ Lemburg, PEP-100: Python Unicode Integration, e PEP-263: Defining Python Source Code Encodings, Developing Unicode-aware Applications in Python Joel Spolski, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) Unicode Consortium, The Unicode Home Page J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 20 / 21
  • 138. License License Aquesta obra est` subjecta a una llic`ncia Reconeixement-Compartir amb a e la mateixa llic`ncia 2.5 Espanya de Creative Commons. e Per veure’n una c`pia, visiteu o http://creativecommons.org/licenses/by-sa/2.5/es/ o envieu una carta a Creative Commons 559 Nathan Abbott Way Stanford California 94305 USA J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 21 / 21