XML FOR DUMMIES Book author: Lucinda Dykes and Ed Tittel Slides Prepared by Cong Tan Part 2 : XML and The Web Chapter 6: Adding Character(s) to XML.
Contents About Character Encodings. Introducing Unicode. Character Sets, Fonts, Scripts, and Glyphs. For Each Character, a Code. Key Character Sets. Using Unicode Character s. Finding Character Entity Information.
1. About Character Encodings. Clearly, the trend is toward longer bit strings to encode character data, so size does matter when representing character data. Here’s why: A 7-bit string can represent a maximum of 27 , or 128, different characters… An 8-bit string can represent a maximum of 28 , or 256, different characters, including everything a 7-bit encoding can handle, and leaves room for what some experts call higher-order characters. A 16-bit string can represent a maximum of 216 , or 56.536, different characters. Some modern computers still use 8-bit encodings to represent most character data. Windows NT, Window 2000, and Window XP, however, use 16-bit encoding for internal representations of text and most global solutions use 16-bit encoding to support all possible languages and characters.
2. Introducing Unicode. Today, Unicode defines just over 96.000 different character codes. The default, character set used to encode all HTML document on the Web. Many people —including numerous XML experts —refer to the XML character set as “Unicode”. Note that XML 1.0, 2nd Edition references Unicode 2.0 and 3.0, and XML 1.1 references Unicode 4.0, whereas the 1st Edition of XML 1.0 references only Unicode 2.0… For more information about Unicode characters, symbols, history, and the current standard, you can find a plethora of information at the Unicode consortium’s Web site at www.unicode.org.
3. Character Sets, Fonts, Scripts, and Glyphs. To see what’s in XML scripts that 7-or 8-bit character encodings can’t cover —which means special symbols or non-Roman alphabets —you’ll need a few extra local ingredients: A character set that matches the script you’re trying to read and display. Software that understands the character set for the script. An electronic font that allows the character set to be displayed on screen. All these ingredients are necessary to work with alternate character sets. Character sets represent a mapping from a script to a set of corresponding numeric character codes. Fonts represent a collection of glyphs for the numeric character codes in a character set. Finally, to create text to match the alphabets used in a script, you need an input tool —such as a text or XML editor —that can work with the character set and its corresponding font.
4. For Each Character, a Code. In the Unicode/ISO 10646 character set, individual characters correspond to specific 16-bit numbers. Numeric entities take one of two forms, decimal or hexadecimal. For example: Each numeric entity in XML has an associated text encoding. If some specific encoding is not defined in a numeric entity’s definition, the default is an encoding called UTF-8, which stands for Unicode Transformation Format, 8-bit form. UTF and UCS are mechanisms for implementing Unicode. UTF versions include UTF-32, UTF-16,UTF-8,UTF-EBCDIC, and UTF-7 UCS versions include UCS-4 and UCS-2. UTF-16 used mainly for internal processing. က <!-- &# indicates a decimal number --> ༀ <!-- &#x indicates a hexadecimal number-->
5. Key Character Sets. Most computers today use some variant of the ASCII, an 8-bit character set that handles the basic Roman alphabet used for English, along with punctuation, numbers, and simple symbols. Most European languages match standard ASCII values from 0 to 127 and go on from there to define alternate mappings between character codes and local script characters for values from 128 to 255. Non-Roman alphabets, such as Hebrew, Japanese, and Thai, depend on special character sets that include basic ASCII(0-127, or 0-255) . A listing of character sets built around the ASCII framework appears in Table 6-1.
Table 6-1 shows that most character sets can render English and German, plus a collection of other. When choosing a variant of ISO-8859, remember that all the languages you want to include must use Unicode. XML goes beyond such idiosyncratic or customized character sets and uses Unicode.
6. Using Unicode Characters. So do many modern word processors —for instance, Word 97, and later versions support a format called encoded text that uses Unicode encoding. If you don’t have already access to such tools and want to save XML file in Unicode format, you must use a conversion tool. Several different tools , both freeware and commercial products, are available, depending on your OS. Widely used tools such as Netscape Navigator(version 4.1 or newer) and IE(version 5 or newer) can handle most ISO-8859 variants. If you want to use an alternate character encoding, you must identify that encoding in your XML document’s prolog as follows: Note that XML parsers are required to support only UTF-8 and UTF-16 encodings, so the encoding attribute in an XML document prolog might not work with all such tools. <?xml version=”1.0” encoding=”ISO-8859-9”?>
7. Finding Character Entity Information. Resource : The Unicode Standard, version 4.0 or you can also find plenty of encoding information online, for example: www.unicode.org/ucd/ You’ll also find the XHTML entity lists useful in this context: Latin-1: www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent. Special: www.w3.org/TR/xhtml1/DTD/xhtml-special.ent Symbols: www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent