Xml For Dummies Chapter 6 Adding Character(S) To Xml


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Xml For Dummies Chapter 6 Adding Character(S) To Xml

  1. 1. XML FOR DUMMIES<br />Book author: Lucinda Dykes and Ed Tittel<br />Slides Prepared by Cong Tan<br />Part 2 : XML and The Web<br />Chapter 6: Adding Character(s) to XML.<br />
  2. 2. Contents<br />About Character Encodings.<br />Introducing Unicode.<br />Character Sets, Fonts, Scripts, and Glyphs.<br />For Each Character, a Code.<br />Key Character Sets.<br />Using Unicode Character s.<br />Finding Character Entity Information.<br />
  3. 3. 1. About Character Encodings. <br /> Clearly, the trend is toward longer bit strings to encode character data, so size does matter when representing character data. Here’s why:<br /> A 7-bit string can represent a maximum of 27 , or 128, different characters…<br /> An 8-bit string can represent a maximum of 28 , or 256, different characters, including everything a 7-bit encoding can handle, and leaves room for what some experts call higher-order characters.<br /> A 16-bit string can represent a maximum of 216 , or 56.536, different characters.<br /> Some modern computers still use 8-bit encodings to represent most character data.<br /> Windows NT, Window 2000, and Window XP, however, use 16-bit encoding for internal representations of text and most global solutions use 16-bit encoding to support all possible languages and characters. <br />
  4. 4. 2. Introducing Unicode.<br />Today, Unicode defines just over 96.000 different character codes.<br /> The default, character set used to encode all HTML document on the Web.<br /> Many people —including numerous XML experts —refer to the XML character set as “Unicode”.<br /> Note that XML 1.0, 2nd Edition references Unicode 2.0 and 3.0, and XML 1.1 references Unicode 4.0, whereas the 1st Edition of XML 1.0 references only Unicode 2.0…<br /> For more information about Unicode characters, symbols, history, and the current standard, you can find a plethora of information at the Unicode consortium’s Web site at www.unicode.org.<br />
  5. 5. 3. Character Sets, Fonts, Scripts, and Glyphs.<br />To see what’s in XML scripts that 7-or 8-bit character encodings can’t cover —which means special symbols or non-Roman alphabets —you’ll need a few extra local ingredients:<br /> A character set that matches the script you’re trying to read and display.<br /> Software that understands the character set for the script.<br /> An electronic font that allows the character set to be displayed on screen.<br /> All these ingredients are necessary to work with alternate character sets.<br /> Character sets represent a mapping from a script to a set of corresponding numeric character codes.<br /> Fonts represent a collection of glyphs for the numeric character codes in a character set.<br /> Finally, to create text to match the alphabets used in a script, you need an input tool —such as a text or XML editor —that can work with the character set and its corresponding font.<br />
  6. 6. 4. For Each Character, a Code.<br />In the Unicode/ISO 10646 character set, individual characters correspond to specific 16-bit numbers.<br /> Numeric entities take one of two forms, decimal or hexadecimal. For example:<br /> Each numeric entity in XML has an associated text encoding.<br /> If some specific encoding is not defined in a numeric entity’s definition, the default is an encoding called UTF-8, which stands for Unicode Transformation Format, 8-bit form.<br /> UTF and UCS are mechanisms for implementing Unicode. <br /> UTF versions include UTF-32, UTF-16,UTF-8,UTF-EBCDIC, and UTF-7<br /> UCS versions include UCS-4 and UCS-2.<br /> UTF-16 used mainly for internal processing. <br />က<br />&lt;!-- &# indicates a decimal number --&gt;<br />ༀ<br />&lt;!-- &#x indicates a hexadecimal number--&gt;<br />
  7. 7. 5. Key Character Sets.<br />Most computers today use some variant of the ASCII, an 8-bit character set that handles the basic Roman alphabet used for English, along with punctuation, numbers, and simple symbols.<br /> Most European languages match standard ASCII values from 0 to 127 and go on from there to define alternate mappings between character codes and local script characters for values from 128 to 255.<br />Non-Roman alphabets, such as Hebrew, Japanese, and Thai, depend on special character sets that include basic ASCII(0-127, or 0-255) .<br /> A listing of character sets built around the ASCII framework appears in Table 6-1.<br />
  8. 8.
  9. 9. Table 6-1 shows that most character sets can render English and German, plus a collection of other.<br />When choosing a variant of ISO-8859, remember that all the languages you want to include must use Unicode.<br />XML goes beyond such idiosyncratic or customized character sets and uses Unicode.<br />
  10. 10. 6. Using Unicode Characters. <br /> So do many modern word processors —for instance, Word 97, and later versions support a format called encoded text that uses Unicode encoding.<br /> If you don’t have already access to such tools and want to save XML file in Unicode format, you must use a conversion tool.<br /> Several different tools , both freeware and commercial products, are available, depending on your OS.<br /> Widely used tools such as Netscape Navigator(version 4.1 or newer) and IE(version 5 or newer) can handle most ISO-8859 variants.<br /> If you want to use an alternate character encoding, you must identify that encoding in your XML document’s prolog as follows: <br /> Note that XML parsers are required to support only UTF-8 and UTF-16 encodings, so the encoding attribute in an XML document prolog might not work with all such tools.<br />&lt;?xml version=”1.0” encoding=”ISO-8859-9”?&gt;<br />
  11. 11. 7. Finding Character Entity Information.<br /> Resource : The Unicode Standard, version 4.0 or you can also find plenty of encoding information online, for example: www.unicode.org/ucd/<br /> You’ll also find the XHTML entity lists useful in this context:<br /> Latin-1: www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent.<br /> Special: www.w3.org/TR/xhtml1/DTD/xhtml-special.ent<br /> Symbols: www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent<br />
  12. 12. THE END<br />