What is XML?• Some important facts about XML – XML stands for the eXtensible Markup Language – It was developed by W3C • World Wide Web Consortium • www.w3.org – XML 1.0 (2nd Edition) • W3C recommendation • http://www.w3.org/TR/REC-xml – XML 1.1 • Candidate recommendation
Evolution of WWW • Web was once a publishing tool for scientific documents only. • Now it is a full-fledged medium, like TV or print. – Furthermore, Web is an Interactive medium – Over 800 million Web pages are written with HTML
Problems of HTML (I)• Over the years, HTML has been extended – HTML has close to 100 tags – Supporting technologies has been introduced by vendors – Still more tags are needed! • Example – E-commerce applications need tags for prices, product references – Streaming would nee tags to control the flow of media – HTML is already on the verge of collapsing under its own weight!!
Problems of HTML (II)• Some applications would benefit greatly from a reduction in the tag count! – More and more people are accessing Web from PDA and smart phones • Mobile devices are not as powerful as PC • The complex Web language cannot be processed • The web tags are more than the web content itself
Basic Principles of XML• Increasing specialized applications need more tags, while other applications want a simple language – W3C resolve this dilemma by making two changes to HTML • No predefined tags • Stricter syntax
No Predefined Tags (I) • XML has no predefined tags. – The author creates all the tags he needs • If u need a certain tag, just make itHTML <table> <tr> <td>Price USD 499 </td> <td><a href=”/newsletter”><b>Pineapplesoft Link</b></a></td> </tr> </table> XML <price currency=“usd”>499.00</price> <toc xlink:href=”/newsletter”>Pineapplesoft Link</toc>
No Predefined Tags (II)• How does the browser know what the author- defined tag looks like? – Style sheet• Can we compare different prices?• What about the current and previous browsers?• Can we simplify Web site maintenance?
Stricter Syntax• More than 50% codes in a browser are devoted to handle errors or sloppiness on the author’s part. – Due to increasing using HTML editors – Browsers are growing in size and becoming slower• XML adopt a strict syntax for smaller and faster browsers <p>Welcome to our site! <img src=logo.jpg> <p>Welcome to our site! <img src=”logo.jpg”/></p>
Document Structures (I)• An example INTERNAL MEMO title From: Bh Huang To: Conrad Ho Regarding: Using User Attention Model in header Watermarking Have u finished the job? Can I adopt the program directly?I think it will be of great benefits by using the user attention model. body Bh
Document Structures (II)<?xml version=“1.0”?><memo><header><from>Bh Huang </from><to>Conrad Ho</to><subject> Using User Attention Model in Watermarking </subject></header><body><para>Have u finished the job? Can I adopt the program directly?I think it will begreat benefits in using the user attention model.</para><signature>Bh</signature></body></memo>
Application of XML• Most popular applications of XML – Document applications manipulate information primarily intended for human consumption – Data applications manipulate information primarily intended for software communications
Document Publishing (I)• XML concentrates on the structure of the document, making it independent of the delivery medium HTML PDF WML XML Document
Document Publishing (II)• It is possible to edit and maintain documents in XML and automatically publish them on different media – More and more publication are available online and in print – Web is changing rapidly – New markup languages are introduced for specific devices
Data Applications• If the structure of a document can be expressed in XML, so as the structure of a database.• XML web site can be regarded as a large database that application can tap
Near-term Applications of XML• Large web site maintenance• Exchange information between organizations• Content made available to different web sites• E-commerce applications where different organizations collaborate to server a customer• Scientific applications with new markup languages for formulas or specifications• E-books needs to express rights and ownerships
<?xml version="1.0"?> <!-- Download from www.marchal.com or www.mcp.com --> <address-book> An Example <entry> <name>John Doe</name> <address> <street>34 Fountain Square Plaza</street> John Doe <region>OH</region> 34 Fountain Square Plaza <postal-code>45202</postal-code> Cincinnati, OH 45202 <locality>Cincinnati</locality> US <country>US</country> </address> 513-744-8889 (preferred) <tel preferred="true">513-744-8889</tel> 513-744-7098 <tel>513-744-7098</tel> email@example.com <email href="mailto:firstname.lastname@example.org"/> Jack Smith </entry> 513-744-3465 <entry> <name>Jack Smith</name> email@example.com <tel>513-744-3465</tel> Never leave messages on his <email href="mailto:firstname.lastname@example.org"/> answering machine. Email instead. <comments>Never leave messages on his answering machine. <b>Email instead.</b></comments> Plain text file </entry> </address-book>•Which one is easier to read?•Which one is easier for software to interpret? XML Document
Elements• Fundamental Units of XML – E.g. <tel>513-744-7098</tel> – Each element is surrounded by a start tag and an end tag, which are quite similar to HTML • Start tag is the element name contained in the “<“ and “>” pair • End tag must include an additional “/” – Both a start tag and a end tag is required for an element
Naming an Element• The names of elements must follow specific rules. – The element name must start with letters or _ – Other parts of an element name can consist letters, digits, -, ., or -. – Spaces are not allowed in an element name – Element names are case-sensitive <copyright-information> <123> <address> address-book <p> <first name> <ADDRESS> AddressBook <base64> <Tom&jerry> <Address> Suggested writing <decompte.client> <firstname> Illegal Case sensitivity Legal
Attributes• Additional information of elements – <tel preferred=”true”>513-744-8889</tel>• An attribute is consisting of its attribute name and value.• Attribute names must follow the same rules as element names• Start tag of an element can contain more than one or no attributes• Quote marks are required!! (quotes can be ‘ or “) – <confidentiality level=“I don’t know”>This document is not confidential </confidentiality>• Attributes are not parts of element names
Special Attributes• xml:space – Specifying the space handling style • preserve: preserving all spaces • default: neglecting repeated spaces• xml:lang – Specifying content of the element is written in which language • <p xml:lang=“en-GB”>What colour is it?</p> • <p xml:lang=“en-US”>What color is it?</p>
Empty Elements• Elements having no contents are called empty elements – <email href=“email@example.com” /> – <email href=“firstname.lastname@example.org”></email>
Hierarchical Structure <?xml version="1.0"?> <!-- Download from www.marchal.com or www.mcp.com --> <address-book>of Elements <entry> Containing texts <name>John Doe</name> <address> <street>34 Fountain Square Plaza</street> <region>OH</region> <postal-code>45202</postal-code> <locality>Cincinnati</locality> <country>US</country> </address> <tel preferred="true">513-744-8889</tel> <tel>513-744-7098</tel> <email href="mailto:email@example.com"/> </entry> Containing other elements <entry> <name>Jack Smith</name> <tel>513-744-3465</tel> <email href="mailto:firstname.lastname@example.org"/> Containing mixture of both <comments>Never leave messages on his answering machine. <b>Email instead.</b></comments> </entry> </address-book>
Hierarchical Structure of Elements (cont.)<entry> Correct •Elements containing other elements <name>Jack Smith</name> are called parents <tel>513-744-3465</tel> •Elements contained in other elements <email href="mailto:email@example.com"/> are called children <comments>Never leave messages on his answering machine. <b>Email instead.</b></comments> •Children must be fully contained </entry> within their parents<entry> <name>Jack Smith</name> <tel>513-744-3465</tel> <email href="mailto:firstname.lastname@example.org"/> <comments>Never leave messages on his answering machine. <b>Email instead. </entry> </comments></b> Wrong
The Root Element • Each document should have only one root element – All other elements must be children of the root element<?xml version="1.0"?> Wrong <?xml version="1.0"?> Correct<entry> <address-book> <name>John Doe</name> <entry> <email href="mailto:email@example.com"/> <name>John Doe</name></entry> <email href="mailto:firstname.lastname@example.org"/><entry> </entry> <name>Jack Smith</name> <entry> <email href="mailto:email@example.com"/> <name>Jack Smith</name></entry> <email href="mailto:firstname.lastname@example.org"/> </entry> </address-book>
The XML Declaration• The first line in an XML document is called the XML declaration – <?xml version="1.0"?>• As long as a document contains the XML declaration, it means that it is a XML document• XML version is included in the XML declaration• XML declaration is now optional, but is suggested to be included too •Current version of XML is 1.0. •The second edition is only the first edition with errors corrected.
Comments• Comments are surrounded by “<!--” and “-->”• Since comments are read by human users only, the XML parsers will neglect them automatically. – E.g. <!-- Download from www.marchal.com or www.mcp.com -->• Comments cannot be added within an element – E.g. <name <!-- an invalid comment -->>Jack </name>
Unicode• Unicode support all languages in the world that are still being used and mathematical or other symbols• All characters in Unicode are represented by 16 bits – The XML file size will be 2X larger than usual text file – Solution: specifying “UTF-8” or “UTF-16” in XML declaration – E.g. <?xml version=“1.0” encoding=“ISO-9959-1” ?>
Entity• Complicated XML documents are usually located within several files• The organizing unit of XML documents is entity• E.g. if we defined an entity “us” with value “United States” – <country>&us;</country> – <country>United States></country>
Predefined Entities• < < Entity reference:• & & <company> Marks & Spencer</company>• > ]]> <company> Marks & Spencer</company>• ' ‘• " “ Character reference: <name> Benoît Marchal</name>
Processing Instruction• The mechanism to insert non-XML statement into an XML document – Compromising the structural property of XML – Enclosure with “<?” and “>” – The first word is called target, to which application or device the instruction is directed • <?xml version=“1.0” encoding=“ISO-8859-1” ?> • <?xml-stylesheet href=“simple-ie5.xsl” type=“text/xsl” ?>
CDATA Sections• Enclosure with “<![CDATA[“ and “]]>” <? xml version=“1.0”?> <example>• XML parser will neglect all <![CDATA[ escaping symbols <?xml version=“1.0”?> <entry>• Used when entity <name> John Doe</name> references are used too </entry>]]> frequently or another XML </example> document is included
Common Errors• The end tag is missing• XML is case sensitive• Using spaces in element names• Quotes of the attribute value is missing
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.