A Programme Under the compumitra Series
Copyright 2010-14 © Sunmitra Education Technologies Limited, India
eXtensible Markup Language (XML)
A comment by Tim Bray of Sun Microsystems on Celebration of
10th Anniversary of XML in Feb 2008.
"There is essentially no computer in the world, desk-top, hand-held,
or back-room, that doesn't process XML sometimes. This is a good
thing, because it shows that information can be packaged and
transmitted and used in a way that's independent of the kinds of
computer and software that are involved. XML won't be the last
neutral information wrapping system; but as the first, it's done very
well."
Outline
 XML Eye-opener.
 What is XML?
 HTML vs. XML.
 Basic XML Syntax.
 Constituents.
 Some XML Rules.
 Element Vs. Attribute.
 Node Naming Principles.
 Advanced Concepts related to XML
 Future of XML
XML Eye Opener
 SIMPLE: So simple that you would wonder, why you
were not trying to understand it till date.
 SUCCESSFUL: Most successful data storage format till
date that even big brand who were strong believers of
proprietary formats for commercial reasons have started
using it.
 SOLID: Most solid ageless concept that this generation
will pass-on to other future generations and they will
keep the baton moving.
What is XML-1
 XML is abbreviation of
eXtensible Markup Language.
 XML evolved from more general
purpose ISO standard SGML
(Standard Generalised Markup
Language).
 All Data needs Description to make
it some useful Information. XML
provides a neat solution.
 XML looks like normal English but it
has been designed to be machine
readable.
What is XML-2
 XML can store data
 XML can help standardization in
exchange of data.
 User defined markup tags to name
dataitems.
 Library Functions are available in most
programming languages to parse XML.
 The syntax looks like
<addressbook>
<adrrecord>
<name>Name1</name>
<address>Address1</address>
<city>City1</city>
</adrrecord>
</addressbook>
Understanding Basic XML Syntax
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<COUNTRYLIST>
<COUNTRY group="G20">
<NAME>India</NAME>
<CODE>IN</CODE>
<ISD>91</ISD>
<CAPITAL largestcity="No">New Delhi</CAPITAL>
<LCITY>Mumbai</LCITY>
<CURRENCY>Indian Rupee</CURRENCY>
<CURCODE>INR</CURCODE>
</COUNTRY>
<COUNTRY group="G5">
<NAME>Japan</NAME>
<CODE>JP</CODE>
<ISD>81</ISD>
<CAPITAL largestcity="Yes">Tokyo</CAPITAL>
<LCITY>Tokyo</LCITY>
<CURRENCY>Yen</CURRENCY>
<CURCODE>JPY</CURCODE>
</COUNTRY>
</COUNTRYLIST>
Element
Node
XML Declarations:
Version: of XML
Encoding: Character-set
Used. UTF-8 is common
(unicode 8 bit variant)
Standalone=Yes, depicts
non-usage of external
type definitions
Attribute Node
Root Element Node
Element Value
Attribute Value
XML Constituents
 Elements
<address><name>somename</name></address>
 Attributes
<Book Version="1.0"><name></name></Book>
 Five predefined Entities to allow for special charaters in the PCDATA
area.
> to &gt;
< to &lt;
& to &amp;
' to &apos;
" to &quot;
 CDATA section (Character Data Not to be parsed). This is meant for
putting lot of code like or general purpose data. Even HTML data can
be put here.
<![CDATA[ ... ]]>
 Processing Instructions (PI) or Directives given betweem <? ?>
<?xml-stylesheet type="text/css" href="mySheet.css"?>
or even initial declaration like below is a PI
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
Parsable Character data (PCDATA)
between element <address> start and end
tags.
Attribute has a name and a value in
quotes.
Some XML Rules - 1
 All elements to have closing tags.
<address>invalid syntax
<address>valid syntax</address>
 All elements are case sensitive.
<Name>incorrect</name>
<Name>correct</Name>
 Elements shall be correctly nested.
<address><name>incorrect</address></name
>
<address><name>correct</name></address>
 Attribute values must be quoted.
<Book Version=1.0><name></name></Book>
(Incorrect)
<Book
Version="1.0"><name></name></Book>
(correct)
Some XML Rules - 2
 XML Document must have a root element and only one root element
(it can have any name though).
<root>
<Child>correct</child>
</root>
 Entities in data values must use special codes.
> as &gt; < as &lt; & as &amp; ' as &apos; " as &quot;
 Comments has this syntax.
<!– This is a comment -->
Comments can not contain – in its text matter.
 Whitespace are preserved as against HTML. For e.g.
"Hello World" in HTML would be "Hello World". In XML it will retain
exact spaces specified.
 Empty Elements have this kind of optional format.
<Name />
Some XML Rules - 3
 Whitespace are preserved as against
HTML.
For e.g.
"Hello World" in HTML would be
"Hello World".
In XML it will retain exact spaces
specified.
 The optional style of writing empty
elements is.
<Name /> in place of <Name></Name>
XML Practice: Element Vs Attributes - 1
 It is generally possible to define all data as
ELEMENT tags in a tree format.
<Library>
<Book>
<ID>201</ID>
<ISBN>8175257660</ISBN>
<Author>Name1</Author>
<Title>Book Title</Title>
</Book>
</Library>
 A neat alternative to above could be using
ATTRIBUTES as follows:
<Library>
<Book ID="201" ISBN="8175257660">
<Author>Name1</Author>
<Title>Book Title</Title>
</Book>
</Library>
XML Practice: Element Vs Attributes -2
 Which method to use is a thoughtful decision.
 Information that is surely singular (will not be
repeated) and is not domain specific is recommended
as ATTRIBUTE.
 If you are unable to classify or the Information can be
repeated (For e.g. Author tag can be repeated in
above example) should be used as ELEMENT.
 Even better format for previous example would be
<Library>
<Book ID="201">
<ISBN>8175257660</ISBN>
<Author>Name1</Author>
<Title>Book Title</Title>
</Book>
</Library>
This is because ISBN is a book related property while ID
may be related to a storage place.
XML Node Naming – Begins with
 Node (elements or attributes) names shall
begin with a letter or _ (underscore).
<1STLINE></1STLINE> invalid element naming
<LINE1></LINE1> valid naming
<BOOK 1Ver="1.00"></BOOK> invalid attribute naming
<BOOK _Ver="1.00"></BOOK> valid attribute naming
XML Node Naming – Consists of
 Name can consist of
 Any English Character or even any foreign language
character as allowed by the encoding set given in the
declaration.
<Name>Sun</Name>
<नाम>सूरज</नाम>
 A dot (.) or hyphen (-) or _(undescore)
<Address.Cityname>Delhi</Address.Cityname>
<Address-Cityname>Delhi</Address-Cityname>
<Address_Cityname>Delhi</Address_Cityname>
Tabs and Spaces are not allowed in
XML Node Names.
XML Node Naming – Based on
Namespace
 Name can belong to a namespace
 Table may be used in html or furniture. One can
resolve this problem by using namespaces as follows
<h:table>
<h:tr>
<h:td>Apples</h:td>
<h:td>Bananas</h:td>
</h:tr>
</h:table>
<f:table>
<f:name>Dining Table</f:name>
<f:width>120</f:width>
<f:length>230</f:length>
</f:table>
HTML Vs XML - 1
 Similarities.
Both Uses markup tags
(elements and attributes) e.g.
<H1>Heading1</H1> or <font
face="Verdana"></font>.
Both use entities e.g. &lt; &gt;
etc.
Both are derived from SGML
HTML Vs XML - 2
 Differences.
HTML has predefined tags, XML
tags are user defined.
HTML is for Humans and errors
are ignored. XML is for
computers as data storehouse or
definitions so errors can not be
ignored.
HTML is usually not updated by
programs while XML is meant for
program based writing.
HTML has large number of
entities. XML has just five.
XSL (Extensible Stylesheet Language)
 Unlike HTML styling using CSS (Cascade
Style Sheet) it has tags that are user
defined.
 It has three parts
XSLT (XSL Transformation): for showing XML
data as transformed XHTML onto a webpage.
Xpath: a way to reach a particular data-item in
an XML file. This is very often useful in
reading XML based configuration files.
XSL-FO (XSL Formatting Objects): Provides a
display/print formatting mechanism for XML
data.
DTD (Document Type Definition)
 A DTD is referred within a DOCTYPE
declaration in an XML file such as.
<!DOCTYPE note SYSTEM "Note.dtd">
 This DTD file will have the format as
follows.<!DOCTYPE note
[
<!ELEMENT note
(to,from,heading,body)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
]>
XML file has the root node
named note with four sub-
elements.
The sub-
elements have
the PCDATA
format.
Parsing XML
 Process of reading XML file and extracting
valid data out of it is called "PARSING".
 Parsers are of two types
Non-Validating Parser: When the document
doesn't check against a validating DTD.
Validating Parser: When a document is
checked against its DTD
Some Advanced Concepts Related to
XML
 XML Schema: Relates to defining
validation rules in form of XSD
(XML Schema Definition) files that
too are in the XML format.
 XQuery: This is a way to search
within an XML file and get the
selected nodes that match the
criteria.
Where to View/Edit
 Browsers: Most Browsers are good at viewing
XML. Internet Explorer is particularly good at it.
 Editors: Special Editors are available that allow
good XML views/editing facilities. Microsoft's
XML Editor, Peter's XML editor are good at it.
 Office Tools: MS-Word, Frontpage like tools
provide good XML Editing. Even MS-Excel
support XML file opening.
 Visual Studio/WebDeveloper: They provide
excellent environment for XML editing and
viewing along with validation support.
Let's Quickly Revise
 2 Types of Nodes: Elements and Attributes. Elements
are repeatable. Attributes can always be put up like
elements, reverse may not be true.
 Special syntax for non-parsable data as CDATA.
 5 Entities for special symbols( <, >, ', ", &).
 HTML style Comments Allowed. <!-- comments --
>
 Case-Sensitive. Closing Required
 One can apply other Processing Instructions (PI) that
is enclosed with in <? ?>. First line is usually a
Version declaration line which is also a PI.
 Always have a single root node.
Future of XML
 All websites may one day be written in XML.
HTML has already been re-standardised as
XHTML which provides better syntax checking
and browser compatibility.
 XML promises to be the most open system for
storage of information from all IT gadgets like
Desktops to Mobile phones to ipods to ipads to
DVD players to microwave-ovens etc. It is already
being used and it is expected to be used in more
and more devices.
 All office documents/e-books offline and online
shall ultimately be in XML as it is the sole non-
proprietary format that is simple and is able to
meet the needs well.
 Ask and guide me at
sunmitraeducation@gmail.com
 Share this information with as
many people as possible.
 Keep visiting www.sunmitra.com
for programme updates.

Basics of XML

  • 1.
    A Programme Underthe compumitra Series Copyright 2010-14 © Sunmitra Education Technologies Limited, India eXtensible Markup Language (XML) A comment by Tim Bray of Sun Microsystems on Celebration of 10th Anniversary of XML in Feb 2008. "There is essentially no computer in the world, desk-top, hand-held, or back-room, that doesn't process XML sometimes. This is a good thing, because it shows that information can be packaged and transmitted and used in a way that's independent of the kinds of computer and software that are involved. XML won't be the last neutral information wrapping system; but as the first, it's done very well."
  • 2.
    Outline  XML Eye-opener. What is XML?  HTML vs. XML.  Basic XML Syntax.  Constituents.  Some XML Rules.  Element Vs. Attribute.  Node Naming Principles.  Advanced Concepts related to XML  Future of XML
  • 3.
    XML Eye Opener SIMPLE: So simple that you would wonder, why you were not trying to understand it till date.  SUCCESSFUL: Most successful data storage format till date that even big brand who were strong believers of proprietary formats for commercial reasons have started using it.  SOLID: Most solid ageless concept that this generation will pass-on to other future generations and they will keep the baton moving.
  • 4.
    What is XML-1 XML is abbreviation of eXtensible Markup Language.  XML evolved from more general purpose ISO standard SGML (Standard Generalised Markup Language).  All Data needs Description to make it some useful Information. XML provides a neat solution.  XML looks like normal English but it has been designed to be machine readable.
  • 5.
    What is XML-2 XML can store data  XML can help standardization in exchange of data.  User defined markup tags to name dataitems.  Library Functions are available in most programming languages to parse XML.  The syntax looks like <addressbook> <adrrecord> <name>Name1</name> <address>Address1</address> <city>City1</city> </adrrecord> </addressbook>
  • 6.
    Understanding Basic XMLSyntax <?xml version="1.0" encoding="UTF-8" standalone="no"?> <COUNTRYLIST> <COUNTRY group="G20"> <NAME>India</NAME> <CODE>IN</CODE> <ISD>91</ISD> <CAPITAL largestcity="No">New Delhi</CAPITAL> <LCITY>Mumbai</LCITY> <CURRENCY>Indian Rupee</CURRENCY> <CURCODE>INR</CURCODE> </COUNTRY> <COUNTRY group="G5"> <NAME>Japan</NAME> <CODE>JP</CODE> <ISD>81</ISD> <CAPITAL largestcity="Yes">Tokyo</CAPITAL> <LCITY>Tokyo</LCITY> <CURRENCY>Yen</CURRENCY> <CURCODE>JPY</CURCODE> </COUNTRY> </COUNTRYLIST> Element Node XML Declarations: Version: of XML Encoding: Character-set Used. UTF-8 is common (unicode 8 bit variant) Standalone=Yes, depicts non-usage of external type definitions Attribute Node Root Element Node Element Value Attribute Value
  • 7.
    XML Constituents  Elements <address><name>somename</name></address> Attributes <Book Version="1.0"><name></name></Book>  Five predefined Entities to allow for special charaters in the PCDATA area. > to &gt; < to &lt; & to &amp; ' to &apos; " to &quot;  CDATA section (Character Data Not to be parsed). This is meant for putting lot of code like or general purpose data. Even HTML data can be put here. <![CDATA[ ... ]]>  Processing Instructions (PI) or Directives given betweem <? ?> <?xml-stylesheet type="text/css" href="mySheet.css"?> or even initial declaration like below is a PI <?xml version="1.0" encoding="UTF-8" standalone="no"?> Parsable Character data (PCDATA) between element <address> start and end tags. Attribute has a name and a value in quotes.
  • 8.
    Some XML Rules- 1  All elements to have closing tags. <address>invalid syntax <address>valid syntax</address>  All elements are case sensitive. <Name>incorrect</name> <Name>correct</Name>  Elements shall be correctly nested. <address><name>incorrect</address></name > <address><name>correct</name></address>  Attribute values must be quoted. <Book Version=1.0><name></name></Book> (Incorrect) <Book Version="1.0"><name></name></Book> (correct)
  • 9.
    Some XML Rules- 2  XML Document must have a root element and only one root element (it can have any name though). <root> <Child>correct</child> </root>  Entities in data values must use special codes. > as &gt; < as &lt; & as &amp; ' as &apos; " as &quot;  Comments has this syntax. <!– This is a comment --> Comments can not contain – in its text matter.  Whitespace are preserved as against HTML. For e.g. "Hello World" in HTML would be "Hello World". In XML it will retain exact spaces specified.  Empty Elements have this kind of optional format. <Name />
  • 10.
    Some XML Rules- 3  Whitespace are preserved as against HTML. For e.g. "Hello World" in HTML would be "Hello World". In XML it will retain exact spaces specified.  The optional style of writing empty elements is. <Name /> in place of <Name></Name>
  • 11.
    XML Practice: ElementVs Attributes - 1  It is generally possible to define all data as ELEMENT tags in a tree format. <Library> <Book> <ID>201</ID> <ISBN>8175257660</ISBN> <Author>Name1</Author> <Title>Book Title</Title> </Book> </Library>  A neat alternative to above could be using ATTRIBUTES as follows: <Library> <Book ID="201" ISBN="8175257660"> <Author>Name1</Author> <Title>Book Title</Title> </Book> </Library>
  • 12.
    XML Practice: ElementVs Attributes -2  Which method to use is a thoughtful decision.  Information that is surely singular (will not be repeated) and is not domain specific is recommended as ATTRIBUTE.  If you are unable to classify or the Information can be repeated (For e.g. Author tag can be repeated in above example) should be used as ELEMENT.  Even better format for previous example would be <Library> <Book ID="201"> <ISBN>8175257660</ISBN> <Author>Name1</Author> <Title>Book Title</Title> </Book> </Library> This is because ISBN is a book related property while ID may be related to a storage place.
  • 13.
    XML Node Naming– Begins with  Node (elements or attributes) names shall begin with a letter or _ (underscore). <1STLINE></1STLINE> invalid element naming <LINE1></LINE1> valid naming <BOOK 1Ver="1.00"></BOOK> invalid attribute naming <BOOK _Ver="1.00"></BOOK> valid attribute naming
  • 14.
    XML Node Naming– Consists of  Name can consist of  Any English Character or even any foreign language character as allowed by the encoding set given in the declaration. <Name>Sun</Name> <नाम>सूरज</नाम>  A dot (.) or hyphen (-) or _(undescore) <Address.Cityname>Delhi</Address.Cityname> <Address-Cityname>Delhi</Address-Cityname> <Address_Cityname>Delhi</Address_Cityname> Tabs and Spaces are not allowed in XML Node Names.
  • 15.
    XML Node Naming– Based on Namespace  Name can belong to a namespace  Table may be used in html or furniture. One can resolve this problem by using namespaces as follows <h:table> <h:tr> <h:td>Apples</h:td> <h:td>Bananas</h:td> </h:tr> </h:table> <f:table> <f:name>Dining Table</f:name> <f:width>120</f:width> <f:length>230</f:length> </f:table>
  • 16.
    HTML Vs XML- 1  Similarities. Both Uses markup tags (elements and attributes) e.g. <H1>Heading1</H1> or <font face="Verdana"></font>. Both use entities e.g. &lt; &gt; etc. Both are derived from SGML
  • 17.
    HTML Vs XML- 2  Differences. HTML has predefined tags, XML tags are user defined. HTML is for Humans and errors are ignored. XML is for computers as data storehouse or definitions so errors can not be ignored. HTML is usually not updated by programs while XML is meant for program based writing. HTML has large number of entities. XML has just five.
  • 18.
    XSL (Extensible StylesheetLanguage)  Unlike HTML styling using CSS (Cascade Style Sheet) it has tags that are user defined.  It has three parts XSLT (XSL Transformation): for showing XML data as transformed XHTML onto a webpage. Xpath: a way to reach a particular data-item in an XML file. This is very often useful in reading XML based configuration files. XSL-FO (XSL Formatting Objects): Provides a display/print formatting mechanism for XML data.
  • 19.
    DTD (Document TypeDefinition)  A DTD is referred within a DOCTYPE declaration in an XML file such as. <!DOCTYPE note SYSTEM "Note.dtd">  This DTD file will have the format as follows.<!DOCTYPE note [ <!ELEMENT note (to,from,heading,body)> <!ELEMENT to (#PCDATA)> <!ELEMENT from (#PCDATA)> <!ELEMENT heading (#PCDATA)> <!ELEMENT body (#PCDATA)> ]> XML file has the root node named note with four sub- elements. The sub- elements have the PCDATA format.
  • 20.
    Parsing XML  Processof reading XML file and extracting valid data out of it is called "PARSING".  Parsers are of two types Non-Validating Parser: When the document doesn't check against a validating DTD. Validating Parser: When a document is checked against its DTD
  • 21.
    Some Advanced ConceptsRelated to XML  XML Schema: Relates to defining validation rules in form of XSD (XML Schema Definition) files that too are in the XML format.  XQuery: This is a way to search within an XML file and get the selected nodes that match the criteria.
  • 22.
    Where to View/Edit Browsers: Most Browsers are good at viewing XML. Internet Explorer is particularly good at it.  Editors: Special Editors are available that allow good XML views/editing facilities. Microsoft's XML Editor, Peter's XML editor are good at it.  Office Tools: MS-Word, Frontpage like tools provide good XML Editing. Even MS-Excel support XML file opening.  Visual Studio/WebDeveloper: They provide excellent environment for XML editing and viewing along with validation support.
  • 23.
    Let's Quickly Revise 2 Types of Nodes: Elements and Attributes. Elements are repeatable. Attributes can always be put up like elements, reverse may not be true.  Special syntax for non-parsable data as CDATA.  5 Entities for special symbols( <, >, ', ", &).  HTML style Comments Allowed. <!-- comments -- >  Case-Sensitive. Closing Required  One can apply other Processing Instructions (PI) that is enclosed with in <? ?>. First line is usually a Version declaration line which is also a PI.  Always have a single root node.
  • 24.
    Future of XML All websites may one day be written in XML. HTML has already been re-standardised as XHTML which provides better syntax checking and browser compatibility.  XML promises to be the most open system for storage of information from all IT gadgets like Desktops to Mobile phones to ipods to ipads to DVD players to microwave-ovens etc. It is already being used and it is expected to be used in more and more devices.  All office documents/e-books offline and online shall ultimately be in XML as it is the sole non- proprietary format that is simple and is able to meet the needs well.
  • 25.
     Ask andguide me at sunmitraeducation@gmail.com  Share this information with as many people as possible.  Keep visiting www.sunmitra.com for programme updates.