1. UNIT-II XML
Introduction to XML
XML stands for Extensible Markup Language. It is a text-based markup language derived from
Standard Generalized Markup Language (SGML).
XML tags identify the data and are used to store and organize the data, rather than specifying
how to display it like HTML tags, which are used to display the data. XML is not going to
replace HTML in the near future, but it introduces new possibilities by adopting many successful
features of HTML.
There are three important characteristics of XML that make it useful in a variety of systems and
solutions:
XML is extensible: XML allows you to create your own self-descriptive tags, or language, that
suits your application.
XML carries the data, does not present it: XML allows you to store the data irrespective of
how it will be presented.
XML is a public standard: XML was developed by an organization called the World Wide
Web Consortium (W3C) and is available as an open standard.
XMLUsage
A short list of XML usage says it all:
XML can work behind the scene to simplify the creation of HTML documents for large web
sites.
XML can be used to exchange the information between organizations and systems.
XML can be used for offloading and reloading of databases.
XML can be used to store and arrange the data, which can customize your data handling needs.
XML can easily be merged with style sheets to create almost any desired output.
Virtually, any type of data can be expressed as an XML document.
What isMarkup?
XML is a markup language that defines set of rules for encoding documents in a format that
is both human-readable and machine-readable. So what exactly is a markup language?
Markup is information added to a document that enhances its meaning in certain ways, in
that it identifies the parts and how they relate to each other. More specifically, a markup
language is a set of symbols that can be placed in the text of a document to demarcate and
label the parts of that document.
Following example shows how XML markup looks, when embedded in a piece of text:
<message>
<text>Hello, world!</text>
</message>
This snippet includes the markup symbols, or the tags such as
<message>...</message> and <text>...</text>. The tags <message> and
</message> mark the start and the end of the XML code fragment. The tags <text> and
</text> surround the text Hello, world!.
2. Is XMLaProgrammingLanguage?
A programming language consists of grammar rules and its own vocabulary which is used to
create computer programs. These programs instructs computer to perform specific tasks.
perform any computation or algorithms. It is usually stored in a simple text file and is
processed by special software that is capable of interpretingXML.
Tags andElements
An XML file is structured by several XML-elements, also called XML-nodes or XML- tags.
XML-elements' names are enclosed by triangular brackets < > as shown below:
<element>
Syntax Rules for Tags and Elements
Element Syntax: Each XML-element needs to be closed either with start or with end
elements as shown below:
<element>....</element>
or in simple-cases, just this way:
<element/>
Nesting of elements: An XML-element can contain multiple XML-elements as its children,
but the children elements must not overlap. i.e., an end tag of an element must have the same
name as that of the most recent unmatched start tag.
Following example shows incorrect nested tags:
<?xml version="1.0"?>
<contact-info>
<company>IARE
<contact-info>
</company>
Following example shows correct nested tags:
<?xml version="1.0"?>
<contact-info>
<company>IARE</company>
<contact-info>
Let us learn about one of the most important part of XML, the XML tags. XML tags form the
foundation of XML. They define the scope of an element in the XML. They can also be used to
insert comments, declare settings required for parsing the environment and to insert special
instructions.
We can broadly categorize XML tags as follows:
StartTag
The beginning of every non-empty XML element is marked by a
start-tag. An example of start-tag is:
<address>
EndTag
Every element that has a start tag should end with an end-tag. An
example of end- tag is:
</address>
Note that the end tags include a solidus ("/") before the name of an
element.
3. EmptyTag
The text that appears between start-tag and end-tag is called content. An element which has
no content is termed as empty. An empty element can be represented in two ways as below:
(1) A start-tag immediately followed by an end-tag as shown below:
<hr></hr>
(2) A complete empty-element tag is as shown below:
<hr />
Empty-element tags may be used for any element which has no content.
XML TagsRules
Following are the rules that need to be followed to use XML tags:
Rule 1
XML tags are case-sensitive. Following line of code is an example of wrong syntax </Address>,
because of the case difference in two tags, which is treated as erroneous syntax in XML.
<address>This is wrong syntax</Address>
Following code shows a correct way, where we use the same case to name the start and the
end tag. <address>This is correct syntax</address>
Rule 2
XML tags must be closed in an appropriate order, i.e., an XML tag opened inside another
element must be closed before the outer element is closed. For example:
<outer_element>
<internal_element>
This tag is closed before the outer_element
</internal_element>
</outer_element>
XMLElements
XML elements can be defined as building blocks of an XML. Elements can behave as containers
to hold text, elements, attributes, media objects or all of these.
Each XML document contains one or more elements, the scope of which are
either delimited by start and end tags, or for empty elements, by an emptyelement
tag.
Syntax
Following is the syntax to write an XML element:
<element-name attribute1 attribute2>
....content
</element-name>
where
element-name is the name of the element. The name its
case in the start and end tags must match.
attribute1, attribute2 are attributes of the element
separated by white spaces. An attribute defines a property of the element. It
associates a name with a value, which is a string of characters. An attribute
is written as:
name = "value"
The name is followed by an = sign and a string value inside double(" ") or single('
') quotes.
4. EmptyElement
An empty element (element with no content) has following syntax:
<name attribute1 attribute2.../>
Example of an XML document using various XML element:
<?xml version="1.0"?>
<contact-info>
<address category="residence">
<name>Tanmay Patil</name>
<company>TutorialsPoint</company>
<phone>(011) 123-4567</phone>
<address/>
</contact-info>
XML ElementsRules
Following rules are required to be followed for XML elements:
An element name can contain any alphanumeric characters. The only punctuation
marks allowed in names are the hyphen (-), under-score (_) and period (.).
Names are case sensitive. For example, Address, address, and ADDRESS are
different names.
Start and end tags of an element must be identical.
An element, which is a container, can contain text or elements as seen in the above
example.
Root element: An XML document can have only one root element. For example, following
is not a correct XML document, because both the x and y elements occur at the top level
without a root element:
<x>...</x>
<y>...</y>
The following example shows a correctly formed XML document:
<root>
<x>...</x>
<y>...</y>
</root>
Case sensitivity: The names of XML-elements are case-sensitive. That means the name of
the start and the end elements need to be exactly in the same case.
For example, <contact-info> is different from<Contact-Info>.
5. XML DTD
What is a DTD?
A DTD is a Document Type Definition.
A DTD defines the structure and the legal elements and attributes of an XML document.
Why Use a DTD?
With a DTD, independent groups of people can agree on a standard DTD for interchanging data.
An application can use a DTD to verify that XML data is valid.
An Internal DTD Declaration
If the DTD is declared inside the XML file, it must be wrapped inside the <!DOCTYPE>
definition:
XML document with an internal DTD
<?xml version="1.0"?>
<!DOCTYPE note [
<!ELEMENT note (to,from,heading,body)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
]>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend</body>
</note>
View XML file »
In the XML file, select "view source" to view the DTD.
The DTD above is interpreted like this:
!DOCTYPE note defines that the root element of this document is note
!ELEMENT note defines that the note element must contain four elements:
"to,from,heading,body"
!ELEMENT to defines the to element to be of type "#PCDATA"
!ELEMENT from defines the from element to be of type "#PCDATA"
!ELEMENT heading defines the heading element to be of type "#PCDATA"
6. !ELEMENT body defines the body element to be of type "#PCDATA"
An External DTD Declaration
If the DTD is declared in an external file, the <!DOCTYPE> definition must contain a reference
to the DTD file:
XML document with a reference to an external DTD
<?xml version="1.0"?>
<!DOCTYPE note SYSTEM "note.dtd">
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
View XML file »
And here is the file "note.dtd", which contains the DTD:
<!ELEMENT note (to,from,heading,body)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
7. DTD - XML Building Blocks
The main building blocks of both XML and HTML documents are elements.
The Building Blocks of XML Documents
Seen from a DTD point of view, all XML documents are made up by the following building
blocks:
Elements
Attributes
Entities
PCDATA
CDATA
Elements:
Elements are the main building blocks of both XML and HTML documents.
Examples of HTML elements are "body" and "table". Examples of XML elements could be
"note" and "message". Elements can contain text, other elements, or be empty. Examples of
empty HTML elements are "hr", "br" and "img".
Examples:
<body>some text</body>
<message>some text</message>
Attributes:
Attributes provide extra information about elements.
Attributes are always placed inside the opening tag of an element. Attributes always come in
name/value pairs. The following "img" element has additional information about a source file:
<img src="computer.gif" />
The name of the element is "img". The name of the attribute is "src". The value of the attribute is
"computer.gif". Since the element itself is empty it is closed by a " /".
8. Entities
Some characters have a special meaning in XML, like the less than sign (<) that defines the start
of an XML tag.
Most of you know the HTML entity: " ". This "no-breaking-space" entity is used in
HTML to insert an extra space in a document. Entities are expanded when a document is parsed
by an XML parser.
The following entities are predefined in XML:
Entity References Character
< <
> >
& &
" "
' '
PCDATA:
PCDATA means parsed character data.
Think of character data as the text found between the start tag and the end tag of an XML
element.
PCDATA is text that WILL be parsed by a parser. The text will be examined by the parser for
entities and markup.
Tags inside the text will be treated as markup and entities will be expanded.
However, parsed character data should not contain any &, <, or > characters; these need to be
represented by the & < and > entities, respectively.
CDATA
CDATA means character data.
CDATA is text that will NOT be parsed by a parser. Tags inside the text will NOT be treated as
markup and entities will not be expanded.
9. XML Schema
An XML Schema describes the structure of an XML document, just like a DTD.
An XML document with correct syntax is called "Well Formed".
An XML document validated against an XML Schema is both "Well Formed" and "Valid".
XML Schema
XML Schema is an XML-based alternative to DTD:
<xs:element name="note">
<xs:complexType>
<xs:sequence>
<xs:element name="to" type="xs:string"/>
<xs:element name="from" type="xs:string"/>
<xs:element name="heading" type="xs:string"/>
<xs:element name="body" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
The Schema above is interpreted like this:
<xs:element name="note"> defines the element called "note"
<xs:complexType> the "note" element is a complex type
<xs:sequence> the complex type is a sequence of elements
<xs:element name="to" type="xs:string"> the element "to" is of type string (text)
<xs:element name="from" type="xs:string"> the element "from" is of type string
<xs:element name="heading" type="xs:string"> the element "heading" is of type string
<xs:element name="body" type="xs:string"> the element "body" is of type string
10. XML Schemas are More Powerful than DTD
XML Schemas are written in XML
XML Schemas are extensible to additions
XML Schemas support data types
XML Schemas support namespaces
Why Use an XML Schema?
With XML Schema, your XML files can carry a description of its own format.
With XML Schema, independent groups of people can agree on a standard for interchanging
data.
With XML Schema, you can verify data.
XML Schemas Support Data Types
One of the greatest strength of XML Schemas is the support for data types:
It is easier to describe document content
It is easier to define restrictions on data
It is easier to validate the correctness of data
It is easier to convert data between different data types
XML Schemas use XML Syntax
Another great strength about XML Schemas is that they are written in XML:
You don't have to learn a new language
You can use your XML editor to edit your Schema files
You can use your XML parser to parse your Schema files
You can manipulate your Schemas with the XML DOM
You can transform your Schemas with XSLT
11. XML DOM
What is the DOM?
The DOM defines a standard for accessing and manipulating documents:
"The W3C Document Object Model (DOM) is a platform and language-neutral interface that
allows programs and scripts to dynamically access and update the content, structure, and style
of a document."
The HTML DOM defines a standard way for accessing and manipulating HTML documents. It
presents an HTML document as a tree-structure.
The XML DOM defines a standard way for accessing and manipulating XML documents. It
presents an XML document as a tree-structure.
Understanding the DOM is a must for anyone working with HTML or XML.
The HTML DOM
All HTML elements can be accessed through the HTML DOM.
This example changes the value of an HTML element with id="demo":
Example
<h1 id="demo">This is a Heading</h1>
<script>
document.getElementById("demo").innerHTML = "Hello World!";
</script>
12. This example changes the value of the first <h1> element in an HTML document:
Example
<h1>This is a Heading</h1>
<h1>This is a Heading</h1>
<script>
document.getElementsByTagName("h1")[0].innerHTML = "Hello World!";
</script>
Note: Even if the HTML document contains only ONE <h1> element you still have to specify
the array index [0], because the getElementsByTagName() method always returns an array.
The XML DOM
All XML elements can be accessed through the XML DOM.
The XML DOM is:
A standard object model for XML
A standard programming interface for XML
Platform- and language-independent
A W3C standard
In other words: The XML DOM is a standard for how to get, change, add, or delete XML
elements.
Get the Value of an XML Element
This code retrieves the text value of the first <title> element in an XML document:
Example
txt = xmlDoc.getElementsByTagName("title")[0].childNodes[0].nodeValue;
Loading an XML File
This example reads "books.xml" into xmlDoc and retrieves the text value of the first <title>
element in books.xml:
13. Example
<!DOCTYPE html>
<html>
<body>
<p id="demo"></p>
<script>
var xhttp = new XMLHttpRequest();
xhttp.onreadystatechange = function() {
if (this.readyState == 4 && this.status == 200) {
myFunction(this);
}
};
xhttp.open("GET", "books.xml", true);
xhttp.send();
function myFunction(xml) {
var xmlDoc = xml.responseXML;
document.getElementById("demo").innerHTML =
xmlDoc.getElementsByTagName("title")[0].childNodes[0].nodeValue;
}
</script>
</body>
</html>
Example Explained
xmlDoc - the XML DOM object created by the parser.
getElementsByTagName("title")[0] - get the first <title> element
childNodes[0] - the first child of the <title> element (the text node)
nodeValue - the value of the node (the text itself)
Loading an XML String
This example loads a text string into an XML DOM object, and extracts the info from it with
JavaScript:
Example
<html>
<body>
<p id="demo"></p>
14. <script>
var text, parser, xmlDoc;
text = "<bookstore><book>" +
"<title>Everyday Italian</title>" +
"<author>Giada De Laurentiis</author>" +
"<year>2005</year>" +
"</book></bookstore>";
parser = new DOMParser();
xmlDoc = parser.parseFromString(text,"text/xml");
document.getElementById("demo").innerHTML =
xmlDoc.getElementsByTagName("title")[0].childNodes[0].nodeValue;
</script>
</body>
</html>
Programming Interface
The DOM models XML as a set of node objects. The nodes can be accessed with JavaScript or
other programming languages. In this tutorial we use JavaScript.
The programming interface to the DOM is defined by a set standard properties and methods.
Properties are often referred to as something that is (i.e. nodename is "book").
Methods are often referred to as something that is done (i.e. delete "book").
XML DOM Properties
These are some typical DOM properties:
x.nodeName - the name of x
x.nodeValue - the value of x
x.parentNode - the parent node of x
x.childNodes - the child nodes of x
x.attributes - the attributes nodes of x
Note: In the list above, x is a node object.
XML DOM Methods
x.getElementsByTagName(name) - get all elements with a specified tag name
x.appendChild(node) - insert a child node to x
x.removeChild(node) - remove a child node from x
Note: In the list above, x is a node object.
15. The sample XML considered in the examples is:
<employees>
<employee id="111">
<firstName>Rakesh</firstName>
<lastName>Mishra</lastName>
<location>Bangalore</location>
</employee>
<employee id="112">
<firstName>John</firstName>
<lastName>Davis</lastName>
<location>Chennai</location>
</employee>
<employee id="113">
<firstName>Rajesh</firstName>
<lastName>Sharma</lastName>
<location>Pune</location>
</employee>
</employees>
And the obejct into which the XML content is to be extracted is defined as below:
class Employee{
String id;
String firstName;
String lastName;
String location;
@Override
16. public String toString() {
return firstName+" "+lastName+"("+id+")"+location;
}
}
There are 3 main parsers for which I have given sample code:
DOM Parser
SAX Parser
StAX Parser
Using DOM Parser
I am making use of the DOM parser implementation that comes with the JDK and in my
example I am using JDK 7. The DOM Parser loads the complete XML content into a Tree
structure. And we iterate through the Node and NodeList to get the content of the XML. The
code for XML parsing using DOM parser is given below.
public class DOMParserDemo {
public static void main(String[] args) throws Exception {
//Get the DOM Builder Factory
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
//Get the DOM Builder
DocumentBuilder builder = factory.newDocumentBuilder();
//Load and Parse the XML document
//document contains the complete XML as a Tree.
Document document = builder.parse(
ClassLoader.getSystemResourceAsStream("xml/employee.xml"));
List<Employee> empList = new ArrayList<>();
//Iterating through the nodes and extracting the data.
17. NodeList nodeList = document.getDocumentElement().getChildNodes();
for (int i = 0; i < nodeList.getLength(); i++) {
//We have encountered an <employee> tag.
Node node = nodeList.item(i);
if (node instanceof Element) {
Employee emp = new Employee();
emp.id = node.getAttributes().
getNamedItem("id").getNodeValue();
NodeList childNodes = node.getChildNodes();
for (int j = 0; j < childNodes.getLength(); j++) {
Node cNode = childNodes.item(j);
//Identifying the child tag of employee encountered.
if (cNode instanceof Element) {
String content = cNode.getLastChild().
getTextContent().trim();
switch (cNode.getNodeName()) {
case "firstName":
emp.firstName = content;
break;
case "lastName":
emp.lastName = content;
break;
case "location":
emp.location = content;
break;
18. }
}
}
empList.add(emp);
}
}
//Printing the Employee list populated.
for (Employee emp : empList) {
System.out.println(emp);
}
}
}
class Employee{
String id;
String firstName;
String lastName;
String location;
@Override
public String toString() {
return firstName+" "+lastName+"("+id+")"+location;
}}
The output for the above will be:
Rakesh Mishra(111)Bangalore
John Davis(112)Chennai
Rajesh Sharma(113)Pune
19. Using SAX Parser
SAX Parser is different from the DOM Parser where SAX parser doesn’t load the complete
XML into the memory, instead it parses the XML line by line triggering different events as and
when it encounters different elements like: opening tag, closing tag, character data, comments
and so on. This is the reason why SAX Parser is called an event based parser.
Along with the XML source file, we also register a handler which extends the DefaultHandler
class. The DefaultHandler class provides different callbacks out of which we would be interested
in:
startElement() – triggers this event when the start of the tag is encountered.
endElement() – triggers this event when the end of the tag is encountered.
characters() – triggers this event when it encounters some text data.
The code for parsing the XML using SAX Parser is given below:
import java.util.ArrayList;
import java.util.List;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class SAXParserDemo {
public static void main(String[] args) throws Exception {
SAXParserFactory parserFactor = SAXParserFactory.newInstance();
SAXParser parser = parserFactor.newSAXParser();
SAXHandler handler = new SAXHandler();
parser.parse(ClassLoader.getSystemResourceAsStream("xml/employee.xml"),
handler);
//Printing the list of employees obtained from XML
20. for ( Employee emp : handler.empList){
System.out.println(emp);
}
}
}
/**
* The Handler for SAX Events.
*/
class SAXHandler extends DefaultHandler {
List<Employee> empList = new ArrayList<>();
Employee emp = null;
String content = null;
@Override
//Triggered when the start of tag is found.
public void startElement(String uri, String localName,
String qName, Attributes attributes)
throws SAXException {
switch(qName){
//Create a new Employee object when the start tag is found
case "employee":
emp = new Employee();
emp.id = attributes.getValue("id");
break;
}
}
21. @Override
public void endElement(String uri, String localName,
String qName) throws SAXException {
switch(qName){
//Add the employee to list once end tag is found
case "employee":
empList.add(emp);
break;
//For all other end tags the employee has to be updated.
case "firstName":
emp.firstName = content;
break;
case "lastName":
emp.lastName = content;
break;
case "location":
emp.location = content;
break;
}
}
@Override
public void characters(char[] ch, int start, int length)
throws SAXException {
content = String.copyValueOf(ch, start, length).trim();
22. }
}
class Employee {
String id;
String firstName;
String lastName;
String location;
@Override
public String toString() {
return firstName + " " + lastName + "(" + id + ")" + location;
}
}
The output for the above would be:
Rakesh Mishra(111)Bangalore
John Davis(112)Chennai
Rajesh Sharma(113)Pune
With this I have covered parsing the same XML document and performing the same task of
populating the list of Employee objects using all the three parsers namely:
DOM Parser
SAX Parser