2. XML
• XML stands for EXtensible Markup Language.
• XML is a markup language much like HTML.
• XML was designed to describe data.
• XML tags are not predefined. You must define
your own tags.
• XML uses a Document Type Definition (DTD)
or an XML Schema to describe the data.
• XML with a DTD or XML Schema is designed to
be self-descriptive.
4/28/2024 2
GAGAN THAKRAL(ABESEC)
3. XML
• Best description of XML is this: XML is a cross-
platform, software and hardware
independent tool for transmitting
information.
4/28/2024 3
GAGAN THAKRAL(ABESEC)
4. XML-Example
XML document : (file name: “xml_note.xml”)
<?xml version="1.0" encoding="ISO-
8859-1" ?>
<note>
<to>Aman</to>
<from>Raman</from>
<header>Reminder</header>
<body>Don't forget me this
weekend!</body>
</note>
4/28/2024 4
GAGAN THAKRAL(ABESEC)
5. More Example
<?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore>
<book category="COOKING">
<title lang="en">North Indian Food</title>
<author>Dr. Ram Parkash</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="CHILDREN">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
----------------
----------------
</bookstore>
4/28/2024 5
GAGAN THAKRAL(ABESEC)
6. The Main Differences
Between XML and HTML
– XML was designed to carry data.
– XML is not a replacement for HTML.
– XML and HTML were designed with different
goals:
• XML was designed to describe data and to focus on
what data is.
• HTML was designed to display data and to focus on
how data looks.
– HTML is about displaying information, while XML
is about describing information.
4/28/2024 6
GAGAN THAKRAL(ABESEC)
7. Advantages of Using XML
• Truly Portable Data
• Easily readable by human users
• Very expressive
• Very flexible and customizable
• Easy to use from programs (libs available)
• Easy to convert into other representations
• Many additional standards and tools
• Widely used and supported
4/28/2024 7
GAGAN THAKRAL(ABESEC)
8. XML Encoding
• XML documents can contain international
characters, like Norwegian æøå, or French
êèé.
• To avoid errors, you should specify the
encoding used, or save your XML files.
• Character encoding defines a unique binary
code for each different character used in a
document.
• In computer terms, character encoding are
also called character set, character map, code
set, and code page.
4/28/2024 8
GAGAN THAKRAL(ABESEC)
10. • The Unicode Standard has become a success
and is implemented in HTML, XML, Java,
JavaScript, E-mail, ASP, PHP, etc.
• The Unicode standard is also supported in
many operating systems and all modern
browsers.
• The Unicode Consortium cooperates with the
leading standards development organizations,
like ISO, W3C, and ECMA.
4/28/2024 10
GAGAN THAKRAL(ABESEC)
11. • UTF-8 uses 1 byte (8-bits) to represent basic
Latin characters, and two, three, or four bytes
for the rest.
• UTF-8 = The Web Standard
• UTF-8 is the standard character encoding on
the web.
• UTF-8 is the default character encoding for
HTML5, CSS, JavaScript, PHP, SQL, and XML.
• UTF-16 uses 2 bytes (16 bits) for most
characters, and four bytes for the rest.
4/28/2024 11
GAGAN THAKRAL(ABESEC)
12. A Simple XML Document
<article>
<author>Gerhard Weikum</author>
<title>The Web in Ten Years</title>
<text>
<abstract>In order to evolve...</abstract>
<section number=“1” title=“Introduction”>
The <index>Web</index> provides the
universal...
</section>
</text>
</article>
4/28/2024 12
GAGAN THAKRAL(ABESEC)
13. A Simple XML Document
<article>
<author>Gerhard Weikum</author>
<title>The Web in Ten Years</title>
<text>
<abstract>In order to evolve...</abstract>
<section number=“1” title=“Introduction”>
The <index>Web</index> provides the
universal...
</section>
</text>
</article>
Freely definable
tags
4/28/2024 13
GAGAN THAKRAL(ABESEC)
14. Elements in XML Documents
• (Freely definable) tags: article, title, author
– with start tag: <article> etc.
– and end tag: </article> etc.
• Elements: <article> ... </article>
• Elements have a name (article) and a content (...)
• Elements may be nested.
• Elements may be empty: <this_is_empty/>
• Each XML document has exactly one root element and forms
a tree.
• Elements with a common parent are ordered.
4/28/2024 14
GAGAN THAKRAL(ABESEC)
15. Elements vs. Attributes
Elements may have attributes (in the start tag) that have a name and
a value, e.g. <section number=“1“>.
What is the difference between elements and attributes?
• Only one attribute with a given name per element (but an arbitrary
number of subelements)
• Attributes have no structure, simply strings (while elements can have
subelements)
As a rule of thumb:
• Content into elements
• Metadata into attributes
Example:
<person born=“1912-06-23“ died=“1954-06-07“>
Abc</person> proved that…
4/28/2024 15
GAGAN THAKRAL(ABESEC)
16. XML Documents as Ordered Trees
article
author title text
section
abstract
The index
We
b
provides
…
title=“…“
number=“1“
In order
…
The Web
in 10
years
4/28/2024 16
GAGAN THAKRAL(ABESEC)
17. Well-Formed XML Documents
A well-formed document must adher to, among others, the
following rules:
• Every start tag has a matching end tag.
• Elements may nest, but must not overlap.
• There must be exactly one root element.
• Attribute values must be quoted.
• An element may not have two attributes with the same
name.
• Comments and processing instructions may not appear
inside tags.
4/28/2024 17
GAGAN THAKRAL(ABESEC)
18. Well-Formed XML Documents
A well-formed document must adher to, among others, the
following rules:
• Every start tag has a matching end tag.
• Elements may nest, but must not overlap.
• There must be exactly one root element.
• Attribute values must be quoted.
• An element may not have two attributes with the same
name.
• Comments and processing instructions may not appear
inside tags.
Only well-formed documents can
be processed by XML parsers.
4/28/2024 18
GAGAN THAKRAL(ABESEC)
19. XML is not…
• A replacement for HTML
(but HTML can be generated from XML)
• A presentation format
(but XML can be converted into one)
• A programming language
(but it can be used with almost any language)
• A network transfer protocol
(but XML may be transferred over a network)
• A database
(but XML may be stored into a database)
4/28/2024 19
GAGAN THAKRAL(ABESEC)
20. Conversion of XML into Tree
<?xml version = “1.0” ?>
<address>
<name>
<first>Shiva</first>
<last>Singh</last>
</name>
<email>shivasingh@gmail.com</email>
<phone>9999999999</phone>
<birthday>
<year>1991</year>
<month>03</month>
<day>11</day>
</birthday>
</address>
4/28/2024 20
GAGAN THAKRAL(ABESEC)
21. • A well-formed XML document has a tree
structure and obeys all the XML rules.
• A particular application may add more rules in
either a DTD (document type definition) or in
a schema.
• Many specialized DTDs and schemas have
been created to describe particular areas.
4/28/2024 21
GAGAN THAKRAL(ABESEC)
22. Document Type Definitions
• A DTD describes the tree structure of a
document and something about its data.
• There are two data types, PCDATA and CDATA.
– PCDATA is parsed character data.
– CDATA is character data, not usually parsed.
• A DTD determines how many times a node
may appear, and how child nodes are ordered.
4/28/2024 22
GAGAN THAKRAL(ABESEC)
23. DTD for address Example
<!ELEMENT address (name, email, phone, birthday)>
<!ELEMENT name (first, last)>
<!ELEMENT first (#PCDATA)>
<!ELEMENT last (#PCDATA)>
<!ELEMENT email (#PCDATA)>
<!ELEMENT phone (#PCDATA)>
<!ELEMENT birthday (year, month, day)>
<!ELEMENT year (#PCDATA)>
<!ELEMENT month (#PCDATA)>
<!ELEMENT day (#PCDATA)>
4/28/2024 23
GAGAN THAKRAL(ABESEC)
24. Schemas
• Schemas are themselves XML documents.
• They were standardized after DTDs and provide more
information about the document.
• They have a number of data types including string,
decimal, integer, boolean, date, and time.
• They divide elements into simple and complex types.
• They also determine the tree structure and how
many children a node may have.
4/28/2024 24
GAGAN THAKRAL(ABESEC)
26. XML Parsers
• An XML parser is a software library or package
that provides interfaces for client applications
to work with an XML document.
• The XML Parser is designed to read the XML
and create a way for programs to use XML.
• XML parser validates the document and check
that the document is well formatted.
4/28/2024 GAGAN THAKRAL(ABESEC) 26
27. Let's understand the working of
XML parser by the figure given
below:
4/28/2024 GAGAN THAKRAL(ABESEC) 27
28. Types of XML Parsers
• These are the two main types of XML Parsers:
1. DOM
2. SAX
4/28/2024 GAGAN THAKRAL(ABESEC) 28
29. DOM (Document Object Model)
• A DOM document is an object which contains
all the information of an XML document. It is
composed like a tree structure.
• The DOM Parser implements a DOM API. This
API is very simple to use.
4/28/2024 GAGAN THAKRAL(ABESEC) 29
30. Features of DOM Parser
• A DOM Parser creates an internal structure in
memory which is a DOM document object and
the client applications get information of the
original XML document by invoking methods
on this document object.
• DOM Parser has a tree based structure.
4/28/2024 GAGAN THAKRAL(ABESEC) 30
31. Advantages
1) It supports both read and write operations
and the API is very simple to use.
2) It is preferred when random access to widely
separated parts of a document is required.
4/28/2024 GAGAN THAKRAL(ABESEC) 31
32. Disadvantages
• It is memory inefficient. (consumes more
memory because the whole XML document
needs to loaded into memory).
• It is comparatively slower than other parsers.
4/28/2024 GAGAN THAKRAL(ABESEC) 32
33. SAX (Simple API for XML)
• A SAX Parser implements SAX API. This API is
an event based API and less intuitive.
4/28/2024 GAGAN THAKRAL(ABESEC) 33
34. Features of SAX Parser
• It does not create any internal structure.
• Clients does not know what methods to call,
they just overrides the methods of the API and
place his own code inside method.
• It is an event based parser, it works like an
event handler in Java.
4/28/2024 GAGAN THAKRAL(ABESEC) 34
35. Advantages
• It is simple and memory efficient.
• It is very fast and works for huge documents.
4/28/2024 GAGAN THAKRAL(ABESEC) 35
36. Disadvantages
• It is event-based so its API is less intuitive.
• Clients never know the full information
because the data is broken into pieces.
4/28/2024 GAGAN THAKRAL(ABESEC) 36