Data interchange integration. Data interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML
2. Data Integration in the Life Sciences
Much unintegrated data:
• from a variety of incompatible sources
• no standard naming convention
• each with a custom browsing and querying mechanism (no common interface)
• and poor interaction with other data sources
3. Approaches to Integration
• Accessing the original data sources
• Handling redundant as well as missing data
• Normalizing analytical data from different data sources
• Conforming terminology to industry standards
• Accessing the integrated data as a single logical repository
• Metadata (used to traverse domains)
4. XML For Bioinformatics
• Biology is a complex discipline
• Wide variety of data resources and repositories
• Biological data represented in multiple formats eg. FASTA, gff etc.
• No standard protocol exists to interrogate biological data stores
• Data Interchange
• EMBL format
• ASN.1
• XML
5. Why XML
• Data in incompatible formats
• Difficulties in Exchanging data
• Software and hardware independent way of sharing data
• XML used to store and display data
• With XML data available to more users
6. XML
• Allows uniform description of data and metadata
• Metadata described through DTDs (Document Type Definition)
• Data conforms to metadata description
• Provides open source solution for data integration between components
• Lots of support in Computer Science community (modules developed)
• XML::CGI - a module to convert CGI parameters to and from XML
• XML::DOM - a Perl extension to XML::Parser. It adds a new 'Style' to XML::Parser,called 'Dom', that allows XML::Parser to build an Object Oriented data structure with a DOM Level 1
compliant interface.
• XML::Dumper - a simple package to experiment with converting Perl data structures to XML and converting XML to perl data structures.
• XML::Encoding - a subclass of XML::Parser, parses encoding map XML files.
• XML::Generator is an extremely simple module to help in the generation of XML.
• XML::Grove - provides simple objects for parsed XML documents. The objects may be modified but no checking is performed.
• XML::Parser - a Perl extension interface to James Clark's XML parser, expat
• XML::QL - an early implementation of a note published by the W3C called "XML-QL: A Query Language for XML".
• XML::XQL - a Perl extension that allows you to perform XQL queries on XML object trees.
7. How the Web is
• HTML documents
• all intended for human consumption
• many generated automatically by applications
Easy to fetch any Web page, from any server, any platform
8. Limits of the Web
• application cannot consume HTML
• HTML wrapper technology is brittle
• need interoperability fast
9. Paradigm Shift on the Web
• new Web standard XML:
• XML generated by applications
• XML consumed by applications
• data exchange
• across platforms: enterprise interoperability
• across enterprises
Web: from collection of documents to data and documents
10. What is XML
• XML stands for eXtensible Markup Language
• XML is a markup language much like HTML
• XML was designed to store and transport data
• XML was designed to be self-descriptive
• XML is a W3C Recommendation
• It is a hierarchical data description language
• XML was designed to describe data and focus on what data is.
• Derived from SGML (Standard Generalized Markup Language), but simpler to use than
SGML
• Documents have tags giving extra information about sections of the document
• E.g. <title> XML </title> <slide> Introduction …</slide>
• Extensible, unlike HTML
• Users can add new tags, and separately specify how the tag should be handled for display
11. What is a DTD
• DTD stands for Document Type Definition.
• A DTD defines the structure and the legal elements and attributes of
an XML document.
• Valid XML Documents
• A "Valid" XML document is "Well Formed", as well as it conforms to
the rules of a DTD:
12. Features of XML
• XML is an easy and automatically parseable way to describe data
• More flexible and adaptable information identification.
• XML is extensible
13. How does XML differ from HTML?
• HTML is a presentation markup language – provides no information
about content.
• There is only one standard definition of all of the tags used in HTML.
• XML can define both presentation style and give information about
content.
• XML relies on custom documents defining the meaning of tags.
14. HTML
<h1> Bibliography </h1>
<p> <i> Foundations of Databases
</i>
Abiteboul, Hull, Vianu
<br> Addison Wesley, 1995
<p> <i> Data on the Web </i>
Abiteoul, Buneman, Suciu
<br> Morgan Kaufmann,
1999
• <!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h1>This is a Heading</h1>
<p>This is a paragraph.</p>
</body>
</html>
15. XML
<bibliography>
<book> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley </publisher>
<year> 1995 </year>
</book>
…
</bibliography>
XML describes the content
16. XML separates data from html
• If you need to update a website dynamically, the kind of effort you
have to put is rigorous. But xml, since it separates data and
presentational features of that data, it is easier to update the xml file
dynamically and html takes care of how data looks.
17. XML Terminology
• tags: book, title, author, …
• start tag: <book>, end tag: </book>
• elements: <book>…<book>,<author>…</author>
• elements are nested
• empty element: <red></red> abbrv. <red/>
• an XML document: single root element
well formed XML document: if it has matching tags
18. More XML: Attributes
<book price = “500” currency = “INR”>
<title> Foundations of Databases </title>
<author> Abiteboul </author>
…
<year> 2017 </year>
</book>
attributes are alternative ways to represent data
19. XML namespace
• XML namespace is a collection of XML elements and attributes identified
by an Internationalized Resource Identifier (IRI); this collection is often
referred to as an XML "vocabulary."
• Since XML allows designers to chose their own tag names, it is possible that
two or more designers may choose the same tag names for some or all of
their elements. XML namespace solves this problem. It provides a way to
distinguish between XML elements that have the same local name but are,
in fact, from different vocabularies. This is done by associating an element
with a namespace. A namespace acts as scope for all elements associated
with it.
20. A minimal XML document
<?xml version=“1.0” ?>
<document name=“first”>ABC</document>
A tag
An attribute
value
Closing tag
21. A Piece of XML Schema
<seq id=“my_seq” name=“NUCLEAR RIBONUCLEOPROTEIN”>
<dbxref>
<database>SWISS-PROT</database>
<unique_id>P09651</unique_id>
</dbxref>
<residues type=“aa”>
SKSESPKEPEQLRKLFIGGLSFETTDESLRSHFEQWGTLTDCVVMRDPNTKRSRGFGFVTYATVEEV
DAAMNARPHKVDGRVVEPKRAVSREDSQRPGAHLTVKKIFVGGIKEDTEEHHLRDYFEQYGKIEVIE
IMTDRGSGKKRGFAFVTFDDHDSVDKIVIQKYHTVNGHNCEVRKALSKQEMASASSSQRGRSGSGNF
GGGRGGGFGGNDNFGRGGNFSGRGGFGGSRGGGGYGGSGDGYNGFGNDGGYGGGGPGYSGGSRGYGS
GGQGYGNQGSGYGGSGSYDSYNNGGGRGFGGGSGSNFGGGGSYNDFGNYNNQSSNFGPMKGGNFGGR
SSGPYGGGGQYFAKPRNQGGYGGSSSSSSYGSGRRF
</residues>
</seq>
22. Biological XML
• Some DTD’s have been proposed publicly as XML formats for biological data
• GAME (Genome Annotation Markup Elements)
• BIOML (The Biopolymer Markup Language)
• IOML (Interactive Outline Markup Language)
• BSML (Bioinformatic Sequence Markup Language)
• CML (Chemical Markup Language)
• GEML (Gene Expression Markup Language)
23. phyloXML: XML for evolutionary biology and comparative genomics
• http://www.phyloxml.org/
• phyloXML is an XML language designed to describe phylogenetic trees (or networks) and
associated data.
• It provides elements for commonly used features, such as taxonomic information, gene names and
identifiers, branch lengths, support values, and gene duplication and speciation events. Using these
standardized elements allows interoperability between various applications and databases.
Furthermore, both due to extensible nature of XML itself and the provision of <property> elements
by phyloXML, extensibility as well as domain specific applications are ensured.
• The structure of phyloXML is described by XML Schema Definition (XSD) language.
24. XML at the PDBe
• http://www.ebi.ac.uk/pdbe/docs/documentation/xml.html
• The PDBe is involved in XML at two levels.
• development of standard DTDs/XML schemae for representing
macromolecular structure and other biological data.
• For example:
• structural genomics data exchange packets (with eHTPX)
• nuclear magnetic resonance experimental information (with CCPN)
• macromolecular structure data (with RCSB)
25. Significance of Using XML
1. Open and extensible - XML’s one-of-a-kind open structure allows you to add other state-of-the-art elements when
needed. This means that you can always adapt your system to embrace industry-specific vocabulary.
2. It is simple to modify a DTD. The XML and DTD files are human readable and then can be easily edited by
people with only few computer skills
3. XML is Internet-oriented and has very rich capabilities for linking data
-This can be used for interconnecting databases
4. XML provides an open framework for defining standard specifications.
-This is an important point because bioinformatics clearly lacks standardization
5. XML data is self-describing. That means it contains both data and information about the data. In records of
traditional database systems, before you store data, it requires to define relational schemata, file description
tables, external data definitions etc. Where as in xml, these things are not required. Because the data itself
contains all these information.
6. XML ensures total usability of data. This is very important for seamless integration of data, as far as business
applications are concerned.
7. XML can be integrated to all the feasible data format like form text and numbers to multimedia like sound, image
to active formats like Java Applets or ActiveX Components.
8. No programming required to modify the presentation of data - One can change the look and feel of documents or
even entire websites with XSL Style Sheets without manipulating the data itself
9. Single source for distributed data - XML documents can consist of data from many different databases distributed
over multiple servers. In other words: With XML the entire World Wide Web is transformed into a single all-
encompassing database.
10. Future-oriented technology - XML is the endorsed industry standard of the World Wide Web Consortium (W3C)
and is supported by all leading software providers. Furthermore, XML is also the standard today in an increasing
number of other industries, for example, health care.