Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. The future of XML storage: Native XML Databases vs. Relational Databases Oleksiy Prokhorov Felix Annan Drexel University 3141 Chestnut Street Philadelphia, PA 19104 {oop23, foa24} @drexel.edu ABSTRACT hence the need to be able to actively store and query them. Extensible Markup Language (XML) is the de- There are two general types of XML documents, facto standard for information exchange data-centric documents and document-centric between business applications. The current documents [9]. Data-centric documents are methods of storing XML documents largely usually well structured and strictly follow a involve the use of XML enabled databases. This relatively fixed XML Schema whilst document- storage format relies heavily on the ability to centric documents are loosely structured and transform XML data into a form that can be may not conform strictly to an XML Schema. stored in a relational database using well The first solution that most companies came up formed XML Schemas. This procedure incurs with for storing XML documents was to store high overhead processing costs which contribute them in what are now called XML enabled to an appallingly low performance of the entire databases. These are relational databases with relational database system. Native XML extra processing facilities for XML. With some databases have been proposed as a solution to database vendors realizing the need for change, this problem. This paper explains why dedicated there has been the development over the last few processing and storage set aside for XML years of a different kind of database for XML formatted documents will be a better solution to storage, the native XML database. This type of the XML processing crisis that most companies database is built from the ground up with are currently facing. ingrained dedicated facilities specially suited for storing XML data. The following sections 1. INTRODUCTION discuss these two implementations for storing XML. We then compare the various issues and Extensible Markup Language (XML) is an open conclude with a preference for the native XML method of marking up (describing) data. It databases. utilizes tags called elements to describe the data enclosed within the elements. With the ability to 2. XML ENABLED DATABASES create custom tags to describe ones own purpose its use has quickly spread to the description of all XML enabled databases are legacy relational kinds of documents ranging from government databases with XML processing built on. documents to financial data. The burden of Relational database systems store data in a dealing with data stored in a proprietary format format where the unit of storage is the table row. has largely been lifted off the shoulders of many The structure of a relational database is generally organizations since they can now exchange data decided once after which it hardly ever changes. in a non-proprietary way. This allows institutions The best data to be stored in a relational system to create their own programs to work with shared is well structured data. data. The hierarchical nature of XML makes it Fig. 1, derived from [2], indicates an overview of the format of choice for certain applications like the process that XML documents/queries go taxonomic data storage in the life sciences and through during processing. documentation storage. XML data is usually At the XQuery interface, an XML Query is described as semi-structured. A specific element received and validated for correctness. After the could have multiple sub-elements of the same query is validated, the document for insertion is type. This semi-structured nature of XML makes validated against an XML Schema (depends on it suitable to describe data that contains many the method of storage) and parsed into the variations. With this explosive use has come the various sections of a hierarchical structure. stockpiling of thousand of XML documents and 1
  2. 2. relational database it will be used against thus ensuring that the XML elements will map onto the appropriate columns and tables. Adaptive Communication Entry Point XML shredding [5] can be used to automatically generate an XML Schema and derive a mapping between the various elements in the input XML document and a generated set of tables in the XQ ace uer erf Int database. y Generally this entire processing of XML documents is set up as layer invisible to the relational database ma Va lid on ati Sc he M X L Similar processing is required in the reverse direction shown on the diagram when requesting information from the database, a process called XML publishing [7]. SQL statements Qu erf Int ac y/ er Q X L S e decomposed from XQuery request statements typically contain a lot of JOINs. When the data to be joined is stored in multiple tables due to erf Int many branches occurring within the XML ac Q L S e Schema, there is the repeated need of the query engine to create multiple views which represent intermediary stages of the query process. These gin ase tab Da En various views are then merged and processed by e the SQL/XML interface to produce the required document or document fragment. The returned XML document is validated and then exposed as nal Re lat ag or St io e the result of the query. To improve the performance of the JOIN statements, some researchers have proposed using a denormalized Fig. 1 XML process for a typical XML database [6] which tends to increase redundancy Enabled Database whilst increasing speed. There are generally two tools available for 3. NATIVE XML DATABASES parsing an XML document. These are the Document Object Model (DOM) [3] parser and There are a number of definitions for a native the Simple API for XML (SAX) [3] parser. XML database. It is a system that has an XML The XQuery/SQL interface exists to convert the document as its fundamental unit of (logical) initial XQuerys and the parsed documents into a storage. A typical process for native XML set of SQL statements that can be applied to the document storage is as shown in Fig 2 (derived database. There are generally two methods by from [2]). Any document presented at the which XML data can be stored in a relational XQuery Interface is parsed using either a SAX or database. In the first, the entire XML document DOM parser and then validated against an XML is dumped into the database as an entry into a Schema. The validation stage is an optional stage column of a row whose data type is specified as but is a good practice to ensure that multiple a Large Object (LOB) [4]. This scenario works documents conform to one another. The schema well when all actions are performed on the also serves as a documentation of what elements document as a whole. It also eliminates the need exist in the database. Each document can be for shredding but comes with its own set of stored with its own schema. When a schema is problems. The second method of storing data in set up for a set of documents, the schema can an XML database lies in mapping the content of evolve with the documents it describes. This the document to a relational database structure. loose content model allows a great amount of The process of conversion is commonly called flexibility in data storage in that when new shredding. For this process to be successful the elements need to be stored this can be done with document must be validated against a, usually minimal additional effort. This is further manually created, XML Schema. This is done to supported by the fact that the XML document ensure that the document is valid for the model naturally supports document/element 2
  3. 3. order, sibling order and the storage of comments databases. This difference translates into and processing instructions [7] which is performance. Valuable processing time has to be important especially for generating document- spent on mapping XML Schemas to database centric XML documents. structures, translating XQueries into SQL [2] and also in parsing and reparsing XML documents stored in LOBs. LOBs are inefficient when there Communication Entry Point is the need to search for specific information between specific elements of an XML document. Each document that is retrieved in a multi-row operation will have to be parsed and the information extracted [4]. Considering that XML ace ery Qu erf Int X document parsing is very processor expensive, the document must be heavily indexed to increase performance. These processes make work more difficult for database vendors who Valid XML (Opti onal) ation Sche ma have to implement and test various algorithms for implementing them. Native XML databases are able to insert and export data in XML more efficiently than XML enabled databases whilst preserving more information about the Dat aba gin En se e document. Apart from using inefficient LOBs, relational databases have no facilities for storing document order, sibling order, comments and processing instructions for documents. storage hierarc XML Relational databases are unable to perform full hical text searches without loading the entire document into memory. By using the XML- aware system available in native XML Fig. 2 XML Process for a Native XML databases, XML documents can be searched Database without loading most of the document into memory. Further more XML databases can The information retrieved from an XML better answer a question like “Find maintenance document is stored directly into a hierarchical procedures for a specific part of a specific tree structure. This structure reflects the default airplane” from a set of airplane manuals faster nature of XML documents. Elements are than relational databases ever will. Though replaced by numerical IDs [2] in the hierarchical relational databases are very efficient at storing structure and this prevents the repetition of data that has a specific structure, the element names throughout the structure. A maintenance of the system becomes more reference maintains the element-numerical ID difficult when the structure needs to change. association. Any XQuery executed does not Denormalized database structures can be more require translation. Results retrieved from the tolerant to change and reduce the number of XML storage are generated directly into an XML joins required to retrieve data. They however end document. up with a lot of redundant data with relational Native XML databases support full text searches design anomalies like non-atomic values, of documents without the need for the entire functional dependencies and multi-valued document to be loaded in memory. They can dependencies which make it more difficult to perform these searches based on specified manage as more changes occur. Adaptive XML criteria provided the different documents use the shredding adds processing overhead to XML- same elements in the same way. schema based decomposition and does not provide methods for automatic schema and 4. COMPARISON database augmentation should a schema change. Native XML databases support data with a It is clear from fig 1 and fig 2 that there are far changing structure. As the data evolves, the more processing stages in XML enabled XML schema evolves, making data management Databases than there are in native XML easier and providing a more efficient storage process. Performance tests show that native 3
  4. 4. XML databases have a higher performance when database vendors want to squeeze all the dealing with large XML documents and large functionality and profits they can out of pre- volumes of XML data [1]. Even in the domain of existing relational systems. Technical well structured data, Native XML database professionals have also become very comfortable processing speeds are comparable to that of a with the relational systems available and are relational databases processing and returning reluctant to change because of the amount of tabular data when the data is indexed [8]. work it could involve. Due to the fact that quite a few organizations need to store data both in a 5. INDUSTRIAL IMPLEMENTATION relational and XML form, one can definitely expect to find bundled (hybrid) systems where The move to native XML databases is a slow both the relational and XML systems coexist but one. Irrespective of their implementations one have native implementations within the system. thing common to all is the existence of a “native” The industrial implementations are examples of XML data type and an XML subsystem with this trend. special functions built in for the data type. Oracle users have the ability to select from two 7. REFERENCES general implementations of the Oracle XMLType data type. Either storage as a [1] Akmal B.Chaudhri, Awais Rashid, Character Large Object or as a shredded set of Roberto Zicari, “XML Data database objects. These storage methods exist management: Native XML and XML- within the XMLType data type and the database Enabled Database Systems, Addison- has special functionality to work with them. Wesley, 2003 Microsoft SQL Server 2000 started off with the [2] Matthias Nicola, Bert van der Linden, transformation of XML data into relational “Native XML Support in DB2 tables. Due to problems in scalability including Universal Database”, ACM Digital issues with preserving document order [1], Library, 2005 Microsoft SQL Server 2005 implements a [3] W3 Schools Online Web Tutorial, “native” XML data type which stores XML in a “XML Tutorial” Binary Large Object (BLOB). The data stored in http://www.w3schools.com the BLOB is heavily indexed to enable fast regeneration of the XML document. [4] Fiebig, T. et.al, “ Anatomy of a Native IBM DB2 v9 also implements a special data type XML Base Management System “, for storage of XML [2]. The internal structure of VLDB Journal 11(4), December 2002 the data type utilizes the hierarchical structure of [5] Juliana Freire and J´erˆome Sim´eon, a tree and is so far the closest implementation of “Adaptive XML Shredding: a true native database system of the three Architecture, Implementation and vendors mentioned here. Though the Challenges”, University of Toronto, implementations of Oracle and Microsoft are not January 2003 actually native they are indicative of the fact that [6] Balmin Andre, Yannis industry realizes the importance of developing Papakonstantinou, “ Storing and dedicated native XML data types and subsystems querying XML data using denormalized to better handle XML storage. rational databases ”, VLDB Journal, 2005 6. CONCLUSION [7] Michael Rys, Don Chamberlin, Daniela Florescu “XML and Relational Though relational databases have been around Database Management Systems: the for a long time, the processing of XML Inside Story”, ACM Digital Library, documents, especially semi-structured XML 2005 documents, which account for the majority, is [8] Atakan Kurtl, Mustafa Atay, “An definitely an area best served by native XML Experimental Study on Query databases. With enough research and Processing Efficiency of Native-XML development of native XML systems they have and XML-enabled Relational Database the potential to become the dominant database Systems”, ACM Digital Library, 2002 system in the future. The change over to native XML systems is slow due to the fact that 4
  5. 5. [9] Ronald Bourret, "XML and Databases", http://www.rpbourret.com/xml/XMLAn dDatabases.htm 5