MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONijdms
Distributed databases and data replication are effective ways to increase the accessibility and reliability of
un-structured, semi-structured and structured data to extract new knowledge. Replications offer better
performance and greater availability of data. With the advent of Big Data, new storage and processing
challenges are emerging.
To meet these challenges, Hadoop and DHTs compete in the storage domain and MapReduce and others in
distributed processing, with their strengths and weaknesses.
We propose an analysis of the circular and radial replication mechanisms of the CLOAK DHT. We
evaluate their performance through a comparative study of data from simulations. The results show that
radial replication is better in storage, unlike circular replication, which gives better search results.
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEijdpsjournal
Big Data has introduced the challenge of storing and processing large volumes of data (text, images, and
videos). The success of centralised exploitation of massive data on a node is outdated, leading to the
emergence of distributed storage, parallel processing and hybrid distributed storage and parallel
processing frameworks.
The main objective of this paper is to evaluate the load balancing and task allocation strategy of our
hybrid distributed storage and parallel processing framework CLOAK-Reduce. To achieve this goal, we
first performed a theoretical approach of the architecture and operation of some DHT-MapReduce. Then,
we compared the data collected from their load balancing and task allocation strategy by simulation.
Finally, the simulation results show that CLOAK-Reduce C5R5 replication provides better load balancing
efficiency, MapReduce job submission with 10% churn or no churn.
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIESijdpsjournal
By programming both the data plane and the control plane, network operators can adapt their networks to
their needs. Thanks to research over the past decade, this concept has more formulized and more
technologically feasible. However, since control plane programmability came first, it has already been
successfully implemented in the real network and is beginning to pay off. Today, the data plane
programmability is evolving very rapidly to reach this level, attracting the attention of researchers and
developers: Designing data plane languages, application development on it, formulizing software switches
and architecture that can run data plane codes and the applications, increasing performance of software
switch, and so on. As the control plane and data plane become more open, many new innovations and
technologies are emerging, but some experts warn that consumers may be confused as to which of the many
technologies to choose. This is a testament to how much innovation is emerging in the network. This paper
outlines some emerging applications on the data plane and offers opportunities for further improvement
and optimization. Our observations show that most of the implementations are done in a test environment
and have not been tested well enough in terms of performance, but there are many interesting works, for
example, previous control plane solutions are being implemented in the data plane.
Transforming data-centric eXtensible markup language into relational database...journalBEEI
eXtensible markup language (XML) appeared internationally as the format for data representation over the web. Yet, most organizations are still utilising relational databases as their database solutions. As such, it is crucial to provide seamless integration via effective transformation between these database infrastructures. In this paper, we propose XML-REG to bridge these two technologies based on node-based and path-based approaches. The node-based approach is good to annotate each positional node uniquely, while the path-based approach provides summarised path information to join the nodes. On top of that, a new range labelling is also proposed to annotate nodes uniquely by ensuring the structural relationships are maintained between nodes. If a new node is to be added to the document, re-labelling is not required as the new label will be assigned to the node via the new proposed labelling scheme. Experimental evaluations indicated that the performance of XML-REG exceeded XMap, XRecursive, XAncestor and Mini-XML concerning storing time, query retrieval time and scalability. This research produces a core framework for XML to relational databases (RDB) mapping, which could be adopted in various industries.
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONijdms
Distributed databases and data replication are effective ways to increase the accessibility and reliability of
un-structured, semi-structured and structured data to extract new knowledge. Replications offer better
performance and greater availability of data. With the advent of Big Data, new storage and processing
challenges are emerging.
To meet these challenges, Hadoop and DHTs compete in the storage domain and MapReduce and others in
distributed processing, with their strengths and weaknesses.
We propose an analysis of the circular and radial replication mechanisms of the CLOAK DHT. We
evaluate their performance through a comparative study of data from simulations. The results show that
radial replication is better in storage, unlike circular replication, which gives better search results.
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEijdpsjournal
Big Data has introduced the challenge of storing and processing large volumes of data (text, images, and
videos). The success of centralised exploitation of massive data on a node is outdated, leading to the
emergence of distributed storage, parallel processing and hybrid distributed storage and parallel
processing frameworks.
The main objective of this paper is to evaluate the load balancing and task allocation strategy of our
hybrid distributed storage and parallel processing framework CLOAK-Reduce. To achieve this goal, we
first performed a theoretical approach of the architecture and operation of some DHT-MapReduce. Then,
we compared the data collected from their load balancing and task allocation strategy by simulation.
Finally, the simulation results show that CLOAK-Reduce C5R5 replication provides better load balancing
efficiency, MapReduce job submission with 10% churn or no churn.
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIESijdpsjournal
By programming both the data plane and the control plane, network operators can adapt their networks to
their needs. Thanks to research over the past decade, this concept has more formulized and more
technologically feasible. However, since control plane programmability came first, it has already been
successfully implemented in the real network and is beginning to pay off. Today, the data plane
programmability is evolving very rapidly to reach this level, attracting the attention of researchers and
developers: Designing data plane languages, application development on it, formulizing software switches
and architecture that can run data plane codes and the applications, increasing performance of software
switch, and so on. As the control plane and data plane become more open, many new innovations and
technologies are emerging, but some experts warn that consumers may be confused as to which of the many
technologies to choose. This is a testament to how much innovation is emerging in the network. This paper
outlines some emerging applications on the data plane and offers opportunities for further improvement
and optimization. Our observations show that most of the implementations are done in a test environment
and have not been tested well enough in terms of performance, but there are many interesting works, for
example, previous control plane solutions are being implemented in the data plane.
Transforming data-centric eXtensible markup language into relational database...journalBEEI
eXtensible markup language (XML) appeared internationally as the format for data representation over the web. Yet, most organizations are still utilising relational databases as their database solutions. As such, it is crucial to provide seamless integration via effective transformation between these database infrastructures. In this paper, we propose XML-REG to bridge these two technologies based on node-based and path-based approaches. The node-based approach is good to annotate each positional node uniquely, while the path-based approach provides summarised path information to join the nodes. On top of that, a new range labelling is also proposed to annotate nodes uniquely by ensuring the structural relationships are maintained between nodes. If a new node is to be added to the document, re-labelling is not required as the new label will be assigned to the node via the new proposed labelling scheme. Experimental evaluations indicated that the performance of XML-REG exceeded XMap, XRecursive, XAncestor and Mini-XML concerning storing time, query retrieval time and scalability. This research produces a core framework for XML to relational databases (RDB) mapping, which could be adopted in various industries.
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
Hadoop is an open source implementation of the MapReduce Framework in the realm of distributed processing.
A Hadoop cluster is a unique type of computational cluster designed for storing and analyzing large datasets
across cluster of workstations. To handle massive scale data, Hadoop exploits the Hadoop Distributed File
System termed as HDFS. The HDFS similar to most distributed file systems share a familiar problem on data
sharing and availability among compute nodes, often which leads to decrease in performance. This paper is an
experimental evaluation of Hadoop's computing performance which is made by designing a rack aware cluster
that utilizes the Hadoop’s default block placement policy to improve data availability. Additionally, an adaptive
data replication scheme that relies on access count prediction using Langrange’s interpolation is adapted to fit
the scenario. To prove, experiments were conducted on a rack aware cluster setup which significantly reduced
the task completion time, but once the volume of the data being processed increases there is a considerable
cutback in computational speeds due to update cost. Further the threshold level for balance between the update
cost and replication factor is identified and presented graphically.
Enhancing Big Data Analysis by using Map-reduce TechniquejournalBEEI
Database is defined as a set of data that is organized and distributed in a manner that permits the user to access the data being stored in an easy and more convenient manner. However, in the era of big-data the traditional methods of data analytics may not be able to manage and process the large amount of data. In order to develop an efficient way of handling big-data, this work enhances the use of Map-Reduce technique to handle big-data distributed on the cloud. This approach was evaluated using Hadoop server and applied on Electroencephalogram (EEG) Big-data as a case study. The proposed approach showed clear enhancement on managing and processing the EEG Big-data with average of 50% reduction on response time. The obtained results provide EEG researchers and specialist with an easy and fast method of handling the EEG big data.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
An Approach for the Incremental Export of Relational Databases into RDF GraphsNikolaos Konstantinou
Several approaches have been proposed in the literature for offering RDF views over databases. In addition to these, a variety of tools exist that allow exporting database contents into RDF graphs. The approaches in the latter category have often been proved demonstrating better performance than the ones in the former. However, when database contents are exported into RDF, it is not always optimal or even necessary to export, or dump as this procedure is often called, the whole database contents every time. This paper investigates the problem of incremental generation and storage of the RDF graph that is the result of exporting relational database contents. In order to express mappings that associate tuples from the source database to triples in the resulting RDF graph, an implementation of the R2RML standard is subject to testing. Next, a methodology is proposed and described that enables incremental generation and storage of the RDF graph that originates from the source relational database contents. The performance of this methodology is assessed, through an extensive set of measurements. The paper concludes with a discussion regarding the authors' most important findings.
Survey of Parallel Data Processing in Context with MapReduce cscpconf
MapReduce is a parallel programming model and an associated implementation introduced by
Google. In the programming model, a user specifies the computation by two functions, Map and Reduce. The underlying MapReduce library automatically parallelizes the computation, and handles complicated issues like data distribution, load balancing and fault tolerance. The original MapReduce implementation by Google, as well as its open-source counterpart,Hadoop, is aimed for parallelizing computing in large clusters of commodity machines.This paper gives an overview of MapReduce programming model and its applications. The author has described here the workflow of MapReduce process. Some important issues, like fault tolerance, are studied in more detail. Even the illustration of working of Map Reduce is given. The data locality issue in heterogeneous environments can noticeably reduce the Map Reduce performance. In this paper, the author has addressed the illustration of data across nodes in a way that each node has a balanced data processing load stored in a parallel manner. Given a data intensive application running on a Hadoop Map Reduce cluster, the auhor has exemplified how data placement is done in Hadoop architecture and the role of Map Reduce in the Hadoop Architecture. The amount of data stored in each node to achieve improved data-processing performance is explained here.
Some slides about the Map/Reduce programming model (academic purposes) adapting some examples of the book Map/Reduce design patterns.
Special thanks to the next authors:
-http://shop.oreilly.com/product/0636920025122.do
-http://mapreducepatterns.com/index.php?title=Main_Page
-http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
A New Architecture for Group Replication in Data GridEditor IJCATR
Nowadays, grid systems are vital technology for programs running with high performance and problems solving with largescale
in scientific, engineering and business. In grid systems, heterogeneous computational resources and data should be shared
between independent organizations that are scatter geographically. A data grid is a kind of grid types that make relations computational
and storage resources. Data replication is an efficient way in data grid to obtain high performance and high availability by saving
numerous replicas in different locations e.g. grid sites. In this research, we propose a new architecture for dynamic Group data
replication. In our architecture, we added two components to OptorSim architecture: Group Replication Management component
(GRM) and Management of Popular Files Group component (MPFG). OptorSim developed by European Data Grid projects for
evaluate replication algorithm. By using this architecture, popular files group will be replicated in grid sites at the end of each
predefined time interval.
Taking Flight: an abbreviated version of my Agile DC talkPaul Boos
This gives you a hint at what my talk will be about and the Agile Transformation approach that I use. There is a a lot missing, but feel free to download and get a feel for it as well.
Hadoop is an open source implementation of the MapReduce Framework in the realm of distributed processing.
A Hadoop cluster is a unique type of computational cluster designed for storing and analyzing large datasets
across cluster of workstations. To handle massive scale data, Hadoop exploits the Hadoop Distributed File
System termed as HDFS. The HDFS similar to most distributed file systems share a familiar problem on data
sharing and availability among compute nodes, often which leads to decrease in performance. This paper is an
experimental evaluation of Hadoop's computing performance which is made by designing a rack aware cluster
that utilizes the Hadoop’s default block placement policy to improve data availability. Additionally, an adaptive
data replication scheme that relies on access count prediction using Langrange’s interpolation is adapted to fit
the scenario. To prove, experiments were conducted on a rack aware cluster setup which significantly reduced
the task completion time, but once the volume of the data being processed increases there is a considerable
cutback in computational speeds due to update cost. Further the threshold level for balance between the update
cost and replication factor is identified and presented graphically.
Enhancing Big Data Analysis by using Map-reduce TechniquejournalBEEI
Database is defined as a set of data that is organized and distributed in a manner that permits the user to access the data being stored in an easy and more convenient manner. However, in the era of big-data the traditional methods of data analytics may not be able to manage and process the large amount of data. In order to develop an efficient way of handling big-data, this work enhances the use of Map-Reduce technique to handle big-data distributed on the cloud. This approach was evaluated using Hadoop server and applied on Electroencephalogram (EEG) Big-data as a case study. The proposed approach showed clear enhancement on managing and processing the EEG Big-data with average of 50% reduction on response time. The obtained results provide EEG researchers and specialist with an easy and fast method of handling the EEG big data.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
An Approach for the Incremental Export of Relational Databases into RDF GraphsNikolaos Konstantinou
Several approaches have been proposed in the literature for offering RDF views over databases. In addition to these, a variety of tools exist that allow exporting database contents into RDF graphs. The approaches in the latter category have often been proved demonstrating better performance than the ones in the former. However, when database contents are exported into RDF, it is not always optimal or even necessary to export, or dump as this procedure is often called, the whole database contents every time. This paper investigates the problem of incremental generation and storage of the RDF graph that is the result of exporting relational database contents. In order to express mappings that associate tuples from the source database to triples in the resulting RDF graph, an implementation of the R2RML standard is subject to testing. Next, a methodology is proposed and described that enables incremental generation and storage of the RDF graph that originates from the source relational database contents. The performance of this methodology is assessed, through an extensive set of measurements. The paper concludes with a discussion regarding the authors' most important findings.
Survey of Parallel Data Processing in Context with MapReduce cscpconf
MapReduce is a parallel programming model and an associated implementation introduced by
Google. In the programming model, a user specifies the computation by two functions, Map and Reduce. The underlying MapReduce library automatically parallelizes the computation, and handles complicated issues like data distribution, load balancing and fault tolerance. The original MapReduce implementation by Google, as well as its open-source counterpart,Hadoop, is aimed for parallelizing computing in large clusters of commodity machines.This paper gives an overview of MapReduce programming model and its applications. The author has described here the workflow of MapReduce process. Some important issues, like fault tolerance, are studied in more detail. Even the illustration of working of Map Reduce is given. The data locality issue in heterogeneous environments can noticeably reduce the Map Reduce performance. In this paper, the author has addressed the illustration of data across nodes in a way that each node has a balanced data processing load stored in a parallel manner. Given a data intensive application running on a Hadoop Map Reduce cluster, the auhor has exemplified how data placement is done in Hadoop architecture and the role of Map Reduce in the Hadoop Architecture. The amount of data stored in each node to achieve improved data-processing performance is explained here.
Some slides about the Map/Reduce programming model (academic purposes) adapting some examples of the book Map/Reduce design patterns.
Special thanks to the next authors:
-http://shop.oreilly.com/product/0636920025122.do
-http://mapreducepatterns.com/index.php?title=Main_Page
-http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
A New Architecture for Group Replication in Data GridEditor IJCATR
Nowadays, grid systems are vital technology for programs running with high performance and problems solving with largescale
in scientific, engineering and business. In grid systems, heterogeneous computational resources and data should be shared
between independent organizations that are scatter geographically. A data grid is a kind of grid types that make relations computational
and storage resources. Data replication is an efficient way in data grid to obtain high performance and high availability by saving
numerous replicas in different locations e.g. grid sites. In this research, we propose a new architecture for dynamic Group data
replication. In our architecture, we added two components to OptorSim architecture: Group Replication Management component
(GRM) and Management of Popular Files Group component (MPFG). OptorSim developed by European Data Grid projects for
evaluate replication algorithm. By using this architecture, popular files group will be replicated in grid sites at the end of each
predefined time interval.
Taking Flight: an abbreviated version of my Agile DC talkPaul Boos
This gives you a hint at what my talk will be about and the Agile Transformation approach that I use. There is a a lot missing, but feel free to download and get a feel for it as well.
this small presentation gives a brief part of an important speech given by the Rev. Sun Myung Moon in Yankee Stadium in 1976. The speech calls upon theTrue American people, those "who believe in the one family of humankind, transcendent of color and nationality as willed by God", to revive the founding spirit of the nation and bring back God to our homes, and churches. "America must return to Godism, an absolutely God-centered ideology."
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...ijseajournal
With the emergence of XML as de facto format for storing and exchanging information over the Internet, the search for ever more innovative and effective techniques for their querying is a major and current concern of the XML database community. Several studies carried out to help solve this problem are mostly oriented towards the evaluation of so-called exact queries which, unfortunately, are likely (especially in the case of semi-structured documents) to yield abundant results (in the case of vague queries) or empty results (in the case of very precise queries). From the observation that users who make requests are not necessarily interested in all possible solutions, but rather in those that are closest to their needs, an important field of research has been opened on the evaluation of preferences queries. In this paper, we propose an approach for the evaluation of such queries, in case the preferences concern the structure of the document. The solution investigated revolves around the proposal of an evaluation plan in three phases: rewriting-evaluation-merge. The rewriting phase makes it possible to obtain, from a partitioningtransformation operation of the initial query, a hierarchical set of preferences path queries which are holistically evaluated in the second phase by an instrumented version of the algorithm TwigStack. The merge phase is the synthesis of the best results.
XML COMPACTION IMPROVEMENTS BASED ON BINARY STRING ENCODINGSijdms
Due to the flexibility and the easy use of XML, it is nowadays widely used in a vast number of application areas and new information is increasingly being encoded as XML documents. Therefore, it is important to provide a repository for XML documents, which supports efficient management and storage of XML data.For this purpose, many proposals have been made, the most common ones are node labeling schemes. On
the other hand, XML repeatedly uses tags to describe the data itself. This self-describing nature of XML makes it verbose with the result that the storage requirements of XML are often expanded and can be excessive. In addition, the increased size leads to increased costs for data manipulation. Therefore, it also seems natural to use compression techniques to increase the efficiency of storing and querying XML data.
In our previous works, we aimed at combining the advantages of both areas (labeling and compaction technologies), Specially, we took advantage of XML structural peculiarities for attempting to reduce storage space requirements and to improve the efficiency of XML query processing using labeling schemes. In this paper, we continue our investigations on variations of binary string encoding forms to decrease the
label size. Also We report the experimental results to examine the impact of binary string encoding on the query performance and the storage size needed to store the compacted XML documents.
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09849539085, 09966235788 or mail us - ieeefinalsemprojects@gmail.co¬m-Visit Our Website: www.finalyearprojects.org
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
The papers for publication in The International Journal of Engineering& Science are selected through rigorous peer reviews to ensure originality, timeliness, relevance, and readability.
Theoretical work submitted to the Journal should be original in its motivation or modeling structure. Empirical analysis should be based on a theoretical framework and should be capable of replication. It is expected that all materials required for replication (including computer programs and data sets) should be available upon request to the authors.
A FAST METHOD FOR IMPLEMENTATION OF THE PROPERTY LISTS IN PROGRAMMING LANGUAGESijpla
One of the major challenges in programming languages is to support different data structures and their
variations in both static and dynamic aspects. One of the these data structures is the property list which
applications use it as a convenient way to store, organize, and access standard types of data. In this paper,
the standards methods for implementation of the Property Lists, including the Static Array, Link List, Hash
and Tree are reviewed. Then an efficient method to implement the property list is presented. The
experimental results shows that our method is fast compared with the existing methods.
Large volume of information is stored in XML format in the Web, and clustering is a management method for this documents. Most of current methods for clustering XML documents consider only one of these two aspects. In this paper, we propose SCEM (Expectation Maximization Structure and Content) for XML documents which is used to effectively cluster XML documents by combining content and structural features. The other contribution of this paper is that we used probabilistic distributions in such way that have probability parameters corresponding to one cluster. In this way, we obtained better effectiveness compared to other clustering methods due to generality. Experimental results on real datasets show effectiveness of proposed method, particularly when it is applied on large XML documents without schema. Also it can be used to improve accuracy and effectiveness of XML information retrieval.
Very few research works have been done on XML security over relational databases despite that XML became the de facto standard for the data representation and exchange on the internet and a lot of XML documents are stored in RDBMS. In [14], the author proposed an access control model for schema-based storage of XML documents in relational storage and translating XML access control rules to relational access control rules. However, the proposed algorithms had performance drawbacks. In this paper, we will use the same access control model of [14] and try to overcome the drawbacks of [14] by proposing an efficient technique to store the XML access control rules in a relational storage of XML DTD. The mapping of the XML DTD to relational schema is proposed in [7]. We also propose an algorithm to translate XPath queries to SQL queries based on the mapping algorithm in [7].
Very few research works have been done on XML security over relational databases despite that XML
became the de facto standard for the data representation and exchange on the internet and a lot of XML
documents are stored in RDBMS. In [14], the author proposed an access control model for schema-based
storage of XML documents in relational storage and translating XML access control rules to relational
access control rules. However, the proposed algorithms had performance drawbacks. In this paper, we will
use the same access control model of [14] and try to overcome the drawbacks of [14] by proposing an
efficient technique to store the XML access control rules in a relational storage of XML DTD. The mapping
of the XML DTD to relational schema is proposed in [7]. We also propose an algorithm to translate XPath
queries to SQL queries based on the mapping algorithm in [7].
Towards a Query Rewriting Algorithm Over Proteomics XML ResourcesCSCJournals
Querying and sharing Web proteomics data is not an easy task. Given that, several data sources can be used to answer the same sub-goals in the Global query, it is obvious that we can have many candidates rewritings. The user-query is formulated using Concepts and Properties related to Proteomics research (Domain Ontology). Semantic mappings describe the contents of underlying sources. In this paper, we propose a characterization of query rewriting problem using semantic mappings as an associated hypergraph. Hence, the generation of candidates’ rewritings can be formulated as the discovery of minimal Transversals of an hypergraph. We exploit and adapt algorithms available in Hypergraph Theory to find all candidates rewritings from a query answering problem. Then, in future work, some relevant criteria could help to determine optimal and qualitative rewritings, according to user needs, and sources performances.
An extended database reverse engineering v a key for database forensic invest...eSAT Journals
Abstract The database forensic investigation plays an important role in the field of computer. The data stored in the database is generally stored in the form of tables. However, it is difficult to extract meaningful data without blueprints of database because the table inside the database has exceedingly complicated relation and the role of the table and field in the table are ambiguous. Proving a computer crime require very complicated processes which are based on digital evidence collection, forensic analysis and investigation process. Current database reverse engineering researches presume that the information regarding semantics of attributes, primary keys, and foreign keys in database tables is complete. However, this may not be the case. Because in a recent database reverse engineering effort to derive a data model from a table-based database system, we find the data content of many attributes are not related to their names at all. Hence database reverse engineering researches is used to extracts the information regarding semantics of attributes, primary keys, and foreign keys, different consistency constraints in database tables. In this paper, different database reverse engineering (DBRE) process such as table relationship analysis and entity relationship analysis are described .We can extracts an extended entity-relationship diagram from a table-based database with little descriptions for the fields in its tables and no description for keys. Also the analysis of the table relationship using database system catalogue, joins of tables, and design of the process extraction for examination of data is described. Data extraction methods will be used for the digital forensics, which more easily acquires digital evidences from databases using table relationship, entity relationship, different joins among the tables etc. By acquiring these techniques it will be possible for the database user to detect database tampering and dishonest manipulation of database. Index Terms: – Foreign key; Table Relationship; DB Forensic; DBRE;
An extended database reverse engineering – a key for database forensic invest...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...Kyong-Ha Lee
Subgraph matching is a fundamental operation for querying
graph-structured data. Due to potential errors and noises in real world graph data, exact subgraph matching is sometimes not appropriate in practice.
In this paper we consider an approximate subgraph matching model that allows missing edges. Based on this model, approximate subgraph matching finds all occurrences of a given query graph in a database graph,
allowing missing edges. A straightforward approach to this problem is to first generate query subgraphs of the query graph by deleting edges and then perform exact subgraph matching for each query subgraph. In this paper we propose a sharing based approach to approximate subgraph matching, called SASUM. Our method is based on the fact that query subgraphs are highly overlapped. Due to this overlapping nature of query subgraphs, the matches of a query subgraph can be computed from the matches of a smaller query subgraph, which results in reducing the number of query subgraphs that need costly exact subgraph matching. Our method uses a lattice framework to identify sharing opportunities between query subgraphs. To further reduce the number of graphs that need exact subgraph matching, SASUM generates small base graphs that are shared by query subgraphs and chooses the minimum number of base graphs whose matches are used to derive the matching results of all query subgraphs. A comprehensive set of experiments shows that our approach outperforms the state-of-the-art
approach by orders of magnitude in terms of query execution time.
Scalable and Adaptive Graph Querying with MapReduceKyong-Ha Lee
We address the problem of processing multiple graph queries over a massive set of data graphs in this letter. As the number of data graphs is growing rapidly, it is often hard to process graph queries with serial algorithms in a timely manner. We propose a distributed graph querying algorithm, which employs feature-based comparison and a filterand-verify scheme working on the MapReduce framework. Moreover, we devise an ecient scheme that adaptively tunes a proper feature size at runtime by sampling data graphs. With various experiments, we show that the proposed method outperforms conventional algorithms in terms of both scalability and efficiency.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Bitmap Indexes for Relational XML Twig Query Processing
1. Bitmap Indexes for Relational XML Twig Query Processing
Kyong-Ha Lee Bongki Moon
Department of Computer Science Department of Computer Science
University of Arizona University of Arizona
Tucson, AZ 85721, USA Tucson, AZ 85721, USA
bart7449@gmail.com bkmoon@cs.arizona.edu
ABSTRACT research has been done for managing XML data with rela-
Due to an increasing volume of XML data, it is consid- tional database systems [5, 16, 12, 4]. For example, a node
ered prudent to store XML data on an industry-strength approach [16, 15, 12, 3] has been suggested as a way of map-
database system instead of relying on a domain specific ap- ping XML data into relational tables. In this approach, each
plication or a file system. For shredded XML data stored XML node (i.e.element, attribute) is labeled and stored as
in the relational tables, however, it may not be straightfor- a row in a relational table. To assist query processing, there
ward to apply existing algorithms for twig query processing, may also be supplementary tables that store such informa-
because most of the algorithms require XML data to be ac- tion as path-materialized strings [15].
cessed in a form of streams of elements grouped by their tags An XML query is commonly modeled as a twig pattern
and sorted in a particular order. In order to support XML whose nodes are connected by either Parent-Child (P-C) or
query processing within the common framework of relational Ancestor-Descendant (A-D) axes. Most existing approaches
database systems, we first propose several bitmap indexes to twig pattern matching are based on decomposing a twig
for supporting holistic twig joins on XML data stored in the into several linear paths. Once all the instances matching
relational tables. Since bitmap indexes are well supported the linear paths are found from XML data (by virtue of a
in most of the commercial and open-source database sys- labeling scheme or a structural summary), they are joined
tems, the proposed bitmap indexes and twig query process- together to find the instances matching the twig pattern as
ing algorithms can be incorporated into the relational query a final result. Numerous efficient algorithms for twig query
processing framework with more ease. The proposed query processing have been reported in the literature, and some of
processing algorithms are efficient in terms of both time and them have been proven to be I/O optimal for a certain class
space, since the compressed bitmap indexes stay compressed of twig queries. Readers are referred to a recent survey for
during query processing. In addition, we propose a hybrid various XML query processing techniques [3].
index which computes twig query solutions with only bit- One of the common assumptions most twig query algo-
vectors, without accessing labeled XML elements stored in rithms rely on is that input data are stored and accessed
the relational tables. as a form of inverted lists whose items are grouped by cer-
tain criteria and sorted in a specific order [1, 2, 10]. Since a
storage scheme like that is not genuinely supported by rela-
Categories and Subject Descriptors tional database systems, the twig query algorithms cannot
H.2.3 [Database Management]: Language—Query lan- be directly applied without preprocessing XML data stored
guages; H.2.4 [Database Management]: Systems—Query in tables. Typically, relevant elements are retrieved from ta-
processing bles, partitioned by their tags and then sorted by document
order of the elements or an order given by a labeling scheme
(e.g., interval based [7, 16]).
General Terms For example, to answer a twig query Q shown in Fig-
Algorithms, Languages, Experimentation, Performance ure 2(a) for XML data stored in relational tables shown
in Figure 3(b), Q is first decomposed into 3 paths(i.e.//A,
//A//B and //A//C) and SQL queries are executed for the
1. INTRODUCTION linear paths. Then, the path solutions are joined together
Due to its simplicity and flexibility, XML is widely adopted to obtain the final result for Q.
for information representation and exchange. With the grow- Example 1. Pseudo SQL statements to answer a twig
ing popularity and increasing volume of XML data, much query in Figure 2(a):
path sol1 ← (select * from node N, path P where
N.pid = P.pid and P.pathString like ”A%”
Permission to make digital or hard copies of all or part of this work for order by N.start);
personal or classroom use is granted without fee provided that copies are path sol2 ← (select * from node N, path P where
not made or distributed for profit or commercial advantage and that copies N.pid = P.pid and P.pathString like ”B%A%”
bear this notice and the full citation on the first page. To copy otherwise, to order by N.start);
republish, to post on servers or to redistribute to lists, requires prior specific path sol3 ← (select * from node N, path P where
permission and/or a fee. N.pid = P.pid and P.pathString like ”C%A%”
CIKM’09, November 2–6, 2009, Hong Kong, China order by N.start);
Copyright 2009 ACM 978-1-60558-512-3/09/11 ...$10.00. select * from path_sol1 A, path_sol2 B, path_sol3 C
2. where A.start < B.start ∧ A.start < C.start ∧
A.end > B.end ∧ A.end > C.end;
To facilitate XML query processing with relational database
systems, we propose several bitmap based indexes for XML
data. A bitmap index has been known as a powerful tool for
OLAP queries and supported by most commercial and open-
source database systems. Since data items with a specific
value from a discrete domain can be found quickly by ac-
cessing a corresponding bit-vector, a bitmap index will be a
very effective tool for retrieving relevant XML elements sep-
arately by their tags or other criteria, so that twig queries
Figure 1: A Sample XML Document
can be processed efficiently for XML data stored in relational
tables.
We first present various cursor-enabled bitmap indexes
built on different domains, namely, (1) tag, (2) path, (3)
tag+level. These bitmap indexes will provide quick access
to relevant XML data stored in a table. Their optimization
techniques will be discussed as well. We also present two
hybrid indexes. bitTwig, one of the hybrid indexes, is dis-
tinguished from most existing ones in that bitTwig identi- Figure 2: Twig Pattern Queries
fies element relationships with only a pair of cursor-enabled
bitmap indexes. All twig instances matching a given query return actual XML data for a given twig query, the refer-
can be found even without accessing XML data (e.g., labels ences to matching XML data stored in the relational tuples
associated with XML elements) stored in the relational ta- need to be provided as well.
bles. Since the amount of data fetched from the database 2.2 XML Storage Scheme in DBMS
can be minimized, this index improves the performance of
twig query processing considerably. Our experiments show We assume that XML data are stored in relational tables
that bitTwig outperforms existing indexes in most cases. as suggested in the node approach [16]. Each XML element
The remainder of this paper is organized as follows. Sec- is labeled by a labeling scheme (e.g., interval-based labels)
tion 2 reviews existing holistic twig join algorithms and XML and stored as a row in a table (known as a node table) as
storage schemes for RDBMS. Section 3 and Section 4 intro- shown in Figure 3(b)). The structural relationship between
duce our indexes and a way to determine the element re- two elements (or rows) stored in the table is determined by
lationship using bit-vectors. Section 5 describes query pro- labels associated with the elements. The node table can be
cessing algorithms with the indexes. Section 6 presents our joined with a supplementary table (known as a path table)
experimental results. Related work is presented in Section 7. that stores path-materialized strings (shown in Figure 3(b))
to assist query evaluation further.
The node approach allows XML data to be indexed by
2. BACKGROUND traditional indexing schemes provided by RDBMS [12]. If
2.1 Twig Patterns and Holistic Twig Join XML data stored in the node table are indexed by bitmaps,
a twig query can be processed more efficiently by using the
Answering a twig pattern query, whose nodes are con- bitmap indexes alone or together with the node table rather
nected by a /-axis or a //-axis, is to find all matching oc- than joining the node table with the path table. The bitmap
currences of the pattern in XML documents. The solution indexes and query processing algorithms will be presented in
to a twig query Q with k nodes is given by a set of k -ary Section 3 and Section 5 in more detail. For bitmap indexing
tuples, each of whose elements represents an XML element and query processing, we further assume that XML elements
stored in the relational tables. Figure 1 shows a sample are stored in the table in the document order (as shown in
XML document, each of whose elements are augmented by Figure 3(b)), which can be naturally maintained without
a document-ordered value (e.g., 0 for the element a1 ) and additional processing just by following order the elements
a label written in interval-based labeling scheme (e.g., (1, appear in an XML document.
32, 1) for the element a1 ). Figure 2 shows examples of twig
pattern queries.
Most holistic twig join algorithms [9, 2] use an interval- 3. INDEX STRUCTURES
based labeling scheme [7, 16] to determine the relationship
between two XML nodes. In addition, these algorithms re- 3.1 Basic Bitmap Indexes
quire XML data to be accessed in a form of streams of el- A bitmap index is a collection of bit-vectors. The bit-
ements grouped by their tags, associated with their labels, vectors in a bitmap index correspond to distinct values in a
and sorted in a particular order. They typically operate in certain value domain. The 1’s in a bit-vector identify a group
two steps. A twig pattern is first decomposed into multiple of XML elements (or tuples in a table) having a specific value
linear paths and the solutions to the paths are then com- corresponding to the bit-vector. The value domain can be
puted. The intermediate results from the linear paths are a set of XML tags, a set of distinct paths, and so on. For
then joined together for the final result. example, if a bit-vector represents a tag name A, then the
Note that holistic join algorithms typically return as a fi- 1’s in the bit-vector identify all XML elements whose tag
nal query result a set of k -ary of labels in the form of (start- name is A with their row-ids in the relational table (shown
Pos, endPos, level) without including actual XML data. To in Figure 3(b)).
3. vectors can be accessed at the same time by using their own
cursors. (see Table 1 for summary)
3.2 More Indexes for A-D Relationship
The basic bitmap indexes presented above are primarily
for representing XML elements having a certain tag, a cer-
tain path, or a pair of (tag, level). To help determine the
A-D relationship between elements identified by a bitP ath
bit-vector, we propose two additional bitmap indexes.
The bitAnc index consists of a set of bit-vectors, one per
each distinct path in an XML document. The 1’s in a bit-
vector represent a set of ancestor nodes of a terminal node in
the corresponding path and the terminal node itself. That is,
a bit-vector of a bitAnc index covers all nodes constituting a
Figure 3: Two Different Storage Schemes linear path that the bit-vector represents. For example, the
1 bit at position 2 in a bitP ath bit-vector in Figure 4(a))
For twig query processing, we first propose three basic represents node b1 , a terminal node in a path /A/A/B. In
bitmap indexes that can be defined on different value do- contrast, for a bitAnc bit-vector, the first and the second
mains: bitTag, bitPath and bitTag+ . The indexes are used bits are also set to 1 to represent b1 ’s ancestors, a0 and a1 .
to determine the row-ids of the elements, required for a given The bitDesc is the same as the bitAnc index except that
query, stored in a relational table. the bit-vectors represent descendants rather than ancestors.
The bitTag is a bitmap index whose bit-vectors corre- In Figure 4(a), the bit-vector of a bitDesc index represents
spond to distinct tag names. This index can be obtained by the descendants of all terminal nodes for the path /A/A/B.
taking the tagName column in Figure 3 as a value domain. For example, the position 8 of the bitDesc bit-vector is set
exactly one bit-vector will be accessed from a bitT ag index to 1 to represent b2 ’s descendant, e2 .
per query node index during query evaluation. Figure 4(b) shows a subtree represented by the three bit-
The bitPath is a bitmap index whose bit-vectors corre- vectors shown in Figure 4(a). This illustrates that a subtree
spond to distinct paths. This bitmap index can be obtained corresponding to a given path can be projected from an orig-
by taking the pid (i.e., path id) column in Figure 3 as a value inal XML document using the bit-vectors. We utilize this
domain. A 1 in a bit-vector indicates a terminal node (i.e., property for twig query processing.
opposite to the root node) in a linear path corresponding to
the bit-vector. From a bitP ath index, multiple bit-vectors 3.3 Properties of Bitmap Indexes
per query node may be accessed during query evaluation, Bitwise operators such as bitwise-AND and bitwise-OR
since a query with //-axis may select multiple linear paths (denoted by and ⊕, respectively) are associative and com-
(e.g./A and /A/A for a given path //A). mutative. While many rules may be derived for the indexes,
The bitTag+ is a bitmap index based on a tag+level par- we summarize only the most useful properties for our bitmap
titioning, as suggested in iTwigJoin [2]. XML elements are indexes in the following lemma. We denote a bit-vector of
first partitioned by their tags, and the elements in each par- a certain index type by indexN ame(value), where value is
tition is again partitioned by their levels. In other words, the column value the bit-vector corresponds to. For exam-
bitTag+ is a bitmap index whose bit-vectors correspond to ple, bitT ag(a) denotes a bit-vector of a bitT ag index cor-
distinct pairs of (tag, level). responding to a tag name a, and bitT ag + (a, l) denotes a
Note that the bitTag consists of the smallest number of bit-vector of a bitT ag + index corresponding to a tag name
bit-vectors among the indexes. The bitTag+ has more bit- a on level l.
vectors than bitTag due to further partitioning by levels, and
the bitPath has the largest number of bit-vectors. Obviously, Lemma 1. The proposed bitmap indexes satisfy the fol-
the sizes of the bitmap indexes become large as the cardi- lowing properties.
nality of the column which the indexes are built over grows height(doc)
higher. However, since bit-vectors are usually compressed, bitT ag(a) ≡
+
bitT ag (a, l) ≡ bitP ath(i) (1)
the size of a bitmap index can be remarkably shrunk. Typ- levell=1 i∈end(a)
ically, bit-vectors tend to be sparser if the cardinality grows bitP ath(i) ≡ ∂path=s (bitT ag(a)), leaf (s) = a (2)
higher, which makes bit-vectors more compressible. i∈pids(s)
The number of bit-vectors required for processing a twig bitP ath(i) ≡ bitAnc(i) bitP ath(i) ≡ bitDesc(i) bitP ath(i) (3)
query will also be different depending on which index is used.
( bitDesc(i)) bitT ag(a) ≡
For example, we have three bit-vectors in a bitP ath index, i∈pids(s)
all which correspond to a query node C in a twig pattern
∂a∈ (bitT ag(a)), leaf (s) = a (4)
shown in Figure 2(a). Because the path for C is //A//C and pids(s)
path ids of corresponding path strings are 3, 8 and 9 in the
path table in Figure 3. If a bitTag index is used instead, Here, i represents subtrees, which is pruned from a whole
there will be only one bit-vector required for the query node XML document, rooted at tags corresponding to a path id
C. (This input scheme is identical to the way of twigStack). i. end(a) returns a set of pids of path strings which end
In our algorithms, if more than one bit-vector is required with a tag name a. leaf (s) returns the tag name of the
per query node, the bit-vectors can be either merged into a terminal of a path string s and pids(s) returns a set of path
bit-vector by bitwise-OR operations or all the required bit- ids corresponding to a path string s.
4. Index bitT ag bitP ath bitT ag +
Inputs per query node single bit-vector& labels single bit-vector & labels multiple bit-vectors & labels
Type of bit-vector bitT ag(a) i∈leaf (a) bitP ath(i) bitT ag + (a, l)
Optimization advance(k) condense(q) advance(k)
Table 1: Summary of Basic Indexes
Figure 5: Identifying A-D relationship with bit-
vectors
Figure 4: Example of A-D Indexes
(1) pathid(a) = pathid(d),
Property (1) allows multiple bit-vectors to be merged into a
single vector that constitutes a tag-based input for twig join (2) pos(a) < pos(d),
algorithm. Property (2) states that a bit-vector generated (3) bitAnc(pathid(d))[pos(a)] = 1,
by merging all the bitP ath bit-vectors satisfying a given path
string s is identical to a bitT ag bit-vector corresponding to (4) bitP ath(pathid(a))[k] = 0, ∀k : pos(a) < k < pos(d).
a tag end(s). By property (3), bit-vectors in bitAnc and Proof. (=⇒) To show that they are the necessary con-
bitDesc indexes contain terminal nodes for a given path i ditions for a to be an ancestor of d, we need to show that
as well, and they can be identified by virtue of bitP ath. a is not an ancestor of d if any of the four conditions is
Finally, property (4) shows that with a given path s, we can not satisfied. If condition (1) is not satisfied, a cannot
get a bit-vector which covers nodes, whose tag name is a, be an ancestor of d because their path strings are iden-
in a subtree rooted leaf (s) by merging bitDesc and bitT ag tical. If either condition (2) or condition (3) is not sat-
with . isfied, a is obviously not an ancestor of d, by the docu-
ment order restriction or the definition of the bitAnc in-
4. CHECKING ELEMENT RELATIONSHIP dex. If condition (4) is not satisfied, there exists an element
x such that bitP ath(pathid(a))[pos(x)] = 1 and pos(a) <
WITH BIT-VECTORS ONLY pos(x) < pos(d). Since bitP ath(pathid(a))[pos(a)] = 1, the
Twig query processing with the basic indexes (presented path strings of a and x are identical, which makes x come
in Section 3.1) requires the use of additional information after all the descendants of a in the document order. Since
such as labels in order to determine the relationship be- pos(x) < pos(d), d cannot be a descendant of a either.
tween elements. This is because the bitmap indexes provide (⇐=) We can show that they are the sufficient conditions
no more information than row-ids of elements correspond- for a to be an ancestor of d by contradiction. Suppose a is
ing to a certain value. In this section, we introduce a hy- not an ancestor of d when all the four conditions are satisfied.
brid index called bitT wig, which consists of two different Since a is not an ancestor of d, the relationship between a
bitmaps. With the bitT wig index, we can determine the re- and d in the document hierarchy must be one of the three:
lationship between elements without resorting to the labels. (i) a is a descendant of d, (ii) a and d are not related and
Since XML nodes are stored in a table in document order, a comes after d in the document order, (iii) a and d are not
we can check preceding and following relationship by simply related and a comes before d in the document order. The
inspecting two corresponding 1’s positions in bit-vectors. A relationships (i) and (ii) both contradict the condition (2).
bitAnc index covers all ancestors of terminal nodes speci- Suppose a and d are in the relationship (iii). The condition
fied by a bitP ath index. Thus we can identify A-D and P-C (3) dictates that a is on a certain path (not necessarily as
relationship with a pair of bitP ath and bitAnc, instead of a terminal node) whose path string is identical to the path
interval-based labeling scheme. string of d. Since a and d have different path strings by the
We let indexN ame(v) and indexN ame(v)[k] denote a bit- condition (1), the path string of a must be a prefix of the
vector of an indexN ame bitmap for a given value v and the path string of d. This implies that there exists an element
k-th bit of the bit-vector, respectively. The path identifica- y such that y is an ancestor of d and the path string of y
tion corresponding to an element a is denoted by pathid(a). is identical to the path string of a. This in turn implies
In addition, we use pos(a) to denote the document order of bitP ath(pathid(a))[pos(y)] = 1, and contracts the condition
an element a. (4), because pos(a) < pos(y) < pos(d). Therefore, a must be
an ancestor of d, if all the four conditions are satisfied.
Theorem 1. For two distinct elements a and d in the
same document, a is an ancestor of d if and only if the A sample use of the index pair for answering twig queries
following conditions are satisfied. in Figure 2(a) is shown in Figure 5. In Figure 5, each label
5. under a bit-vector indicates tag name, path id and level the 5. TWIG QUERY PROCESSING
bit-vector represents. For better understanding, we illus-
trate a pair of bit-vectors as one by denoting terminal nodes 5.1 Algorithm Overview
as underlined 1’s in the figure. We identify which bits in
bitAnc(i) represents terminal nodes by virtue of bitP ath(i)
during query evaluation. Algorithm 2: Main procedure
1 condense a given twig Q to Q if possible ; // see section 5.4
2 select bit-vectors from index(es) and load them into each query
Example 2. With pairs of bit-vectors in Figure 5, we
node q ∈ Q ; // see section 5.2
check A-D relationship between a2 and c1 and between a2 3 while ¬end(root) do
and c2 in Figure 1. Here, a2 , c1 and c2 ’s bit positions are 1, 4 q ← getN ext(root);
3 and 10, respectively as shown in Figure 1. We can decide 5 if q = root then
6 cleanstack(Sparent(q) , Currq .pos);
that a2 is an ancestor of c1 exists, since they satisfy all the
7 if q ≡ root ∨ ¬empty(Sparent(q) ) then
conditions in Theorem 1. However, a2 is not an ancestor of
8 cleanstack(Sq , Currq .pos);
c2 , since bitAnc((pathid(c2 ))[pos(a2 )] (i.e., in the bit-vector 9 push Currq and N ext1q [Currq .pid] into stack Sq ;
labeled C(8,4)), which corresponds to c2 , is 0. It violates the // see section 5.3
condition (3) of Theorem 1. 10 advance(q);
11 if isLeaf (q) then
12 showSolutionF romStacks(q)// see twigStack [1]
Example 3. With pairs of bit-vectors in Figure 5, we de- 13 pop(Sq );
cide that a2 and b2 such that pos(a2 ) = 1 and pos(b2 ) = 14 else advance(q, Currparent(q) .pos); // see algorithm 3
7 have no A-D relationship, since bitP ath(1)[k] = 1, ∃k : 15 mergeAllP athSolutions(); // see twigStack [1]
pos(a2 ) < k < pos(b2 ). 16 Function getNext(q)
17 if isLeaf(q) then return q;
18 foreach qi ∈ children(q) do
Note that if ak is the k-th element which corresponds to 19 ni ← getN ext(qi );
a pathstring, and it has a descendant d, then pos(ak ) < 20 if ¬exist(qmin ) ∨ (Currni .pos < Currqmin .pos) then
qmin ← ni ;
pos(d) < pos(ak+1 ). Thus, we can skip examining 1’s in 21 if ¬exist(qmax ) ∨ (Currni .pos > Currqmax .pos) then
bitP ath(j) from pos(ak ) to pos(ak+1 ), if they do not satisfy qmax ← ni ;
other conditions. This gives us a performance merit while 22 if parent(ni ) = q then return ni ;
checking A-D relationship. We additionally use the level 23 while ¬checkAD(q, qmax ) // see algorithm 1
24 do
information of elements for identifying P-C relationship. 25 advance(q);
Algorithm 1: checkAD(a, b) 26 if Currq .pos > Currqmin .pos then return qmin ;
Output: Boolean 27 else return q
1 if Curra .pid ≡ Currd .pid then 28 end
2 return false;
3 else if pos(a) < pos(d) then
4 if getV alueF romBitAnc(Curra .pid, Currd .pos) then
// Check any other node exists btw. them in the Our main algorithm for twig join is shown in Algorithm 2.
current ancestor’s bitP ath The inputs vary according to the type of a selected index.
5 if N ext1a [Curra .pid].pos < Currd .pos then return Given a twig query Q, we first condense query nodes if possi-
false;
6 else return true;
ble. (This will be discussed in section 5.4.) This step reduces
7 else return false ; the number of query nodes to be processed. Second, we se-
8 else lect bit-vectors for each query node from the index. Then,
9 return false; we iterate the procedure in line 3-14 until the root query
10 end
11 Function getValueFromBitAnc(pid, pos)
node exhausts its input (i.e., until all cursors in bit-vectors
12 return a value from a bit-vector of bitAnc which represents for root reach the end of bit-vector). The getNext(q) re-
pid in the position, pos; turns a query node ni whose current position is the smallest
13 end
among the current positions of query nodes that are de-
Algorithm 1 describes a function which checks A-D relation- scendants of the current element of a query node q. The
ship between two elements. In line 1, it checks whether two CleanStack(Sq , pos) pops out nodes from the stack Sq , if
elements a and d have the same path id. If it is true, the el- they are not ancestors of a node referred to by pos. The
ements are terminals of the different occurrences of a given two functions either read labels or call checkAD(a,b) (Algo-
pathstring. e.g., a2 and a3 in Figure 1. Thus, they have rithm 1) to determine the A-D relationship, depending on
no A-D relationship. However, if a and d have the same tag the type of index used.
name, but different path ids, A-D relationship between them The advance() is to move a cursor in bit-vector(s) as-
may exist. e.g., two elements a1 and a2 in Figure 1 have the signed to each query node forward and set the current value
same tag name and also have A-D relationship. Note that of bit-vector(s) to pos the cursor points to. (This will be
we do not merge bit-vectors in Algorithm 1, so as to identify described in Section 5.3). The properties of the two hybrid
which bit-vector corresponds to the current nodes. In line indexes in this paper are summarized in Table 2.
5, to examine the condition (4) of Theorem 1, we compare
the next 1’s position with the one of the current 1. To avoid 5.2 Selecting Bit-vectors
moving a cursor in bitAnc from pos(ak ) to pos(ak+1 ) every The number of bit-vectors assigned to a query node vary
time it checks the condition (4), we keep the position of the according to the selected index type. With bitTag, only a
next 1 in advance by looking ahead (see Section 5.3). The single bit-vector per query node q is selected. With bitPath,
getValueFromBitAnc(pid, pos) is a function for examining multiple bit-vectors which satisfy a pathstring s for a query
the condition (3) of Theorem 1. node q are merged into a single vector(Lemma 1.(2)), then it
6. Name DescTag bitTwig cursor’s current position. e.g., two bit-vectors’ current1 in-
Indexes used bitDesc & bitT ag bitP ath & bitAnc dicate 0 and 1 in Figure 6. Thus, the minimum value, 0 is
Inputs per single bit-vector multiple selected as Currq . After that, advance() moves the cursor
query node & labels bit-vectors only of bitP ath(0), which corresponds to the current Currr .pid,
Type of ( i∈pids(q) bitDesc(i)) i∈pids(s) (
forward and finally, it reaches the end of the bit-vector. At
bit-vector bitT ag(a) bitP ath(i), bitAnc(i)) this time, 1 is chosen as Currq .pos.
Optimization advance(k) condense(q) With bitT ag + index, We block off copying current1 to
Currq for irrelevant bit-vectors associated with q in runtime,
Table 2: Summary of Two Hybrid Indexes by checking the current level value of inputs of parent(q) or
ancestor(q). For example, we have a query node ancestor(q)
which associates with multiple bit-vectors for the input and
let Currancestor(q) .level be the level of the current element of
ancestor(q). Here, we block off copying Current1 of the bit-
vector which does not satisfy level > Currancestor(q) .level.
For P-C relationship, we ban copying Current1 of the bit-
vector which does not satisfy level = Currparent(q) .level +1.
Figure 6: Current Values and Multiple Bit-vectors As a result, we can skip reading elements of which levels do
Per Query Node not satisfy the level condition. This improves I/O optimality
of twig join by avoiding reading useless elements from the
associates with q. For a twig pattern with OR-predicate(in relational table.
Figure 2(d)), we build a query node x instead of both C
and D, then find bit-vectors which represent either of paths 5.4 Optimization
//A/B/C or //A/B/D. Next, we merge the bit-vectors into a
The CPU time of holistic twig join algorithm which al-
single vector and assign it to x. With DescTag index, we
lows multiple inputs for a single query node is O(N × |Q| ×
find all bitDesc bit-vectors for pids corresponding to each
|Input+Output|) for a twig query Q [2], where N is the total
query node q in Q, then merge them into a single vector for
number of useful inputs for the twig query Q. Since |Output|
each q. Next, we AND this with bitT ag(a) for each query
in all twig join algorithms should be always same, the only
node a, according to Lemma 1.(4). bitTag+ and bitTwig
option for optimization is to reduce |Q| and |IN P U T |. We
allow multiple bit-vectors per query node, and selected bit-
introduce our methods to reduce |Q| and |IN P U T | in this
vectors in the indexes are not merged. Note that if any query
section.
node in Q does not have any corresponding bit-vector, no
Condensing Query Nodes
solution for Q exists in the XML document D. This property
With a bitmap index built over path domain, i.e.bitPath
makes it possible to examine if twig solution for Q exists in
or bitTwig, query nodes which are neither leaf nodes nor
D, without completing whole twig join.
branching nodes and which are not projected to be appeared
in twig solution can be ignored during query processing.
5.3 Advancing Cursors This reduces both |Q| and |IN P U T |. e.g., in Figure 2(c),
For the sake of twig joining, we augment our bitmap in- if a query node B is not projected, we do not read all b ele-
dexes with cursors so that the positions of 1’s in a bit-vector ments during query processing. Since, with our indexes built
can be checked individually by advancing a cursor associated on path domain, bit-vectors associated with C represent ∃c
with the bit-vector. The advance() moves a cursor forward elements such that c corresponds to a linear path //A/B/C,
to the next 1 in bit-vector(s) and fill Currq , i.e., the cur- rather than ∀c ∈ C, according to Lemma 1.(2). Thus, we
rent element information of q’s input, with either of a bit can select useful c nodes without reading any b.
position or a corresponding label. i.e., Currq retains a pair Bit-Vector Compression and Cursor Movement
of (pos, levelN um) for bitTwig and retains a label for other The number of bit-vectors in a bitmap index is as same as
indexes. For a single bit-vector per query node, Currq al- the cardinality of a column the index is built over. Since a
ways retains pos a cursor indicates. For multiple bit-vectors high-cardinality column makes index size large, compression
per query node, only the minimum of the cursors’ positions techniques are useful to reduce the index size. A common
is selected as Currq to guarantee that Algorithm 2 reads an technique for compressing bit-vectors is run-length encod-
element per query node at a time. Figure 6 illustrates mul- ing, where run is a consecutive sequence of the same bits.
tiple bit-vectors and their cursors on them associated with We use a hybrid run-length encoding [13] that groups bits
a query node A in a twig pattern (shown in Figure 2(a)). into compressed words and uncompressed words. The major
In the figure, a Current1 and a N ext1 are associated with benefit of this approach is that bitwise operations are per-
each bit-vector. The Current1 stores the position of the formed very efficiently, benefiting from the fact that bitwise
bit the cursor indicates and N ext1 is set to the position of operations are actually performed on word units. Figure 7
the next 1 in the bit-vector by looking ahead. Note that shows how a bit-vector of 8,000 bits is compressed into four
we avoid seeking the next 1 for checking A-D relationship 32-bit words by the compression scheme. The first bit of
in Algorithm 1 by keeping the N ext1 in this way. Among each word decides if the word is compressed. If the first bit
Current1s, the minimum value is selected as Currq , with is 1, the word is compressed with a sequence of bit values
pid and levelN um of the bit-vector. that the second bit(i.e., a fill bit) stores. e.g., if the fill bit is
When advance() is called, The N ext1 of the bit-vector 0, then the word is filled with a sequence of 0’s. In the figure,
from which the current Currq comes is copied to Current1 the first and the third word is uncompressed. The second
of the bit-vector. Next, the cursor in the bit-vector moves word is compressed and 256*31 0’s fill the word. The fourth
forward to the next 1, then the N ext1 is newly set to the word keeps the rest bits, which stores the last few bits that
7. Figure 8: Skipping to Read Useless Nodes
be checked. The cost for computing the remaining distance
Figure 7: Compressing A Bit-vector on a 32-bit Ma- is also cheap when C meets compressed words, since the
chine number of bits to be examined is reduced by the compres-
sion. In our algorithm, the two function calls never mingle
could not construct a single word by themselves. Another with on a single bit-vector. The cursors in most of indexes
word with the value 2, not shown, is required to store the are triggered by only getPosOfnext and only the cursors in
number of useful bits in the rest word. bitAnc are triggered by getValueFromBitAnc calls.
Furthermore, to enable a cursor to move on a compressed
bit-vector without decompressing the bit-vector, we imple- Example 5. Let a cursor C be at the 16-th bit of the
mented the cursor as a composite variable, which consists of first word in Figure 7, i.e., C={16, 0, 16, 0}. To get the
a logical position and 3 physical position values. value of 3,000-th bit, it first computes the remaining dis-
tance as 2,884(=3,000-16). Next, since the current word
C.position : the logical(integer) position of the current bit is not compressed and the distance is not smaller than the
C is located at, number of bits the word represents, The C moves forward by
C.word : the current word C is located at, a word. Then, C.position is 31 and the remaining distance
C.bit : the current bit within C.word C is located at, is 2,869(=3,000-31). The second word is compressed and
C.rest : The current bit within the rest word, if C is located the distance < run length ∗ 31, thus it returns the fill bit of
at the word the word, 0 as a result.
Skipping Obsolete Input Nodes
We use two functions for moving a cursor on a compressed
We skip reading many useless labels with a function ad-
bit-vector (i.e., getPosOfNext and getValueFromBitAnc).
vance(k). The idea is illustrated in Figure 8. Assume
With the cursor C, we can skip examining a consecutive se-
that there is a linear path //A//B//C, and intervals of el-
quence of 0’s by examining just one compressed word whose
ements in each query node input are like those shown in
fill bit is 0. Thus, the cost of finding 1’s in a bit-vector
Figure 8. Let the cursors for each input indicate ai , bk−1 ,
is remarkably reduced as the cursor meets the compressed
and ck−1 . With no skipping, elements in the input for query
words.
node B will be read from bk−1 to bi . Here, bi which has A-
Example 4. Let a cursor C indicate that the 31-th bit D relationship with ai satisfies a condition, startP os(ai ) <
of the first word of a bit-vector at the bottom of Fig.7, then startP os(bi ) ∧ endP os(bi ) < endP os(ai ). Thus, we can skip
the current status of C is {31, 0, 31, 0}. When we find reading labels in the input of B from bk to bi−1 by jumping
the position of the next 1, it begins examining the first bit the cursor in our index to bi which satisfies pos(ai ) < pos(bi ).
of the second word, which is compressed. Since the word is Also, this action is performed recursively, so we also skip
filled with 0, we skip counting the position value bit-by-bit reading labels for the query node C from ck to ci−1 .
by simply increasing C by run length ∗ w. C will be {7997, Algorithm 3: advance(q, k ) for a single bit-vector bv
1, 31, 0} right after reading the second word. Thus, the 1 pos ← C.position ; // A value for position
position of the next 1 will be 7,998(=31+256*31+31). 2 if k is not given then
3 pos← getP osOf N ext1(bv);
In Algorithm 1, we also check the bit at a given position 4 else
5 repeat
pos for identifying A-D relationship between two elements 6 pos← getP osOf N ext1(bv);
with a function (getValueFromBitAnc). For the function, we 7 until pos≤ k ;
maintain two cursors in each bit-vector from bitAnc. We sep- 8 end
9 if pos=EOF then
arate two-way cursor moves into two one-way cursor moves 10 read the record that pos value indicates and load a label;
since a cursor’s bidirectional moves may affect the distance 11 end
C moves, i.e., C moves forward longer distance if C went
back once. In the function, if a given position pos is less than In Algorithm 3, a input k is pos(ancestor(q)), i.e., the
the forward cursor C, it moves another cursor B backward. current position of an ancestor of a query node q. Two
Also, if the pos is behind the position of B, then it copies the elements which have A-D relationship satisfy a condition
position of C to B to set a starting line for B. By adding the pos(D) > pos(A). Therefore, we iteratively advance a de-
backward cursor B, it prevents the cursor from strolling over scendant’s cursor until it satisfies the condition (in line 5-7).
a bit-vector by occasional backward moves. When the cur- Then, it loads a label from the record the cursor indicates.
sor C moves forward, it first computes the distance C should Note that advance(k) in Algorithm 3 is a version for a sin-
move and reduces the remaining distance whenever C ad- gle input. If multiple inputs per query node are allowed, the
vances. If the remaining distance is less than the number of function advances cursors of all bit-vectors until k and it sets
bits the visited word represents, the word contains the bit to the current position, by the rule described in Section 5.3.
8. 5.5 Analysis of bitTwig rithms with twigStack on tag-based streaming model. For
The size of input data when using one of basic indexes is more fairness, the outputs from all algorithms are were to
O( n (|bvi |) + |tuples|), where n is the total number of use- be equal in size.
i
ful bit-vectors, |bvi | is the size of a useful bit-vector bvi and We used a synthetic XML and three real XML datasets
|tuples| is the number of useful tuples in a node table(shown for experiments. All datasets are first labeled and stored
in Figure 3). Meanwhile, the size of input data when using as rows in a table (shown in Fig 3(b)). XMark is the syn-
bitTwig is at most O(2 × n |bvi |) since it requires a pair thetic dataset that describes data dealt in a virtual auction
i
of indexes only. The CPU time of our algorithm when using site. Its structure is recursive, but not too much. DBLP
a basic index is simply the sum of the CPU time of twig and SWISSPROT(shortly, SPROT) are real-world datasets,
joining and the cost of cursor moves on useful bit-vectors. which have rather shallow structure. TREEBANK(shortly,
The CPU time of bitTwig can be computed by summing TBANK) is a real-world dataset, which has a deep recursive
the cost of total calls of each sub function in our algorithm. structure. A statistics of our datasets is shown in Table 3.
Algorithm 2 contains 3 sub functions reading inputs, i.e., Note that the number of distinct tags in all the datasets
advance(), getNext() and cleanStack(). Total calls of are not different much, but the number of distinct paths
advance() in Algorithm 2 read all 1’s in useful bit-vectors. varies.(see TBANK in Table 3.) We selected various twig
Moreover, in each advance() call, the minimum pos among
bit-vectors’ current1s is selected. Thus, the cost of total XMark DBLP TBANK SPROT
advance() calls in bitTwig is O(n × |Q| × n |bitP ath(i)|).
i Size(MBytes) 113 486 86 112
Note that advance() moves a cursor in a bitP ath(i) forward, # of elements 1,666,315 11,692,273 2,437,666 2,977,031
and checkAD() moves a cursor of bitAnc(i) to a specific posi- # of attributes 381,878 2,305,980 1 2,189,859
tion. Thus, checkAD() does not affect the cost of advance() # of tags 77 41 251 99
calls. # of (tag, level) 119 47 2,237 100
The checkAD() function is called in cleanStack() and # of paths 548 170 338,749 264
in getNext() to identify A-D relationship between two el- Max/Avg depth 12/5 6/2.9 36/7.8 5/3.5
ements. Since getNext() calls checkAD() with the posi-
tion of ancestor’s cursor, which is always advancing forward Table 3: XML Datasets
by advance(), it makes the cursor of bitAnc(i) move for-
ward only. Thus, the cost of total calls of getNext() is queries which included (1) twigs with A-D axes only (2)
O(n × |Q| × n |bitAnc(i)|). In contrast, cleanStack() in-
i twigs with P-C only (3) twigs which had one branching node
volves the cursor to move backward. While cleanStack(ai , (4) twigs which were mixed up with A-D and P-C axes.
dj ), items in a stack SA are compared with the current cur- Our indexes were built on the node table in Fig 3(b).
sor position of bitAnc(j) for checking A-D relationship ac- We used a stack to generate reverse path strings(shortly
cording to the condition (3) of Theorem 1. Thus, items in rps) and label each node. Our index construction algorithm
SA are visited from top to bottom and this involves that the pushes an element into the stack when a startElement event
cursor in bitAnc(j) moves backward. However, even in this of the element occurs, and pops out the element when an
case, the cursor never goes back over the same 1’s repeat- endElement occurs. In push(), we label startP os and gen-
edly. Because, if the topmost stack item has A-D relation- erate rps by reading items in the stack from top to bottom
ship with dj , other items in SA are not compared with dj and assign unique id(i.e., pid) to the rps. In pop(), we la-
and they do not trigger the backward cursor moves. On the bel endP os to the element. Each bit-vector in our indexes
contrary, if the topmost stack item has no A-D relationship, stands for one distinct value in a single or dual columns.
the item will be popped out from SA , and the cursor does Thus, building the indexes are similar to the way to build a
not move toward the position the popped item indicates any bitmap index in RDBMS. However, bitAnc and bitDesc in-
more. Thus, with our separated two-way cursors described dexes are different since each bit-vector in the indexes rep-
in Section 5.4, the cost of total calls of cleanStack() is resents a set of values, and the sets are not disjoint. To
O(n × |Q| × n |bitAnc(i)|). Finally, the total CPU time
i build bitAnc, whenever push() is called, we copy the pid of
of bitT wig is O(n × |Q| × ( n (|bitP ath(i)| + |bitAnc(i)|) +
i an upper stack item into a lower items from top to bottom
|Output|)) when added in the output size, meaning that bit- simultaneously. Thus, when parsing ends, all elements have
Twig is linear in the sum of the size of bit-vector pairs. Note a set of pids for their descendants, e.g., a root a1 in Fig 1 has
that the size of a pair of the bitmap indexes is even less than all pids in the XML, pid 0∼11. Then, we construct bitAnc
the size of labels in most cases. (see Table 4). by setting the i-th bit to 1 in each corresponding bit-vector
for each in a set of path ids that the i-th XML element cor-
6. EXPERIMENTS responds to. To build bitDesc, whenever pop() is called, we
copy the pid of a lower stack item to an upper item from
6.1 Experimental Setup bottom to top simultaneously. Then, each element has a set
We implemented our algorithms as a single threaded ex- of pids for its ancestors. e.g., b1 in Fig 1 has a set of pids,
ecutable in GNU C++. All experiments were performed 0, 1, 2. We construct bitDesc with the pid sets.
on a system having a Core 2 Duo 2GHz processor and a
5,400 RPM disk that ran Linux 2.6. We used the FastBit 6.2 Experimental Results
library [13] and augmented it with our cursor mechanism. Statistics of Indexes We compared sizes of our indexes
We also used Xerces library to parse XML documents. The in Table 4. Compared with the interval-based labels, our
node table in Fig 3(b) and indexes were built on a file sys- indexes were comparably small. Also, even in a case of bit-
tem. To check how indexes and different input data struc- Twig, the sum of bitPath and bitAnc was smaller than the
ture affected the performance, we also compared our algo- labeled values, except for TBANK dataset. Our observation
9. Figure 9: Query Execution Time
was that unless XML structure is severely deep and recur-
sive, the size of bit-vectors built over path domain is small.
Another observation was that index building time is mainly
caused by column cardinality, rather than the number of tu-
ples. e.g., building indexes over path domain in TBANK
took the longest time(in Table 5). On the contrary, building
indexes over DBLP, which has about 14 million rows in a
relation, took rather a little time.
(MBytes) XMark DBLP TREEBANK SWISSPROT
interval 23.44 160.20 27.90 59.13
bitTag 5.32 19.67 6.85 14.40
bitPath 6.18 20.00 23.53 19.03
bitAnc 6.74 20.08 30.84 26.50
bitDesc 6.00 18.34 24.18 18.92
bitT ag + 6.05 20.67 14.33 14.62
Figure 10: Bytes Read During Query Evaluation
Table 4: Size of Encoded Values
answering twig queries. We also saw that in general, in-
dexes built over path domain read less size of records. Usu-
(Seconds) XMark DBLP TREEBANK SWISSPROT ally, indexes built over tag domain read the most size of
bitTag 17.36 98.51 38.61 52.34 records, but advance(k) significantly reduced the size. The
bitPath 20.81 77.63 7,408.09 35.71 bitT ag + index built over (tag, level) domain had a trade-
bitAnc 34.45 169.10 8,054.15 69.11 off between the size of index read and data read, especially
bitDesc 21.73 86.35 7,528.23 38.28 on TBANK dataset. The bitT ag + read less records, how-
bitT ag + 29.00 180.95 206.75 88.30 ever it required more bit-vectors for query processing. The
bitT ag + also did not outperform other strategies in terms
Table 5: Index Build Time of execution time in our experiments. The DescTag, which
merges a path-based bit-vector and a tag-based bit-vector
Performance Measure We measured our query process- for each query node, was comparable to other strategies
ing algorithm with various indexes with two criteria: (1) which used path-based indexes in some cases. However,
Execution time for answering twig queries (shown in Fig 9) when the structure was extremely flat and the number of
and (2) The size of data read during twig join.(shown in leaf elements was large, merging bit-vectors affected the per-
Fig 10). In Fig 9, five indexes and twigStack were com- formance, e.g., SPROT1 in Figure 9. In terms of execution
pared on four datasets. In the figure, tagSkip stands for the time, bitTwig outperformed other indexes in most cases. Es-
use of bitTag with advance(k) for skipping useless labels. pecially, it was best suitable for answering twigs towards
The other indexes were optimized by either of advance(k) XML data whose structures were not deep and not recursive,
and condense(q). We assumed the bitmap indexes were not e.g., DBLP and SPROT. However, it showed unendurable
loaded into memory before query processing, and buffer size execution time for TBANK (in Figure9), since it needed to
was set to 0 in our experiments. Thus, index loading time hold many bit-vectors per query node and managed cursor
was always included in the execution time. moves on them. In a case of recursive-structured XML, bit-
Tag with advance(k) showed the best performance in our
Performance Analysis In terms of the size of data read experiments. bitT ag + was not the best in any case. How-
during query processing( in Figure 10), we saw that in all ever, it ran fast on TBANK dataset, next to tagSkip and
cases, bitTwig read the smallest size of data. This is be- bitTag. DescTag showed the execution time similar to the
cause bitTwig reads only bit-vectors from two indexes for one of bitPath, but it rather showed better performance on
10. TBANK. We also observed that in cases of bitPath and De- 9. REFERENCES
scTag on TBANK, most of execution time was spent to find [1] N. Bruno, N. Koudas, and D. Srivastava. Holistic twig
and load bit-vectors. Finding corresponding bit-vectors with joins: optimal XML pattern matching. In Proceedings
a given path which includes A-D axes involves a subsequence of the ACM SIGMOD Conference, pages 310–321,
matching problem, which is known as difficult to be sup- 2002.
ported efficiently. Similar to the concept of pre-computed [2] T. Chen, J. Lu, and T.W. Ling. On boosting holism in
join, pre-merged bit-vectors may be useful for the path se- XML twig pattern matching using structural indexing
lections. Once, we merge bit-vectors into a single bit-vector techniques. In Proceedings of the ACM SIGMOD
for a given pathstring, we can find pathstrings with an ex- Conference, pages 455–466, 2005.
act matching query next time. In addition, the number of
[3] G. Gou and R. Chirkova. Efficiently querying large
bit-vectors to be loaded decreases by one.
xml data repositories: a survey. IEEE TKDE,
19(10):1381–1403, 2007.
7. RELATED WORK [4] T. Grust and et al. Why off-the-shelf RDBMSs are
The twigStack is a multi-way, sort-merge join algorithm better at XPath than you might expect. In
that avoids generating obsolete path solutions during twig Proceedings of the ACM SIGMOD Conference, pages
pattern matching [1]. The I/O optimality is proven for twig 949–958, 2007.
patterns that have no /-axis. Much research has been fo- [5] P. J. Harding, Q. Li, and B. Moon. XISS/R: XML
cused on guaranteeing I/O optimality for a certain class of indexing and storage system using RDBMS. In
XPath [9, 2]. The iTwigJoin work discussed optimality for Proceedings of the 29th VLDB Conference, pages
twig patterns with A-D or P-C only or one branch-node only 1073–1076, 2003.
by partitioning input data as a unit of tag+level, and prefix [6] H. Jiang and et al. Holistic twig joins on indexed XML
path-based [2]. Variants of a B+ -treeindex have been sug- documents. In Proceedings of the 29th VLDB
gested to skip unnecessary input data such that twig queries Conference, pages 273–284, 2003.
can be processed more efficiently [1, 6]. TJFast adopted a [7] Q. Li and B. Moon. Indexing and Querying XML
different approach to reducing the size of input data [10]. Data for Regular Path Expressions. In Proceedings of
It encoded root-to-leaf path of each leaf node in extended the 27th VLDB Conference, pages 361–370, 2001.
Dewey order, so internal nodes could be derived from the [8] H. Lu and et al. What makes the differences:
encoded values of leaf nodes. benchmarking XML database implementations. ACM
Various XML storage schemes have been suggested for re- Transactions on Internet Technology, 5(1):154–194,
lational database systems [8]. A labeling scheme or a path 2005.
materialization is among the common techniques adopted [9] J. Lu, T. Chen, and T.W. Ling. Efficient processing of
by an RDBMS for XML query processing. It has been XML twig patterns with parent child edges: a
shown that the conventional index structures of an RDBMS look-ahead approach. In Proceedings of the 13th ACM
can be tailored to index XML data as well [12]. Recently, CIKM Conference, pages 533–542, 2004.
Grust et al. show that node approach can give better per- [10] J. Lu and et al. From region encoding to extended
formance than native XML storage system with partitioned dewey: on efficient processing of XML twig pattern
B+ -tree [4]. matching. In Proceedings of the 31st VLDB
Bitmap index is known to be effective for indexing a col- Conference, pages 193–204, 2005.
umn from a small domain and for performing logical op-
[11] P. O’Neil and D. Quass. Improved query performance
erations [11]. For columns with high cardinality, bitmap
with variant indexes. In Proceedings of the ACM
indexes can be compressed to reduce the sizes, or a hybrid
SIGMOD Conference, pages 38–49, 1997.
encoding scheme (such as WAH [13]) can be adopted so that
[12] S. Pal and et al. Indexing XML data stored in a
bitwise operations can be performed efficiently using word
relational database. In Proceedings of the 30th VLDB
units. BitCube [14] adopts bitmap indexes for XML query
Conference, pages 1146–1157, 2004.
processing. Similar to the data cubes for OLAP, it consists
of 3-dimensional bitmaps, where the dimensions represent [13] K. Wu, E. Otoo, and A. Shoshani. On the
documents, paths, and values. However, it only supports performance of bitmap indices for high cardinality
querying a linear path with value predicates and its size can attributes. In Proceedings of the 30th VLDB
be huge since its structure is hard to be well compressed. Conference, pages 24–35, 2004.
[14] J.P. Yoon, V. Raghavan, V. Chakilam, and
L. Kerschberg. BitCube: a three-dimensional Bitmap
8. CONCLUSIONS indexing for XML documents. Journal of Intelligent
We propose various bitmap indexes for querying XML Information Systems, 17(2):241–254, 2001.
data stored in relational tables. We support twig query [15] M. Yoshikawa and et al. XRel: a path-based approach
processing by an RDBMS with cursor-enabled bitmap in- to storage and retrieval of XML documents using
dexes. Our cursor mechanism and query processing algo- relational databases. ACM Transactions on Internet
rithms work on compressed bit-vectors without uncompress- Technology, 1(1):110–141, 2001.
ing them. The performances of proposed algorithms are [16] C. Zhang and et al. On supporting containment
evaluated in our experiments. The bitTwig outperforms queries in relational database management systems. In
other strategies for shallow XML documents by computing Proceedings of the ACM SIGMOD Conference, pages
twig solution with only bit-vectors. For XML documents 425–436, 2001.
with a deep and recursive structure, tag-based bitmap in-
dexes with advance(k) show the best performance.