SlideShare a Scribd company logo
1 of 8
Download to read offline
IOSR Journal of Computer Engineering (IOSR-JCE)
e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 6, Ver. I (Nov – Dec. 2015), PP 69-76
www.iosrjournals.org
DOI: 10.9790/0661-17616976 www.iosrjournals.org 69 | Page
Duplicate Detection in Hierarchical Data Using XPath
1
Akash R. Petkar, 2
Vijay B. Patil
1,2
Computer Science &Engineering, MIT Aurangabad.
Abstract: There were many techniques for identifying duplicates in relational data, but only a few solutions
focus on identifying duplicates which has complex hierarchical structure, as XML data. In this paper, we
present a new technique for identifying XML duplicates, so-called XML duplication using Xpath. XML
duplication using Xpath technique uses a Bayesian network to conclude the possibility that two xml elements are
duplicates, based on the information within the elements and other information organized in the XML. In
addition, to increase the proficiency of the web usage, a new pruning strategy was created. This pruning
strategy will help to gain maximum benefits over non-computing algorithm. This technique can be used to
increase the proficiency of identifying duplicates and remove it, so no duplicate record will be there. Through
many experiments, our algorithm is able to achieve high accuracy and retrieve count in several XML dataset.
XML duplication using Xpath technique is able to outclass another technique for identifying duplicates, both in
proficiency and potency.
Keywords: Identifying duplicates, XML, Bayesian network, object cleaning, hierarchical structure, Xpath.
I. Introduction
Electronic data plays an important role in today’s world for business processes, applications and
making quick decisions. As we are focusing on how the data can be essential and we have to compromise on
different types of errors which come in different representations [1]. In this paper we are focusing on different
types of errors that can be occurred in Data. We will mainly focus on fuzzy duplicates or duplicate records.
Duplicate records are multiple representation of same real world object that are differently represented. These
records are somewhat different from each other. These records attributes differ in some way from each other in
XML document.
Duplicate detection means finding out these different representations of same real world object.
Duplicate detection is a tough task to find duplicate records. The common comparisons algorithm to find the
duplicates cannot be used, so finddifferent possibly matching strategyto compare, so that they are referring to
the same object or not
In this paper, the focusis on which possibly matching strategy can find out to detect duplicate records.
The Focus should be able to match the different representation of information at a conceptual level. Take xml
dataset for comparing different possibilities. An XML dataset or document includes a set of nodes in the
document. It consists of root node and child nodes. It starts with an opening tag (Ex. <A>) and a closing tag(Ex.
</A>).
Figure 1: Attribute Scope
In Fig.1two records are shown for two different XML records. But both records are representing the
same country so there can be possibility that in XML dataset or document it can be present. So find different
such possibilities that represent the same object. Duplicate records are exactly same by textual information. But
if they are slightly changed; the information are not exactly duplicates.
Another problem is that XML can be presented in different structures so the possibility of finding the
duplicates becomes high. An XML document contains one root element and number of child element, but child
element can also have different child elements and so on. In this paper, a novel technique is presented which can
be used to detect XML duplication of same real world object.
Rest of the paper is organized as follows: Section II the related work is discussed. Section III describes
the proposed work. Section IV describes the mathematical model. Section Vpresents the performance analysis.
Section VI shows graphical model. Section VII concludes the paper.
<Customer>
<Country>Australia</country>
</Customer>
<Customer>
<Country>AUS</country>
</Customer>
Duplicate Detection in Hierarchical Data Using XPath
DOI: 10.9790/0661-17616976 www.iosrjournals.org 70 | Page
II. Related Work
Data cleaning [2] ordata cleansing deals with identifying and removing errors, irregularities from data
are represented in such a way that it can degrade the quality of data.Dataquality problems are there in single data
files such as databases or XML File ex. Due to misspelling during entry of data, invalid data or some missing
information. When integrating multiple file data to integrate into one file then it should be able to get a single
file that is free from duplicate records.
2.1 Eliminating Fuzzy Duplicates in Data Warehouses
In data warehouses large databases are integrated ex. global web-based information system, merged
database systems so there can be a possibility of duplicate records. The need for data cleaning process becomes
an important factor for getting the accurate records with no duplication. The duplicate records or redundant
records are represented in different representations. In order to get the data accurate and consistent, joining
different types of data and merging into one data and eliminate the duplicate data becomes a necessary step for
faster processing of data. Firstly find out the different possibilities of data that can be represented in different
structure so that multiple documents when combine together will form a single document which is free from
duplicate records.
For Example in Table 1 there are records that consist off_n (firstname), country and email. Record r1
and r2 are exact duplicates because their f_n, country and email values are same so we can say they are exact
(100 %) duplicates, so we can easily say they are duplicates and can be easily removed. Record r1 and r3 are not
exact duplicates because theirf_n and email values are same, but not country values are same. But in record r3
the country values is denoted in different format, but it refers to same record. Therefore, find such duplicates.
These duplicates are called as fuzzy duplicates [3].
Table 1: Exact Duplicates as well as Fuzzy Duplicates
Record f_n country Email
r1 John Australia john@gmail.com
r2 John Australia john@gmail.com
r3 John AUS john@gmail.com
2.2 DogmatiX Tracks down Duplicates in XML
Inthis,dogmatix defines a general framework for identifying the duplicates. In this, records are checked
whether they are duplicates or not based on their values. In real world, records are represented in multiple
patterns for same object. The dogmatix frameworkis flexible to work on different algorithms and new methods
can be added to improve this framework. An overview of the framework is given in [5].Theframework consists
of three types,
 Candidate definition: Defines which document should be compare.
 Duplicate definition: Defines when two objects are duplicates.
 Duplicate detection: Defines How Duplicates are searched.
III. Proposed Work
3.1 XML Duplication Using XPath
Every tuple in a relational table has exactly one value for every attribute. Most duplicate detection
approaches designed for a single relation iteratively compare pairs of tuples as follows: They first compare
attribute values pair wisely by computing a value similarity, and then combine these similarities to a total tuple
similarity. If the similarity is above a specified threshold, tuple pairs represent duplicates, otherwise they
represent non-duplicates. This comparison approach is called a threshold similarity measure approach.
Figure 2: CustomerDetails XML
CustomerDetails
Customer 1 Customer 2 Customer n
PersonID
Address1
Address2
Email
POB
DOB
PersonName
PersonID
Address1
Address2
Email
POB
DOB
PersonName
PersonID
Address1
Address2
Email
POB
DOB
PersonName
Duplicate Detection in Hierarchical Data Using XPath
DOI: 10.9790/0661-17616976 www.iosrjournals.org 71 | Page
DTD are a strict representation or set of rules for XML.XML can be represented by a tree structure as
shown in Fig. 2.In this customerdetails, is the root element and customer are the child elements. The child
elements have 7 attributes as personid, personname, dob, pob, Email, Address1, and Address2.
Today, XML is used in many web applications. The popularity of xml has increased because of its
platform independent, less space and easy to use ability. XML is mostly used for data storage and fast transfer of
data.
3.2 Different Types of Parsers
On web there are various types of parsers are available for parsing the XML document. Some of them
are good in some features and some of them are not. In Table 2 different types of parser are used for parsing,
but some parser are design for read only access. But in this paper, a novel approach is proposed for parsing the
XML file and finding the exact duplicates or fuzzy duplicates and removing the duplicates, so that pure XML
document must be formed.STAX parser, SAX parser, DOM parser etc. are used for parsing. In Table 2 how these
parsers differ from each other.The notation for the following Table 2 can be used as Xpath Capability = XC,
CPU & Memory = C&M, Forward only = FO, Read xml = RXML,Write xml = WXML,
create,read,update,delete = crud.
Table 2: Features Table
Feature STAX SAX DOM
API Type Pull,Streaming Push, Streaming In memory tree
XC No No yes
C&M Good Good Varies
FO Yes Yes No
RXML Yes Yes Yes
WXML Yes Yes Yes
CRUD No No Yes
3.3 XMLDOM Parser
The structure of Dom parser can be identified using the following diagram.
Figure 3: XML Dom Parser Structure
DOM, also known as document object model. It is mostly used for XML operation today. The
responsibility of a DOM parser is to read the XML document specified and convert that into a tree structure
suitable for traversal. Internally, a DOM parser takes help from SAX parser to read the file and the compares the
XML against the DTD or the schema, so that relationships between parent-child tags can be set up and the tag
tree is built into the memory. DOM first copies XML into memory before parsing It, So It is a good advice to
have large heap size to avoid exceptions.
3.4 XML Duplication Using Xpath Algorithm
In this, we describe how the algorithm works and how it is able to detect duplicates in XML dataset
and remove the duplicates from the XML.
Steps in xml duplication using XPath algorithm,
Begin;
1) Read the xml file and get the ordered list of parent nodes (L).
2) Read the child nodes of xml file.
3) Current score = 0
4)For each node n in L do
5)If (attributes of child nodes= threshold value) then
Completely duplicate
Else
Initializer Parser
Dom Parsing
Process
XML Parser
XML Document
Duplicate Detection in Hierarchical Data Using XPath
DOI: 10.9790/0661-17616976 www.iosrjournals.org 72 | Page
No duplicate
6) Remove all duplicate nodes and save the new xml file.
Table 3: Results for Customer, Suplier and Order Dataset
IV. Mathematical Model
1. Identify XML records R
𝑅 = 𝑟1, 𝑟2, 𝑟3, 𝑟4, 𝑟4, 𝑟5, … … .
Where R is main set of records
2. Then Identify Nodes of each record N
𝑁 = 𝑛1, 𝑛2, 𝑛3, 𝑛4, 𝑟5, …… .
Where N is main set of nodes for a record
3. Probability that a node is duplicate
𝑃 = 𝑝1, 𝑝2, 𝑝3, 𝑝4, 𝑝5, … ….
𝑃 𝑡𝑖𝑗 |𝑉𝑡𝑖𝑗 , 𝐶𝑡𝑖𝑗 =
1 𝑖𝑓𝑉𝑡𝑖𝑗 = 𝐶𝑡𝑖𝑗 = 1
0 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Where P is main set of probability
If 𝑉𝑡𝑖𝑗 = 𝐶𝑡𝑖𝑗 = 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝑉𝑎𝑙𝑢𝑒 Then It Is Duplicate
4. Identify Duplicate nodes DN
𝐷𝑁 = 𝑑𝑛1, 𝑑𝑛2, 𝑑𝑛3, 𝑑𝑛4, 𝑑𝑛5,… … .
Where DN is main set of the duplicate nodes
5. Identify Duplicate records for DR
𝐷𝑅 = 𝑑𝑟1, 𝑑𝑟2, 𝑑𝑟3, 𝑑𝑟4, 𝑑𝑟5, … ….
Where DR is main set of the duplicate records
6. Calculating Total Time for Duplication
𝑇𝑜𝑡𝑎𝑙𝑇𝑖𝑚𝑒 =
𝑇𝑖𝑚𝑒𝐷𝑒𝑝𝑡ℎ𝑆𝑒𝑎𝑟𝑐ℎ + 𝐷𝑒𝑑𝑢𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛
2
V. Performance Analysis
In this paper, the experiments are performed on customer, suplier and order datasets. The customer
dataset consist of attributes as personid,personname, dob, pob, email, address1, address2. The suplierdataset
consist of attributessuppkey, name, address, nationkey, phone, acctbal, comment. The Orderdatasetconsists of
orderkey, custkey, orderstatus, totalprice, orderdate, orderpriority, and clerk.
The datasets are downloaded from www.cs.washington.edu/research/xmldatasets/www.repository.html
5.1 Test Case I
In customerdetailsfile there are seven attributes they are personid, personname, DOB, POB, email, address1 and
address2. When parsed the customerdetailsXML file then which elements are exact duplicates then that records
are duplicates.
Record 1
<Customer>
<personid>1</personid>
<personname>Order#0000001</personname>
<dob>01/01/2001</dob>
Dataset Customer Suplier Order
Records 1000 1000 1000
Time Depth Search For
Duplicate
2973 Milliseconds 3000 Milliseconds 2933 Milliseconds
Sorted Set 2873 Milliseconds 2876 Milliseconds 2872 Milliseconds
Deduplication For The
Duplicate Record
91 Milliseconds 130 Milliseconds 59 Milliseconds
Total Time 1532 Milliseconds 1565 Milliseconds 1496 Milliseconds
Duplicate Detection in Hierarchical Data Using XPath
DOI: 10.9790/0661-17616976 www.iosrjournals.org 73 | Page
<pob>Australia</pob>
<email>Order1@gmail.com</email>
<Address1>Jalna</Address1>
<Address2>Beed</Address2>
</Customer>
Record 2
<Customer>
<personid>1</personid>
<personname>Order#0000001</personname>
<dob>01/01/2001</dob>
<pob>Australia</pob>
<email>Order1@gmail.com</email>
<Address1>Jalna</Address1>
<Address2>Beed</Address2>
</Customer>
Above both records are exactly same then they are exact duplicates. But if attributes places of
Adddress1 and Address2 are changed then readings as
Record 3
<Customer>
<personid>1</personid>
<personname>Order#0000001</personname>
<dob>01/01/2001</dob>
<pob>Australia</pob>
<email>Order1@gmail.com</email>
<Address1>Jalna</Address1>
<Address2>Beed</Address2>
</Customer>
Record 4
<Customer>
<personid>1</personid>
<personname>Order#0000001</personname>
<dob>01/01/2001</dob>
<pob>Australia</pob>
<email>Order1@gmail.com</email>
<Address1>Beed</Address1>
<Address2>Jalna</Address2>
</Customer>
It Shows, they are not (100 %) exactly Duplicates, because of their address places are change, but they
are duplicates. They are represented in different manner so they represent to same object.
5.2 Test Case 2
In CustomerDetails, alternate names for countries are given,soitcan be easily identifiedwhetherit is
duplicate records or not.
Table 4: Alternate Names for Countries
Countries Other name Lowercase name
Australia
Canada
China
India
United Kingdom
Sri Lanka
France
Iceland
Mexico
New Zealand
AU
CA
CN
IN
UK
LK
FR
IS
MX
NZ
au
ca
cn
in
uk
lk
fr
is
mx
nz
Duplicate Detection in Hierarchical Data Using XPath
DOI: 10.9790/0661-17616976 www.iosrjournals.org 74 | Page
Record 1
<Customer>
<personid>1</personid>
<personname>Order#0000001</personname>
<dob>01/01/2001</dob>
<pob>Australia</pob>
<email>Order1@gmail.com</email>
<Address1>Jalna</Address1>
<Address2>Beed</Address2>
</Customer>
Record 2
<Customer>
<personid>1</personid>
<personname>Order#0000001</personname>
<dob>01/01/2001</dob>
<pob>AU</pob>
<email>Order1@gmail.com</email>
<Address1>Beed</Address1>
<Address2>Jalna</Address2>
</Customer>
Record 3
<Customer>
<personid>1</personid>
<personname>Order#0000001</personname>
<dob>01/01/2001</dob>
<pob>au</pob>
<email>Order1@gmail.com</email>
<Address1>Beed</Address1>
<Address2>Jalna</Address2>
</Customer>
These records are not (100 %)Duplicates,becausethe countries names are different, but all they refer to
one country,so above 3 records are duplicates. Table 3 shows the performance results.
VI. Graphical Model
Precisionmeasures the percentage of correctly identified duplicates, over the total set of objects
determined as duplicates by the system.
Recallmeasures the percentage of duplicates correctly identified by the system, over the total set of
duplicate objects.
Table 5 shows the calculated results for Dogmatix and Xml Duplication Using Xpath.Also the
graphical comparison of Dogmatix and Xml Duplication using Xpath for customer dataset, suplier dataset and
order dataset are given below.
Table 5: Comparison Results In terms Of Precision and Recall
Dataset Dogmatix XML Duplication
Using Xpath
Precision Recall Precision Recall
Customer 0.4522 0.45 0.502 0.5
Suplier 0.1206 0.12 0.417 0.41
order 0.417 0.41 0.48 0.48
Duplicate Detection in Hierarchical Data Using XPath
DOI: 10.9790/0661-17616976 www.iosrjournals.org 75 | Page
Figure 4: For Customer Dataset
Figure 5: For Suplier Dataset
Figure 6: For Order Dataset
VII. Conclusion
In this paper, we present the algorithm to determine whether two recordsare duplicates or not based on
a given threshold. For finding the duplicates the Algorithm uses a Bayesian network. Xml duplication using
Xpath requires little user interaction, since user only needs to provide the Xml dataset file and based on that file
the user has to give the threshold value. However this technique is very flexible for duplicate detection in XML
data.
These techniques will able to solve the problem of duplicate data. Nowadays more and more data is
generated, because of various devices so thereis lots of data to be maintained in the database so there can be
duplicate records for a single record. To avoid the problem of efficiency of network, this technique can be useful
to reduce the network load. This technique is performed under various experiments to find duplicate values.
These experiments are performed on both artificial and real world dataset and showed that XML duplication
using Xpath is very good technique. The process of duplication and Deduplication of records using Xpath
technique is able to outclass other parser such as (Ex. DOM, SAX, and STAX Etc.).
When calculated Recall and Precision for Xpath is 93 %, when compared with Recall and Precision for
Dogmatix is 66 %. The success demonstrated in the experimental results will show that there is still something
more we can do for the future work.
References
[1] E. Rahm and H.H. Do, “Data Cleaning: Problems and Current Approaches,” IEEE Data Eng. Bull., vol. 23, no. 4, pp. 3-13, Dec.
2000.
0.42
0.44
0.46
0.48
0.5
0.52
Precision Recall
XML
Duplication
Using Xpath
Dogmatix
0
0.1
0.2
0.3
0.4
0.5
Precision Recall
XML
Duplication
Using Xpath
Dogmatix
0.36
0.38
0.4
0.42
0.44
0.46
0.48
0.5
Precision Recall
XML
Duplication
Using Xpath
Dogmatix
Duplicate Detection in Hierarchical Data Using XPath
DOI: 10.9790/0661-17616976 www.iosrjournals.org 76 | Page
[2] Melaine Weis and Felix Naumann, “Detecting Duplicate Objects In XML,”ACM SIGMOD Special Interest Group on Mgmt of
Data IQIS 2004, pp. 10-19, 2004
[3] R. Ananthakrishna, S. Chaudhuri, and V. Ganti, “Eliminating Fuzzy Duplicates in Data Warehouses,” Proc. Conf. Very Large
Databases (VLDB), pp. 586-597, 2002.
[4] D.V. Kalashnikov and S. Mehrotra, “Domain-Independent Data Cleaning via Analysis of Entity-Relationship Graph.”ACM Trans.
Database Systems, vol. 31, no. 2, pp. 716-767, 2006.
[5] M. Weis and F. Naumann, “Dogmatix Tracks Down Duplicates in XML,” Proc. ACM SIGMOD Conf. Management of Data, pp.
431-442, 2005.
[6] L. Leita˜o, P. Calado, and M. Weis, “Structure-Based Inference of XML Similarity for Fuzzy Duplicate Detection,” Proc. 16th
ACM Int’l Conf. Information and Knowledge Management, pp. 293-302, 2007.
[7] A.M. Kade and C.A. Heuser, “Matching XML Documents in Highly Dynamic Applications,” Proc. ACM Symp. Document Eng.
(DocEng), pp. 191-198, 2008.
[8] D. Milano, M. Scannapieco, and T. Catarci, “Structure Aware XML Object Identification,” Proc. VLDB Workshop Clean
Databases (CleanDB), 2006.
[9] Luis Leitao, PavelCalado, and Melaine Herschel, “Efficient and Effective Duplicate Detection In Hierarchical Data”Vol. 25, No. 5,
pp. 1028-1041 2013.
[10] P. Calado, M. Herschel, and L. Leita˜o, “An Overview of XML Duplicate Detection Algorithms,” Soft Computing in XML Data
Management, Studies in Fuzziness and Soft Computing, vol. 255, pp. 193-224, 2010.
[11] S. Puhlmann, M. Weis, and F. Naumann, “XML Duplicate Detection Using Sorted Neighborhoods,” Proc. Conf. Extending
Database Technology (EDBT), pp. 773-791, 2006.

More Related Content

What's hot

Upstate CSCI 525 Data Mining Chapter 3
Upstate CSCI 525 Data Mining Chapter 3Upstate CSCI 525 Data Mining Chapter 3
Upstate CSCI 525 Data Mining Chapter 3DanWooster1
 
Heterogeneous fuzzy xml data integration based on structrual and semantic sim...
Heterogeneous fuzzy xml data integration based on structrual and semantic sim...Heterogeneous fuzzy xml data integration based on structrual and semantic sim...
Heterogeneous fuzzy xml data integration based on structrual and semantic sim...Amir Shokri
 
DatumTron In-Memory Graph Database
DatumTron In-Memory Graph DatabaseDatumTron In-Memory Graph Database
DatumTron In-Memory Graph DatabaseAshraf Azmi
 
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...ijseajournal
 
CIS 336 Wonderful Education--cis336.com
CIS 336 Wonderful Education--cis336.comCIS 336 Wonderful Education--cis336.com
CIS 336 Wonderful Education--cis336.comJaseetha16
 
Relational database
Relational  databaseRelational  database
Relational databaseamkrisha
 
The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...
The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...
The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...Editor IJCATR
 
Relational Data Model Introduction
Relational Data Model IntroductionRelational Data Model Introduction
Relational Data Model IntroductionNishant Munjal
 
9. Object Relational Databases in DBMS
9. Object Relational Databases in DBMS9. Object Relational Databases in DBMS
9. Object Relational Databases in DBMSkoolkampus
 
Ibps it officer exam capsule by affairs cloud
Ibps it officer exam capsule by affairs cloudIbps it officer exam capsule by affairs cloud
Ibps it officer exam capsule by affairs cloudaffairs cloud
 
Introducing FRSAD and Mapping it with Other Models
Introducing FRSAD and Mapping it with Other ModelsIntroducing FRSAD and Mapping it with Other Models
Introducing FRSAD and Mapping it with Other ModelsMarcia Zeng
 

What's hot (19)

Dbms ppt
Dbms pptDbms ppt
Dbms ppt
 
Upstate CSCI 525 Data Mining Chapter 3
Upstate CSCI 525 Data Mining Chapter 3Upstate CSCI 525 Data Mining Chapter 3
Upstate CSCI 525 Data Mining Chapter 3
 
Heterogeneous fuzzy xml data integration based on structrual and semantic sim...
Heterogeneous fuzzy xml data integration based on structrual and semantic sim...Heterogeneous fuzzy xml data integration based on structrual and semantic sim...
Heterogeneous fuzzy xml data integration based on structrual and semantic sim...
 
Bc0041
Bc0041Bc0041
Bc0041
 
DatumTron In-Memory Graph Database
DatumTron In-Memory Graph DatabaseDatumTron In-Memory Graph Database
DatumTron In-Memory Graph Database
 
Dbms
DbmsDbms
Dbms
 
App B
App BApp B
App B
 
ch6
ch6ch6
ch6
 
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
 
CIS 336 Wonderful Education--cis336.com
CIS 336 Wonderful Education--cis336.comCIS 336 Wonderful Education--cis336.com
CIS 336 Wonderful Education--cis336.com
 
Z04506138145
Z04506138145Z04506138145
Z04506138145
 
Relational database
Relational  databaseRelational  database
Relational database
 
The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...
The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...
The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...
 
Relational Data Model Introduction
Relational Data Model IntroductionRelational Data Model Introduction
Relational Data Model Introduction
 
The Aleph 4
The Aleph 4The Aleph 4
The Aleph 4
 
9. Object Relational Databases in DBMS
9. Object Relational Databases in DBMS9. Object Relational Databases in DBMS
9. Object Relational Databases in DBMS
 
Ibps it officer exam capsule by affairs cloud
Ibps it officer exam capsule by affairs cloudIbps it officer exam capsule by affairs cloud
Ibps it officer exam capsule by affairs cloud
 
L6 structure
L6 structureL6 structure
L6 structure
 
Introducing FRSAD and Mapping it with Other Models
Introducing FRSAD and Mapping it with Other ModelsIntroducing FRSAD and Mapping it with Other Models
Introducing FRSAD and Mapping it with Other Models
 

Viewers also liked

1989-2009: ora devono cadere le barriere all'innovazione
1989-2009: ora devono cadere le barriere all'innovazione1989-2009: ora devono cadere le barriere all'innovazione
1989-2009: ora devono cadere le barriere all'innovazioneMassimiliano Navacchia
 
Thong tin Landmark 3 Landmark 6 Vinhomes Central Park Vingroup
Thong tin Landmark 3 Landmark 6 Vinhomes Central Park VingroupThong tin Landmark 3 Landmark 6 Vinhomes Central Park Vingroup
Thong tin Landmark 3 Landmark 6 Vinhomes Central Park VingroupBất động sản 777
 
A Hybrid Approach for Performance Enhancement of VANET using CSMA-MACA: a Review
A Hybrid Approach for Performance Enhancement of VANET using CSMA-MACA: a ReviewA Hybrid Approach for Performance Enhancement of VANET using CSMA-MACA: a Review
A Hybrid Approach for Performance Enhancement of VANET using CSMA-MACA: a Reviewiosrjce
 
“Design and Detection of Mobile Botnet Attacks”
“Design and Detection of Mobile Botnet Attacks”“Design and Detection of Mobile Botnet Attacks”
“Design and Detection of Mobile Botnet Attacks”iosrjce
 
Climbing to success
Climbing to successClimbing to success
Climbing to successjrobles2004
 
Multicultural Program Final Proposal Project
Multicultural Program Final Proposal ProjectMulticultural Program Final Proposal Project
Multicultural Program Final Proposal ProjectLaKeisha Weber
 
Intervention strategy -Behavior-Specific Praise
Intervention strategy -Behavior-Specific PraiseIntervention strategy -Behavior-Specific Praise
Intervention strategy -Behavior-Specific PraiseLaKeisha Weber
 
Xu hướng cửa hàng pop-up trong marketing ngành bán lẻ
Xu hướng cửa hàng pop-up trong marketing ngành bán lẻXu hướng cửa hàng pop-up trong marketing ngành bán lẻ
Xu hướng cửa hàng pop-up trong marketing ngành bán lẻPhi Van Nguyen
 
Ngành dược Người tiêu dùng và hoạt động quảng cáo của các công ty
Ngành dược Người tiêu dùng và hoạt động quảng cáo của các công tyNgành dược Người tiêu dùng và hoạt động quảng cáo của các công ty
Ngành dược Người tiêu dùng và hoạt động quảng cáo của các công tyShinnosuke Mo
 
Take a Whirlwind Tour of Hot Topics in Dyslexia & Literacy PART 1 (IDA Conf 2...
Take a Whirlwind Tour of Hot Topics in Dyslexia & Literacy PART 1 (IDA Conf 2...Take a Whirlwind Tour of Hot Topics in Dyslexia & Literacy PART 1 (IDA Conf 2...
Take a Whirlwind Tour of Hot Topics in Dyslexia & Literacy PART 1 (IDA Conf 2...Carolyn D. Cowen
 
Glosario técnico n°1
Glosario técnico n°1Glosario técnico n°1
Glosario técnico n°1jrtorresb
 
Bringing Empowerment to Women through Menstrual Hygiene
Bringing Empowerment to Women through Menstrual Hygiene   Bringing Empowerment to Women through Menstrual Hygiene
Bringing Empowerment to Women through Menstrual Hygiene GlobalHunt Foundation
 

Viewers also liked (16)

1989-2009: ora devono cadere le barriere all'innovazione
1989-2009: ora devono cadere le barriere all'innovazione1989-2009: ora devono cadere le barriere all'innovazione
1989-2009: ora devono cadere le barriere all'innovazione
 
ทุนกอ.นตผ.
ทุนกอ.นตผ.ทุนกอ.นตผ.
ทุนกอ.นตผ.
 
Thong tin Landmark 3 Landmark 6 Vinhomes Central Park Vingroup
Thong tin Landmark 3 Landmark 6 Vinhomes Central Park VingroupThong tin Landmark 3 Landmark 6 Vinhomes Central Park Vingroup
Thong tin Landmark 3 Landmark 6 Vinhomes Central Park Vingroup
 
A Hybrid Approach for Performance Enhancement of VANET using CSMA-MACA: a Review
A Hybrid Approach for Performance Enhancement of VANET using CSMA-MACA: a ReviewA Hybrid Approach for Performance Enhancement of VANET using CSMA-MACA: a Review
A Hybrid Approach for Performance Enhancement of VANET using CSMA-MACA: a Review
 
“Design and Detection of Mobile Botnet Attacks”
“Design and Detection of Mobile Botnet Attacks”“Design and Detection of Mobile Botnet Attacks”
“Design and Detection of Mobile Botnet Attacks”
 
Santo Domenico_CV
Santo Domenico_CVSanto Domenico_CV
Santo Domenico_CV
 
Climbing to success
Climbing to successClimbing to success
Climbing to success
 
Alex resume and transcript
Alex resume and transcriptAlex resume and transcript
Alex resume and transcript
 
Restos jesus
Restos jesusRestos jesus
Restos jesus
 
Multicultural Program Final Proposal Project
Multicultural Program Final Proposal ProjectMulticultural Program Final Proposal Project
Multicultural Program Final Proposal Project
 
Intervention strategy -Behavior-Specific Praise
Intervention strategy -Behavior-Specific PraiseIntervention strategy -Behavior-Specific Praise
Intervention strategy -Behavior-Specific Praise
 
Xu hướng cửa hàng pop-up trong marketing ngành bán lẻ
Xu hướng cửa hàng pop-up trong marketing ngành bán lẻXu hướng cửa hàng pop-up trong marketing ngành bán lẻ
Xu hướng cửa hàng pop-up trong marketing ngành bán lẻ
 
Ngành dược Người tiêu dùng và hoạt động quảng cáo của các công ty
Ngành dược Người tiêu dùng và hoạt động quảng cáo của các công tyNgành dược Người tiêu dùng và hoạt động quảng cáo của các công ty
Ngành dược Người tiêu dùng và hoạt động quảng cáo của các công ty
 
Take a Whirlwind Tour of Hot Topics in Dyslexia & Literacy PART 1 (IDA Conf 2...
Take a Whirlwind Tour of Hot Topics in Dyslexia & Literacy PART 1 (IDA Conf 2...Take a Whirlwind Tour of Hot Topics in Dyslexia & Literacy PART 1 (IDA Conf 2...
Take a Whirlwind Tour of Hot Topics in Dyslexia & Literacy PART 1 (IDA Conf 2...
 
Glosario técnico n°1
Glosario técnico n°1Glosario técnico n°1
Glosario técnico n°1
 
Bringing Empowerment to Women through Menstrual Hygiene
Bringing Empowerment to Women through Menstrual Hygiene   Bringing Empowerment to Women through Menstrual Hygiene
Bringing Empowerment to Women through Menstrual Hygiene
 

Similar to Duplicate Detection in Hierarchical Data Using XPath

Vol 15 No 3 - May 2015
Vol 15 No 3 - May 2015Vol 15 No 3 - May 2015
Vol 15 No 3 - May 2015ijcsbi
 
Part2- The Atomic Information Resource
Part2- The Atomic Information ResourcePart2- The Atomic Information Resource
Part2- The Atomic Information ResourceJEAN-MICHEL LETENNIER
 
Xml document probabilistic
Xml document probabilisticXml document probabilistic
Xml document probabilisticIJITCA Journal
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseEditor IJCATR
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseEditor IJCATR
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseEditor IJCATR
 
XCLS++: A new algorithm to improve XCLS+ for clustering XML documents
XCLS++: A new algorithm to improve XCLS+ for clustering XML documents XCLS++: A new algorithm to improve XCLS+ for clustering XML documents
XCLS++: A new algorithm to improve XCLS+ for clustering XML documents IJITCA Journal
 
The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...IJITCA Journal
 
Effective Data Retrieval in XML using TreeMatch Algorithm
Effective Data Retrieval in XML using TreeMatch AlgorithmEffective Data Retrieval in XML using TreeMatch Algorithm
Effective Data Retrieval in XML using TreeMatch AlgorithmIRJET Journal
 
Xml data clustering an overview
Xml data clustering an overviewXml data clustering an overview
Xml data clustering an overviewunyil96
 
RELATIONAL STORAGE FOR XML RULES
RELATIONAL STORAGE FOR XML RULESRELATIONAL STORAGE FOR XML RULES
RELATIONAL STORAGE FOR XML RULESijwscjournal
 
RELATIONAL STORAGE FOR XML RULES
RELATIONAL STORAGE FOR XML RULESRELATIONAL STORAGE FOR XML RULES
RELATIONAL STORAGE FOR XML RULESijwscjournal
 
Annotating Search Results from Web Databases
Annotating Search Results from Web Databases Annotating Search Results from Web Databases
Annotating Search Results from Web Databases Mohit Sngg
 
introduction of database in DBMS
introduction of database in DBMSintroduction of database in DBMS
introduction of database in DBMSAbhishekRajpoot8
 
An overview of fragmentation
An overview of fragmentationAn overview of fragmentation
An overview of fragmentationcsandit
 
P REFIX - BASED L ABELING A NNOTATION FOR E FFECTIVE XML F RAGMENTATION
P REFIX - BASED  L ABELING  A NNOTATION FOR  E FFECTIVE  XML F RAGMENTATIONP REFIX - BASED  L ABELING  A NNOTATION FOR  E FFECTIVE  XML F RAGMENTATION
P REFIX - BASED L ABELING A NNOTATION FOR E FFECTIVE XML F RAGMENTATIONijcsit
 

Similar to Duplicate Detection in Hierarchical Data Using XPath (20)

K04302082087
K04302082087K04302082087
K04302082087
 
Ijert semi 1
Ijert semi 1Ijert semi 1
Ijert semi 1
 
Vol 15 No 3 - May 2015
Vol 15 No 3 - May 2015Vol 15 No 3 - May 2015
Vol 15 No 3 - May 2015
 
Part2- The Atomic Information Resource
Part2- The Atomic Information ResourcePart2- The Atomic Information Resource
Part2- The Atomic Information Resource
 
Xml document probabilistic
Xml document probabilisticXml document probabilistic
Xml document probabilistic
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented database
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented database
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented database
 
XCLS++: A new algorithm to improve XCLS+ for clustering XML documents
XCLS++: A new algorithm to improve XCLS+ for clustering XML documents XCLS++: A new algorithm to improve XCLS+ for clustering XML documents
XCLS++: A new algorithm to improve XCLS+ for clustering XML documents
 
The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...
 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346
 
Effective Data Retrieval in XML using TreeMatch Algorithm
Effective Data Retrieval in XML using TreeMatch AlgorithmEffective Data Retrieval in XML using TreeMatch Algorithm
Effective Data Retrieval in XML using TreeMatch Algorithm
 
Xml data clustering an overview
Xml data clustering an overviewXml data clustering an overview
Xml data clustering an overview
 
5010
50105010
5010
 
RELATIONAL STORAGE FOR XML RULES
RELATIONAL STORAGE FOR XML RULESRELATIONAL STORAGE FOR XML RULES
RELATIONAL STORAGE FOR XML RULES
 
RELATIONAL STORAGE FOR XML RULES
RELATIONAL STORAGE FOR XML RULESRELATIONAL STORAGE FOR XML RULES
RELATIONAL STORAGE FOR XML RULES
 
Annotating Search Results from Web Databases
Annotating Search Results from Web Databases Annotating Search Results from Web Databases
Annotating Search Results from Web Databases
 
introduction of database in DBMS
introduction of database in DBMSintroduction of database in DBMS
introduction of database in DBMS
 
An overview of fragmentation
An overview of fragmentationAn overview of fragmentation
An overview of fragmentation
 
P REFIX - BASED L ABELING A NNOTATION FOR E FFECTIVE XML F RAGMENTATION
P REFIX - BASED  L ABELING  A NNOTATION FOR  E FFECTIVE  XML F RAGMENTATIONP REFIX - BASED  L ABELING  A NNOTATION FOR  E FFECTIVE  XML F RAGMENTATION
P REFIX - BASED L ABELING A NNOTATION FOR E FFECTIVE XML F RAGMENTATION
 

More from iosrjce

An Examination of Effectuation Dimension as Financing Practice of Small and M...
An Examination of Effectuation Dimension as Financing Practice of Small and M...An Examination of Effectuation Dimension as Financing Practice of Small and M...
An Examination of Effectuation Dimension as Financing Practice of Small and M...iosrjce
 
Does Goods and Services Tax (GST) Leads to Indian Economic Development?
Does Goods and Services Tax (GST) Leads to Indian Economic Development?Does Goods and Services Tax (GST) Leads to Indian Economic Development?
Does Goods and Services Tax (GST) Leads to Indian Economic Development?iosrjce
 
Childhood Factors that influence success in later life
Childhood Factors that influence success in later lifeChildhood Factors that influence success in later life
Childhood Factors that influence success in later lifeiosrjce
 
Emotional Intelligence and Work Performance Relationship: A Study on Sales Pe...
Emotional Intelligence and Work Performance Relationship: A Study on Sales Pe...Emotional Intelligence and Work Performance Relationship: A Study on Sales Pe...
Emotional Intelligence and Work Performance Relationship: A Study on Sales Pe...iosrjce
 
Customer’s Acceptance of Internet Banking in Dubai
Customer’s Acceptance of Internet Banking in DubaiCustomer’s Acceptance of Internet Banking in Dubai
Customer’s Acceptance of Internet Banking in Dubaiiosrjce
 
A Study of Employee Satisfaction relating to Job Security & Working Hours amo...
A Study of Employee Satisfaction relating to Job Security & Working Hours amo...A Study of Employee Satisfaction relating to Job Security & Working Hours amo...
A Study of Employee Satisfaction relating to Job Security & Working Hours amo...iosrjce
 
Consumer Perspectives on Brand Preference: A Choice Based Model Approach
Consumer Perspectives on Brand Preference: A Choice Based Model ApproachConsumer Perspectives on Brand Preference: A Choice Based Model Approach
Consumer Perspectives on Brand Preference: A Choice Based Model Approachiosrjce
 
Student`S Approach towards Social Network Sites
Student`S Approach towards Social Network SitesStudent`S Approach towards Social Network Sites
Student`S Approach towards Social Network Sitesiosrjce
 
Broadcast Management in Nigeria: The systems approach as an imperative
Broadcast Management in Nigeria: The systems approach as an imperativeBroadcast Management in Nigeria: The systems approach as an imperative
Broadcast Management in Nigeria: The systems approach as an imperativeiosrjce
 
A Study on Retailer’s Perception on Soya Products with Special Reference to T...
A Study on Retailer’s Perception on Soya Products with Special Reference to T...A Study on Retailer’s Perception on Soya Products with Special Reference to T...
A Study on Retailer’s Perception on Soya Products with Special Reference to T...iosrjce
 
A Study Factors Influence on Organisation Citizenship Behaviour in Corporate ...
A Study Factors Influence on Organisation Citizenship Behaviour in Corporate ...A Study Factors Influence on Organisation Citizenship Behaviour in Corporate ...
A Study Factors Influence on Organisation Citizenship Behaviour in Corporate ...iosrjce
 
Consumers’ Behaviour on Sony Xperia: A Case Study on Bangladesh
Consumers’ Behaviour on Sony Xperia: A Case Study on BangladeshConsumers’ Behaviour on Sony Xperia: A Case Study on Bangladesh
Consumers’ Behaviour on Sony Xperia: A Case Study on Bangladeshiosrjce
 
Design of a Balanced Scorecard on Nonprofit Organizations (Study on Yayasan P...
Design of a Balanced Scorecard on Nonprofit Organizations (Study on Yayasan P...Design of a Balanced Scorecard on Nonprofit Organizations (Study on Yayasan P...
Design of a Balanced Scorecard on Nonprofit Organizations (Study on Yayasan P...iosrjce
 
Public Sector Reforms and Outsourcing Services in Nigeria: An Empirical Evalu...
Public Sector Reforms and Outsourcing Services in Nigeria: An Empirical Evalu...Public Sector Reforms and Outsourcing Services in Nigeria: An Empirical Evalu...
Public Sector Reforms and Outsourcing Services in Nigeria: An Empirical Evalu...iosrjce
 
Media Innovations and its Impact on Brand awareness & Consideration
Media Innovations and its Impact on Brand awareness & ConsiderationMedia Innovations and its Impact on Brand awareness & Consideration
Media Innovations and its Impact on Brand awareness & Considerationiosrjce
 
Customer experience in supermarkets and hypermarkets – A comparative study
Customer experience in supermarkets and hypermarkets – A comparative studyCustomer experience in supermarkets and hypermarkets – A comparative study
Customer experience in supermarkets and hypermarkets – A comparative studyiosrjce
 
Social Media and Small Businesses: A Combinational Strategic Approach under t...
Social Media and Small Businesses: A Combinational Strategic Approach under t...Social Media and Small Businesses: A Combinational Strategic Approach under t...
Social Media and Small Businesses: A Combinational Strategic Approach under t...iosrjce
 
Secretarial Performance and the Gender Question (A Study of Selected Tertiary...
Secretarial Performance and the Gender Question (A Study of Selected Tertiary...Secretarial Performance and the Gender Question (A Study of Selected Tertiary...
Secretarial Performance and the Gender Question (A Study of Selected Tertiary...iosrjce
 
Implementation of Quality Management principles at Zimbabwe Open University (...
Implementation of Quality Management principles at Zimbabwe Open University (...Implementation of Quality Management principles at Zimbabwe Open University (...
Implementation of Quality Management principles at Zimbabwe Open University (...iosrjce
 
Organizational Conflicts Management In Selected Organizaions In Lagos State, ...
Organizational Conflicts Management In Selected Organizaions In Lagos State, ...Organizational Conflicts Management In Selected Organizaions In Lagos State, ...
Organizational Conflicts Management In Selected Organizaions In Lagos State, ...iosrjce
 

More from iosrjce (20)

An Examination of Effectuation Dimension as Financing Practice of Small and M...
An Examination of Effectuation Dimension as Financing Practice of Small and M...An Examination of Effectuation Dimension as Financing Practice of Small and M...
An Examination of Effectuation Dimension as Financing Practice of Small and M...
 
Does Goods and Services Tax (GST) Leads to Indian Economic Development?
Does Goods and Services Tax (GST) Leads to Indian Economic Development?Does Goods and Services Tax (GST) Leads to Indian Economic Development?
Does Goods and Services Tax (GST) Leads to Indian Economic Development?
 
Childhood Factors that influence success in later life
Childhood Factors that influence success in later lifeChildhood Factors that influence success in later life
Childhood Factors that influence success in later life
 
Emotional Intelligence and Work Performance Relationship: A Study on Sales Pe...
Emotional Intelligence and Work Performance Relationship: A Study on Sales Pe...Emotional Intelligence and Work Performance Relationship: A Study on Sales Pe...
Emotional Intelligence and Work Performance Relationship: A Study on Sales Pe...
 
Customer’s Acceptance of Internet Banking in Dubai
Customer’s Acceptance of Internet Banking in DubaiCustomer’s Acceptance of Internet Banking in Dubai
Customer’s Acceptance of Internet Banking in Dubai
 
A Study of Employee Satisfaction relating to Job Security & Working Hours amo...
A Study of Employee Satisfaction relating to Job Security & Working Hours amo...A Study of Employee Satisfaction relating to Job Security & Working Hours amo...
A Study of Employee Satisfaction relating to Job Security & Working Hours amo...
 
Consumer Perspectives on Brand Preference: A Choice Based Model Approach
Consumer Perspectives on Brand Preference: A Choice Based Model ApproachConsumer Perspectives on Brand Preference: A Choice Based Model Approach
Consumer Perspectives on Brand Preference: A Choice Based Model Approach
 
Student`S Approach towards Social Network Sites
Student`S Approach towards Social Network SitesStudent`S Approach towards Social Network Sites
Student`S Approach towards Social Network Sites
 
Broadcast Management in Nigeria: The systems approach as an imperative
Broadcast Management in Nigeria: The systems approach as an imperativeBroadcast Management in Nigeria: The systems approach as an imperative
Broadcast Management in Nigeria: The systems approach as an imperative
 
A Study on Retailer’s Perception on Soya Products with Special Reference to T...
A Study on Retailer’s Perception on Soya Products with Special Reference to T...A Study on Retailer’s Perception on Soya Products with Special Reference to T...
A Study on Retailer’s Perception on Soya Products with Special Reference to T...
 
A Study Factors Influence on Organisation Citizenship Behaviour in Corporate ...
A Study Factors Influence on Organisation Citizenship Behaviour in Corporate ...A Study Factors Influence on Organisation Citizenship Behaviour in Corporate ...
A Study Factors Influence on Organisation Citizenship Behaviour in Corporate ...
 
Consumers’ Behaviour on Sony Xperia: A Case Study on Bangladesh
Consumers’ Behaviour on Sony Xperia: A Case Study on BangladeshConsumers’ Behaviour on Sony Xperia: A Case Study on Bangladesh
Consumers’ Behaviour on Sony Xperia: A Case Study on Bangladesh
 
Design of a Balanced Scorecard on Nonprofit Organizations (Study on Yayasan P...
Design of a Balanced Scorecard on Nonprofit Organizations (Study on Yayasan P...Design of a Balanced Scorecard on Nonprofit Organizations (Study on Yayasan P...
Design of a Balanced Scorecard on Nonprofit Organizations (Study on Yayasan P...
 
Public Sector Reforms and Outsourcing Services in Nigeria: An Empirical Evalu...
Public Sector Reforms and Outsourcing Services in Nigeria: An Empirical Evalu...Public Sector Reforms and Outsourcing Services in Nigeria: An Empirical Evalu...
Public Sector Reforms and Outsourcing Services in Nigeria: An Empirical Evalu...
 
Media Innovations and its Impact on Brand awareness & Consideration
Media Innovations and its Impact on Brand awareness & ConsiderationMedia Innovations and its Impact on Brand awareness & Consideration
Media Innovations and its Impact on Brand awareness & Consideration
 
Customer experience in supermarkets and hypermarkets – A comparative study
Customer experience in supermarkets and hypermarkets – A comparative studyCustomer experience in supermarkets and hypermarkets – A comparative study
Customer experience in supermarkets and hypermarkets – A comparative study
 
Social Media and Small Businesses: A Combinational Strategic Approach under t...
Social Media and Small Businesses: A Combinational Strategic Approach under t...Social Media and Small Businesses: A Combinational Strategic Approach under t...
Social Media and Small Businesses: A Combinational Strategic Approach under t...
 
Secretarial Performance and the Gender Question (A Study of Selected Tertiary...
Secretarial Performance and the Gender Question (A Study of Selected Tertiary...Secretarial Performance and the Gender Question (A Study of Selected Tertiary...
Secretarial Performance and the Gender Question (A Study of Selected Tertiary...
 
Implementation of Quality Management principles at Zimbabwe Open University (...
Implementation of Quality Management principles at Zimbabwe Open University (...Implementation of Quality Management principles at Zimbabwe Open University (...
Implementation of Quality Management principles at Zimbabwe Open University (...
 
Organizational Conflicts Management In Selected Organizaions In Lagos State, ...
Organizational Conflicts Management In Selected Organizaions In Lagos State, ...Organizational Conflicts Management In Selected Organizaions In Lagos State, ...
Organizational Conflicts Management In Selected Organizaions In Lagos State, ...
 

Recently uploaded

Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 

Recently uploaded (20)

Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 

Duplicate Detection in Hierarchical Data Using XPath

  • 1. IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 6, Ver. I (Nov – Dec. 2015), PP 69-76 www.iosrjournals.org DOI: 10.9790/0661-17616976 www.iosrjournals.org 69 | Page Duplicate Detection in Hierarchical Data Using XPath 1 Akash R. Petkar, 2 Vijay B. Patil 1,2 Computer Science &Engineering, MIT Aurangabad. Abstract: There were many techniques for identifying duplicates in relational data, but only a few solutions focus on identifying duplicates which has complex hierarchical structure, as XML data. In this paper, we present a new technique for identifying XML duplicates, so-called XML duplication using Xpath. XML duplication using Xpath technique uses a Bayesian network to conclude the possibility that two xml elements are duplicates, based on the information within the elements and other information organized in the XML. In addition, to increase the proficiency of the web usage, a new pruning strategy was created. This pruning strategy will help to gain maximum benefits over non-computing algorithm. This technique can be used to increase the proficiency of identifying duplicates and remove it, so no duplicate record will be there. Through many experiments, our algorithm is able to achieve high accuracy and retrieve count in several XML dataset. XML duplication using Xpath technique is able to outclass another technique for identifying duplicates, both in proficiency and potency. Keywords: Identifying duplicates, XML, Bayesian network, object cleaning, hierarchical structure, Xpath. I. Introduction Electronic data plays an important role in today’s world for business processes, applications and making quick decisions. As we are focusing on how the data can be essential and we have to compromise on different types of errors which come in different representations [1]. In this paper we are focusing on different types of errors that can be occurred in Data. We will mainly focus on fuzzy duplicates or duplicate records. Duplicate records are multiple representation of same real world object that are differently represented. These records are somewhat different from each other. These records attributes differ in some way from each other in XML document. Duplicate detection means finding out these different representations of same real world object. Duplicate detection is a tough task to find duplicate records. The common comparisons algorithm to find the duplicates cannot be used, so finddifferent possibly matching strategyto compare, so that they are referring to the same object or not In this paper, the focusis on which possibly matching strategy can find out to detect duplicate records. The Focus should be able to match the different representation of information at a conceptual level. Take xml dataset for comparing different possibilities. An XML dataset or document includes a set of nodes in the document. It consists of root node and child nodes. It starts with an opening tag (Ex. <A>) and a closing tag(Ex. </A>). Figure 1: Attribute Scope In Fig.1two records are shown for two different XML records. But both records are representing the same country so there can be possibility that in XML dataset or document it can be present. So find different such possibilities that represent the same object. Duplicate records are exactly same by textual information. But if they are slightly changed; the information are not exactly duplicates. Another problem is that XML can be presented in different structures so the possibility of finding the duplicates becomes high. An XML document contains one root element and number of child element, but child element can also have different child elements and so on. In this paper, a novel technique is presented which can be used to detect XML duplication of same real world object. Rest of the paper is organized as follows: Section II the related work is discussed. Section III describes the proposed work. Section IV describes the mathematical model. Section Vpresents the performance analysis. Section VI shows graphical model. Section VII concludes the paper. <Customer> <Country>Australia</country> </Customer> <Customer> <Country>AUS</country> </Customer>
  • 2. Duplicate Detection in Hierarchical Data Using XPath DOI: 10.9790/0661-17616976 www.iosrjournals.org 70 | Page II. Related Work Data cleaning [2] ordata cleansing deals with identifying and removing errors, irregularities from data are represented in such a way that it can degrade the quality of data.Dataquality problems are there in single data files such as databases or XML File ex. Due to misspelling during entry of data, invalid data or some missing information. When integrating multiple file data to integrate into one file then it should be able to get a single file that is free from duplicate records. 2.1 Eliminating Fuzzy Duplicates in Data Warehouses In data warehouses large databases are integrated ex. global web-based information system, merged database systems so there can be a possibility of duplicate records. The need for data cleaning process becomes an important factor for getting the accurate records with no duplication. The duplicate records or redundant records are represented in different representations. In order to get the data accurate and consistent, joining different types of data and merging into one data and eliminate the duplicate data becomes a necessary step for faster processing of data. Firstly find out the different possibilities of data that can be represented in different structure so that multiple documents when combine together will form a single document which is free from duplicate records. For Example in Table 1 there are records that consist off_n (firstname), country and email. Record r1 and r2 are exact duplicates because their f_n, country and email values are same so we can say they are exact (100 %) duplicates, so we can easily say they are duplicates and can be easily removed. Record r1 and r3 are not exact duplicates because theirf_n and email values are same, but not country values are same. But in record r3 the country values is denoted in different format, but it refers to same record. Therefore, find such duplicates. These duplicates are called as fuzzy duplicates [3]. Table 1: Exact Duplicates as well as Fuzzy Duplicates Record f_n country Email r1 John Australia john@gmail.com r2 John Australia john@gmail.com r3 John AUS john@gmail.com 2.2 DogmatiX Tracks down Duplicates in XML Inthis,dogmatix defines a general framework for identifying the duplicates. In this, records are checked whether they are duplicates or not based on their values. In real world, records are represented in multiple patterns for same object. The dogmatix frameworkis flexible to work on different algorithms and new methods can be added to improve this framework. An overview of the framework is given in [5].Theframework consists of three types,  Candidate definition: Defines which document should be compare.  Duplicate definition: Defines when two objects are duplicates.  Duplicate detection: Defines How Duplicates are searched. III. Proposed Work 3.1 XML Duplication Using XPath Every tuple in a relational table has exactly one value for every attribute. Most duplicate detection approaches designed for a single relation iteratively compare pairs of tuples as follows: They first compare attribute values pair wisely by computing a value similarity, and then combine these similarities to a total tuple similarity. If the similarity is above a specified threshold, tuple pairs represent duplicates, otherwise they represent non-duplicates. This comparison approach is called a threshold similarity measure approach. Figure 2: CustomerDetails XML CustomerDetails Customer 1 Customer 2 Customer n PersonID Address1 Address2 Email POB DOB PersonName PersonID Address1 Address2 Email POB DOB PersonName PersonID Address1 Address2 Email POB DOB PersonName
  • 3. Duplicate Detection in Hierarchical Data Using XPath DOI: 10.9790/0661-17616976 www.iosrjournals.org 71 | Page DTD are a strict representation or set of rules for XML.XML can be represented by a tree structure as shown in Fig. 2.In this customerdetails, is the root element and customer are the child elements. The child elements have 7 attributes as personid, personname, dob, pob, Email, Address1, and Address2. Today, XML is used in many web applications. The popularity of xml has increased because of its platform independent, less space and easy to use ability. XML is mostly used for data storage and fast transfer of data. 3.2 Different Types of Parsers On web there are various types of parsers are available for parsing the XML document. Some of them are good in some features and some of them are not. In Table 2 different types of parser are used for parsing, but some parser are design for read only access. But in this paper, a novel approach is proposed for parsing the XML file and finding the exact duplicates or fuzzy duplicates and removing the duplicates, so that pure XML document must be formed.STAX parser, SAX parser, DOM parser etc. are used for parsing. In Table 2 how these parsers differ from each other.The notation for the following Table 2 can be used as Xpath Capability = XC, CPU & Memory = C&M, Forward only = FO, Read xml = RXML,Write xml = WXML, create,read,update,delete = crud. Table 2: Features Table Feature STAX SAX DOM API Type Pull,Streaming Push, Streaming In memory tree XC No No yes C&M Good Good Varies FO Yes Yes No RXML Yes Yes Yes WXML Yes Yes Yes CRUD No No Yes 3.3 XMLDOM Parser The structure of Dom parser can be identified using the following diagram. Figure 3: XML Dom Parser Structure DOM, also known as document object model. It is mostly used for XML operation today. The responsibility of a DOM parser is to read the XML document specified and convert that into a tree structure suitable for traversal. Internally, a DOM parser takes help from SAX parser to read the file and the compares the XML against the DTD or the schema, so that relationships between parent-child tags can be set up and the tag tree is built into the memory. DOM first copies XML into memory before parsing It, So It is a good advice to have large heap size to avoid exceptions. 3.4 XML Duplication Using Xpath Algorithm In this, we describe how the algorithm works and how it is able to detect duplicates in XML dataset and remove the duplicates from the XML. Steps in xml duplication using XPath algorithm, Begin; 1) Read the xml file and get the ordered list of parent nodes (L). 2) Read the child nodes of xml file. 3) Current score = 0 4)For each node n in L do 5)If (attributes of child nodes= threshold value) then Completely duplicate Else Initializer Parser Dom Parsing Process XML Parser XML Document
  • 4. Duplicate Detection in Hierarchical Data Using XPath DOI: 10.9790/0661-17616976 www.iosrjournals.org 72 | Page No duplicate 6) Remove all duplicate nodes and save the new xml file. Table 3: Results for Customer, Suplier and Order Dataset IV. Mathematical Model 1. Identify XML records R 𝑅 = 𝑟1, 𝑟2, 𝑟3, 𝑟4, 𝑟4, 𝑟5, … … . Where R is main set of records 2. Then Identify Nodes of each record N 𝑁 = 𝑛1, 𝑛2, 𝑛3, 𝑛4, 𝑟5, …… . Where N is main set of nodes for a record 3. Probability that a node is duplicate 𝑃 = 𝑝1, 𝑝2, 𝑝3, 𝑝4, 𝑝5, … …. 𝑃 𝑡𝑖𝑗 |𝑉𝑡𝑖𝑗 , 𝐶𝑡𝑖𝑗 = 1 𝑖𝑓𝑉𝑡𝑖𝑗 = 𝐶𝑡𝑖𝑗 = 1 0 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 Where P is main set of probability If 𝑉𝑡𝑖𝑗 = 𝐶𝑡𝑖𝑗 = 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝑉𝑎𝑙𝑢𝑒 Then It Is Duplicate 4. Identify Duplicate nodes DN 𝐷𝑁 = 𝑑𝑛1, 𝑑𝑛2, 𝑑𝑛3, 𝑑𝑛4, 𝑑𝑛5,… … . Where DN is main set of the duplicate nodes 5. Identify Duplicate records for DR 𝐷𝑅 = 𝑑𝑟1, 𝑑𝑟2, 𝑑𝑟3, 𝑑𝑟4, 𝑑𝑟5, … …. Where DR is main set of the duplicate records 6. Calculating Total Time for Duplication 𝑇𝑜𝑡𝑎𝑙𝑇𝑖𝑚𝑒 = 𝑇𝑖𝑚𝑒𝐷𝑒𝑝𝑡ℎ𝑆𝑒𝑎𝑟𝑐ℎ + 𝐷𝑒𝑑𝑢𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 2 V. Performance Analysis In this paper, the experiments are performed on customer, suplier and order datasets. The customer dataset consist of attributes as personid,personname, dob, pob, email, address1, address2. The suplierdataset consist of attributessuppkey, name, address, nationkey, phone, acctbal, comment. The Orderdatasetconsists of orderkey, custkey, orderstatus, totalprice, orderdate, orderpriority, and clerk. The datasets are downloaded from www.cs.washington.edu/research/xmldatasets/www.repository.html 5.1 Test Case I In customerdetailsfile there are seven attributes they are personid, personname, DOB, POB, email, address1 and address2. When parsed the customerdetailsXML file then which elements are exact duplicates then that records are duplicates. Record 1 <Customer> <personid>1</personid> <personname>Order#0000001</personname> <dob>01/01/2001</dob> Dataset Customer Suplier Order Records 1000 1000 1000 Time Depth Search For Duplicate 2973 Milliseconds 3000 Milliseconds 2933 Milliseconds Sorted Set 2873 Milliseconds 2876 Milliseconds 2872 Milliseconds Deduplication For The Duplicate Record 91 Milliseconds 130 Milliseconds 59 Milliseconds Total Time 1532 Milliseconds 1565 Milliseconds 1496 Milliseconds
  • 5. Duplicate Detection in Hierarchical Data Using XPath DOI: 10.9790/0661-17616976 www.iosrjournals.org 73 | Page <pob>Australia</pob> <email>Order1@gmail.com</email> <Address1>Jalna</Address1> <Address2>Beed</Address2> </Customer> Record 2 <Customer> <personid>1</personid> <personname>Order#0000001</personname> <dob>01/01/2001</dob> <pob>Australia</pob> <email>Order1@gmail.com</email> <Address1>Jalna</Address1> <Address2>Beed</Address2> </Customer> Above both records are exactly same then they are exact duplicates. But if attributes places of Adddress1 and Address2 are changed then readings as Record 3 <Customer> <personid>1</personid> <personname>Order#0000001</personname> <dob>01/01/2001</dob> <pob>Australia</pob> <email>Order1@gmail.com</email> <Address1>Jalna</Address1> <Address2>Beed</Address2> </Customer> Record 4 <Customer> <personid>1</personid> <personname>Order#0000001</personname> <dob>01/01/2001</dob> <pob>Australia</pob> <email>Order1@gmail.com</email> <Address1>Beed</Address1> <Address2>Jalna</Address2> </Customer> It Shows, they are not (100 %) exactly Duplicates, because of their address places are change, but they are duplicates. They are represented in different manner so they represent to same object. 5.2 Test Case 2 In CustomerDetails, alternate names for countries are given,soitcan be easily identifiedwhetherit is duplicate records or not. Table 4: Alternate Names for Countries Countries Other name Lowercase name Australia Canada China India United Kingdom Sri Lanka France Iceland Mexico New Zealand AU CA CN IN UK LK FR IS MX NZ au ca cn in uk lk fr is mx nz
  • 6. Duplicate Detection in Hierarchical Data Using XPath DOI: 10.9790/0661-17616976 www.iosrjournals.org 74 | Page Record 1 <Customer> <personid>1</personid> <personname>Order#0000001</personname> <dob>01/01/2001</dob> <pob>Australia</pob> <email>Order1@gmail.com</email> <Address1>Jalna</Address1> <Address2>Beed</Address2> </Customer> Record 2 <Customer> <personid>1</personid> <personname>Order#0000001</personname> <dob>01/01/2001</dob> <pob>AU</pob> <email>Order1@gmail.com</email> <Address1>Beed</Address1> <Address2>Jalna</Address2> </Customer> Record 3 <Customer> <personid>1</personid> <personname>Order#0000001</personname> <dob>01/01/2001</dob> <pob>au</pob> <email>Order1@gmail.com</email> <Address1>Beed</Address1> <Address2>Jalna</Address2> </Customer> These records are not (100 %)Duplicates,becausethe countries names are different, but all they refer to one country,so above 3 records are duplicates. Table 3 shows the performance results. VI. Graphical Model Precisionmeasures the percentage of correctly identified duplicates, over the total set of objects determined as duplicates by the system. Recallmeasures the percentage of duplicates correctly identified by the system, over the total set of duplicate objects. Table 5 shows the calculated results for Dogmatix and Xml Duplication Using Xpath.Also the graphical comparison of Dogmatix and Xml Duplication using Xpath for customer dataset, suplier dataset and order dataset are given below. Table 5: Comparison Results In terms Of Precision and Recall Dataset Dogmatix XML Duplication Using Xpath Precision Recall Precision Recall Customer 0.4522 0.45 0.502 0.5 Suplier 0.1206 0.12 0.417 0.41 order 0.417 0.41 0.48 0.48
  • 7. Duplicate Detection in Hierarchical Data Using XPath DOI: 10.9790/0661-17616976 www.iosrjournals.org 75 | Page Figure 4: For Customer Dataset Figure 5: For Suplier Dataset Figure 6: For Order Dataset VII. Conclusion In this paper, we present the algorithm to determine whether two recordsare duplicates or not based on a given threshold. For finding the duplicates the Algorithm uses a Bayesian network. Xml duplication using Xpath requires little user interaction, since user only needs to provide the Xml dataset file and based on that file the user has to give the threshold value. However this technique is very flexible for duplicate detection in XML data. These techniques will able to solve the problem of duplicate data. Nowadays more and more data is generated, because of various devices so thereis lots of data to be maintained in the database so there can be duplicate records for a single record. To avoid the problem of efficiency of network, this technique can be useful to reduce the network load. This technique is performed under various experiments to find duplicate values. These experiments are performed on both artificial and real world dataset and showed that XML duplication using Xpath is very good technique. The process of duplication and Deduplication of records using Xpath technique is able to outclass other parser such as (Ex. DOM, SAX, and STAX Etc.). When calculated Recall and Precision for Xpath is 93 %, when compared with Recall and Precision for Dogmatix is 66 %. The success demonstrated in the experimental results will show that there is still something more we can do for the future work. References [1] E. Rahm and H.H. Do, “Data Cleaning: Problems and Current Approaches,” IEEE Data Eng. Bull., vol. 23, no. 4, pp. 3-13, Dec. 2000. 0.42 0.44 0.46 0.48 0.5 0.52 Precision Recall XML Duplication Using Xpath Dogmatix 0 0.1 0.2 0.3 0.4 0.5 Precision Recall XML Duplication Using Xpath Dogmatix 0.36 0.38 0.4 0.42 0.44 0.46 0.48 0.5 Precision Recall XML Duplication Using Xpath Dogmatix
  • 8. Duplicate Detection in Hierarchical Data Using XPath DOI: 10.9790/0661-17616976 www.iosrjournals.org 76 | Page [2] Melaine Weis and Felix Naumann, “Detecting Duplicate Objects In XML,”ACM SIGMOD Special Interest Group on Mgmt of Data IQIS 2004, pp. 10-19, 2004 [3] R. Ananthakrishna, S. Chaudhuri, and V. Ganti, “Eliminating Fuzzy Duplicates in Data Warehouses,” Proc. Conf. Very Large Databases (VLDB), pp. 586-597, 2002. [4] D.V. Kalashnikov and S. Mehrotra, “Domain-Independent Data Cleaning via Analysis of Entity-Relationship Graph.”ACM Trans. Database Systems, vol. 31, no. 2, pp. 716-767, 2006. [5] M. Weis and F. Naumann, “Dogmatix Tracks Down Duplicates in XML,” Proc. ACM SIGMOD Conf. Management of Data, pp. 431-442, 2005. [6] L. Leita˜o, P. Calado, and M. Weis, “Structure-Based Inference of XML Similarity for Fuzzy Duplicate Detection,” Proc. 16th ACM Int’l Conf. Information and Knowledge Management, pp. 293-302, 2007. [7] A.M. Kade and C.A. Heuser, “Matching XML Documents in Highly Dynamic Applications,” Proc. ACM Symp. Document Eng. (DocEng), pp. 191-198, 2008. [8] D. Milano, M. Scannapieco, and T. Catarci, “Structure Aware XML Object Identification,” Proc. VLDB Workshop Clean Databases (CleanDB), 2006. [9] Luis Leitao, PavelCalado, and Melaine Herschel, “Efficient and Effective Duplicate Detection In Hierarchical Data”Vol. 25, No. 5, pp. 1028-1041 2013. [10] P. Calado, M. Herschel, and L. Leita˜o, “An Overview of XML Duplicate Detection Algorithms,” Soft Computing in XML Data Management, Studies in Fuzziness and Soft Computing, vol. 255, pp. 193-224, 2010. [11] S. Puhlmann, M. Weis, and F. Naumann, “XML Duplicate Detection Using Sorted Neighborhoods,” Proc. Conf. Extending Database Technology (EDBT), pp. 773-791, 2006.