24

Impulse Technologies
Beacons U to World of technology
044-42133143, 98401 03301,9841091117 ieeeprojects@yahoo.com www.impulse.net.in
Efficient and Effective Duplicate Detection in Hierarchical Data
Abstract
Although there is a long line of work on identifying duplicates in relational
data, only a few solutions focus on duplicate detection in more complex
hierarchical structures, like XML data. In this paper, we present a novel method for
XML duplicate detection, called XMLDup. XMLDup uses a Bayesian network to
determine the probability of two XML elements being duplicates, considering not
only the information within the elements, but also the way that information is
structured. In addition, to improve the efficiency of the network evaluation, a novel
pruning strategy, capable of significant gains over the unoptimized version of the
algorithm, is presented. Through experiments, we show that our algorithm is able
to achieve high precision and recall scores in several datasets. XMLDup is also
able to outperform another state of the art duplicate detection solution, both in
terms of efficiency and of effectiveness. Finally, we also study how important the
structure of elements is in the duplicate detection process. We observe that, not
only structure can clearly influence the outcome, but also that, by ensuring a
structure that is adequate to the characteristics of the data, we can actually improve
the quality of the results.

Your Own Ideas or Any project from any company can be Implemented
at Better price (All Projects can be done in Java or DotNet whichever the student wants)
1

24

More Related Content

What's hot

Viewers also liked

Similar to 24

More from IMPULSE_TECHNOLOGY

Recently uploaded

24