Efficient and effective duplicate detection in hierarchical data

  • 158 views
Uploaded on

For more projects visit @ www.nanocdac.com

For more projects visit @ www.nanocdac.com

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
158
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Efficient and Effective Duplicate Detection in Hierarchical DataAbstract:Although there is a long line of work on identifying duplicates in relational data, only afew solutions focus on duplicate detection in more complex hierarchical structures, like XMLdata. In this paper, we present a novel method for XML duplicate detection, called XMLDup.XMLDup uses a Bayesian network to determine the probability of two XML elements beingduplicates, considering not only the information within the elements, but also the way thatinformation is structured. In addition, to improve the efficiency of the network evaluation, anovel pruning strategy, capable of significant gains over the unoptimized version of thealgorithm, is presented. Through experiments, we show that our algorithm is able to achieve highprecision and recall scores in several datasets. XMLDup is also able to outperform another stateof the art duplicate detection solution, both in terms of efficiency and of effectiveness. Finally,we also study how important the structure of elements is in the duplicate detection process. Weobserve that, not only structure can clearly influence the outcome, but also that, by ensuring astructure that is adequate to the characteristics of the data, we can actually improve the quality ofthe results.Soft ware and hard ware requirementsHardware Required:System : Pentium IVHard Disk : 80 GBRAM : 512 MBSoftware Required:O/S : Windows XPLanguage : Visual C#www.nanocdac.com www.nsrcnano.com branches: hyderabad nagpur