Your SlideShare is downloading. ×
Efficient and effective duplicate detection in hierarchical data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Efficient and effective duplicate detection in hierarchical data

163
views

Published on

For more projects visit @ www.nanocdac.com

For more projects visit @ www.nanocdac.com

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
163
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Efficient and Effective Duplicate Detection in Hierarchical DataAbstract:Although there is a long line of work on identifying duplicates in relational data, only afew solutions focus on duplicate detection in more complex hierarchical structures, like XMLdata. In this paper, we present a novel method for XML duplicate detection, called XMLDup.XMLDup uses a Bayesian network to determine the probability of two XML elements beingduplicates, considering not only the information within the elements, but also the way thatinformation is structured. In addition, to improve the efficiency of the network evaluation, anovel pruning strategy, capable of significant gains over the unoptimized version of thealgorithm, is presented. Through experiments, we show that our algorithm is able to achieve highprecision and recall scores in several datasets. XMLDup is also able to outperform another stateof the art duplicate detection solution, both in terms of efficiency and of effectiveness. Finally,we also study how important the structure of elements is in the duplicate detection process. Weobserve that, not only structure can clearly influence the outcome, but also that, by ensuring astructure that is adequate to the characteristics of the data, we can actually improve the quality ofthe results.Soft ware and hard ware requirementsHardware Required:System : Pentium IVHard Disk : 80 GBRAM : 512 MBSoftware Required:O/S : Windows XPLanguage : Visual C#www.nanocdac.com www.nsrcnano.com branches: hyderabad nagpur