• Save
Efficient and effective duplicate detection in hierarchical data
Upcoming SlideShare
Loading in...5
×
 

Efficient and effective duplicate detection in hierarchical data

on

  • 692 views

For more project visit @ www.nanocdac.com

For more project visit @ www.nanocdac.com

Statistics

Views

Total Views
692
Views on SlideShare
692
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft Word

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Efficient and effective duplicate detection in hierarchical data Efficient and effective duplicate detection in hierarchical data Document Transcript

  • Efficient and Effective Duplicate Detection in Hierarchical DataAbstract:Although there is a long line of work on identifying duplicates in relational data, only afew solutions focus on duplicate detection in more complex hierarchical structures, like XMLdata. In this paper, we present a novel method for XML duplicate detection, called XMLDup.XMLDup uses a Bayesian network to determine the probability of two XML elements beingduplicates, considering not only the information within the elements, but also the way thatinformation is structured. In addition, to improve the efficiency of the network evaluation, anovel pruning strategy, capable of significant gains over the unoptimized version of thealgorithm, is presented. Through experiments, we show that our algorithm is able to achieve highprecision and recall scores in several datasets. XMLDup is also able to outperform another stateof the art duplicate detection solution, both in terms of efficiency and of effectiveness. Finally,we also study how important the structure of elements is in the duplicate detection process. Weobserve that, not only structure can clearly influence the outcome, but also that, by ensuring astructure that is adequate to the characteristics of the data, we can actually improve the quality ofthe results.Soft ware and hard ware requirementsHardware Required:System : Pentium IVHard Disk : 80 GBRAM : 512 MBSoftware Required:O/S : Windows XPLanguage : Visual C#www.nanocdac.com www.nsrcnano.com branches: hyderabad nagpur