Impulse Technologies
                                      Beacons U to World of technology
        044-42133143, 98401 03301,9841091117 ieeeprojects@yahoo.com www.impulse.net.in
      Efficient and Effective Duplicate Detection in Hierarchical Data
   Abstract
          Although there is a long line of work on identifying duplicates in relational
   data, only a few solutions focus on duplicate detection in more complex
   hierarchical structures, like XML data. In this paper, we present a novel method for
   XML duplicate detection, called XMLDup. XMLDup uses a Bayesian network to
   determine the probability of two XML elements being duplicates, considering not
   only the information within the elements, but also the way that information is
   structured. In addition, to improve the efficiency of the network evaluation, a novel
   pruning strategy, capable of significant gains over the unoptimized version of the
   algorithm, is presented. Through experiments, we show that our algorithm is able
   to achieve high precision and recall scores in several datasets. XMLDup is also
   able to outperform another state of the art duplicate detection solution, both in
   terms of efficiency and of effectiveness. Finally, we also study how important the
   structure of elements is in the duplicate detection process. We observe that, not
   only structure can clearly influence the outcome, but also that, by ensuring a
   structure that is adequate to the characteristics of the data, we can actually improve
   the quality of the results.




  Your Own Ideas or Any project from any company can be Implemented
at Better price (All Projects can be done in Java or DotNet whichever the student wants)
                                                                                          1

24

  • 1.
    Impulse Technologies Beacons U to World of technology 044-42133143, 98401 03301,9841091117 ieeeprojects@yahoo.com www.impulse.net.in Efficient and Effective Duplicate Detection in Hierarchical Data Abstract Although there is a long line of work on identifying duplicates in relational data, only a few solutions focus on duplicate detection in more complex hierarchical structures, like XML data. In this paper, we present a novel method for XML duplicate detection, called XMLDup. XMLDup uses a Bayesian network to determine the probability of two XML elements being duplicates, considering not only the information within the elements, but also the way that information is structured. In addition, to improve the efficiency of the network evaluation, a novel pruning strategy, capable of significant gains over the unoptimized version of the algorithm, is presented. Through experiments, we show that our algorithm is able to achieve high precision and recall scores in several datasets. XMLDup is also able to outperform another state of the art duplicate detection solution, both in terms of efficiency and of effectiveness. Finally, we also study how important the structure of elements is in the duplicate detection process. We observe that, not only structure can clearly influence the outcome, but also that, by ensuring a structure that is adequate to the characteristics of the data, we can actually improve the quality of the results. Your Own Ideas or Any project from any company can be Implemented at Better price (All Projects can be done in Java or DotNet whichever the student wants) 1