An adaptive algorithm for detection of duplicate records

1,713 views
1,542 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,713
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

An adaptive algorithm for detection of duplicate records

  1. 1. An Adaptive Algorithm for Detection of Duplicate Records Presented By: Rama kanta Behera IT200127207 Under the guidance of : Miss Ipsita Mishra
  2. 2. INTRODUCTION <ul><li>A “ records set ” is a list of prior distinct records. A new record is to be verified for a duplicate against the records set </li></ul><ul><li>A database is a collection of related data. </li></ul><ul><li>Various Algorithms like </li></ul><ul><ul><ul><li>Matching learning algo, </li></ul></ul></ul><ul><ul><ul><li>Learnable string similarity measures </li></ul></ul></ul><ul><ul><ul><li>Adaptive Algo </li></ul></ul></ul>
  3. 3. OBJECTIVES <ul><li>Reduced cost of duplicate record detection. </li></ul><ul><li>Perfect scalability of one such detection procedure. </li></ul><ul><li>Cache prior information of distinct records and thus cause retaining of prior records redundant for furthering the search </li></ul><ul><li>Keep the algorithm adaptive. </li></ul>
  4. 4. PREVALENT METHODS <ul><li>The Brute Force Method </li></ul><ul><li>This method consumes complexity of the order number of records in the records set and requires all prior records to be stored. </li></ul><ul><li>Method by Rail et. al </li></ul><ul><li>The comparison of a new record against the records set is reduced from being full text match to comparing two integers </li></ul>
  5. 5. OUTLINE OF THE PROPOSED SOLUTION The central idea behind the present algorithm is based on the fundamental property of primality of numbers I f(x) Record set Integer number space Fig: hashing I P Record set Integer number Prime number f(x) g(x) Fig: Extended hashing into prime space
  6. 6. r1 r2 … rn I1 I2 … In P1 P2 … Pn PRODUCT( P prior) f(x) g(x) P1*p2 …*pn= P prior Fig: The complete algorithm
  7. 7. REALIZATION OF THE ALGORITHM <ul><li>Two functions f(x) and g(x) are to be realized for the implementation of the algorithm. </li></ul><ul><ul><li>Realizing f(x) </li></ul></ul><ul><ul><li>Realizing g(x) </li></ul></ul>
  8. 8. STEPS OF THE ALGORITHM Step 1 : For each new record, hash is performed and unique hash value (Hnew) for each distinct record is obtained. Step 2 : Hnew is mapped to its corresponding unique prime (Pnew). Step 3 : Pprior is divided with Pnew. If Pnew exactly divides Pprior, then the corresponding record to Pnew is a duplicate and already exists in Pprior. Else, Pnew is a distinct record. Step 4 : If Pnew is a distinct record, Pprior is multiplied with Pnew and the result is stored back in Pprior. Thus updating Pprior renders the algorithm adaptive.
  9. 9. Fig: Flowchart
  10. 10. IMPLEMENTATIONS There are three important implementation details that need to be discussed <ul><ul><li>Size of Records set </li></ul></ul><ul><ul><li>Use of Logarithms </li></ul></ul><ul><ul><li>Subsets of Records set </li></ul></ul>
  11. 11. CONCLUSION <ul><li>A new approach to handle duplicate records is presented </li></ul><ul><li>This approach combines the concepts of number theory and algorithmic to solve the oftener felt problem of “duplicate record detection”. </li></ul>
  12. 12. THANK YOU !!!

×