# An adaptive algorithm for detection of duplicate records

## on Aug 20, 2011

An Adaptive Algorithm for Detection of Duplicate Records

• An Adaptive Algorithm for Detection of Duplicate Records Presented By: Rama kanta Behera IT200127207 Under the guidance of : Miss Ipsita Mishra
• INTRODUCTION
• A “ records set ” is a list of prior distinct records. A new record is to be verified for a duplicate against the records set
• A database is a collection of related data.
• Various Algorithms like
• Matching learning algo,
• Learnable string similarity measures
• OBJECTIVES
• Reduced cost of duplicate record detection.
• Perfect scalability of one such detection procedure.
• Cache prior information of distinct records and thus cause retaining of prior records redundant for furthering the search
• PREVALENT METHODS
• The Brute Force Method
• This method consumes complexity of the order number of records in the records set and requires all prior records to be stored.
• Method by Rail et. al
• The comparison of a new record against the records set is reduced from being full text match to comparing two integers
• OUTLINE OF THE PROPOSED SOLUTION The central idea behind the present algorithm is based on the fundamental property of primality of numbers I f(x) Record set Integer number space Fig: hashing I P Record set Integer number Prime number f(x) g(x) Fig: Extended hashing into prime space
• r1 r2 … rn I1 I2 … In P1 P2 … Pn PRODUCT( P prior) f(x) g(x) P1*p2 …*pn= P prior Fig: The complete algorithm
• REALIZATION OF THE ALGORITHM
• Two functions f(x) and g(x) are to be realized for the implementation of the algorithm.
• Realizing f(x)
• Realizing g(x)
• STEPS OF THE ALGORITHM Step 1 : For each new record, hash is performed and unique hash value (Hnew) for each distinct record is obtained. Step 2 : Hnew is mapped to its corresponding unique prime (Pnew). Step 3 : Pprior is divided with Pnew. If Pnew exactly divides Pprior, then the corresponding record to Pnew is a duplicate and already exists in Pprior. Else, Pnew is a distinct record. Step 4 : If Pnew is a distinct record, Pprior is multiplied with Pnew and the result is stored back in Pprior. Thus updating Pprior renders the algorithm adaptive.
• Fig: Flowchart
• IMPLEMENTATIONS There are three important implementation details that need to be discussed
• Size of Records set
• Use of Logarithms
• Subsets of Records set
• CONCLUSION
• A new approach to handle duplicate records is presented
• This approach combines the concepts of number theory and algorithmic to solve the oftener felt problem of “duplicate record detection”.
• THANK YOU !!!