Be the first to like this
Similarity join is most important technique to
involve many applications such as data integration, record
linkage and pattern recognition. Here we introduce new
algorithm for similarity join with edit distance constraints.
Currently extracting overlapping grams from string and consider
only string that share certain gram as candidate. Now we propose
extracting non-overlapping substring or chunk from string.
Chunk scheme based on tail-restricted chunk boundary
dictionary (CBD). This approach integrated existing approach
for calculating similarity with several new filters unique to chunk
based method. Greedy algorithm automatically select good
chunking scheme from given data set. Then show the result our
method occupies less space and faster performance to compute