IJCER (www.ijceronline.com) International Journal of computational Engineering research


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

IJCER (www.ijceronline.com) International Journal of computational Engineering research

  1. 1. International Journal Of Computational Engineering Research (ijceronline.com) Vol. 2 Issue. 5 A Dynamic Filtering Algorithm to Search Approximate String 1 A.Dasaradha, 2P.K.Sahu 1 Dept. of CSE, Final M.Tech Student, AITAM Engineering College, INDIA, 2 Dept. of CSE, Associative Professor, AITAM Engineering College, INDIAAbstract:Recently string data management in databases has gained Efficiently as possible. Many techniques have beenlot of interest in various applications such as data designed, such as [1], [2], [3], [4], [5], [6], [7]. Thesecleaning, query relaxation and spellchecking. Hence in methods assume a given similarity function to quantifythis paper we provide a solution how to find similar to a the closeness between two strings. Various string-query string from a given a collection of strings. The similarity functions have been studied, such as editproposed solution has two phases. In the first phase three distance, cosine similarity and Jaccard coefficient. Allalgorithms such as ScanCount, MergeSkip, and these methods use the gram concept which is a substringDivideSkip for answering approximate string search of a string to be used as a signature of the string. Thesequeries. In the second phase, we study on how to algorithms rely on inverted lists of grams to findintegrate various filtering techniques with the proposed candidate strings, and utilize the fact that similar stringsmerging algorithms. Several experiments have been should share enough common grams. Many algorithmsconducted on various data sets to calculate the [9], [2] [15] mainly focused on “join queries” i.e., findingperformance of the proposed techniques. similar pairs from two collections of strings. Approximate string search could be treated as a specialIntroduction: case of join queries. It is well understood that theRecently string data management in databases has gained behavior of an algorithm for answering selection querieslot of interest in text mining. Hence in this paper we could be very different from that for answering joinstudy a problem how to find similar to a query string queries. We believe approximate string search is“approximate string search” from a given a collection of important enough to deserve a separate investigation.strings. This problem can be find in various applicationslike data cleaning, query relaxation, and spellchecking. In this paper the proposed solution has two phases. In the first phase, we propose three efficient algorithms forSpell Checking: Given an input document, a spellchecker answering approximate string search queries, calledhas to find all possible mistyped words by searching ScanCount, MergeSkip, and DivideSkip. The ScanCountsimilar words to them in the dictionary. We have to find algorithm adopts a simple idea of scanning the invertedmatched candidates to recommend for words which are lists and counting candidate strings. Despite the fact thatnot there in the dictionary. it is very naive, when combined with various filtering techniques, this algorithm can still achieve a highData Cleaning: A data collection has various performance. The MergeSkip algorithm exploits theinconsistencies which have to be solved before the data value differences among the inverted lists and thecan be used for accurate data analysis. The process of threshold on the number of common grams of similardetecting and correcting such inconsistencies is known as strings to skip many irrelevant candidates on the lists.data cleaning. A common form of inconsistency arises The DivideSkip algorithm combines the MergeSkipwhen a real-world entity has more than one algorithm and the idea in the MergeOpt algorithmrepresentation in the data collection; for example, the proposed in [9] that divides the lists into two groups. Onesame address could be encoded using different strings in group is for those long lists, and the other group is for thedifferent records in the collection. Multiple remaining lists. We run the MergeSkip algorithm torepresentations arise due to a variety of reasons such as merge the short lists with a different threshold, and usemisspellings caused by typographic errors and different the long lists to verify the candidates. Our experiments onformatting conventions used by data sources. three real data sets showed that the proposed algorithmsThese applications require a high real-time performance could significantly improve the performance of existingfor each query to be answered. Hence it is necessary to algorithms.design algorithms for answering such queries asIssn 2250-3005(online) September| 2012 Page 1215
  2. 2. International Journal Of Computational Engineering Research (ijceronline.com) Vol. 2 Issue. 5In the second phase we study, how to integrate various least T times among all the lists. This algorithm is basedfiltering techniques with the proposed merging on the observation that a record in the answer mustalgorithms. Various filters have been proposed to appear on at least one of the short lists. This algorithm iseliminate strings that cannot be similar enough to a given more efficient than Heap algorithm.string. Surprisingly, our experiments and analysis showthat a naive solution of adopting all available filtering Proposed Solution:techniques might not achieve the best performance to Here we present our three proposed merging algorithms.merge inverted lists. Intuitively, filters can segmentinverted lists to relatively shorter lists, while merging ScanCount:algorithms need to merge these lists. In addition, the In this algorithm [8], we maintain an array (S) of countsmore filters we apply, the more groups of inverted lists for all the string ids. Then scan the inverted lists. Forwe need to merge, and more overhead we need to spend each string id on each list, we increment the countfor processing these groups before merging their lists. corresponding to the string by 1. Report the string ids thatThus filters and merging algorithms need to be integrated appear at least T times in the lists. The time complexityjudiciously by considering this tradeoff. Based on this of the algorithm is O(M) for heap algorithm it isanalysis, we classify filters into two categories: single- O(MlogN) and The space complexity is O(|S|), where S issignature filters and multi-signature filters. We propose a the size of the string collection. ScanCount algorithmstrategy to selectively choose proper filters to build an improves the Heap algorithm by eliminating the heapindex structure and integrate them with merging data structure and the corresponding operations on thealgorithms. Experiments show that our strategy reduces heap. The algorithm is formally descried in Figure 1.the running time by as much as one to two orders ofmagnitude over approaches without filtering techniquesor strategies that naively use all the filtering techniques.The remainder of this paper is organized as follows. Insection 2 discuss about related work, section 3 describesabout the proposed solution, section 4 explains theexperimental setup and section 5 concludes the paper.Related Work:Several existing algorithms assume an index of invertedlists for the grams of the strings in the collection S toanswer approximate string queries on S. In the index, foreach gram g of the strings in S, we have a list lg of the idsof the strings that include this gram, possibly with thecorresponding positional information of the gram in thestrings [12] [13] [14].Heap algorithm [11]: When merging the lists, maintainthe frontiers of the lists as a heap. In each step, pop thetop element from the heap and increment the count of therecord id corresponding to the popped frontier record.Remove this record id from this list, and reinsert the nextrecord id on the list to the heap. Report a record id Figure 1: Flowchart for ScanCount Algorithmwhenever its count value is at least threshold T. Thisalgorithm time complexity is O(MlogN) and space MergeSkip:complexity is O(N). The main principle of this algorithm, is to skip on the lists those record ids that cannot be in the answer to theMergeOpt algorithm [10]: It treats the T − 1 longest query, by utilizing the threshold T. Similar to Heapinverted lists of G(Q, q) separately. For the remaining N algorithm, we also maintain a heap for the frontiers of− (T − 1) relatively short inverted lists, Use the Heap these lists. The key difference is in each iteration, we popalgorithm to merge them with a lower threshold i.e., 1. those records from the heap that have the same value asFor each candidate string, apply binary search on each of the top record t on the heap which is described in figure2.the T −1 long lists to verify if the string appears on atIssn 2250-3005(online) September| 2012 Page 1216
  3. 3. International Journal Of Computational Engineering Research (ijceronline.com) Vol. 2 Issue. 5 Experimental Setup: The performance of five merging algorithms such as Heap, MergeOpt, ScanCount, MergeSkip, and DivideSkip has been evaluated using DBLP dataset. DBLP dataset: It includes paper titles downloaded from the DBLP Bibliography site1. The raw data was in an XML format, and we extracted 274,788 paper titles with a total size 17.8MB. The average size of gram inverted lists for a query was about 67, and the total number of distinct grams was 59, 940. The gram length q was 4 for the data sets. All the algorithms were implemented using GNU C++ and run on a system with 2GB main memory, a 2.13GHz Dual Core CPU and Ubuntu operating system. Figure 2: Flowchart for MergeSkip AlgorithmDivideSkip:The key idea of DivideSkip algorithm is to combineMergeSkip and MergeOpt algorithms. Both thealgorithms try to skip irrelevant records on the lists butusing different intuitions. MergeSkip exploits the valuedifferences among the records on the lists, whileMergeOpt exploits the size differences among the lists. Figure 4: Average query time versus data set size.DivideSkip algorithm uses both differences to improvethe search performance. Figure 5: Number of string ids visited by the algorithms. Classification Of Filters: A filter generates a set of signatures for a string, such that similar strings share similar signatures, and these Figure 3: Flow chart for DivideSkip algorithm signatures can be used easily to build an index structure. Filters are classified into two categories. Single-signature filters generate a single signature (typically an integer or a hash code) for a string and Multi-signature filters generate multiple signatures for a string.Issn 2250-3005(online) September| 2012 Page 1217
  4. 4. International Journal Of Computational Engineering Research (ijceronline.com) Vol. 2 Issue. 5 position filter. Figure 7 and 8 shows the improvedLength Filtering: If two strings s1 and s2 are within edit performance of the algorithm using filters in the DBLPdistance k, the difference between their lengths cannot dataset.exceed k. Thus, given a query string s1, we only need toconsider strings s2 in the data collection such that thedifference between |s1| and |s2| is not greater than k. Thisis a Single signature filter as it generates a singlesignature for a string.Position Filtering: If two strings s1 and s2 are withinedit distance k, then a q-gram in s1 cannot correspond toa q-gram in the other string that differs by more than kpositions. Thus, given a positional gram (i1, g1) in thequery string, we only need to consider the othercorresponding gram (i2, g2) in the data set, such that|i1−i2| ≤ k. This is a multi-signature filter because itproduces a set of positional grams as signatures for a Figure 7: DBLP data set for Mergestring.Prefix Filtering [10]: Given two q-gram sets G(s1) andG(s2) for strings s1 and s2, we can fix an ordering O ofthe universe from which all set elements are drawn. Letp(n, s) denote the n-th prefix element in G(s) as per theordering O. For simplicity, p(1, s) is abbreviated as ps.An important property is that, if |G(s1) ∩ G(s2)| ≥ T ,then ps2 ≤ p(n, s1), where n = |s1| − T + 1.Applying Filters Before Merging Lists:All the existing filters can be grouped to improve thesearch performance of merging algorithms. One way togroup them is to build a tree structure called as Filter Figure 8: DBLP data set for TotalTree, in which each level corresponds to a filter which isdescribed in figure 6. Conclusion: In this paper we have proposed a solution for how to efficiently find a collection of strings those similar to a given string. We designed solution in two phases. In the first phase three algorithms such as ScanCount, MergeSkip and DivideSkip for answering approximate string search queries. In the second phase, we study on how to integrate various filtering techniques with the proposed merging algorithms. Several experiments have Figure 6: A FilterTree been conducted on various data sets to calculate the performance of the proposed techniques.It is very important to decide which filters should be usedon which levels. To improve the performance, use single-signature filters at level 1 (close to the root) such as thelength filter and the prefix filter because each string inthe data set will be inserted to a single path, instead ofappearing in multiple paths. During a search, for thesefilters we only need to traverse those paths on which thecandidate strings can appear. From level 2, we can addthose multi-signature ones, such as the gram filter and theIssn 2250-3005(online) September| 2012 Page 1218
  5. 5. International Journal Of Computational Engineering Research (ijceronline.com) Vol. 2 Issue. 5References:[1] . Arasu, V. Ganti, and R. Kaushik, “Efficient Exact Set-Similarity Joins,” in VLDB, 2006, pp. 918–929.[2] R. Bayardo, Y. Ma, and R. Srikant, “Scaling up all-pairs similarity search,” in WWW Conference, 2007.[3] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani, “Robust and Efficient Fuzzy Match for Online Data Cleaning,” in SIGMOD, 2003, pp. 313–324.[4] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava, “Approximate string joins in a database (almost) for free,” in VLDB, 2001, pp. 491–500.[5] C. Li, B. Wang, and X. Yang, “VGRAM: Improving performance of approximate queries on string collections using variable-length grams,” in Very Large Data Bases, 2007.[6] E. Sutinen and J. Tarhio, “On Using q-Grams Locations in Approximate String Matching,” in ESA, 1995, pp. 327–340.[7] E. Ukkonen, “Approximae String Matching with q-Grams and Maximal Matching,” Theor. Comut. Sci., vol. 1, pp. 191–211, 1992.[8] V. Levenshtein, “Binary Codes Capable of Correcting Spurious Insertions and Deletions of Ones,” Profl. Inf. Transmission, vol. 1, pp. 8–17, 1965.[9] S. Sarawagi and A. Kirpal, “Efficient set joins on similarity predicate,” in ACM SIGMOD, 2004.[10] S. Chaudhuri, V. Ganti, and R. Kaushik, “A primitive operator for similarity joins in data cleaning,” in ICDE, 2006, pp. 5–16.[11] G. Navarro, “A guided tour to approximate string matching,” ACM Computing Surveys, vol. 33, no. 1, pp. 31–88, 2001.[12] K. Ramasamy, J. M. Patel, R. Kaushik, and J. F. Naughton, “Set containment joins: The good, the bad and the ugly,” in VLDB, 2000.[13] N. Koudas, S. Sarawagi, and D. Srivastava, “Record linkage: similarity measures and algorithms,” in SIGMOD Tutorial, 2005, pp. 802–803.[14] M.-S. Kim, K.-Y. Whang, J.-G. Lee, and M.-J. Lee, “n-Gram/2L: A space and time efficient two-level n-gram inverted index structure.” In VLDB, 2005, pp. 325–336.[15] Chen Li, Jiaheng Lu, Yiming Lu, “Efficient Merging and Filtering Algorithms for Approximate String Searches”, ICDE 2008, IEEE.Issn 2250-3005(online) September| 2012 Page 1219