Spe165 t

217 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
217
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Spe165 t

  1. 1. IJCA Special Issue on “Computational Science - New Dimensions & Perspectives”NCCSE, 201171Duplicate Detection of Query Results from Multiple WebDatabasesHemalatha SDepartment of Computer Science &EngineeringAdhiyamaan College of EngineeringHosur - 635109, IndiaRaja KDepartment of Computer Science &EngineeringAdhiyamaan College of EngineeringHosur - 635109, IndiaTholkappia ArasuDepartment of CSEJayam college of Engineering andTechnology,Dharmapuri-636819, IndiaABSTRACTThe results from multiple databases compose the deep or hiddenWeb, which is estimated to contain a much larger amount of highquality, usually structured information and to have a faster growthrate than the static Web.The system that helps users integrate and,more importantly, compare the query results returned frommultiple Web databases, an important task is to match thedifferent sources’ records that refer to the same real-world entity.Most state-of-the-art record matching methods are supervised,which requires the user to provide training data. In the Webdatabases, the records are not available in hand as they are query-dependent, they can be obtained only after the user submits thequery. After removal of the same-source duplicates, the assumednon duplicate records from the same source can be used astraining examples. The method uses the classifiers the weightedcomponent similarity summing classifier (WCSS) and SupportVector Machine (SVM) classifier that works along with theGaussian mixture model (GMM) to iteratively to identify theduplicates. The classifiers work cooperatively to identify theduplicate records. The complete GMM is parameterized by themean vectors, covariance matrices and mixture weights from allthe records.General TermsData Mining, Record Matching, merge purge operation, LearningKeywordsGaussian Mixture Model, supervised learning, Support VectorMachine1. INTRODUCTIONThe Web Databases dynamically generate Web pages in responseto user queries. Most Web databases are only accessible via aquery interface through which users can submit queries. Withmany businesses, government organisations and research projectscollecting massive amounts of data, the techniques collectivelyknown as data mining have in recent years attracted interest bothfrom academia and industry. While there is much ongoingresearch in data mining algorithms and techniques, it is wellknown that a large proportion of the time and effort in real-worlddata mining projects is spent understanding the data to beanalysed, as well as in the data preparation and pre processingsteps. An increasingly important task in the data pre processingstep of many data mining projects is detecting and removingduplicate records that relate to the same entity within one data set.Similarly, linking or matching records relating to the same entityfrom several data sets is often required as information frommultiple sources needs to be integrated, combined or linked inorder to allow more detailed data analysis or mining. It takesadvantage of the dissimilarity among records from the same Webdatabase for record matching. Most existing work requireshuman-labeled training data (positive, negative, or both), whichplaces a heavy burden on users. Most previous work is based onpredefined matching rules hand-coded by domain experts ormatching rules learned offline by some learning method from a setof training examples. Such approaches work well in a traditionaldatabase environment, where all instances of the target databasescan be readily accessed, as long as a set of high-qualityrepresentative records can be examined by experts or selected forthe user to label. In the Web database scenario, the records tomatch are highly query-dependent, since they can only beobtained through online queries. Moreover, they are only a partialand biased portion of all the data in the source Web databases.To overcome such problems, propose a new method of recordmatching problem of identifying duplicates among records inquery results from multiple web databases. The proposedapproach Unsupervised Duplicate Detection[1] for the specificproblem employs two classifiers that collaborate in an iterativemanner. This paper proposes a method where the SVM classifieris fused with the Gaussian Mixture Model to identify theduplicates iteratively. The iteration stops when there are no moreduplicates in the result set. Blocking methods [2] are used inrecord linkage systems to reduce the number of candidate recordcomparison pairs to a feasible number whilst still maintaininglinkage accuracy. Blocking methods partition the data sets intoblocks or clusters of records which share a blocking attribute orare otherwise similar with respect to a defined criterion.2. RELATED WORKThe records consist of multiple fields, making the duplicatedetection problem much more complicated. Approaches that relyon training data to learn how to match the records includesprobabilistic approaches and supervised machine learningtechniques. Approaches that rely on domain knowledge or ongeneric distance metrics to match records.2.1 Probabilistic Matching ModelsLet us assume that two classes M and U contains the record pairsthat represent the same entity (“match”) and the class U thatcontains the record pairs that represent two different entities(“nonmatch”) [2],[3]. Let x be a comparison vector, randomly
  2. 2. IJCA Special Issue on “Computational Science - New Dimensions & Perspectives”NCCSE, 201172drawn from the comparison space that corresponds to the recordpair (a,b). The goal is to determine whether (a,b) ε M or (a,b) ε U.A decision rule(1), based simply on probabilities can be written asfollows:(a,b) ε {M if p(M|x) ≥p(U|x){U otherwise. (1)2.2 Bigram IndexingThe Bigram Indexing (BI) method as implemented in the Febrl [4]record linkage system allows for fuzzy blocking. The basic idea isthat the blocking key values are converted into a list of bigrams(sub-strings containing two characters) and sub-lists of allpossible permutations will be built using a threshold (between 0.0and 1.0). The resulting bigram lists are sorted and inserted into aninverted index, which will be used to retrieve the correspondingrecord numbers in a block. The number of sub-lists created for ablocking key value both depends on the length of the value andthe threshold. The lower the threshold the shorter the sub-lists, butalso the more sub-lists there will be per blocking key value,resulting in more (smaller blocks) in the inverted index. In theinformation retrieval field, bigram indexing has been found to berobust to small typographical errors in documents2.3 Supervised and Semisupervised LearningThe supervised learning systems rely on the existence of trainingdata in the form of record pairs, prelabeled as matching or not.One set of supervised learning techniques treat each record pair(a,b) independently, similar to the probabilistic techniques. Awell-known CART algorithm [5], which generates classificationand regression trees. A linear discriminant algorithm, whichgenerates a linear combination of the parameters for separating thedata according to their classes, and a “vector quantization”approach which is a generalization of the nearest neighbouralgorithms. The transitivity assumption can sometimes result ininconsistent decisions. For example (a,b) and (a,c) can beconsidered matches, but (b,c) not. Partitioning such “inconsistent”graphs with the goal of minimizing inconsistencies in an NP-complete problem.2.4 Active Learning Based TechniquesOne of the problems with the supervised learning techniques isthe requirement for a large number of training examples. While itis easy to create a large number of training pairs that are eitherclearly nonduplicates or clearly duplicates, it is very difficult togenerate ambiguous cases that would help create a highly accurateclassifier. Based on this observation, some duplicate detectionsystems used active learning methods to automatically locate suchambiguous pairs. Their method suggested that, by creatingmultiple classifiers, trained using slightly different data orparameters, it is possible to detect ambiguous cases and then askthe user for feedback. The key innovation in this work is thecreation of several redundant functions and the concurrentexploitation of their conflicting actions in order to discover newkinds of inconsistencies among duplicates in the data set.2.5 Distance based TechniquesEven active learning techniques require some training data orsome human effort to create the matching models. In the absenceof such training data or the ability to get human input, supervisedand active learning techniques are not appropriate. One way ofavoiding the need for training data is to define a distance metricfor records which does not need tuning through training data.Distance-based approaches[6], that conflate each record in onebig field may ignore important information that can be used forduplicate detection. A simple approach is to measure the distancebetween individual fields, using the appropriate distance metricfor each field, and then compute the weighted distance betweenthe records. In this case, the problem is the computation of theweights and the overall setting becomes very similar to theprobabilistic setting. one of the problems of the distance-basedtechniques is the need to define the appropriate value for thematching threshold. In the presence of training data, it is possibleto find the appropriate threshold value. However, this wouldnullify the major advantage of distance-based techniques, which isthe ability to operate without training data. Recently, Chaudhuri etal. [9] proposed a new framework for distance-based duplicatedetection, observing that the distance thresholds for detecting realduplicate entries are different from each database tuple. To detectthe appropriate threshold, Chaudhuri et al. observed that entriesthat correspond to the same real-world object but have differentrepresentation in the database tend 1) to have small distances fromeach other (compact set property), to have only a small number ofother neighbors within a small distance (sparse neighborhoodproperty). Furthermore, Chaudhuri et al. propose an efficientalgorithm for computing the required threshold for each object inthe database and show that the quality of the results outperformsapproaches that rely on a single, global threshold.2.6 Rule Based ApproachesA special case of distance-based approaches [6] and [7] is the useof rules to define whether two records are the same or not. Rule-based approaches can be considered as distance-based techniques,where the distance of two records is either 0 or 1. It is noteworthythat such rule-based approaches which require a human expert todevise meticulously crafted matching rules typically result insystems with high accuracy. However, the required tuningrequires extremely high manual effort from the human experts andthis effort makes the deployment of such systems difficult inpractice. Currently, the typical approach is to use a system thatgenerates matching rules from training data and then manuallytune the automatically generated rules. The mappingtransformation standardizes data, the matching transformationfinds pairs of records that probably refer to the same real object,the clustering transformation groups together matching pairs witha high similarity value, and, finally, the merging transformationcollapses each individual cluster into a tuple of the resulting datasource. It is noteworthy that such rule-based approaches whichrequire a human expert to devise meticulously crafted matchingrules typically result in systems with high accuracy. However, therequired tuning requires extremely high manual effort from thehuman experts and this effort makes the deployment of suchsystems difficult in practice. Currently, the typical approach is touse a system that generates matching rules from training data andthen manually tune the automatically generated rules.2.7 Unsupervised learningOne way to avoid manual labeling of the comparison vectors is touse clustering algorithms and group together similar comparison
  3. 3. IJCA Special Issue on “Computational Science - New Dimensions & Perspectives”NCCSE, 201173vectors. The idea behind most unsupervised learningapproaches[4] and [7] for duplicate detection is that similarcomparison vectors correspond to the same class. The idea ofunsupervised learning for duplicate detection has its roots in theprobabilistic model proposed by Fellegi and Sunter [8]. The ideaof unsupervised learning for duplicate detection has its roots inthe probabilistic model proposed by Fellegi and Sunter . Whenthere is no training data to compute the probability estimates, it ispossible to use variations of the Expectation Maximizationalgorithm to identify appropriate clusters in the data. Verykios etal. [10] propose the use of a bootstrapping technique based onclustering to learn matching models. The basic idea, also knownas cotraining [11], is to use very few labeled data, and then useunsupervised learning techniques to appropriately label the datawith unknown labels. Initially, Verykios et al. treat each entry ofthe comparison vector (which corresponds to the result of a fieldcomparison) as a continuous, real variable. The basic premise isthat each cluster contains comparison vectors with similarcharacteristics. Therefore, all the record pairs in the cluster belongto the same class (matches, nonmatches, or possiblematches).There are multiple techniques for duplicate record detection. Wecan divide the techniques into two broad categories: ad hoctechniques that work quickly on existing relational databases andmore “principled” techniques that are based on probabilisticinference models. While probabilistic methods outperform ad hoctechniques in terms of accuracy, the ad hoc techniques work muchfaster and can scale to databases with hundreds of thousands ofrecords. Probabilistic inference techniques are practical todayonly for data sets that are one or two orders of magnitude smallerthan the data sets handled by ad hoc techniques. A promisingdirection for future research is to devise techniques that cansubstantially improve the efficiency of approaches that rely onmachine learning and probabilistic inference. A question that isunlikely to be resolved soon is the question of which of thepresented methods should be used for a given duplicate detectiontask. Unfortunately, there is no clear answer to this question. Theduplicate record detection task is highly data-dependent and it isunclear if we will ever see a technique dominating all othersacross all data sets. The problem of choosing the best method forduplicate data detection is very similar to the problem of modelselection and performance prediction for data mining.3. IMPROVING THE EFFICIENCY OFDUPLICATE DETECTIONSo far, in our discussion of methods for detecting whether tworecords refer to the same real-world object, we have focusedmainly on the quality of the comparison techniques and not on theefficiency of the duplicate detection process. Now, we turn to thecentral issue of improving the speed of duplicate detection. Anelementary technique for discovering matching entries in tables Aand B is to execute a “nested-loop” comparison, i.e., to compareevery record of table A with every record in table B.Unfortunately, such a strategy requires a total of |A| . |B|comparisons, a cost that is prohibitively expensive even formoderately sized tables. We describe techniques that substantiallyreduce the number of required comparisons. Another factor thatcan lead to increased computation expense is the cost required fora single comparison. It is not uncommon for a record to containtens of fields. Therefore, each record comparison requiresmultiple field comparisons and each field comparison can beexpensive.3.1 BlockingOne “traditional” method for identifying identical records in adatabase table is to scan the table and compute the value of a hashfunction for each record. The value of the hash function definesthe “bucket” to which this record is assigned. By definition, tworecords that are identical will be assigned to the same bucket.Therefore, in order to locate duplicates, it is enough to compareonly the records that fall into the same bucket for matches. Thehashing technique cannot be used directly for approximateduplicates since there is no guarantee that the hash value of twosimilar records will be the same. However, there is an interestingcounterpart of this method, named blocking. As discussed abovewith relation to utilizing the hash function, blocking typicallyrefers to the procedure of subdividing files into a set of mutuallyexclusive subsets (blocks) under the assumption that no matchesoccur across different blocks. A common approach to achievingthese blocks is to use a function such as Soundex, NYSIIS, orMetaphone3.2 Sorted Neighborhood ApproachThe method consists of the following three steps:Create key: A key for each record in the list is computedby extracting relevant fields or portions of fields.Sort data: The records in the database are sorted byusing the key found in the first step. A sorting key isdefined to be a sequence of attributes, or a sequence ofsubstrings within the attributes, chosen from the recordin an ad hoc manner. Attributes that appear first in thekey have a higher priority than those that appearsubsequently.Merge: A fixed size window is moved through thesequential list of records in order to limit thecomparisons for matching records to those records inthe window.If the size of the window is w records, then every new record thatenters that window is compared with the previous w -1 records tofind “matching” records. The first record in the window slides outof it. The sorted neighborhood approach relies on the assumptionthat duplicate records will be close in the sorted list, and thereforewill be compared during the merge step. The effectiveness of thesorted neighborhood approach is highly dependent upon thecomparison key that is selected to sort the records. In general, nosingle key will be sufficient to sort the records in such a way thatall the matching records can be detected. If the error in a recordoccurs in the particular field or portion of the field that is the mostimportant part of the sorting key, there is a very small possibilitythat the record will end up close to a matching record aftersorting.3.3 Clustering and CanopiesTo improve the performance of a basic “nested-loop” recordcomparison by assuming that duplicate detection is transitive.Under the assumption of transitivity, the problem of matchingrecords in a database can be described in terms of determining theconnected components of an undirected graph. At any time, theconnected components of the graph correspond to the transitiveclosure of the “record matches” relationships discovered so far.
  4. 4. IJCA Special Issue on “Computational Science - New Dimensions & Perspectives”NCCSE, 201174We use a union-find structure to efficiently compute theconnected components of the graph. During the Union step,duplicate records are “merged” into a cluster and only a“representative” of the cluster is kept for subsequent comparisons.This reduces the total number of record comparisons withoutsubstantially reducing the accuracy of the duplicate detectionprocess. The concept behind this approach is that, if a record b isnot similar to a record b already in the cluster, then it will notmatch the other members of the cluster either. The use of canopiesfor speeding up the duplicate detection process. The basic idea isto use a cheap comparison metric to group records intooverlapping clusters called canopies. (This is in contrast toblocking that requires hard, nonoverlapping partitions.) After thefirst step, the records are then compared pairwise, using a moreexpensive similarity metric that leads to better qualitative results.The assumption behind this method is that there is an inexpensivesimilarity function that can be used as a “quick-and-dirty”approximation for another, more expensive function. For example,if two strings have a length difference larger than 3, then their editdistance cannot be smaller than 3. In that case, the lengthcomparison serves as a cheap (canopy) function for the moreexpensive edit distance.4. PROPOSED APPROACHThe focus is on the Web databases from the same domain i.e.,Web databases that provide the same type of records in responseto user queries. Different users may have different criteria forwhat constitute a duplicate even for records within the samedomain. An intuitive solution to this problem is that we can learna classifier from N and use the learned classifier to classify P.Although there are several works based on learning from onlypositive examples, to our knowledge all works assume that thepositive examples are correct. Two classifiers along withGaussian mixture model identifies the duplicate vector setiteratively.4.1 Weighted Component SimilaritySumming ClassifierAt the beginning, it is used to identify some duplicate vectorswhen there are no positive examples available. Then, afteriteration begins, it is used again to cooperate with C2 to identifynew duplicate vectors. Because no duplicate vectors are availableinitially, classifiers that need class information to train, such asdecision tree and Naive Bayes, cannot be used. An intuitivemethod to identify duplicate vectors is to assume that two recordsare duplicates if most of their fields that are under considerationare similar. In the WCSS classifier, we assign a weight to acomponent to indicate the importance of its corresponding fieldunder the condition that the sum of all component weights isequal to The intuition for the weight assignment includes, Thesimilarity between two duplicate records should be close to 1. Fora duplicate vector V12 that is formed by a pair of duplicaterecords r1 and r2, we need to assign large weights to thecomponents with large similarity values and small weights to thecomponents with small similarity values . The similarity for twononduplicate records should be close to 0. Hence, for anonduplicate vector V12 that is formed by a pair of nonduplicaterecords r1 and r2, we need to assign small weights to thecomponents with large similarity values and large weights to thecomponents with small similarity values.4.2 Support Vector Machine ClassifierAfter detecting a few duplicate vectors whose similarity scores arebigger than the threshold using the WCSS classifier[12], havepositive examples, the identified duplicate vectors in D, andnegative examples, namely the remaining nonduplicate vectors inN. Hence, we can train another classifier and use this trainedclassifier to identify new duplicate vectors from the remainingpotential duplicate vectors in P and the nonduplicate vectors in N.A classifier suitable for the task should have the followingcharacteristics. First, it should not be sensitive to the relative sizeof the positive and negative examples because the size of thenegative examples is usually much bigger than the size of thepositive examples. This is especially the case at the beginning ofthe duplicate vector detection iterations when a limited number ofduplicates are detected. Another requirement is that the classifiershould work well given limited training examples.Similarity-based classifiers estimate the class label of a testsample based on the similarities between the test sample and a setof labeled training samples, and the pairwise similarities betweenthe training samples. Like others, we use the term similarity-basedclassification whether the pairwise relationship is a similarity ordissimilarity. Similarity-based classification does not requiredirect access to the features of the samples, and thus the samplespace can be any set, not necessarily a Euclidean space, as long asthe similarity function is well defined for any pair of samples.4.3 Gaussian Mixture ModelThe complete Gaussian mixture model [15], is parameterized bythe mean vectors μi, covariance matrices ∑i and mixture weightswi from all component densities. These parameters are collectivelyrepresented by the notation,λ = {wi, μi,∑i} i = 1, . . . ,M. (2)There are several variants on the GMM shown in Equation (2).The covariance matrices, ∑i, can be full rank or constrained to bediagonal. Additionally, parameters can be shared, or tied, amongthe Gaussian components, such as having a common covariancematrix for all components, The choice of model configuration(number of components, full or diagonal covariance matrices, andparameter tying) is often determined by the amount of dataavailable for estimating the GMM parameters. The use of a GMMfor representing feature distributions in a biometric system mayalso be motivated by the intuitive notion that the individualcomponent densities may model some underlying set of hiddenclasses. For example, in speaker recognition, it is reasonable toassume the acoustic space of spectral related featurescorresponding to a speaker’s broad phonetic events, such asvowels, nasals or fricatives.4.4 Evaluation MetricThe performance of the duplicate detected dataset [13],[14] canbe reported using precision and recall, which are defined asfollows:Precision= #of correctly Identified Duplicate Pairs#of All Identified Duplicate Pairs
  5. 5. IJCA Special Issue on “Computational Science - New Dimensions & Perspectives”NCCSE, 201175Recall= #of correctly Identified Duplicate Pairs#of True duplicate PairsDue to the usually imbalanced distribution of matches andnonmatches in the weight vector set, these commonly usedaccuracy measures are not suitable for assessing the quality ofrecord matching. The large number of nonmatches usuallydominates the accuracy measure and yields results that are toooptimistic. To measure the performance F-measure can also beused, which is the harmonic mean of precision and recall, toevaluate the classification quality:F-measure= 2.precision.recall(precision+recall)5. EXPERIMENTSWe evaluate our model on dataset of research paper citations. Thedataset is from the Cora Computer Science Research PaperEngine, containing about 1800 citations, with 600 unique papersand 200 unique venues. The datasets are manually labeled forboth paper coreference and venue coreference, as well asmanually segmented into fields, such as author, title, etc. The datawere collected by searching for certain authors and topic, and theyare split into subsets with non-overlapping papers for the sake ofcross-validation experiments. We used a number of featurefunctions, including exact and approximate string match onnormalized and unnormalized values for the following citationfields: title, booktitle, journal, authors, venue, date, editors,institution, and the entire unsegmented citation string. We alsocalculated an unweighted cosine similarity between tokens in thetitle and author fields. Additional features include whether or notthe papers have the same publication type (e.g. journal orconference), as well as the numerical distance between fields suchas year and volume. All real values were binned and convertedinto binary-valued features. To evaluate performance, we comparethe clusters output by our system with the true clustering usingpairwise metrics and use the metrics ratios for evaluation. TheCora web site(www.cora.whizbang.com)provides a searchinterface to over 50,000 computer science research papers . Aspart of the sites functionality, we provide an interface fortraversing the citation graph. That is, for a given paper, weprovide links to all other papers it references, and links to all otherpapers that reference it, in the respective bibliography section ofeach paper. To provide this interface to the data, it is necessary torecognize when two citations from different papers are referencingthe same third paper, even though the text of the citations maydiffer. For example, one paper may abbreviate firrst author names,while the second may include them.The datasets were collected and the algorithm was coded in Javaand run on a machine with Windows XP operating system. Theduplicates are identified and the core Gaussian Mixture Model isbeing implemented.6. REFERENCES[1] W.Su, J. Wang, and Frederick H. Lochovsky, “RecordMatching over Query Results from Multiple Web Databases”IEEE Transactions on knowledge and data engineering.[2] R. Baxter, P. Christen, and T. Churches, “A Comparison ofFast Blocking Methods for Record Linkage,” Proc. KDDWorkshop Data Cleaning, Record Linkage, and ObjectConsolidation, pp. 25-27, 2003[3] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani,“Robust and Efficient Fuzzy Match for Online DataCleaning,” Proc. ACM SIGMOD, pp. 313-324, 2003.[4] P. Christen, T. Churches, and M. Hegland, “Febrl—AParallel Open Source Data Linkage System,” Advances inKnowledge Discovery and Data Mining, pp. 638-647,Springer, 2004.[5] O. Bennjelloun, H. Garcia-Molina, D. Menestrina, Q.Su,S.E.Whang, and J. Widom, “Swoosh: A GenericApproach to Entity Resolution,” The VLDB J., vol. 18, no.1, pp. 255-276, 2009.[6] M. Bilenko and R.J. Mooney, “Adaptive Duplicate DetectionUsing Learnable String Similarity Measures,” Proc. ACMSIGKDD, pp. 39-48, 2003.[7] P. Christen, “Automatic Record Linkage Using SeededNearest Neighbour and Support Vector MachineClassification,” Proc. ACM SIGKDD, pp. 151-159, 2008.[8] W.E. Winkler, “Using the EM Algorithm for WeightComputationin the Fellegi-Sunter Model of RecordLinkage,” Proc. Section Survey Research Methods, pp. 667-671, 1988[9] S. Chaudhuri, V. Ganti, and R. Motwani, “Robust Identification ofFuzzy Duplicates,” Proc. 21st IEEE Int’l Conf. Data Eng.(ICDE ’05),pp. 865-876, 2005.[10] V.S. Verykios, A.K. Elmagarmid, and E.N. Houstis,“Automating the Approximate Record Matching Process,”Information Sciences, vol. 126, nos. 1-4, pp. 83-98, July2000.[11] A. Blum and T. Mitchell, “Combining Labeled andUnlabeled Data with Co-Training,” COLT ’98: Proc. 11thAnn. Conf. Computational Learning Theory, pp. 92-100,1998.[12] W.W. Cohen and J. Richman, “Learning to Match andCluster Large High-Dimensional Datasets for DataIntegration,” Proc. ACM SIGKDD, pp. 475-480, 2002.[13] P. Christen and K. Goiser, “Quality and ComplexityMeasures for Data Linkage and Deduplication,” QualityMeasures in Data Mining, F. Guillet and H. Hamilton,eds., vol. 43, pp. 127-151, Springer, 2007.[14] W.W. Cohen, H. Kautz, and D. McAllester, “Hardening SoftInformation Sources,” Proc. ACM SIGKDD, pp. 255-259,2000.[15] Gaussian Mixture Models Douglas Reynolds MIT LincolnLaboratory, 244 Wood St., Lexington, MA 02140, USA[16] W.E. Winkler, “Using the EM Algorithm for WeightComputation in the Fellegi-Sunter Model of RecordLinkage,” Proc. Section Survey Research Methods, pp. 667-671, 1988.

×