PPT file


Published on

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

PPT file

  1. 1. Semi-Supervised Clustering and its Application to Text Clustering and Record Linkage Raymond J. Mooney Sugato Basu Mikhail Bilenko Arindam Banerjee
  2. 2. Supervised Classification Example . . . .
  3. 3. Supervised Classification Example . . . . . . . . . . . . . . . . . . . .
  4. 4. Supervised Classification Example . . . . . . . . . . . . . . . . . . . .
  5. 5. Unsupervised Clustering Example . . . . . . . . . . . . . . . . . . . .
  6. 6. Unsupervised Clustering Example . . . . . . . . . . . . . . . . . . . .
  7. 7. Semi-Supervised Learning <ul><li>Combines labeled and unlabeled data during training to improve performance: </li></ul><ul><ul><li>Semi-supervised classification : Training on labeled data exploits additional unlabeled data, frequently resulting in a more accurate classifier. </li></ul></ul><ul><ul><li>Semi-supervised clustering : Uses small amount of labeled data to aid and bias the clustering of unlabeled data. </li></ul></ul>
  8. 8. Semi-Supervised Classification Example . . . . . . . . . . . . . . . . . . . .
  9. 9. Semi-Supervised Classification Example . . . . . . . . . . . . . . . . . . . .
  10. 10. Semi-Supervised Classification <ul><li>Algorithms: </li></ul><ul><ul><li>Semisupervised EM [Ghahramani:NIPS94,Nigam:ML00]. </li></ul></ul><ul><ul><li>Co-training [Blum:COLT98]. </li></ul></ul><ul><ul><li>Transductive SVM’s [Vapnik:98,Joachims:ICML99]. </li></ul></ul><ul><li>Assumptions: </li></ul><ul><ul><li>Known, fixed set of categories given in the labeled data. </li></ul></ul><ul><ul><li>Goal is to improve classification of examples into these known categories. </li></ul></ul>
  11. 11. Semi-Supervised Clustering Example . . . . . . . . . . . . . . . . . . . .
  12. 12. Semi-Supervised Clustering Example . . . . . . . . . . . . . . . . . . . .
  13. 13. Second Semi-Supervised Clustering Example . . . . . . . . . . . . . . . . . . . .
  14. 14. Second Semi-Supervised Clustering Example . . . . . . . . . . . . . . . . . . . .
  15. 15. Semi-Supervised Clustering <ul><li>Can group data using the categories in the initial labeled data. </li></ul><ul><li>Can also extend and modify the existing set of categories as needed to reflect other regularities in the data. </li></ul><ul><li>Can cluster a disjoint set of unlabeled data using the labeled data as a “guide” to the type of clusters desired. </li></ul>
  16. 16. Search-Based Semi-Supervised Clustering <ul><li>Alter the clustering algorithm that searches for a good partitioning by: </li></ul><ul><ul><li>Modifying the objective function to give a reward for obeying labels on the supervised data [Demeriz:ANNIE99]. </li></ul></ul><ul><ul><li>Enforcing constraints ( must-link, cannot-link ) on the labeled data during clustering [Wagstaff:ICML00, Wagstaff:ICML01]. </li></ul></ul><ul><ul><li>Use the labeled data to initialize clusters in an iterative refinement algorithm (kMeans, EM) [Basu:ICML02]. </li></ul></ul>
  17. 17. Unsupervised KMeans Clustering <ul><li>KMeans is a partitional clustering algorithm based on iterative relocation that partitions a dataset into K clusters. </li></ul><ul><li>Algorithm : </li></ul><ul><li>Initialize K cluster centers randomly. Repeat until convergence: </li></ul><ul><ul><li>Cluster Assignment Step : Assign each data point x to the cluster X l , such that L 2 distance of x from (center of X l ) is minimum </li></ul></ul><ul><ul><li>Center Re-estimation Step : Re-estimate each cluster center as the mean of the points in that cluster </li></ul></ul>
  18. 18. KMeans Objective Function <ul><li>Locally minimizes sum of squared distance between the data points and their corresponding cluster centers: </li></ul><ul><li>Initialization of K cluster centers: </li></ul><ul><ul><li>Totally random </li></ul></ul><ul><ul><li>Random perturbation from global mean </li></ul></ul><ul><ul><li>Heuristic to ensure well-separated centers </li></ul></ul><ul><ul><li>etc. </li></ul></ul>
  19. 19. K Means Example
  20. 20. K Means Example Randomly Initialize Means x x
  21. 21. K Means Example Assign Points to Clusters x x
  22. 22. K Means Example Re-estimate Means x x
  23. 23. K Means Example Re-assign Points to Clusters x x
  24. 24. K Means Example Re-estimate Means x x
  25. 25. K Means Example Re-assign Points to Clusters x x
  26. 26. K Means Example Re-estimate Means and Converge x x
  27. 27. Semi-Supervised KMeans <ul><li>Seeded KMeans: </li></ul><ul><ul><li>Labeled data provided by user are used for initialization: initial center for cluster i is the mean of the seed points having label i. </li></ul></ul><ul><ul><li>Seed points are only used for initialization , and not in subsequent steps. </li></ul></ul><ul><li>Constrained KMeans: </li></ul><ul><ul><li>Labeled data provided by user are used to initialize KMeans algorithm. </li></ul></ul><ul><ul><li>Cluster labels of seed data are kept unchanged in the cluster assignment steps, and only the labels of the non-seed data are re-estimated. </li></ul></ul>
  28. 28. Semi-Supervised K Means Example
  29. 29. Semi-Supervised K Means Example Initialize Means Using Labeled Data x x
  30. 30. Semi-Supervised K Means Example Assign Points to Clusters x x
  31. 31. Semi-Supervised K Means Example Re-estimate Means and Converge x x
  32. 32. Similarity-Based Semi-Supervised Clustering <ul><li>Train an adaptive similarity function to fit the labeled data. </li></ul><ul><li>Use a standard clustering algorithm with the trained similarity function to cluster the unlabeled data. </li></ul><ul><li>Adaptive similarity functions: </li></ul><ul><ul><li>Altered Euclidian distance [Klein:ICML02] </li></ul></ul><ul><ul><li>Trained Mahalanobis distance [Xing:NIPS02] </li></ul></ul><ul><ul><li>EM-trained edit distance [Bilenko:KDD03] </li></ul></ul><ul><li>Clustering algorithms: </li></ul><ul><ul><li>Single-link agglomerative [Bilenko:KDD03] </li></ul></ul><ul><ul><li>Complete-link agglomerative [Klein:ICML02] </li></ul></ul><ul><ul><li>K-means [Xing:NIPS02] </li></ul></ul>
  33. 33. Semi-Supervised Clustering Example Similarity Based
  34. 34. Semi-Supervised Clustering Example Distances Transformed by Learned Metric
  35. 35. Semi-Supervised Clustering Example Clustering Result with Trained Metric
  36. 36. Experiments <ul><li>Evaluation measures: </li></ul><ul><ul><li>Objective function value for KMeans. </li></ul></ul><ul><ul><li>Mutual Information (MI) between distributions of computed cluster labels and human-provided class labels. </li></ul></ul><ul><li>Experiments: </li></ul><ul><ul><li>Change of objective function and MI with increasing fraction of seeding (for complete labeling and no noise). </li></ul></ul><ul><ul><li>Change of objective function and MI with increasing noise in seeds (for complete labeling and fixed seeding). </li></ul></ul>
  37. 37. Experimental Methodology <ul><li>Clustering algorithm is always run on the entire dataset. </li></ul><ul><li>Learning curves with 10-fold cross-validation: </li></ul><ul><ul><li>10% data set aside as test set whose label is always hidden. </li></ul></ul><ul><ul><li>Learning curve generated by training on different “seed fractions” of the remaining 90% of the data whose label is provided. </li></ul></ul><ul><li>Objective function is calculated over the entire dataset. </li></ul><ul><li>MI measure calculated only on the independent test set. </li></ul>
  38. 38. Experimental Methodology (contd.) <ul><li>For each fold in the Seeding experiments: </li></ul><ul><ul><li>Seeds selected from training dataset by varying seed fraction from 0.0 to 1.0, in steps of 0.1 </li></ul></ul><ul><li>For each fold in the Noise experiments: </li></ul><ul><ul><li>Noise simulated by changing the labels of a fraction of the seed values to a random incorrect value. </li></ul></ul>
  39. 39. COP-KMeans <ul><li>COPKMeans [Wagstaff et al .: ICML01] is KMeans with must-link (must be in same cluster) and cannot-link (cannot be in same cluster) constraints on data points. </li></ul><ul><li>Initialization : Cluster centers are chosen randomly, but as each one is chosen any must-link constraints that it participates in are enforced (so that they cannot later be chosen as the center of another cluster). </li></ul><ul><li>Algorithm : During cluster assignment step in COP-KMeans, a point is assigned to its nearest cluster without violating any of its constraints. If no such assignment exists, abort. </li></ul>
  40. 40. Datasets <ul><li>Data sets: </li></ul><ul><ul><li>UCI Iris (3 classes; 150 instances) </li></ul></ul><ul><ul><li>CMU 20 Newsgroups (20 classes; 20,000 instances) </li></ul></ul><ul><ul><li>Yahoo! News (20 classes; 2,340 instances) </li></ul></ul><ul><li>Data subsets created for experiments: </li></ul><ul><ul><li>Small-20 newsgroup : random sample of 100 documents from each newsgroup, created to study effect of datasize on algorithms. </li></ul></ul><ul><ul><li>Different-3 newsgroup : 3 very different newsgroups ( alt.atheism, rec.sport.baseball, sci.space ), created to study effect of data separability on algorithms. </li></ul></ul><ul><ul><li>Same-3 newsgroup : 3 very similar newsgroups ( comp.graphics, comp.os.ms-windows, comp.windows.x ). </li></ul></ul>
  41. 41. Text Data <ul><li>Vector space model with TF-IDF weighting for text data. </li></ul><ul><li>Non-content bearing words removed: </li></ul><ul><ul><li>Stopwords </li></ul></ul><ul><ul><li>High and low frequency words </li></ul></ul><ul><ul><li>Words of length < 3 </li></ul></ul><ul><li>Text-handling software: </li></ul><ul><ul><li>Spherical KMeans was used as the underlying clustering algorithm: it uses cosine-similarity instead of Euclidean distance between word vectors. </li></ul></ul><ul><ul><li>Code base built on top of MC and SPKMeans packages developed at UT Austin. </li></ul></ul>
  42. 42. Results: MI and Seeding <ul><li>Zero noise in seeds [Small-20 NewsGroup] </li></ul><ul><ul><li>Semi-Supervised KMeans substantially better than unsupervised KMeans </li></ul></ul>
  43. 43. Results: Objective function and Seeding <ul><li>User-labeling consistent with KMeans assumptions [Small-20 NewsGroup] </li></ul><ul><ul><li>Obj. function of data partition increases exponentially with seed fraction </li></ul></ul>
  44. 44. Results: MI and Seeding <ul><li>Zero noise in seeds [Yahoo! News] </li></ul><ul><ul><li>Semi-Supervised KMeans still better than unsupervised </li></ul></ul>
  45. 45. Results: Objective Function and Seeding <ul><li> </li></ul><ul><li>User-labeling inconsistent with KMeans assumptions [Yahoo! News] </li></ul><ul><ul><li>Objective function of constrained algorithms decreases with seeding </li></ul></ul>
  46. 46. Results: Dataset Separability <ul><li>Difficult datasets: lots of overlap between the clusters [Same-3 NewsGroup] </li></ul><ul><ul><li>Semi-supervision gives substantial improvement </li></ul></ul>
  47. 47. Results: Dataset Separability <ul><li>Easy datasets: not much overlap between the clusters [Different-3 NewsGroup] </li></ul><ul><ul><li>Semi-supervision does not give substantial improvement </li></ul></ul>
  48. 48. Results: Noise Resistance <ul><li>Seed fraction: 0.1 [20 NewsGroup] </li></ul><ul><ul><li>Seeded-KMeans most robust against noisy seeding </li></ul></ul>
  49. 49. Record Linkage <ul><li>Identify and merge duplicate field values and duplicate records in a database. </li></ul><ul><li>Applications </li></ul><ul><ul><li>Duplicates in mailing lists </li></ul></ul><ul><ul><li>Information integration of multiple databases of stores, restaurants, etc. </li></ul></ul><ul><ul><li>Matching bibliographic references in research papers (Cora/ResearchIndex) </li></ul></ul><ul><ul><li>Different published editions in a database of books. </li></ul></ul>
  50. 50. Experimental Datasets <ul><li>1,200 artificially corrupted mailing list addresses. </li></ul><ul><li>1,295 Cora research paper citations. </li></ul><ul><li>864 restaurant listings from Fodor’s and Zagat’s guidebooks. </li></ul><ul><li>1,675 Citeseer research paper citations. </li></ul>
  51. 51. Record Linkage Examples Author Title Venue Address Year Name Address City Cusine San Mateo, CA. Advances in Neural Information Processing Systems Information, prediction, and query by committee Freund, Y., Seung, H.S., Shamir, E. & Tishby, N. 1993 San Mateo, CA Advances in Neural Information Processing System Information, prediction, and query by committee Yoav Freund, H. Sebastian Seung, Eli Shamir, and Naftali Tishby Delis New York City 156 Second Ave. Second Avenue Deli Delicatessen New York 156 2nd Ave. at 10th Second Avenue Deli
  52. 52. Traditional Record Linkage <ul><li>Apply a static text-similarity metric to each field. </li></ul><ul><ul><li>Cosine similarity </li></ul></ul><ul><ul><li>Jaccard similarity </li></ul></ul><ul><ul><li>Edit distance </li></ul></ul><ul><li>Combine similarity of each field to determine overall similarity. </li></ul><ul><ul><li>Manually weighted sum </li></ul></ul><ul><li>Threshold overall similarity to detect duplicates. </li></ul>
  53. 53. Edit (Levenstein) Distance <ul><li>Minimum number of character deletions , additions, or substitutions needed to make two strings equivalent. </li></ul><ul><ul><li>“ misspell” to “mispell” is distance 1 </li></ul></ul><ul><ul><li>“ misspell” to “mistell” is distance 2 </li></ul></ul><ul><ul><li>“ misspell” to “misspelling” is distance 3 </li></ul></ul><ul><li>Can be computed efficiently using dynamic programming in O( mn ) time where m and n are the lengths of the two strings being compared. </li></ul>
  54. 54. Edit Distance with Affine Gaps <ul><li>Contiguous deletions/additions are less expensive than non-contiguous ones. </li></ul><ul><ul><li>“ misspell” to “misspelling” is distance < 3 </li></ul></ul><ul><li>Relative cost of contiguous and non-contiguous deletions/additions determined by a manually-set parameter. </li></ul><ul><li>Affine gap edit-distance better for identifying duplicates than Levenstein. </li></ul>
  55. 55. Trainable Record Linkage <ul><li>MARLIN ( M ultiply A daptive R ecord L inkage using IN duction) </li></ul><ul><li>Learn parameterized similarity metrics for comparing each field. </li></ul><ul><ul><li>Trainable edit-distance </li></ul></ul><ul><ul><ul><li>Use EM to set edit-operation costs </li></ul></ul></ul><ul><li>Learn to combine multiple similarity metrics for each field to determine equivalence. </li></ul><ul><ul><li>Use SVM to decide on duplicates </li></ul></ul>
  56. 56. Trainable Edit Distance <ul><li>Learnable edit distance based on generative probabilistic model for producing matched pairs of strings. </li></ul><ul><li>Parameters trained using EM to maximize the probability of producing training pairs of equivalent strings. </li></ul><ul><li>Originally developed for Levenstein distance by Ristad & Yianilos (1998). </li></ul><ul><li>We modified for affine gap edit distance. </li></ul>
  57. 57. Sample Learned Edit Operations <ul><li>Inexpensive operations: </li></ul><ul><ul><li>Deleting/adding space </li></ul></ul><ul><ul><li>Substituting ‘/’ for ‘-’ in phone numbers </li></ul></ul><ul><ul><li>Deleting/adding ‘e’ and ‘t’ in addresses (Street  St.) </li></ul></ul><ul><li>Expensive operations: </li></ul><ul><ul><li>Deleting/adding a digit in a phone number. </li></ul></ul><ul><ul><li>Deleting/adding a ‘q’ in a name </li></ul></ul>
  58. 58. Combining Field Similarities <ul><li>Record similarity is determined by combining the similarities of individual fields. </li></ul><ul><li>Some fields are more indicative of record similarity than others: </li></ul><ul><ul><li>For addresses, city similarity is less relevant than restaurant/person name or street address. </li></ul></ul><ul><ul><li>For bibliographic citation, venue (i.e. conference or journal name) is less relevant than author or title. </li></ul></ul><ul><li>Field similarities should be weighted when combined to determine record similarity. </li></ul><ul><li>Weights should be learned using learning algorithm. </li></ul>
  59. 59. MARLIN Record Linkage Framework … … … … … Trainable similarity metrics Trainable duplicate detector A.Field1 B.Field1 A.Field n B.Field n A.Field2 B.Field2 m1 m k m1 m k m1 m k
  60. 60. Learned Record Similarity <ul><li>Field similarities used as feature vectors describing a pair of records. </li></ul><ul><li>SVM trained on these feature vectors to discriminate duplicate from non-duplicate pairs. </li></ul><ul><li>Record similarity based on distance of feature vector from the separator. </li></ul>
  61. 61. Record Pair Classification Example
  62. 62. Clustering Records into Equivalence Classes <ul><li>Use similarity-based semi-supervised clustering to identify groups of equivalent records. </li></ul><ul><li>Use single-link agglomerative clustering to cluster records based on learned similarity metric. </li></ul>
  63. 63. Experimental Methodology <ul><li>2-fold cross-validation with equivalence classes of records randomly assigned to folds. </li></ul><ul><li>Results averaged over 20 runs of cross-validation. </li></ul><ul><li>Accuracy of duplicate detection on test data measured using: </li></ul><ul><ul><li>Precision: </li></ul></ul><ul><ul><li>Recall: </li></ul></ul><ul><ul><li>F-Measure: </li></ul></ul>
  64. 64. Mailing List Name Field Results
  65. 65. Cora Title Field Results
  66. 66. Maximum F-measure for Detecting Duplicate Field Values T-test results indicate differences are significant at .05 level 0.94 0.91 0.97 0.94 0.71 0.35 Learned Affine Edit Dist. 0.92 0.89 0.95 0.93 0.68 0.29 Static Affine Edit Dist. Citeseer Constraint Citeseer RL Citeseer Face Citeseer Reason Restaurant Address RestaurantName Metric
  67. 67. Mailing List Record Results
  68. 68. Restaurant Record Results
  69. 69. Combining Similarity and Search-Based Semi-Supervised Clustering <ul><li>Can apply seeded/constrained clustering with a trained similarity metric. </li></ul><ul><li>We developed a unified framework for Euclidian distance with soft pairwise-constraints (must-link, cannot-link). </li></ul><ul><li>Experiments on UCI data comparing approaches. </li></ul><ul><li>With small amounts of training, seeded/constrained tends to do better than similarity-based. </li></ul><ul><li>With larger amounts of labeled data, similarity-based tends to do better. </li></ul><ul><li>Combining both approaches outperforms both individual approaches. </li></ul>
  70. 70. Active Semi-Supervision <ul><li>Use active learning to select the most informative labeled examples. </li></ul><ul><li>We have developed an active approach for selecting good pairwise queries to get must-link and cannot-link constraints. </li></ul><ul><ul><li>Should these two examples be in same or different clusters? </li></ul></ul><ul><li>Experimental results on UCI and text data. </li></ul><ul><li>Active learning achieves much higher accuracy with fewer labeled training pairs. </li></ul>
  71. 71. Future Work <ul><li>Adaptive metric learning for vector-space cosine-similarity. </li></ul><ul><ul><li>Supervised learning of better token weights than TF-IDF. </li></ul></ul><ul><li>Unified method for text data (cosine similarity) that combines seeded/constrained with learned similarity measure. </li></ul><ul><li>Active learning results for duplicate detection. </li></ul><ul><ul><li>Static-active learning </li></ul></ul><ul><li>Exploiting external data/knowledge (e.g. from the web) for improving similarity measures for duplicate detection. </li></ul>
  72. 72. Conclusion <ul><li>Semi-supervised clustering is an alternative way of combining labeled and unlabeled data in learning. </li></ul><ul><li>Search-based and similarity-based are two alternative approaches. </li></ul><ul><li>They have useful applications in text-clustering and database record linkage. </li></ul><ul><li>Experimental results for these applications illustrate their utility. </li></ul><ul><li>They can be combined to produce even better results. </li></ul>