Learning Similarity Metrics for Event Identification in Social Media

1,842 views
1,767 views

Published on

WSDM2010 Talk.

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,842
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
48
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Learning Similarity Metrics for Event Identification in Social Media

  1. 1. Learning Similarity Metrics for Event Identification in Social Media Hila Becker, Luis Gravano Mor Naaman Columbia University Rutgers University
  2. 2. “Event” Content in Social Media Sites
  3. 3. “Event” Content in Social Media Sites “Event”= something that occurs at a certain time in a certain place [Yang et al. ‟99] Smaller events, without traditional news coverage Popular, widely known events
  4. 4. Identifying Events and Associated Social Media Documents
  5. 5. Identifying Events and Associated Social Media Documents  Applications  Event browsing  Local event search  …  General approach: group similar documents via clustering Each cluster corresponds to one event and its associated social media documents
  6. 6. Identifying Events and Associated Social Media Documents  Applications  Event browsing  Local event search  …  General approach: group similar documents via clustering Each cluster corresponds to one event and its associated social media documents
  7. 7. Identifying Events and Associated Social Media Documents  Applications  Event browsing  Local event search  …  General approach: group similar documents via clustering Each cluster corresponds to one event and its associated social media documents
  8. 8. Event Identification: Challenges
  9. 9. Event Identification: Challenges  Uneven data quality  Missing, short, uninformative text  … but revealing structured context available: tags, date/time, geo-coordinates  Scalability  Dynamic data stream of event information  Number of events unknown  Difficult to estimate  Constantly changing
  10. 10. Event Identification: Challenges  Uneven data quality  Missing, short, uninformative text  … but revealing structured context available: tags, date/time, geo-coordinates  Scalability  Dynamic data stream of event information  Number of events unknown  Difficult to estimate  Constantly changing
  11. 11. Event Identification: Challenges  Uneven data quality  Missing, short, uninformative text  … but revealing structured context available: tags, date/time, geo-coordinates  Scalability  Dynamic data stream of event information  Number of events unknown  Difficult to estimate  Constantly changing
  12. 12. Event Identification: Challenges  Uneven data quality  Missing, short, uninformative text  … but revealing structured context available: tags, date/time, geo-coordinates  Scalability  Dynamic data stream of event information  Number of events unknown  Difficult to estimate  Constantly changing
  13. 13. Clustering Social Media Documents
  14. 14. Clustering Social Media Documents  Social media document representation  Social media document similarity  Social media document clustering framework  Similarity metric learning for clustering  Ensemble-based  Classification-based  Evaluation results
  15. 15. Clustering Social Media Documents  Social media document representation  Social media document similarity  Social media document clustering framework  Similarity metric learning for clustering  Ensemble-based  Classification-based  Evaluation results
  16. 16. Clustering Social Media Documents  Social media document representation  Social media document similarity  Social media document clustering framework  Similarity metric learning for clustering  Ensemble-based  Classification-based  Evaluation results
  17. 17. Clustering Social Media Documents  Social media document representation  Social media document similarity  Social media document clustering framework  Similarity metric learning for clustering  Ensemble-based  Classification-based  Evaluation results
  18. 18. Clustering Social Media Documents  Social media document representation  Social media document similarity  Social media document clustering framework  Similarity metric learning for clustering  Ensemble-based  Classification-based  Evaluation results
  19. 19. Social Media Document Features 46
  20. 20. Social Media Document Features Title 47
  21. 21. Social Media Document Features Title 48
  22. 22. Social Media Document Features Title Description 49
  23. 23. Social Media Document Features Title Description 50
  24. 24. Social Media Document Features Title Description Tags 51
  25. 25. Social Media Document Features Title Description Tags 52
  26. 26. Social Media Document Features Title Description Tags Date/Time 53
  27. 27. Social Media Document Features Title Description Tags Date/Time 54
  28. 28. Social Media Document Features Title Description Tags Date/Time Location 55
  29. 29. Social Media Document Features Title Description Tags Date/Time Location 56
  30. 30. Social Media Document Features Title Description Tags Date/Time Location All-Text 57
  31. 31. Social Media Document Similarity Title Description Tags Location All-Text Date/Time
  32. 32. Social Media Document Similarity  Text: cosine similarity of tf-idf vectors (tf-idf version?; stemming?; stop-word elimination?)Title Description Tags Location All-Text Date/Time A AA B BB
  33. 33. Social Media Document Similarity  Text: cosine similarity of tf-idf vectors (tf-idf version?; stemming?; stop-word elimination?)Title Description Tags Location All-Text Date/Time time A AA B BB  Time: proximity in minutes
  34. 34. Social Media Document Similarity  Text: cosine similarity of tf-idf vectors (tf-idf version?; stemming?; stop-word elimination?)Title Description Tags Location All-Text Date/Time time A AA B BB  Time: proximity in minutes  Location: geo-coordinate proximity
  35. 35. Social Media Document Similarity  Text: cosine similarity of tf-idf vectors (tf-idf version?; stemming?; stop-word elimination?)Title Description Tags Location All-Text Date/Time time A AA B BB  Time: proximity in minutes  Location: geo-coordinate proximity
  36. 36. General Clustering Framework Document feature representation Social media documents Event clusters 63
  37. 37. General Clustering Framework Document feature representation Social media documents Event clusters 64
  38. 38. General Clustering Framework Document feature representation Social media documents Event clusters 65
  39. 39. General Clustering Framework Document feature representation Social media documents Event clusters 66
  40. 40. General Clustering Framework Document feature representation Social media documents Event clusters 67
  41. 41. General Clustering Framework Document feature representation Social media documents Event clusters 68
  42. 42. Clustering Algorithm
  43. 43. Clustering Algorithm  Many alternatives possible! [Berkhin 2002]  Single-pass incremental clustering algorithm  Scalable, online solution  Used effectively for event identification in textual news  Does not require a priori knowledge of number of clusters  Parameters:  Similarity Function σ  Threshold μ
  44. 44. Clustering Algorithm  Many alternatives possible! [Berkhin 2002]  Single-pass incremental clustering algorithm  Scalable, online solution  Used effectively for event identification in textual news  Does not require a priori knowledge of number of clusters  Parameters:  Similarity Function σ  Threshold μ
  45. 45. Cluster Representation and Parameter Tuning
  46. 46. Cluster Representation and Parameter Tuning  Centroid cluster representation  Average tf-idf scores  Average time  Geographic mid-point  Parameter tuning in supervised training phase Clustering quality metrics to optimize:  Normalized Mutual Information (NMI) [Amigó et al. 2008]  B-Cubed [Strehl et al. 2002]
  47. 47. Cluster Representation and Parameter Tuning  Centroid cluster representation  Average tf-idf scores  Average time  Geographic mid-point  Parameter tuning in supervised training phase Clustering quality metrics to optimize:  Normalized Mutual Information (NMI) [Amigó et al. 2008]  B-Cubed [Strehl et al. 2002]
  48. 48. Clustering Quality Metrics  Characteristics of clusters:  Homogeneity  Completeness
  49. 49. Clustering Quality Metrics  Characteristics of clusters:  Homogeneity  Completeness
  50. 50. Clustering Quality Metrics  Characteristics of clusters:  Homogeneity  Completeness
  51. 51. Clustering Quality Metrics  Characteristics of clusters:  Homogeneity  Completeness ✔
  52. 52. Clustering Quality Metrics  Characteristics of clusters:  Homogeneity  Completeness ✔
  53. 53. Clustering Quality Metrics  Characteristics of clusters:  Homogeneity  Completeness ✔
  54. 54. Clustering Quality Metrics  Characteristics of clusters:  Homogeneity  Completeness ✔ ✔
  55. 55. Clustering Quality Metrics  Characteristics of clusters:  Homogeneity  Completeness ✔ ✔  Captured by both NMI and B-Cubed  Optimize both metrics using a single (Pareto optimal) objective function: NMI+B-Cubed
  56. 56. Clustering Quality Metrics  Characteristics of clusters:  Homogeneity  Completeness ✔ ✔  Captured by both NMI and B-Cubed  Optimize both metrics using a single (Pareto optimal) objective function: NMI+B-Cubed
  57. 57. Learning a Similarity Metric for Clustering
  58. 58. Learning a Similarity Metric for Clustering  Ensemble-based similarity  Training a cluster ensemble  Computing a similarity score by:  Combining individual partitions  Combining individual similarities  Classification-based similarity  Training data sampling strategies  Modeling strategies
  59. 59. Learning a Similarity Metric for Clustering  Ensemble-based similarity  Training a cluster ensemble  Computing a similarity score by:  Combining individual partitions  Combining individual similarities  Classification-based similarity  Training data sampling strategies  Modeling strategies
  60. 60. Overview of a Cluster Ensemble Algorithm
  61. 61. Overview of a Cluster Ensemble Algorithm Ctitle Ctag s Ctime
  62. 62. Consensus Function: combine ensemble similarities Overview of a Cluster Ensemble Algorithm Wtitle Wtags Wtime f(C,W) Ctitle Ctag s Ctime Learned in a training step
  63. 63. Consensus Function: combine ensemble similarities Overview of a Cluster Ensemble Algorithm Wtitle Wtags Wtime f(C,W) Ctitle Ctag s Ctime Ensemble clustering solution Learned in a training step
  64. 64. Overview of a Cluster Ensemble Algorithm Wtitle Wtags Wtime f(C,W) Ctitle Ctag s Ctime
  65. 65. Overview of a Cluster Ensemble Algorithm Wtitle Wtags Wtime f(C,W) Ctitle Ctag s Ctime
  66. 66. Overview of a Cluster Ensemble Algorithm Wtitle Wtags Wtime f(C,W) Ctitle Ctag s Ctime
  67. 67. Overview of a Cluster Ensemble Algorithm Wtitle Wtags Wtime f(C,W) Ctitle Ctag s Ctime
  68. 68. Overview of a Cluster Ensemble Algorithm Wtitle Wtags Wtime f(C,W) σCtitle(di,cj)>μCtitle σCtags(di,cj)>μCtags σCtime(di,cj)>μCtime For each document di and cluster cj
  69. 69. Learning a Similarity Metric for Clustering       Classification-based similarity  Training data sampling strategies  Modeling strategies
  70. 70. Classification-based Similarity Metrics
  71. 71. Classification-based Similarity Metrics  Classify pairs of documents as similar/dissimilar  Feature vector  Pairwise similarity scores  One feature per similarity metric (e.g., time- proximity, location-proximity, …)  Modeling strategies  Document pairs  Document-centroid pairs
  72. 72. Classification-based Similarity Metrics  Classify pairs of documents as similar/dissimilar  Feature vector  Pairwise similarity scores  One feature per similarity metric (e.g., time- proximity, location-proximity, …)  Modeling strategies  Document pairs  Document-centroid pairs
  73. 73. Classification-based Similarity Metrics  Classify pairs of documents as similar/dissimilar  Feature vector  Pairwise similarity scores  One feature per similarity metric (e.g., time- proximity, location-proximity, …)  Modeling strategies  Document pairs  Document-centroid pairs
  74. 74. Training Classification-based Similarity
  75. 75. Training Classification-based Similarity  Challenge: most document pairs do not correspond to the same event  Skewed label distribution  Small, highly homogeneous clusters  Sampling strategies  Random  Select a document at random  Randomly create one positive and one negative example  Time-based  Create examples for the first NxN documents  Resample such that the label distribution is balanced
  76. 76. Training Classification-based Similarity  Challenge: most document pairs do not correspond to the same event  Skewed label distribution  Small, highly homogeneous clusters  Sampling strategies  Random  Select a document at random  Randomly create one positive and one negative example  Time-based  Create examples for the first NxN documents  Resample such that the label distribution is balanced
  77. 77. Experiments: Alternative Similarity Metrics
  78. 78. Experiments: Alternative Similarity Metrics  Ensemble-based techniques  Combining individual partitions (ENS-PART)  Combining individual similarities (ENS-SIM)  Classification-based techniques  Modeling: document-document vs. document-centroid pairs  Sampling: time-based vs. random  Logistic Regression (CLASS-LR), Support Vector Machines (CLASS-SVM)  Baselines  Title, Description, Tags, All-Text, Time-Proximity, Location- Proximity
  79. 79. Experiments: Alternative Similarity Metrics  Ensemble-based techniques  Combining individual partitions (ENS-PART)  Combining individual similarities (ENS-SIM)  Classification-based techniques  Modeling: document-document vs. document-centroid pairs  Sampling: time-based vs. random  Logistic Regression (CLASS-LR), Support Vector Machines (CLASS-SVM)  Baselines  Title, Description, Tags, All-Text, Time-Proximity, Location- Proximity
  80. 80. Experiments: Alternative Similarity Metrics  Ensemble-based techniques  Combining individual partitions (ENS-PART)  Combining individual similarities (ENS-SIM)  Classification-based techniques  Modeling: document-document vs. document-centroid pairs  Sampling: time-based vs. random  Logistic Regression (CLASS-LR), Support Vector Machines (CLASS-SVM)  Baselines  Title, Description, Tags, All-Text, Time-Proximity, Location- Proximity
  81. 81. Experimental Setup
  82. 82. Experimental Setup  Datasets:  Upcoming  >270K Flickr photos  Event labels from the “upcoming” event database (upcoming:event=12345)  Split into 3 parts for training/validation/testing  LastFM  >594K Flickr photos  Event labels from last.fm music catalog (lastfm:event=6789)  Used as an additional test set
  83. 83. Experimental Setup  Datasets:  Upcoming  >270K Flickr photos  Event labels from the “upcoming” event database (upcoming:event=12345)  Split into 3 parts for training/validation/testing  LastFM  >594K Flickr photos  Event labels from last.fm music catalog (lastfm:event=6789)  Used as an additional test set
  84. 84. Experimental Setup  Datasets:  Upcoming  >270K Flickr photos  Event labels from the “upcoming” event database (upcoming:event=12345)  Split into 3 parts for training/validation/testing  LastFM  >594K Flickr photos  Event labels from last.fm music catalog (lastfm:event=6789)  Used as an additional test set
  85. 85. Experimental Setup  Datasets:  Upcoming  >270K Flickr photos  Event labels from the “upcoming” event database (upcoming:event=12345)  Split into 3 parts for training/validation/testing  LastFM  >594K Flickr photos  Event labels from last.fm music catalog (lastfm:event=6789)  Used as an additional test set
  86. 86. Experimental Setup  Datasets:  Upcoming  >270K Flickr photos  Event labels from the “upcoming” event database (upcoming:event=12345)  Split into 3 parts for training/validation/testing  LastFM  >594K Flickr photos  Event labels from last.fm music catalog (lastfm:event=6789)  Used as an additional test set
  87. 87. Experimental Setup  Datasets:  Upcoming  >270K Flickr photos  Event labels from the “upcoming” event database (upcoming:event=12345)  Split into 3 parts for training/validation/testing  LastFM  >594K Flickr photos  Event labels from last.fm music catalog (lastfm:event=6789)  Used as an additional test set
  88. 88. Experimental Setup  Datasets:  Upcoming  >270K Flickr photos  Event labels from the “upcoming” event database (upcoming:event=12345)  Split into 3 parts for training/validation/testing  LastFM  >594K Flickr photos  Event labels from last.fm music catalog (lastfm:event=6789)  Used as an additional test set
  89. 89. Experimental Setup  Datasets:  Upcoming  >270K Flickr photos  Event labels from the “upcoming” event database (upcoming:event=12345)  Split into 3 parts for training/validation/testing  LastFM  >594K Flickr photos  Event labels from last.fm music catalog (lastfm:event=6789)  Used as an additional test set
  90. 90. Experimental Setup  Datasets:  Upcoming  >270K Flickr photos  Event labels from the “upcoming” event database (upcoming:event=12345)  Split into 3 parts for training/validation/testing  LastFM  >594K Flickr photos  Event labels from last.fm music catalog (lastfm:event=6789)  Used as an additional test set
  91. 91. Clustering Accuracy over Upcoming Test Set  All similarity learning techniques outperform the baselines  Classification-based techniques perform better than ensemble-based techniques Algorithm NMI B-Cubed All-Text 0.9240 0.7697 Tags 0.9229 0.7676 ENS-PART 0.9296 0.7819 ENS-SIM 0.9322 0.7861 CLASS-SVM 0.9425 0.8095 CLASS-LR 0.9444 0.8155
  92. 92. Clustering Accuracy over Upcoming Test Set  All similarity learning techniques outperform the baselines  Classification-based techniques perform better than ensemble-based techniques Algorithm NMI B-Cubed All-Text 0.9240 0.7697 Tags 0.9229 0.7676 ENS-PART 0.9296 0.7819 ENS-SIM 0.9322 0.7861 CLASS-SVM 0.9425 0.8095 CLASS-LR 0.9444 0.8155
  93. 93. Clustering Accuracy over Upcoming Test Set  All similarity learning techniques outperform the baselines  Classification-based techniques perform better than ensemble-based techniques Algorithm NMI B-Cubed All-Text 0.9240 0.7697 Tags 0.9229 0.7676 ENS-PART 0.9296 0.7819 ENS-SIM 0.9322 0.7861 CLASS-SVM 0.9425 0.8095 CLASS-LR 0.9444 0.8155
  94. 94. Clustering Accuracy over Upcoming Test Set  All similarity learning techniques outperform the baselines  Classification-based techniques perform better than ensemble-based techniques Algorithm NMI B-Cubed All-Text 0.9240 0.7697 Tags 0.9229 0.7676 ENS-PART 0.9296 0.7819 ENS-SIM 0.9322 0.7861 CLASS-SVM 0.9425 0.8095 CLASS-LR 0.9444 0.8155
  95. 95. Statistical Significance Analysis  Clustering results for 10 partitions of Upcoming test set  Significant using Friedman test, p<0.05  Post-hoc analysis:
  96. 96. NMI: Clustering Accuracy over Both Test Sets Upcoming LastFM NMI  Similarity learning models trained on Upcoming data show similar trends when tested on LastFM data
  97. 97. Conclusions
  98. 98. Conclusions  Structured context features of social media documents  Effective complementary cues for social media document similarity  Tags, Time-Proximity among highest weighted features  Domain-appropriate similarity metrics  Weighted combination yields high quality clustering results  Significantly outperform text-only techniques  Similarity learning models generalize to unseen data sets
  99. 99. Conclusions  Structured context features of social media documents  Effective complementary cues for social media document similarity  Tags, Time-Proximity among highest weighted features  Domain-appropriate similarity metrics  Weighted combination yields high quality clustering results  Significantly outperform text-only techniques  Similarity learning models generalize to unseen data sets
  100. 100. Conclusions  Structured context features of social media documents  Effective complementary cues for social media document similarity  Tags, Time-Proximity among highest weighted features  Domain-appropriate similarity metrics  Weighted combination yields high quality clustering results  Significantly outperform text-only techniques  Similarity learning models generalize to unseen data sets
  101. 101. Current and Future Work  Improving clustering accuracy with social media “links” [SSM „10 poster]  Capturing event content across sites (YouTube, Flickr, Twitter)  Designing event search strategies
  102. 102. Thank You!

×