• Like

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Learning Similarity Metrics for Event Identification in Social Media

  • 1,563 views
Uploaded on

WSDM2010 Talk.

WSDM2010 Talk.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,563
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
39
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Learning Similarity Metrics for Event Identification in Social Media Hila Becker, Luis Gravano Mor Naaman Columbia University Rutgers University
  • 2. “Event” Content in Social Media Sites
  • 3. “Event” Content in Social Media Sites “Event”= something that occurs at a certain time in a certain place [Yang et al. ‟99] Popular, widely known events Smaller events, without traditional news coverage
  • 4. Identifying Events and Associated Social Media Documents
  • 5. Identifying Events and Associated Social Media Documents  Applications  Event browsing  Local event search …  General approach: group similar documents via clustering Each cluster corresponds to one event and its associated social media documents
  • 6. Identifying Events and Associated Social Media Documents  Applications  Event browsing  Local event search …  General approach: group similar documents via clustering Each cluster corresponds to one event and its associated social media documents
  • 7. Identifying Events and Associated Social Media Documents  Applications  Event browsing  Local event search …  General approach: group similar documents via clustering Each cluster corresponds to one event and its associated social media documents
  • 8. Event Identification: Challenges
  • 9. Event Identification: Challenges  Uneven data quality  Missing, short, uninformative text  … but revealing structured context available: tags, date/time, geo-coordinates  Scalability  Dynamic data stream of event information  Number of events unknown  Difficult to estimate  Constantly changing
  • 10. Event Identification: Challenges  Uneven data quality  Missing, short, uninformative text  … but revealing structured context available: tags, date/time, geo-coordinates  Scalability  Dynamic data stream of event information  Number of events unknown  Difficult to estimate  Constantly changing
  • 11. Event Identification: Challenges  Uneven data quality  Missing, short, uninformative text  … but revealing structured context available: tags, date/time, geo-coordinates  Scalability  Dynamic data stream of event information  Number of events unknown  Difficult to estimate  Constantly changing
  • 12. Event Identification: Challenges  Uneven data quality  Missing, short, uninformative text  … but revealing structured context available: tags, date/time, geo-coordinates  Scalability  Dynamic data stream of event information  Number of events unknown  Difficult to estimate  Constantly changing
  • 13. Clustering Social Media Documents
  • 14. Clustering Social Media Documents  Social media document representation  Social media document similarity  Social media document clustering framework  Similarity metric learning for clustering  Ensemble-based  Classification-based  Evaluation results
  • 15. Clustering Social Media Documents  Social media document representation  Social media document similarity  Social media document clustering framework  Similarity metric learning for clustering  Ensemble-based  Classification-based  Evaluation results
  • 16. Clustering Social Media Documents  Social media document representation  Social media document similarity  Social media document clustering framework  Similarity metric learning for clustering  Ensemble-based  Classification-based  Evaluation results
  • 17. Clustering Social Media Documents  Social media document representation  Social media document similarity  Social media document clustering framework  Similarity metric learning for clustering  Ensemble-based  Classification-based  Evaluation results
  • 18. Clustering Social Media Documents  Social media document representation  Social media document similarity  Social media document clustering framework  Similarity metric learning for clustering  Ensemble-based  Classification-based  Evaluation results
  • 19. Social Media Document Features 46
  • 20. Social Media Document Features 47 Title
  • 21. Social Media Document Features 48 Title
  • 22. Social Media Document Features 49 Title Description
  • 23. Social Media Document Features 50 Title Description
  • 24. Social Media Document Features 51 Title Description Tags
  • 25. Social Media Document Features 52 Title Description Tags
  • 26. Social Media Document Features 53 Title Description Tags Date/Time
  • 27. Social Media Document Features 54 Title Description Tags Date/Time
  • 28. Social Media Document Features 55 Title Description Tags Date/Time Location
  • 29. Social Media Document Features 56 Title Description Tags Date/Time Location
  • 30. Social Media Document Features 57 Title Description Tags Date/Time Location All-Text
  • 31. Social Media Document Similarity Title Description Tags Date/Time Location All-Text
  • 32. Social Media Document Similarity  Text: cosine similarity of tf-idf vectors Title (tf-idf version?; stemming?; stop-word elimination?) Description A A A B B B Tags Date/Time Location All-Text
  • 33. Social Media Document Similarity  Text: cosine similarity of tf-idf vectors Title (tf-idf version?; stemming?; stop-word elimination?) Description A A A B B B Tags  Time: proximity in minutes time Date/Time Location All-Text
  • 34. Social Media Document Similarity  Text: cosine similarity of tf-idf vectors Title (tf-idf version?; stemming?; stop-word elimination?) Description A A A B B B Tags  Time: proximity in minutes time Date/Time Location  Location: geo-coordinate proximity All-Text
  • 35. Social Media Document Similarity  Text: cosine similarity of tf-idf vectors Title (tf-idf version?; stemming?; stop-word elimination?) Description A A A B B B Tags  Time: proximity in minutes time Date/Time Location  Location: geo-coordinate proximity All-Text
  • 36. General Clustering Framework 63 Social media Document feature Event clusters documents representation
  • 37. General Clustering Framework 64 Social media Document feature Event clusters documents representation
  • 38. General Clustering Framework 65 Social media Document feature Event clusters documents representation
  • 39. General Clustering Framework 66 Social media Document feature Event clusters documents representation
  • 40. General Clustering Framework 67 Social media Document feature Event clusters documents representation
  • 41. General Clustering Framework 68 Social media Document feature Event clusters documents representation
  • 42. Clustering Algorithm
  • 43. Clustering Algorithm  Many alternatives possible! [Berkhin 2002]  Single-pass incremental clustering algorithm  Scalable, online solution  Used effectively for event identification in textual news  Does not require a priori knowledge of number of clusters  Parameters:  Similarity Function σ  Threshold μ
  • 44. Clustering Algorithm  Many alternatives possible! [Berkhin 2002]  Single-pass incremental clustering algorithm  Scalable, online solution  Used effectively for event identification in textual news  Does not require a priori knowledge of number of clusters  Parameters:  Similarity Function σ  Threshold μ
  • 45. Cluster Representation and Parameter Tuning
  • 46. Cluster Representation and Parameter Tuning  Centroid cluster representation  Average tf-idf scores  Average time  Geographic mid-point  Parameter tuning in supervised training phase Clustering quality metrics to optimize:  Normalized Mutual Information (NMI) [Amigó et al. 2008]  B-Cubed [Strehl et al. 2002]
  • 47. Cluster Representation and Parameter Tuning  Centroid cluster representation  Average tf-idf scores  Average time  Geographic mid-point  Parameter tuning in supervised training phase Clustering quality metrics to optimize:  Normalized Mutual Information (NMI) [Amigó et al. 2008]  B-Cubed [Strehl et al. 2002]
  • 48. Clustering Quality Metrics  Characteristics of clusters:  Homogeneity  Completeness
  • 49. Clustering Quality Metrics  Characteristics of clusters:  Homogeneity  Completeness
  • 50. Clustering Quality Metrics  Characteristics of clusters:  Homogeneity  Completeness
  • 51. Clustering Quality Metrics  Characteristics of clusters:  Homogeneity ✔  Completeness
  • 52. Clustering Quality Metrics  Characteristics of clusters:  Homogeneity ✔  Completeness
  • 53. Clustering Quality Metrics  Characteristics of clusters:  Homogeneity ✔  Completeness
  • 54. Clustering Quality Metrics  Characteristics of clusters:  Homogeneity ✔  Completeness ✔
  • 55. Clustering Quality Metrics  Characteristics of clusters:  Homogeneity ✔  Completeness ✔  Captured by both NMI and B-Cubed  Optimize both metrics using a single (Pareto optimal) objective function: NMI+B-Cubed
  • 56. Clustering Quality Metrics  Characteristics of clusters:  Homogeneity ✔  Completeness ✔  Captured by both NMI and B-Cubed  Optimize both metrics using a single (Pareto optimal) objective function: NMI+B-Cubed
  • 57. Learning a Similarity Metric for Clustering
  • 58. Learning a Similarity Metric for Clustering  Ensemble-based similarity  Training a cluster ensemble  Computing a similarity score by:  Combining individual partitions  Combining individual similarities  Classification-based similarity  Trainingdata sampling strategies  Modeling strategies
  • 59. Learning a Similarity Metric for Clustering  Ensemble-based similarity  Training a cluster ensemble  Computing a similarity score by:  Combining individual partitions  Combining individual similarities  Classification-based similarity  Trainingdata sampling strategies  Modeling strategies
  • 60. Overview of a Cluster Ensemble Algorithm
  • 61. Overview of a Cluster Ensemble Algorithm Ctitle Ctag s Ctime
  • 62. Overview of a Cluster Ensemble Algorithm Ctitle Wtitle Consensus Function: combine ensemble similarities Wtags Ctag s f(C,W) Wtime Ctime Learned in a training step
  • 63. Overview of a Cluster Ensemble Algorithm Ctitle Wtitle Consensus Function: Ensemble combine ensemble clustering similarities solution Wtags Ctag s f(C,W) Wtime Ctime Learned in a training step
  • 64. Overview of a Cluster Ensemble Algorithm Ctitle Wtitle Wtags Ctag s f(C,W) Wtime Ctime
  • 65. Overview of a Cluster Ensemble Algorithm Ctitle Wtitle Wtags Ctag s f(C,W) Wtime Ctime
  • 66. Overview of a Cluster Ensemble Algorithm Ctitle Wtitle Wtags Ctag s f(C,W) Wtime Ctime
  • 67. Overview of a Cluster Ensemble Algorithm Ctitle Wtitle Wtags Ctag s f(C,W) Wtime Ctime
  • 68. Overview of a Cluster Ensemble Algorithm For each σCtitle(di,cj)>μCtitle document di Wtitle and cluster cj σCtags(di,cj)>μCtags Wtags f(C,W) Wtime σCtime(di,cj)>μCtime
  • 69. Learning a Similarity Metric for Clustering       Classification-based similarity  Trainingdata sampling strategies  Modeling strategies
  • 70. Classification-based Similarity Metrics
  • 71. Classification-based Similarity Metrics  Classify pairs of documents as similar/dissimilar  Feature vector  Pairwise similarity scores  One feature per similarity metric (e.g., time- proximity, location-proximity, …)  Modeling strategies  Document pairs  Document-centroid pairs
  • 72. Classification-based Similarity Metrics  Classify pairs of documents as similar/dissimilar  Feature vector  Pairwise similarity scores  One feature per similarity metric (e.g., time- proximity, location-proximity, …)  Modeling strategies  Document pairs  Document-centroid pairs
  • 73. Classification-based Similarity Metrics  Classify pairs of documents as similar/dissimilar  Feature vector  Pairwise similarity scores  One feature per similarity metric (e.g., time- proximity, location-proximity, …)  Modeling strategies  Document pairs  Document-centroid pairs
  • 74. Training Classification-based Similarity
  • 75. Training Classification-based Similarity  Challenge: most document pairs do not correspond to the same event  Skewed label distribution  Small, highly homogeneous clusters  Sampling strategies  Random  Select a document at random  Randomly create one positive and one negative example  Time-based  Create examples for the first NxN documents  Resample such that the label distribution is balanced
  • 76. Training Classification-based Similarity  Challenge: most document pairs do not correspond to the same event  Skewed label distribution  Small, highly homogeneous clusters  Sampling strategies  Random  Select a document at random  Randomly create one positive and one negative example  Time-based  Create examples for the first NxN documents  Resample such that the label distribution is balanced
  • 77. Experiments: Alternative Similarity Metrics
  • 78. Experiments: Alternative Similarity Metrics  Ensemble-based techniques  Combining individual partitions (ENS-PART)  Combining individual similarities (ENS-SIM)  Classification-based techniques  Modeling: document-document vs. document-centroid pairs  Sampling: time-based vs. random  Logistic Regression (CLASS-LR), Support Vector Machines (CLASS-SVM)  Baselines  Title, Description, Tags, All-Text, Time-Proximity, Location- Proximity
  • 79. Experiments: Alternative Similarity Metrics  Ensemble-based techniques  Combining individual partitions (ENS-PART)  Combining individual similarities (ENS-SIM)  Classification-based techniques  Modeling: document-document vs. document-centroid pairs  Sampling: time-based vs. random  Logistic Regression (CLASS-LR), Support Vector Machines (CLASS-SVM)  Baselines  Title, Description, Tags, All-Text, Time-Proximity, Location- Proximity
  • 80. Experiments: Alternative Similarity Metrics  Ensemble-based techniques  Combining individual partitions (ENS-PART)  Combining individual similarities (ENS-SIM)  Classification-based techniques  Modeling: document-document vs. document-centroid pairs  Sampling: time-based vs. random  Logistic Regression (CLASS-LR), Support Vector Machines (CLASS-SVM)  Baselines  Title, Description, Tags, All-Text, Time-Proximity, Location- Proximity
  • 81. Experimental Setup
  • 82. Experimental Setup  Datasets:  Upcoming  >270K Flickr photos  Event labels from the “upcoming” event database (upcoming:event=12345)  Split into 3 parts for training/validation/testing  LastFM  >594K Flickr photos  Event labels from last.fm music catalog (lastfm:event=6789)  Used as an additional test set
  • 83. Experimental Setup  Datasets:  Upcoming  >270K Flickr photos  Event labels from the “upcoming” event database (upcoming:event=12345)  Split into 3 parts for training/validation/testing  LastFM  >594K Flickr photos  Event labels from last.fm music catalog (lastfm:event=6789)  Used as an additional test set
  • 84. Experimental Setup  Datasets:  Upcoming  >270K Flickr photos  Event labels from the “upcoming” event database (upcoming:event=12345)  Split into 3 parts for training/validation/testing  LastFM  >594K Flickr photos  Event labels from last.fm music catalog (lastfm:event=6789)  Used as an additional test set
  • 85. Experimental Setup  Datasets:  Upcoming  >270K Flickr photos  Event labels from the “upcoming” event database (upcoming:event=12345)  Split into 3 parts for training/validation/testing  LastFM  >594K Flickr photos  Event labels from last.fm music catalog (lastfm:event=6789)  Used as an additional test set
  • 86. Experimental Setup  Datasets:  Upcoming  >270K Flickr photos  Event labels from the “upcoming” event database (upcoming:event=12345)  Split into 3 parts for training/validation/testing  LastFM  >594K Flickr photos  Event labels from last.fm music catalog (lastfm:event=6789)  Used as an additional test set
  • 87. Experimental Setup  Datasets:  Upcoming  >270K Flickr photos  Event labels from the “upcoming” event database (upcoming:event=12345)  Split into 3 parts for training/validation/testing  LastFM  >594K Flickr photos  Event labels from last.fm music catalog (lastfm:event=6789)  Used as an additional test set
  • 88. Experimental Setup  Datasets:  Upcoming  >270K Flickr photos  Event labels from the “upcoming” event database (upcoming:event=12345)  Split into 3 parts for training/validation/testing  LastFM  >594K Flickr photos  Event labels from last.fm music catalog (lastfm:event=6789)  Used as an additional test set
  • 89. Experimental Setup  Datasets:  Upcoming  >270K Flickr photos  Event labels from the “upcoming” event database (upcoming:event=12345)  Split into 3 parts for training/validation/testing  LastFM  >594K Flickr photos  Event labels from last.fm music catalog (lastfm:event=6789)  Used as an additional test set
  • 90. Experimental Setup  Datasets:  Upcoming  >270K Flickr photos  Event labels from the “upcoming” event database (upcoming:event=12345)  Split into 3 parts for training/validation/testing  LastFM  >594K Flickr photos  Event labels from last.fm music catalog (lastfm:event=6789)  Used as an additional test set
  • 91. Clustering Accuracy over Upcoming Test Set Algorithm NMI B-Cubed All-Text 0.9240 0.7697 Tags 0.9229 0.7676 ENS-PART 0.9296 0.7819 ENS-SIM 0.9322 0.7861 CLASS-SVM 0.9425 0.8095 CLASS-LR 0.9444 0.8155  All similarity learning techniques outperform the baselines  Classification-based techniques perform better than ensemble-based techniques
  • 92. Clustering Accuracy over Upcoming Test Set Algorithm NMI B-Cubed All-Text 0.9240 0.7697 Tags 0.9229 0.7676 ENS-PART 0.9296 0.7819 ENS-SIM 0.9322 0.7861 CLASS-SVM 0.9425 0.8095 CLASS-LR 0.9444 0.8155  All similarity learning techniques outperform the baselines  Classification-based techniques perform better than ensemble-based techniques
  • 93. Clustering Accuracy over Upcoming Test Set Algorithm NMI B-Cubed All-Text 0.9240 0.7697 Tags 0.9229 0.7676 ENS-PART 0.9296 0.7819 ENS-SIM 0.9322 0.7861 CLASS-SVM 0.9425 0.8095 CLASS-LR 0.9444 0.8155  All similarity learning techniques outperform the baselines  Classification-based techniques perform better than ensemble-based techniques
  • 94. Clustering Accuracy over Upcoming Test Set Algorithm NMI B-Cubed All-Text 0.9240 0.7697 Tags 0.9229 0.7676 ENS-PART 0.9296 0.7819 ENS-SIM 0.9322 0.7861 CLASS-SVM 0.9425 0.8095 CLASS-LR 0.9444 0.8155  All similarity learning techniques outperform the baselines  Classification-based techniques perform better than ensemble-based techniques
  • 95. Statistical Significance Analysis  Clustering results for 10 partitions of Upcoming test set  Significant using Friedman test, p<0.05  Post-hoc analysis:
  • 96. NMI: Clustering Accuracy over Both Test Sets Upcoming LastFM NMI  Similarity learning models trained on Upcoming data show similar trends when tested on LastFM data
  • 97. Conclusions
  • 98. Conclusions  Structured context features of social media documents  Effective complementary cues for social media document similarity  Tags, Time-Proximity among highest weighted features  Domain-appropriate similarity metrics  Weighted combination yields high quality clustering results  Significantly outperform text-only techniques  Similarity learning models generalize to unseen data sets
  • 99. Conclusions  Structured context features of social media documents  Effective complementary cues for social media document similarity  Tags, Time-Proximity among highest weighted features  Domain-appropriate similarity metrics  Weighted combination yields high quality clustering results  Significantly outperform text-only techniques  Similarity learning models generalize to unseen data sets
  • 100. Conclusions  Structured context features of social media documents  Effective complementary cues for social media document similarity  Tags, Time-Proximity among highest weighted features  Domain-appropriate similarity metrics  Weighted combination yields high quality clustering results  Significantly outperform text-only techniques  Similarity learning models generalize to unseen data sets
  • 101. Current and Future Work  Improving clustering accuracy with social media “links” [SSM „10 poster]  Capturing event content across sites (YouTube, Flickr, Twitter)  Designing event search strategies
  • 102. Thank You!