SlideShare a Scribd company logo
1 of 102
Download to read offline
Learning Similarity Metrics
for Event Identification in
Social Media
Hila Becker, Luis Gravano Mor Naaman
Columbia University Rutgers University
“Event” Content in Social Media Sites
“Event” Content in Social Media Sites
“Event”= something that occurs at a certain time in a
certain place [Yang et al. ‟99]
Smaller events, without traditional
news coverage
Popular, widely known events
Identifying Events and Associated
Social Media Documents
Identifying Events and Associated
Social Media Documents
 Applications
 Event browsing
 Local event search
 …
 General approach: group similar documents
via clustering
Each cluster corresponds to one event and its
associated social media documents
Identifying Events and Associated
Social Media Documents
 Applications
 Event browsing
 Local event search
 …
 General approach: group similar documents
via clustering
Each cluster corresponds to one event and its
associated social media documents
Identifying Events and Associated
Social Media Documents
 Applications
 Event browsing
 Local event search
 …
 General approach: group similar documents
via clustering
Each cluster corresponds to one event and its
associated social media documents
Event Identification: Challenges
Event Identification: Challenges
 Uneven data quality
 Missing, short, uninformative text
 … but revealing structured context available:
tags, date/time, geo-coordinates
 Scalability
 Dynamic data stream of event information
 Number of events unknown
 Difficult to estimate
 Constantly changing
Event Identification: Challenges
 Uneven data quality
 Missing, short, uninformative text
 … but revealing structured context available:
tags, date/time, geo-coordinates
 Scalability
 Dynamic data stream of event information
 Number of events unknown
 Difficult to estimate
 Constantly changing
Event Identification: Challenges
 Uneven data quality
 Missing, short, uninformative text
 … but revealing structured context available:
tags, date/time, geo-coordinates
 Scalability
 Dynamic data stream of event information
 Number of events unknown
 Difficult to estimate
 Constantly changing
Event Identification: Challenges
 Uneven data quality
 Missing, short, uninformative text
 … but revealing structured context available:
tags, date/time, geo-coordinates
 Scalability
 Dynamic data stream of event information
 Number of events unknown
 Difficult to estimate
 Constantly changing
Clustering Social Media Documents
Clustering Social Media Documents
 Social media document representation
 Social media document similarity
 Social media document clustering
framework
 Similarity metric learning for clustering
 Ensemble-based
 Classification-based
 Evaluation results
Clustering Social Media Documents
 Social media document representation
 Social media document similarity
 Social media document clustering
framework
 Similarity metric learning for clustering
 Ensemble-based
 Classification-based
 Evaluation results
Clustering Social Media Documents
 Social media document representation
 Social media document similarity
 Social media document clustering
framework
 Similarity metric learning for clustering
 Ensemble-based
 Classification-based
 Evaluation results
Clustering Social Media Documents
 Social media document representation
 Social media document similarity
 Social media document clustering
framework
 Similarity metric learning for clustering
 Ensemble-based
 Classification-based
 Evaluation results
Clustering Social Media Documents
 Social media document representation
 Social media document similarity
 Social media document clustering
framework
 Similarity metric learning for clustering
 Ensemble-based
 Classification-based
 Evaluation results
Social Media Document Features
46
Social Media Document Features
Title
47
Social Media Document Features
Title
48
Social Media Document Features
Title
Description
49
Social Media Document Features
Title
Description
50
Social Media Document Features
Title
Description
Tags
51
Social Media Document Features
Title
Description
Tags
52
Social Media Document Features
Title
Description
Tags
Date/Time
53
Social Media Document Features
Title
Description
Tags
Date/Time
54
Social Media Document Features
Title
Description
Tags
Date/Time
Location
55
Social Media Document Features
Title
Description
Tags
Date/Time
Location
56
Social Media Document Features
Title
Description
Tags
Date/Time
Location
All-Text
57
Social Media Document Similarity
Title
Description
Tags
Location
All-Text
Date/Time
Social Media Document Similarity
 Text: cosine similarity of tf-idf vectors
(tf-idf version?; stemming?; stop-word elimination?)Title
Description
Tags
Location
All-Text
Date/Time
A AA B BB
Social Media Document Similarity
 Text: cosine similarity of tf-idf vectors
(tf-idf version?; stemming?; stop-word elimination?)Title
Description
Tags
Location
All-Text
Date/Time
time
A AA B BB
 Time: proximity in minutes
Social Media Document Similarity
 Text: cosine similarity of tf-idf vectors
(tf-idf version?; stemming?; stop-word elimination?)Title
Description
Tags
Location
All-Text
Date/Time
time
A AA B BB
 Time: proximity in minutes
 Location: geo-coordinate proximity
Social Media Document Similarity
 Text: cosine similarity of tf-idf vectors
(tf-idf version?; stemming?; stop-word elimination?)Title
Description
Tags
Location
All-Text
Date/Time
time
A AA B BB
 Time: proximity in minutes
 Location: geo-coordinate proximity
General Clustering Framework
Document feature
representation
Social media
documents
Event clusters
63
General Clustering Framework
Document feature
representation
Social media
documents
Event clusters
64
General Clustering Framework
Document feature
representation
Social media
documents
Event clusters
65
General Clustering Framework
Document feature
representation
Social media
documents
Event clusters
66
General Clustering Framework
Document feature
representation
Social media
documents
Event clusters
67
General Clustering Framework
Document feature
representation
Social media
documents
Event clusters
68
Clustering Algorithm
Clustering Algorithm
 Many alternatives possible! [Berkhin 2002]
 Single-pass incremental clustering algorithm
 Scalable, online solution
 Used effectively for event identification in textual
news
 Does not require a priori knowledge of number of
clusters
 Parameters:
 Similarity Function σ
 Threshold μ
Clustering Algorithm
 Many alternatives possible! [Berkhin 2002]
 Single-pass incremental clustering algorithm
 Scalable, online solution
 Used effectively for event identification in textual
news
 Does not require a priori knowledge of number of
clusters
 Parameters:
 Similarity Function σ
 Threshold μ
Cluster Representation and
Parameter Tuning
Cluster Representation and
Parameter Tuning
 Centroid cluster representation
 Average tf-idf scores
 Average time
 Geographic mid-point
 Parameter tuning in supervised training
phase
Clustering quality metrics to optimize:
 Normalized Mutual Information (NMI) [Amigó et al. 2008]
 B-Cubed [Strehl et al. 2002]
Cluster Representation and
Parameter Tuning
 Centroid cluster representation
 Average tf-idf scores
 Average time
 Geographic mid-point
 Parameter tuning in supervised training
phase
Clustering quality metrics to optimize:
 Normalized Mutual Information (NMI) [Amigó et al. 2008]
 B-Cubed [Strehl et al. 2002]
Clustering Quality Metrics
 Characteristics of clusters:
 Homogeneity
 Completeness
Clustering Quality Metrics
 Characteristics of clusters:
 Homogeneity
 Completeness
Clustering Quality Metrics
 Characteristics of clusters:
 Homogeneity
 Completeness
Clustering Quality Metrics
 Characteristics of clusters:
 Homogeneity
 Completeness
✔
Clustering Quality Metrics
 Characteristics of clusters:
 Homogeneity
 Completeness
✔
Clustering Quality Metrics
 Characteristics of clusters:
 Homogeneity
 Completeness
✔
Clustering Quality Metrics
 Characteristics of clusters:
 Homogeneity
 Completeness
✔
✔
Clustering Quality Metrics
 Characteristics of clusters:
 Homogeneity
 Completeness
✔
✔
 Captured by both NMI and B-Cubed
 Optimize both metrics using a single (Pareto
optimal) objective function: NMI+B-Cubed
Clustering Quality Metrics
 Characteristics of clusters:
 Homogeneity
 Completeness
✔
✔
 Captured by both NMI and B-Cubed
 Optimize both metrics using a single (Pareto
optimal) objective function: NMI+B-Cubed
Learning a Similarity Metric for Clustering
Learning a Similarity Metric for Clustering
 Ensemble-based similarity
 Training a cluster ensemble
 Computing a similarity score by:
 Combining individual partitions
 Combining individual similarities
 Classification-based similarity
 Training data sampling strategies
 Modeling strategies
Learning a Similarity Metric for Clustering
 Ensemble-based similarity
 Training a cluster ensemble
 Computing a similarity score by:
 Combining individual partitions
 Combining individual similarities
 Classification-based similarity
 Training data sampling strategies
 Modeling strategies
Overview of a Cluster Ensemble Algorithm
Overview of a Cluster Ensemble Algorithm
Ctitle
Ctag
s
Ctime
Consensus Function:
combine ensemble
similarities
Overview of a Cluster Ensemble Algorithm
Wtitle
Wtags
Wtime
f(C,W)
Ctitle
Ctag
s
Ctime
Learned in a
training step
Consensus Function:
combine ensemble
similarities
Overview of a Cluster Ensemble Algorithm
Wtitle
Wtags
Wtime
f(C,W)
Ctitle
Ctag
s
Ctime
Ensemble
clustering
solution
Learned in a
training step
Overview of a Cluster Ensemble Algorithm
Wtitle
Wtags
Wtime
f(C,W)
Ctitle
Ctag
s
Ctime
Overview of a Cluster Ensemble Algorithm
Wtitle
Wtags
Wtime
f(C,W)
Ctitle
Ctag
s
Ctime
Overview of a Cluster Ensemble Algorithm
Wtitle
Wtags
Wtime
f(C,W)
Ctitle
Ctag
s
Ctime
Overview of a Cluster Ensemble Algorithm
Wtitle
Wtags
Wtime
f(C,W)
Ctitle
Ctag
s
Ctime
Overview of a Cluster Ensemble Algorithm
Wtitle
Wtags
Wtime
f(C,W)
σCtitle(di,cj)>μCtitle
σCtags(di,cj)>μCtags
σCtime(di,cj)>μCtime
For each
document di
and cluster cj
Learning a Similarity Metric for Clustering





 Classification-based similarity
 Training data sampling strategies
 Modeling strategies
Classification-based Similarity Metrics
Classification-based Similarity Metrics
 Classify pairs of documents as similar/dissimilar
 Feature vector
 Pairwise similarity scores
 One feature per similarity metric (e.g., time-
proximity, location-proximity, …)
 Modeling strategies
 Document pairs
 Document-centroid pairs
Classification-based Similarity Metrics
 Classify pairs of documents as similar/dissimilar
 Feature vector
 Pairwise similarity scores
 One feature per similarity metric (e.g., time-
proximity, location-proximity, …)
 Modeling strategies
 Document pairs
 Document-centroid pairs
Classification-based Similarity Metrics
 Classify pairs of documents as similar/dissimilar
 Feature vector
 Pairwise similarity scores
 One feature per similarity metric (e.g., time-
proximity, location-proximity, …)
 Modeling strategies
 Document pairs
 Document-centroid pairs
Training Classification-based Similarity
Training Classification-based Similarity
 Challenge: most document pairs do not correspond to the
same event
 Skewed label distribution
 Small, highly homogeneous clusters
 Sampling strategies
 Random
 Select a document at random
 Randomly create one positive and one negative example
 Time-based
 Create examples for the first NxN documents
 Resample such that the label distribution is balanced
Training Classification-based Similarity
 Challenge: most document pairs do not correspond to the
same event
 Skewed label distribution
 Small, highly homogeneous clusters
 Sampling strategies
 Random
 Select a document at random
 Randomly create one positive and one negative example
 Time-based
 Create examples for the first NxN documents
 Resample such that the label distribution is balanced
Experiments: Alternative Similarity
Metrics
Experiments: Alternative Similarity
Metrics
 Ensemble-based techniques
 Combining individual partitions (ENS-PART)
 Combining individual similarities (ENS-SIM)
 Classification-based techniques
 Modeling: document-document vs. document-centroid pairs
 Sampling: time-based vs. random
 Logistic Regression (CLASS-LR), Support Vector Machines
(CLASS-SVM)
 Baselines
 Title, Description, Tags, All-Text, Time-Proximity, Location-
Proximity
Experiments: Alternative Similarity
Metrics
 Ensemble-based techniques
 Combining individual partitions (ENS-PART)
 Combining individual similarities (ENS-SIM)
 Classification-based techniques
 Modeling: document-document vs. document-centroid pairs
 Sampling: time-based vs. random
 Logistic Regression (CLASS-LR), Support Vector Machines
(CLASS-SVM)
 Baselines
 Title, Description, Tags, All-Text, Time-Proximity, Location-
Proximity
Experiments: Alternative Similarity
Metrics
 Ensemble-based techniques
 Combining individual partitions (ENS-PART)
 Combining individual similarities (ENS-SIM)
 Classification-based techniques
 Modeling: document-document vs. document-centroid pairs
 Sampling: time-based vs. random
 Logistic Regression (CLASS-LR), Support Vector Machines
(CLASS-SVM)
 Baselines
 Title, Description, Tags, All-Text, Time-Proximity, Location-
Proximity
Experimental Setup
Experimental Setup
 Datasets:
 Upcoming
 >270K Flickr photos
 Event labels from the “upcoming” event database
(upcoming:event=12345)
 Split into 3 parts for training/validation/testing
 LastFM
 >594K Flickr photos
 Event labels from last.fm music catalog (lastfm:event=6789)
 Used as an additional test set
Experimental Setup
 Datasets:
 Upcoming
 >270K Flickr photos
 Event labels from the “upcoming” event database
(upcoming:event=12345)
 Split into 3 parts for training/validation/testing
 LastFM
 >594K Flickr photos
 Event labels from last.fm music catalog (lastfm:event=6789)
 Used as an additional test set
Experimental Setup
 Datasets:
 Upcoming
 >270K Flickr photos
 Event labels from the “upcoming” event database
(upcoming:event=12345)
 Split into 3 parts for training/validation/testing
 LastFM
 >594K Flickr photos
 Event labels from last.fm music catalog (lastfm:event=6789)
 Used as an additional test set
Experimental Setup
 Datasets:
 Upcoming
 >270K Flickr photos
 Event labels from the “upcoming” event database
(upcoming:event=12345)
 Split into 3 parts for training/validation/testing
 LastFM
 >594K Flickr photos
 Event labels from last.fm music catalog (lastfm:event=6789)
 Used as an additional test set
Experimental Setup
 Datasets:
 Upcoming
 >270K Flickr photos
 Event labels from the “upcoming” event database
(upcoming:event=12345)
 Split into 3 parts for training/validation/testing
 LastFM
 >594K Flickr photos
 Event labels from last.fm music catalog (lastfm:event=6789)
 Used as an additional test set
Experimental Setup
 Datasets:
 Upcoming
 >270K Flickr photos
 Event labels from the “upcoming” event database
(upcoming:event=12345)
 Split into 3 parts for training/validation/testing
 LastFM
 >594K Flickr photos
 Event labels from last.fm music catalog (lastfm:event=6789)
 Used as an additional test set
Experimental Setup
 Datasets:
 Upcoming
 >270K Flickr photos
 Event labels from the “upcoming” event database
(upcoming:event=12345)
 Split into 3 parts for training/validation/testing
 LastFM
 >594K Flickr photos
 Event labels from last.fm music catalog (lastfm:event=6789)
 Used as an additional test set
Experimental Setup
 Datasets:
 Upcoming
 >270K Flickr photos
 Event labels from the “upcoming” event database
(upcoming:event=12345)
 Split into 3 parts for training/validation/testing
 LastFM
 >594K Flickr photos
 Event labels from last.fm music catalog (lastfm:event=6789)
 Used as an additional test set
Experimental Setup
 Datasets:
 Upcoming
 >270K Flickr photos
 Event labels from the “upcoming” event database
(upcoming:event=12345)
 Split into 3 parts for training/validation/testing
 LastFM
 >594K Flickr photos
 Event labels from last.fm music catalog (lastfm:event=6789)
 Used as an additional test set
Clustering Accuracy over Upcoming Test Set
 All similarity learning techniques outperform the baselines
 Classification-based techniques perform better than
ensemble-based techniques
Algorithm NMI B-Cubed
All-Text 0.9240 0.7697
Tags 0.9229 0.7676
ENS-PART 0.9296 0.7819
ENS-SIM 0.9322 0.7861
CLASS-SVM 0.9425 0.8095
CLASS-LR 0.9444 0.8155
Clustering Accuracy over Upcoming Test Set
 All similarity learning techniques outperform the baselines
 Classification-based techniques perform better than
ensemble-based techniques
Algorithm NMI B-Cubed
All-Text 0.9240 0.7697
Tags 0.9229 0.7676
ENS-PART 0.9296 0.7819
ENS-SIM 0.9322 0.7861
CLASS-SVM 0.9425 0.8095
CLASS-LR 0.9444 0.8155
Clustering Accuracy over Upcoming Test Set
 All similarity learning techniques outperform the baselines
 Classification-based techniques perform better than
ensemble-based techniques
Algorithm NMI B-Cubed
All-Text 0.9240 0.7697
Tags 0.9229 0.7676
ENS-PART 0.9296 0.7819
ENS-SIM 0.9322 0.7861
CLASS-SVM 0.9425 0.8095
CLASS-LR 0.9444 0.8155
Clustering Accuracy over Upcoming Test Set
 All similarity learning techniques outperform the baselines
 Classification-based techniques perform better than
ensemble-based techniques
Algorithm NMI B-Cubed
All-Text 0.9240 0.7697
Tags 0.9229 0.7676
ENS-PART 0.9296 0.7819
ENS-SIM 0.9322 0.7861
CLASS-SVM 0.9425 0.8095
CLASS-LR 0.9444 0.8155
Statistical Significance Analysis
 Clustering results for 10 partitions of Upcoming
test set
 Significant using Friedman test, p<0.05
 Post-hoc analysis:
NMI: Clustering Accuracy over Both Test Sets
Upcoming LastFM
NMI
 Similarity learning models trained on Upcoming
data show similar trends when tested on LastFM
data
Conclusions
Conclusions
 Structured context features of social media documents
 Effective complementary cues for social media
document similarity
 Tags, Time-Proximity among highest weighted
features
 Domain-appropriate similarity metrics
 Weighted combination yields high quality clustering
results
 Significantly outperform text-only techniques
 Similarity learning models generalize to unseen data
sets
Conclusions
 Structured context features of social media documents
 Effective complementary cues for social media
document similarity
 Tags, Time-Proximity among highest weighted
features
 Domain-appropriate similarity metrics
 Weighted combination yields high quality clustering
results
 Significantly outperform text-only techniques
 Similarity learning models generalize to unseen data
sets
Conclusions
 Structured context features of social media documents
 Effective complementary cues for social media
document similarity
 Tags, Time-Proximity among highest weighted
features
 Domain-appropriate similarity metrics
 Weighted combination yields high quality clustering
results
 Significantly outperform text-only techniques
 Similarity learning models generalize to unseen data
sets
Current and Future Work
 Improving clustering accuracy with social media
“links” [SSM „10 poster]
 Capturing event content across sites (YouTube,
Flickr, Twitter)
 Designing event search strategies
Thank You!

More Related Content

Similar to Learning Similarity Metrics for Event Identification in Social Media

Contextual Recommendation of Social Updates, a tag-based framework
Contextual Recommendation of Social Updates, a tag-based frameworkContextual Recommendation of Social Updates, a tag-based framework
Contextual Recommendation of Social Updates, a tag-based frameworkAdrien Joly
 
Effective Event Identification in Social Media
Effective Event Identification in Social MediaEffective Event Identification in Social Media
Effective Event Identification in Social MediaWei-Yuan Chang
 
20080509 Friday Food Manchester United Business School
20080509 Friday Food Manchester United Business School20080509 Friday Food Manchester United Business School
20080509 Friday Food Manchester United Business Schoolimec.archive
 
Digital Trails Dave King 1 5 10 Part 2 D3
Digital Trails   Dave King   1 5 10   Part 2   D3Digital Trails   Dave King   1 5 10   Part 2   D3
Digital Trails Dave King 1 5 10 Part 2 D3Dave King
 
Social Information Access: A Personal Update
Social Information Access: A Personal UpdateSocial Information Access: A Personal Update
Social Information Access: A Personal UpdateDaqing He
 
PhD Defense - A Context Management Framework based on Wisdom of Crowds for So...
PhD Defense - A Context Management Framework based on Wisdom of Crowds for So...PhD Defense - A Context Management Framework based on Wisdom of Crowds for So...
PhD Defense - A Context Management Framework based on Wisdom of Crowds for So...Adrien Joly
 
An Empirical Study on Criteria for Assessing Information Quality in Corporate...
An Empirical Study on Criteria for Assessing Information Quality in Corporate...An Empirical Study on Criteria for Assessing Information Quality in Corporate...
An Empirical Study on Criteria for Assessing Information Quality in Corporate...Wolfgang Reinhardt
 
Methodology of CVE Research - Sajid Amit.pdf
Methodology of CVE Research - Sajid Amit.pdfMethodology of CVE Research - Sajid Amit.pdf
Methodology of CVE Research - Sajid Amit.pdfSajid Amit
 
Methods for measuring citizen-science impact
Methods for measuring citizen-science impactMethods for measuring citizen-science impact
Methods for measuring citizen-science impactLuigi Ceccaroni
 
Text mining and analytics v6 - p1
Text mining and analytics   v6 - p1Text mining and analytics   v6 - p1
Text mining and analytics v6 - p1Dave King
 
Metadata Quality Assurance
Metadata Quality AssuranceMetadata Quality Assurance
Metadata Quality AssurancePéter Király
 
Information Seeking with Social Signals: Anatomy of a Social Tag-based Explor...
Information Seeking with Social Signals: Anatomy of a Social Tag-based Explor...Information Seeking with Social Signals: Anatomy of a Social Tag-based Explor...
Information Seeking with Social Signals: Anatomy of a Social Tag-based Explor...Ed Chi
 
Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3Marianne Sweeny
 
Needs Assessment and Program Planning
Needs Assessment and Program PlanningNeeds Assessment and Program Planning
Needs Assessment and Program PlanningLynda Milne
 
Standardized Annotations for Survey Datasets: Enabling Automated Quality Assu...
Standardized Annotations for Survey Datasets: Enabling Automated Quality Assu...Standardized Annotations for Survey Datasets: Enabling Automated Quality Assu...
Standardized Annotations for Survey Datasets: Enabling Automated Quality Assu...Inspirient
 
If You Tag it, Will They Come? Metadata Quality and Repository Management
If You Tag it, Will They Come? Metadata Quality and Repository ManagementIf You Tag it, Will They Come? Metadata Quality and Repository Management
If You Tag it, Will They Come? Metadata Quality and Repository ManagementSarah Currier
 
Evolving social data mining and affective analysis
Evolving social data mining and affective analysis  Evolving social data mining and affective analysis
Evolving social data mining and affective analysis Athena Vakali
 
MCN2017 Workshop: Web Analytics and SEO
MCN2017 Workshop: Web Analytics and SEOMCN2017 Workshop: Web Analytics and SEO
MCN2017 Workshop: Web Analytics and SEOBrian Alpert
 
Building AI Applications using Knowledge Graphs
Building AI Applications using Knowledge GraphsBuilding AI Applications using Knowledge Graphs
Building AI Applications using Knowledge GraphsAndre Freitas
 

Similar to Learning Similarity Metrics for Event Identification in Social Media (20)

Contextual Recommendation of Social Updates, a tag-based framework
Contextual Recommendation of Social Updates, a tag-based frameworkContextual Recommendation of Social Updates, a tag-based framework
Contextual Recommendation of Social Updates, a tag-based framework
 
Effective Event Identification in Social Media
Effective Event Identification in Social MediaEffective Event Identification in Social Media
Effective Event Identification in Social Media
 
20080509 Friday Food Manchester United Business School
20080509 Friday Food Manchester United Business School20080509 Friday Food Manchester United Business School
20080509 Friday Food Manchester United Business School
 
Digital Trails Dave King 1 5 10 Part 2 D3
Digital Trails   Dave King   1 5 10   Part 2   D3Digital Trails   Dave King   1 5 10   Part 2   D3
Digital Trails Dave King 1 5 10 Part 2 D3
 
Social Information Access: A Personal Update
Social Information Access: A Personal UpdateSocial Information Access: A Personal Update
Social Information Access: A Personal Update
 
PhD Defense - A Context Management Framework based on Wisdom of Crowds for So...
PhD Defense - A Context Management Framework based on Wisdom of Crowds for So...PhD Defense - A Context Management Framework based on Wisdom of Crowds for So...
PhD Defense - A Context Management Framework based on Wisdom of Crowds for So...
 
An Empirical Study on Criteria for Assessing Information Quality in Corporate...
An Empirical Study on Criteria for Assessing Information Quality in Corporate...An Empirical Study on Criteria for Assessing Information Quality in Corporate...
An Empirical Study on Criteria for Assessing Information Quality in Corporate...
 
Methodology of CVE Research - Sajid Amit.pdf
Methodology of CVE Research - Sajid Amit.pdfMethodology of CVE Research - Sajid Amit.pdf
Methodology of CVE Research - Sajid Amit.pdf
 
Methods for measuring citizen-science impact
Methods for measuring citizen-science impactMethods for measuring citizen-science impact
Methods for measuring citizen-science impact
 
Text mining and analytics v6 - p1
Text mining and analytics   v6 - p1Text mining and analytics   v6 - p1
Text mining and analytics v6 - p1
 
Metadata Quality Assurance
Metadata Quality AssuranceMetadata Quality Assurance
Metadata Quality Assurance
 
Information Seeking with Social Signals: Anatomy of a Social Tag-based Explor...
Information Seeking with Social Signals: Anatomy of a Social Tag-based Explor...Information Seeking with Social Signals: Anatomy of a Social Tag-based Explor...
Information Seeking with Social Signals: Anatomy of a Social Tag-based Explor...
 
Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3
 
Needs Assessment and Program Planning
Needs Assessment and Program PlanningNeeds Assessment and Program Planning
Needs Assessment and Program Planning
 
Standardized Annotations for Survey Datasets: Enabling Automated Quality Assu...
Standardized Annotations for Survey Datasets: Enabling Automated Quality Assu...Standardized Annotations for Survey Datasets: Enabling Automated Quality Assu...
Standardized Annotations for Survey Datasets: Enabling Automated Quality Assu...
 
If You Tag it, Will They Come? Metadata Quality and Repository Management
If You Tag it, Will They Come? Metadata Quality and Repository ManagementIf You Tag it, Will They Come? Metadata Quality and Repository Management
If You Tag it, Will They Come? Metadata Quality and Repository Management
 
Evolving social data mining and affective analysis
Evolving social data mining and affective analysis  Evolving social data mining and affective analysis
Evolving social data mining and affective analysis
 
MCN2017 Workshop: Web Analytics and SEO
MCN2017 Workshop: Web Analytics and SEOMCN2017 Workshop: Web Analytics and SEO
MCN2017 Workshop: Web Analytics and SEO
 
McGeary Data Curation Network: Developing and Scaling
McGeary Data Curation Network: Developing and ScalingMcGeary Data Curation Network: Developing and Scaling
McGeary Data Curation Network: Developing and Scaling
 
Building AI Applications using Knowledge Graphs
Building AI Applications using Knowledge GraphsBuilding AI Applications using Knowledge Graphs
Building AI Applications using Knowledge Graphs
 

Learning Similarity Metrics for Event Identification in Social Media

  • 1. Learning Similarity Metrics for Event Identification in Social Media Hila Becker, Luis Gravano Mor Naaman Columbia University Rutgers University
  • 2. “Event” Content in Social Media Sites
  • 3. “Event” Content in Social Media Sites “Event”= something that occurs at a certain time in a certain place [Yang et al. ‟99] Smaller events, without traditional news coverage Popular, widely known events
  • 4. Identifying Events and Associated Social Media Documents
  • 5. Identifying Events and Associated Social Media Documents  Applications  Event browsing  Local event search  …  General approach: group similar documents via clustering Each cluster corresponds to one event and its associated social media documents
  • 6. Identifying Events and Associated Social Media Documents  Applications  Event browsing  Local event search  …  General approach: group similar documents via clustering Each cluster corresponds to one event and its associated social media documents
  • 7. Identifying Events and Associated Social Media Documents  Applications  Event browsing  Local event search  …  General approach: group similar documents via clustering Each cluster corresponds to one event and its associated social media documents
  • 9. Event Identification: Challenges  Uneven data quality  Missing, short, uninformative text  … but revealing structured context available: tags, date/time, geo-coordinates  Scalability  Dynamic data stream of event information  Number of events unknown  Difficult to estimate  Constantly changing
  • 10. Event Identification: Challenges  Uneven data quality  Missing, short, uninformative text  … but revealing structured context available: tags, date/time, geo-coordinates  Scalability  Dynamic data stream of event information  Number of events unknown  Difficult to estimate  Constantly changing
  • 11. Event Identification: Challenges  Uneven data quality  Missing, short, uninformative text  … but revealing structured context available: tags, date/time, geo-coordinates  Scalability  Dynamic data stream of event information  Number of events unknown  Difficult to estimate  Constantly changing
  • 12. Event Identification: Challenges  Uneven data quality  Missing, short, uninformative text  … but revealing structured context available: tags, date/time, geo-coordinates  Scalability  Dynamic data stream of event information  Number of events unknown  Difficult to estimate  Constantly changing
  • 14. Clustering Social Media Documents  Social media document representation  Social media document similarity  Social media document clustering framework  Similarity metric learning for clustering  Ensemble-based  Classification-based  Evaluation results
  • 15. Clustering Social Media Documents  Social media document representation  Social media document similarity  Social media document clustering framework  Similarity metric learning for clustering  Ensemble-based  Classification-based  Evaluation results
  • 16. Clustering Social Media Documents  Social media document representation  Social media document similarity  Social media document clustering framework  Similarity metric learning for clustering  Ensemble-based  Classification-based  Evaluation results
  • 17. Clustering Social Media Documents  Social media document representation  Social media document similarity  Social media document clustering framework  Similarity metric learning for clustering  Ensemble-based  Classification-based  Evaluation results
  • 18. Clustering Social Media Documents  Social media document representation  Social media document similarity  Social media document clustering framework  Similarity metric learning for clustering  Ensemble-based  Classification-based  Evaluation results
  • 19. Social Media Document Features 46
  • 20. Social Media Document Features Title 47
  • 21. Social Media Document Features Title 48
  • 22. Social Media Document Features Title Description 49
  • 23. Social Media Document Features Title Description 50
  • 24. Social Media Document Features Title Description Tags 51
  • 25. Social Media Document Features Title Description Tags 52
  • 26. Social Media Document Features Title Description Tags Date/Time 53
  • 27. Social Media Document Features Title Description Tags Date/Time 54
  • 28. Social Media Document Features Title Description Tags Date/Time Location 55
  • 29. Social Media Document Features Title Description Tags Date/Time Location 56
  • 30. Social Media Document Features Title Description Tags Date/Time Location All-Text 57
  • 31. Social Media Document Similarity Title Description Tags Location All-Text Date/Time
  • 32. Social Media Document Similarity  Text: cosine similarity of tf-idf vectors (tf-idf version?; stemming?; stop-word elimination?)Title Description Tags Location All-Text Date/Time A AA B BB
  • 33. Social Media Document Similarity  Text: cosine similarity of tf-idf vectors (tf-idf version?; stemming?; stop-word elimination?)Title Description Tags Location All-Text Date/Time time A AA B BB  Time: proximity in minutes
  • 34. Social Media Document Similarity  Text: cosine similarity of tf-idf vectors (tf-idf version?; stemming?; stop-word elimination?)Title Description Tags Location All-Text Date/Time time A AA B BB  Time: proximity in minutes  Location: geo-coordinate proximity
  • 35. Social Media Document Similarity  Text: cosine similarity of tf-idf vectors (tf-idf version?; stemming?; stop-word elimination?)Title Description Tags Location All-Text Date/Time time A AA B BB  Time: proximity in minutes  Location: geo-coordinate proximity
  • 36. General Clustering Framework Document feature representation Social media documents Event clusters 63
  • 37. General Clustering Framework Document feature representation Social media documents Event clusters 64
  • 38. General Clustering Framework Document feature representation Social media documents Event clusters 65
  • 39. General Clustering Framework Document feature representation Social media documents Event clusters 66
  • 40. General Clustering Framework Document feature representation Social media documents Event clusters 67
  • 41. General Clustering Framework Document feature representation Social media documents Event clusters 68
  • 43. Clustering Algorithm  Many alternatives possible! [Berkhin 2002]  Single-pass incremental clustering algorithm  Scalable, online solution  Used effectively for event identification in textual news  Does not require a priori knowledge of number of clusters  Parameters:  Similarity Function σ  Threshold μ
  • 44. Clustering Algorithm  Many alternatives possible! [Berkhin 2002]  Single-pass incremental clustering algorithm  Scalable, online solution  Used effectively for event identification in textual news  Does not require a priori knowledge of number of clusters  Parameters:  Similarity Function σ  Threshold μ
  • 46. Cluster Representation and Parameter Tuning  Centroid cluster representation  Average tf-idf scores  Average time  Geographic mid-point  Parameter tuning in supervised training phase Clustering quality metrics to optimize:  Normalized Mutual Information (NMI) [Amigó et al. 2008]  B-Cubed [Strehl et al. 2002]
  • 47. Cluster Representation and Parameter Tuning  Centroid cluster representation  Average tf-idf scores  Average time  Geographic mid-point  Parameter tuning in supervised training phase Clustering quality metrics to optimize:  Normalized Mutual Information (NMI) [Amigó et al. 2008]  B-Cubed [Strehl et al. 2002]
  • 48. Clustering Quality Metrics  Characteristics of clusters:  Homogeneity  Completeness
  • 49. Clustering Quality Metrics  Characteristics of clusters:  Homogeneity  Completeness
  • 50. Clustering Quality Metrics  Characteristics of clusters:  Homogeneity  Completeness
  • 51. Clustering Quality Metrics  Characteristics of clusters:  Homogeneity  Completeness ✔
  • 52. Clustering Quality Metrics  Characteristics of clusters:  Homogeneity  Completeness ✔
  • 53. Clustering Quality Metrics  Characteristics of clusters:  Homogeneity  Completeness ✔
  • 54. Clustering Quality Metrics  Characteristics of clusters:  Homogeneity  Completeness ✔ ✔
  • 55. Clustering Quality Metrics  Characteristics of clusters:  Homogeneity  Completeness ✔ ✔  Captured by both NMI and B-Cubed  Optimize both metrics using a single (Pareto optimal) objective function: NMI+B-Cubed
  • 56. Clustering Quality Metrics  Characteristics of clusters:  Homogeneity  Completeness ✔ ✔  Captured by both NMI and B-Cubed  Optimize both metrics using a single (Pareto optimal) objective function: NMI+B-Cubed
  • 57. Learning a Similarity Metric for Clustering
  • 58. Learning a Similarity Metric for Clustering  Ensemble-based similarity  Training a cluster ensemble  Computing a similarity score by:  Combining individual partitions  Combining individual similarities  Classification-based similarity  Training data sampling strategies  Modeling strategies
  • 59. Learning a Similarity Metric for Clustering  Ensemble-based similarity  Training a cluster ensemble  Computing a similarity score by:  Combining individual partitions  Combining individual similarities  Classification-based similarity  Training data sampling strategies  Modeling strategies
  • 60. Overview of a Cluster Ensemble Algorithm
  • 61. Overview of a Cluster Ensemble Algorithm Ctitle Ctag s Ctime
  • 62. Consensus Function: combine ensemble similarities Overview of a Cluster Ensemble Algorithm Wtitle Wtags Wtime f(C,W) Ctitle Ctag s Ctime Learned in a training step
  • 63. Consensus Function: combine ensemble similarities Overview of a Cluster Ensemble Algorithm Wtitle Wtags Wtime f(C,W) Ctitle Ctag s Ctime Ensemble clustering solution Learned in a training step
  • 64. Overview of a Cluster Ensemble Algorithm Wtitle Wtags Wtime f(C,W) Ctitle Ctag s Ctime
  • 65. Overview of a Cluster Ensemble Algorithm Wtitle Wtags Wtime f(C,W) Ctitle Ctag s Ctime
  • 66. Overview of a Cluster Ensemble Algorithm Wtitle Wtags Wtime f(C,W) Ctitle Ctag s Ctime
  • 67. Overview of a Cluster Ensemble Algorithm Wtitle Wtags Wtime f(C,W) Ctitle Ctag s Ctime
  • 68. Overview of a Cluster Ensemble Algorithm Wtitle Wtags Wtime f(C,W) σCtitle(di,cj)>μCtitle σCtags(di,cj)>μCtags σCtime(di,cj)>μCtime For each document di and cluster cj
  • 69. Learning a Similarity Metric for Clustering       Classification-based similarity  Training data sampling strategies  Modeling strategies
  • 71. Classification-based Similarity Metrics  Classify pairs of documents as similar/dissimilar  Feature vector  Pairwise similarity scores  One feature per similarity metric (e.g., time- proximity, location-proximity, …)  Modeling strategies  Document pairs  Document-centroid pairs
  • 72. Classification-based Similarity Metrics  Classify pairs of documents as similar/dissimilar  Feature vector  Pairwise similarity scores  One feature per similarity metric (e.g., time- proximity, location-proximity, …)  Modeling strategies  Document pairs  Document-centroid pairs
  • 73. Classification-based Similarity Metrics  Classify pairs of documents as similar/dissimilar  Feature vector  Pairwise similarity scores  One feature per similarity metric (e.g., time- proximity, location-proximity, …)  Modeling strategies  Document pairs  Document-centroid pairs
  • 75. Training Classification-based Similarity  Challenge: most document pairs do not correspond to the same event  Skewed label distribution  Small, highly homogeneous clusters  Sampling strategies  Random  Select a document at random  Randomly create one positive and one negative example  Time-based  Create examples for the first NxN documents  Resample such that the label distribution is balanced
  • 76. Training Classification-based Similarity  Challenge: most document pairs do not correspond to the same event  Skewed label distribution  Small, highly homogeneous clusters  Sampling strategies  Random  Select a document at random  Randomly create one positive and one negative example  Time-based  Create examples for the first NxN documents  Resample such that the label distribution is balanced
  • 78. Experiments: Alternative Similarity Metrics  Ensemble-based techniques  Combining individual partitions (ENS-PART)  Combining individual similarities (ENS-SIM)  Classification-based techniques  Modeling: document-document vs. document-centroid pairs  Sampling: time-based vs. random  Logistic Regression (CLASS-LR), Support Vector Machines (CLASS-SVM)  Baselines  Title, Description, Tags, All-Text, Time-Proximity, Location- Proximity
  • 79. Experiments: Alternative Similarity Metrics  Ensemble-based techniques  Combining individual partitions (ENS-PART)  Combining individual similarities (ENS-SIM)  Classification-based techniques  Modeling: document-document vs. document-centroid pairs  Sampling: time-based vs. random  Logistic Regression (CLASS-LR), Support Vector Machines (CLASS-SVM)  Baselines  Title, Description, Tags, All-Text, Time-Proximity, Location- Proximity
  • 80. Experiments: Alternative Similarity Metrics  Ensemble-based techniques  Combining individual partitions (ENS-PART)  Combining individual similarities (ENS-SIM)  Classification-based techniques  Modeling: document-document vs. document-centroid pairs  Sampling: time-based vs. random  Logistic Regression (CLASS-LR), Support Vector Machines (CLASS-SVM)  Baselines  Title, Description, Tags, All-Text, Time-Proximity, Location- Proximity
  • 82. Experimental Setup  Datasets:  Upcoming  >270K Flickr photos  Event labels from the “upcoming” event database (upcoming:event=12345)  Split into 3 parts for training/validation/testing  LastFM  >594K Flickr photos  Event labels from last.fm music catalog (lastfm:event=6789)  Used as an additional test set
  • 83. Experimental Setup  Datasets:  Upcoming  >270K Flickr photos  Event labels from the “upcoming” event database (upcoming:event=12345)  Split into 3 parts for training/validation/testing  LastFM  >594K Flickr photos  Event labels from last.fm music catalog (lastfm:event=6789)  Used as an additional test set
  • 84. Experimental Setup  Datasets:  Upcoming  >270K Flickr photos  Event labels from the “upcoming” event database (upcoming:event=12345)  Split into 3 parts for training/validation/testing  LastFM  >594K Flickr photos  Event labels from last.fm music catalog (lastfm:event=6789)  Used as an additional test set
  • 85. Experimental Setup  Datasets:  Upcoming  >270K Flickr photos  Event labels from the “upcoming” event database (upcoming:event=12345)  Split into 3 parts for training/validation/testing  LastFM  >594K Flickr photos  Event labels from last.fm music catalog (lastfm:event=6789)  Used as an additional test set
  • 86. Experimental Setup  Datasets:  Upcoming  >270K Flickr photos  Event labels from the “upcoming” event database (upcoming:event=12345)  Split into 3 parts for training/validation/testing  LastFM  >594K Flickr photos  Event labels from last.fm music catalog (lastfm:event=6789)  Used as an additional test set
  • 87. Experimental Setup  Datasets:  Upcoming  >270K Flickr photos  Event labels from the “upcoming” event database (upcoming:event=12345)  Split into 3 parts for training/validation/testing  LastFM  >594K Flickr photos  Event labels from last.fm music catalog (lastfm:event=6789)  Used as an additional test set
  • 88. Experimental Setup  Datasets:  Upcoming  >270K Flickr photos  Event labels from the “upcoming” event database (upcoming:event=12345)  Split into 3 parts for training/validation/testing  LastFM  >594K Flickr photos  Event labels from last.fm music catalog (lastfm:event=6789)  Used as an additional test set
  • 89. Experimental Setup  Datasets:  Upcoming  >270K Flickr photos  Event labels from the “upcoming” event database (upcoming:event=12345)  Split into 3 parts for training/validation/testing  LastFM  >594K Flickr photos  Event labels from last.fm music catalog (lastfm:event=6789)  Used as an additional test set
  • 90. Experimental Setup  Datasets:  Upcoming  >270K Flickr photos  Event labels from the “upcoming” event database (upcoming:event=12345)  Split into 3 parts for training/validation/testing  LastFM  >594K Flickr photos  Event labels from last.fm music catalog (lastfm:event=6789)  Used as an additional test set
  • 91. Clustering Accuracy over Upcoming Test Set  All similarity learning techniques outperform the baselines  Classification-based techniques perform better than ensemble-based techniques Algorithm NMI B-Cubed All-Text 0.9240 0.7697 Tags 0.9229 0.7676 ENS-PART 0.9296 0.7819 ENS-SIM 0.9322 0.7861 CLASS-SVM 0.9425 0.8095 CLASS-LR 0.9444 0.8155
  • 92. Clustering Accuracy over Upcoming Test Set  All similarity learning techniques outperform the baselines  Classification-based techniques perform better than ensemble-based techniques Algorithm NMI B-Cubed All-Text 0.9240 0.7697 Tags 0.9229 0.7676 ENS-PART 0.9296 0.7819 ENS-SIM 0.9322 0.7861 CLASS-SVM 0.9425 0.8095 CLASS-LR 0.9444 0.8155
  • 93. Clustering Accuracy over Upcoming Test Set  All similarity learning techniques outperform the baselines  Classification-based techniques perform better than ensemble-based techniques Algorithm NMI B-Cubed All-Text 0.9240 0.7697 Tags 0.9229 0.7676 ENS-PART 0.9296 0.7819 ENS-SIM 0.9322 0.7861 CLASS-SVM 0.9425 0.8095 CLASS-LR 0.9444 0.8155
  • 94. Clustering Accuracy over Upcoming Test Set  All similarity learning techniques outperform the baselines  Classification-based techniques perform better than ensemble-based techniques Algorithm NMI B-Cubed All-Text 0.9240 0.7697 Tags 0.9229 0.7676 ENS-PART 0.9296 0.7819 ENS-SIM 0.9322 0.7861 CLASS-SVM 0.9425 0.8095 CLASS-LR 0.9444 0.8155
  • 95. Statistical Significance Analysis  Clustering results for 10 partitions of Upcoming test set  Significant using Friedman test, p<0.05  Post-hoc analysis:
  • 96. NMI: Clustering Accuracy over Both Test Sets Upcoming LastFM NMI  Similarity learning models trained on Upcoming data show similar trends when tested on LastFM data
  • 98. Conclusions  Structured context features of social media documents  Effective complementary cues for social media document similarity  Tags, Time-Proximity among highest weighted features  Domain-appropriate similarity metrics  Weighted combination yields high quality clustering results  Significantly outperform text-only techniques  Similarity learning models generalize to unseen data sets
  • 99. Conclusions  Structured context features of social media documents  Effective complementary cues for social media document similarity  Tags, Time-Proximity among highest weighted features  Domain-appropriate similarity metrics  Weighted combination yields high quality clustering results  Significantly outperform text-only techniques  Similarity learning models generalize to unseen data sets
  • 100. Conclusions  Structured context features of social media documents  Effective complementary cues for social media document similarity  Tags, Time-Proximity among highest weighted features  Domain-appropriate similarity metrics  Weighted combination yields high quality clustering results  Significantly outperform text-only techniques  Similarity learning models generalize to unseen data sets
  • 101. Current and Future Work  Improving clustering accuracy with social media “links” [SSM „10 poster]  Capturing event content across sites (YouTube, Flickr, Twitter)  Designing event search strategies