Social media sites host a variety of event information We use the traditional event definition from the event detection literature, stating that an event is…, In particular, we consider events that range from …
Our goal is to… This could facilitate application such as… as can be seen in this image, similar to news aggregation sites but for events, including a variety of rich media content Our approach for even identification uses clustering to group similar event documents, such that…
Social media data quality is uneven When developing our approach, we had to consider the scalability of our algorithms as there exists a vast amount of social media event data on the web
Define and motivate the event identification task…
We have different notions of similarity for different types of features… We have to come up with a principled way to combine these different notions into a single similarity
We can cluster out document collection according to the variety of feature reps. discussed, each would have its own
Social Media Document Representation Title Description Tags Date/Time Location All-Text
Social Media Document Similarity
Text: tf-idf weights, cosine similarity
Title Description Tags Date/Time Location All-Text Title Description Tags Date/Time-Keywords Location-Proximity All-Text Location-Keywords Date/Time-Proximity time
Location: geo-coordinate proximity
A A A B B B
Time: proximity in minutes
Social Media Document Clustering Framework Document feature representation Social media documents Event clusters
Clustering: Ensemble Algorithm Consensus Function: combine ensemble similarities W title W tags W time f(C,W) C title C tags C time Ensemble clustering solution Learned in a training step
Clustering: Measuring Quality
Homogeneous clusters
✔ ✔
Complete clusters
Metric: Normalized Mutual Information (NMI) Shared information between clustering solution and “ground truth”
Experimental Setup
Data: >270K Flickr photos
Event labels from Yahoo!’s “upcoming” event database
Split into 3 parts for training/validation/testing
Clusterers: single pass algorithm with centroid similarity
Weighing scheme: Normalized Mutual Information (NMI) scores on validation set
Consensus function: weighted average of clusterers’ binary predictions
Final prediction step: single pass clustering algorithm
Preliminary Evaluation Results
Individual clusterer performance
Highest NMI: Tags, All-Text
Lowest NMI: Description, Title
Ensemble performance, compared against all individual clusterers
Highest overall performance in terms of NMI
More homogenous clusters: each event is spread over fewer clusters
Hila Becker, Mor Naaman, Luis Gravano , "Event Iden more
Hila Becker, Mor Naaman, Luis Gravano , "Event Identification in Social Media", in Proceedings of the ACM SIGMOD Workshop on the Web and Databases (WebDB '09), 2009. less
0 comments
Post a comment