Identification and Characterization of Events in Social Media	Hila Becker, Thesis Defense
Social Media is Changing the World2Lady Gaga, Justin Bieber, and Britney Spears have more Twitter followers than the entire populations of some countries (e.g., Israel, Greece)YouTube is the second largest search engine in the worldEvery minute, 24 hours of video are uploaded to YouTubeOver the past five years people uploaded 6,000,000,000 images to Flickr
3Source: http://www.searchenginejournal.com/the-growth-of-social-media-an-infographic/
Event Content in Social Media4
5MIKE CLARKE/AFP/Getty Images
6Source: Tweets from Tahrir, edited by Nadia Idle and Alex Nunns
7
Event Identification, Characterization, and Content SelectionIdentify events and their associated social media documentsIn a timely mannerAcross different social media sitesCharacterize events along different dimensions Select high-quality, relevant, useful event documents8
Event Content in Social MediaChallenges:Wide variety of topics, not all related to events (e.g., personal status updates, every-day mundane conversations)Unconventional text: abbreviations, typosLarge-scale, rapidly produced contentOpportunities:Content generated in real-time, as events happenRich context features (e.g., time, location)Users’ perspective9
Event Content in Social Media10TimelinessReal-timeRetrospectiveTwitter new event detection [Petrović et al. NAACL’10]Event detection on Flickr[Chen and Roy CIKM’09]UnknownContent DiscoveryOrganization of YouTube concert videos [Kennedy and Naaman WWW’09]Earthquake prediction using Twitter [Sakaki et al. WWW’10]Known
Event Content in Social Media11Trending Event isa real-world occurrence described by:One or more terms and a time periodVolume of messages posted for the terms in the time period exceeds some expected level of activityUnknownContent DiscoveryPlanned Event is a real-world occurrence with corresponding published event record consisting of:Title, describing the subject of the eventThe time at which the event is planned to occurKnown
ContributionsTrend (and trending event) study, for characterizing and differentiating between different types of trendsOnline clustering framework with an event classification step for identifying trending  events and their associated documents in social mediaSocial media document similarity metric learning approachesQuery formulation strategies for identifying social media documents for planned eventsSelection techniques for identifying high quality, relevant, and useful event content12UnknownKnownUnknown/Known
ContributionsTrend (and trending event) study, for characterizing and differentiating between different types of trendsOnline clustering framework with an event classification step for identifying trending  events and their associated documents in social mediaSocial media document similarity metric learning approachesQuery formulation strategies for identifying social media documents for planned eventsSelection techniques for identifying high quality, relevant, and useful event content13KnownUnknown/Known
ContributionsTrend (and trending event) study, for characterizing and differentiating between different types of trendsOnline clustering framework with an event classification step for identifying trending  events and their associated documents in social mediaSocial media document similarity metric learning approachesQuery formulation strategies for identifying social media documents for planned eventsSelection techniques for identifying high quality, relevant, and useful event content14Unknown/Known
ContributionsTrend (and trending event) study, for characterizing and differentiating between different types of trendsOnline clustering framework with an event classification step for identifying trending  events and their associated documents in social mediaSocial media document similarity metric learning approachesQuery formulation strategies for identifying social media documents for planned eventsSelection techniques for identifying high quality, relevant, and useful event content15
Identification and Characterization of Events in Social Media16Characterizationof trending events Identification of trending events Similarity metric learning for trending eventsIdentification of content for planned eventsSelection of event content
What Types of Trends Exist in Social Media?Taxonomy of trendsCharacterization of each trendManually assigned categoriesAutomatically computed featuresAnalysis of differences between trend types according to each characteristic17Trending EventsNon-Event Trends
TrendsTrend:One or more terms and a time period Volume of messages posted for the terms in the time period exceeds some expected level of activityMay or may not reflect a real-world occurrenceA trending event is a type of trend18
Twitter ContentStreams of textual messagesBrief content (140 characters)Communicated to network of followersProvide timely reflection of thoughts and interests19
Characterizing Trends on TwitterCollect a set of Twitter trendsBurst detectionTwitter’s “trending topics”Qualitative analysis: trend taxonomyQuantitative analysisAutomatically compute features of each trend and corresponding messagesManually label each trend according to categories introduced by the taxonomyIdentify differences between trend categories according to automatically computed features20
Affinity Diagram Method21
Endogenous vs. Exogenous TrendsEndogenous Trends: Twitter-centric activities that do not correspond to external events (e.g., a popular post by a celebrity)Exogenous Trends: trending eventsthat originated outside of the Twitter system (e.g., an earthquake)Do exogenous and endogenous trends exhibit different characteristics?22
Characterization of Trends and Trending Events	Automatically computed featuresContent FeaturesInteraction FeaturesTime-based FeaturesParticipation FeaturesSocial Network FeaturesCompared differences between categoriesHypotheses guided by differences in categories according to feature typesPerformed t-tests for significance analysis23
Contributions of the StudyTrends fall into two main categories: exogenous (i.e., trending event) and endogenous (i.e., platform-centric trend)There are significant differences between exogenous and endogenous trendsProportion of messages with URLsUnique hashtag in top 10% of messagesProportion of retweetsReciprocity24
Identification and Characterization of Events in Social Media25Characterizationof trending events Identification of trending events Similarity metric learning for trending eventsIdentification of content for planned eventsSelection of event content
Identifying Trending EventsEvent ClustersDocumentsDocument Clusters26
Identifying Trending Events in Real-TimeOrder documents by post timeUse tf-idf vector representation of textual contentStop word eliminationStemmingidf computed over past dataSeparate tweets by locationFocus on tweets from NYCDifferent locations can be processed in parallel27
Clustering AlgorithmMany alternatives possible! [Berkhin 2002]Single-pass incremental clustering algorithmScalable, online solutionUsing centroid representation Used effectively for Event identification in textual news [Allan et al. 1998]News event detection on Twitter [Sankaranarayanan et al. 2009]Does not require a priori knowledge of number of clustersParameters:Similarity Function σThreshold μ28
Overview of Cluster-based ApproachGroup similar documents via online clusteringCompute statistics of cluster content Top terms (e.g., [earthquake, japan])Number of documents per hour…Use cluster-level features to identify trendingeventclustersSingle feature with threshold (e.g., increase in volume over time-window [Petrovićet al. 2010])Trained classification model29
Event Classification on TwitterCluster-level featuresSocial interaction Topic coherenceTrending behaviorPlatform-centric Event classifierHuman-annotated training dataSVM model (selected during training phase)30
Experimental SetupClassification accuracyBaseline: Naïve Bayes text classification (NB-Text) [Sankaranarayanan et al. 2009]10-fold cross validationBlind test set of randomly chosen tweetsEvent surfacing: select top event clusters per hourBaselinesFastest-growing clusters per hour (Fastest) [Petrović et al. 2010]Randomly selected clusters per hour (Random)5 hours, top-20 clusters per hour31
Identified Events32A sample of events identified by our classifiers on the test set
Classification Performance (F-measure)RW-Event event classifier is more effective at discriminating between real-world events and rest of Twitter data33
NDCG@K Evaluation34Performance of event classifier and baselines for event surfacing task.
Identification and Characterization of Events in Social Media35Characterizationof trending events Identification of trending events Similarity metric learning for trending eventsIdentification of content for planned eventsSelection of event content
Social Media Document RepresentationTitleDescriptionTagsDate/TimeLocationAll-Text3636
Social Media Document SimilarityText: cosine similarity of tf-idf vectors (tf-idf version?; stemming?; stop word elimination?)37TitleAAABBBDescriptionTime: proximity in minutesTagstimeDate/TimeLocation: geo-coordinate proximityLocationAll-Text37
Clustering AlgorithmMany alternatives possible! [Berkhin 2002]Single-pass incremental clustering algorithmScalable, online solutionUsing centroid representation Used effectively for Event identification in textual news [Allan et al. 1998]News event detection on Twitter [Sankaranarayanan et al. 2009]Does not require a priori knowledge of number of clustersParameters:Similarity Function σThreshold μ38
Cluster Representation and Parameter TuningCentroid cluster representationAverage tf-idf scoresAverage timeGeographic mid-pointParameter tuning in supervised training phaseClustering quality metrics to optimize:Normalized Mutual Information (NMI) [Amigó et al. 2008]B-Cubed [Strehl et al. 2002]39
Learning a Similarity Metric for ClusteringEnsemble-based similarityTraining a cluster ensembleComputing a similarity score by:Combining individual partitionsCombining individual similaritiesClassification-based similarityTraining data sampling strategiesModeling strategies40
Overview of a Cluster Ensemble AlgorithmCtitleEnsemble clustering solutionConsensus Function:combine ensemble similaritiesWtitlef(C,W)WtagsCtagsWtimeCtimeLearned in a training step41
Overview of a Cluster Ensemble Algorithm: Combining PartitionsWtitleCtitlef(C,W)WtagsCtagsWtimeCtime42
Overview of a Cluster Ensemble Algorithm: Combining SimilaritiesFor each document diand cluster cjσCtitle(di,cj)>μCtitleWtitlef(C,W)WtagsσCtags(di,cj)>μCtagsWtimeσCtime(di,cj)>μCtime43
Learning a Similarity Metric for ClusteringEnsemble-based similarityTraining a cluster ensembleComputing a similarity score by:Combining individual partitionsCombining individual similaritiesClassification-based similarityTraining data sampling strategiesModeling strategies44
Classification-based Similarity MetricsClassify pairs of documents as similar/dissimilarFeature vectorPairwise similarity scores One feature per similarity metric (e.g., time-proximity, location-proximity, …)Modeling strategiesDocument pairs Document-centroid pairs45
Experiments: Alternative Similarity MetricsEnsemble-based techniquesCombining individual partitions (ENS-PART)Combining individual similarities (ENS-SIM)Classification-based techniquesModeling: document-document vs. document-centroidpairsLogistic Regression (CLASS-LR), Support Vector Machines (CLASS-SVM)BaselinesTitle, Description, Tags, All-Text, Time-Proximity, Location-Proximity46
Experimental SetupDatasets:Upcoming>270K Flickr photosEvent labels from the “upcoming” event database (upcoming:event=12345)Split into 3 parts for training/validation/testingLastFM>594K Flickr photosEvent labels from last.fm music catalog (lastfm:event=6789)Used as an additional test set47
Clustering Accuracy over Upcoming Test SetAll similarity learning techniques outperform the baselinesClassification-based techniques perform better than ensemble-based techniques48
NMI: Clustering Accuracy over Both Test Sets		Upcoming				LastFM49NMISimilarity learning models trained on Upcoming data show similar trends when tested on LastFM data
Identification and Characterization of Events in Social Media50Characterizationof trending events Identification of trending events Similarity metric learning for trending eventsIdentification of content for planned eventsSelection of event content
Identifying Content for Planned EventsIdentify planned event documents given known event informationUser-contributed planned event recordsLastFM EventsEventBriteFacebook EventsStructured features (e.g., title, time, location)Challenging identification scenarioKnown event information is often inaccurate or incompleteSocial media documents are brief and noisy51
Planned Event Record52TitleDescriptionDate/TimeVenueCity
Approach for Known Identification ScenarioTwo-step query formulation strategyPrecision-oriented queries using known event featuresRecall-oriented queries using retrieved content from precision-oriented queriesLeverage cross-site contentIdentify event documents on each site individuallyUse event documents on one site to retrieve additional event documents on a different site53
Query Formulation StrategiesPrecision-oriented Queries: Combined event record featuresPhrase, bag-of-words, stop word eliminationExamples: [“title”+”venue”], [title-no-stopwords+”city”]Recall-oriented QueriesFrequency AnalysisFrequent terms in the event’s retrieved contentInfrequently found in Web documentsTerm Extraction54
Leveraging Cross-Site ContentBuild precision-oriented queries using planned eventfeaturesUse precision-oriented queries to retrieve data from:TwitterFlickrYouTubeBuild recall-oriented queries using data from:Each site individuallyAll sites collectively55[title+city][title+venue]…tweet1tweet2tweetnphoto1photo2photonvideo1video2videon
Experimental Settings60 planned events from EventBrite, LastFM, LinkedIn, and FacebookCorresponding social media documentsRetrieved from Twitter, Flickr, and YouTubeRanked according to similarity to event recordTechniquesPrecision: only precision-oriented queriesMS: precision- and recall-oriented queries selected using Microsoft n-gram probability scoreRTR: precision- and recall-oriented queries selected using ratio of document frequency around the time of the event to document frequency in larger time window56
NDCG Performance on Twitter57NDCG scores for top-k Twitter documents retrieved by Precision-oriented queries (Precision), and query strategies using Twitter data (Twitter-RTR, Twitter-MS).
Cross-Site NDCG Performance58NDCG scores for top-k YouTube documents retrieved by Precision-oriented queries (Precision), and query strategies using data from Twitter (Twitter-MS) and YouTube (YouTube MS).
Identification and Characterization of Events in Social Media59Characterizationof trending events Identification of trending events Similarity metric learning for trending eventsIdentification of content for planned eventsSelection of event content
Event Content Selection60 Tiger Woods to make a public apology Friday and talk about his future in golf.Tiger woods y'all,tiger woods y'all,ah tiger woods y'allTiger Woods ApologyTiger Woods Hugs: http://tinyurl.com/yhf4uzwTiger Woods Returns To Golf - Public Apology http://bit.ly/9Ui5jxWedge wars upstage Watson v Woods: BBC Sport (blog)
Event Content SelectionChallenges:Document clusters contain noiseRelevant documents might have poor quality textRelevant, high quality documents might not be interestingFor each document and a given event evaluateQualityRelevanceUsefulness61
Centrality Based Document SelectionCentroidCosine similarity of each document to cluster centroidDegreeDocuments are nodesDocuments are connected if their similarity is above a thresholdCompute degree centrality of each nodeLexRank[Erkan and Radev 2004]Same graph structure as Degree method Central documents are similar to other central documents62
Experimental Methodology: Content Selection50 event clustersRandomly selected5 top tweets per event for each: Centroid, Degree, LexRankLabeled on a 1-4 scaleQuality: excellent (4) poor (1)Relevance: clearly relevant (4)  not relevant (1)Usefulness: clearly useful (4)  not useful (1)63
Content Selection ResultsAverage scores over all events (out of 4)High quality and relevance (>3) for both Degree and CentroidCentroid only method with high usefulness 64
ConclusionsTechniques for identifying, characterizing, and selecting social media content for eventsThere are significant differences between types of trends in social media, specifically trending events and non-event trendsTrending events and their associated social media documents can be effectively identified using online clustering with:A classification step to separate event and non-event contentSocial media document similarity metrics for documents with rich context featuresA two-step query formulation technique is useful for identifying planned events across different social media sites Centrality-based techniques can be used to select high quality, relevant, and useful social media event content65
Future WorkClustering framework optimizationBlocking techniquesTopic modelsIdentify unknown events with learned similarity metrics across sitesImprove breadth of event contentRank events for search and presentationExtension of content selection techniquesLearned ranking models66
PublicationsHila Becker, Dan Iter, MorNaaman, Luis Gravano, “Identifying Content for Planned Events Across Social Media Sites,” under submission.Hila Becker, Mor Naaman, Luis Gravano, “Beyond Trending Topics: Real-World Event Identification on Twitter,” in Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM’11), short paper.Hila Becker, Mor Naaman, Luis Gravano, “Selecting Quality Twitter Content for Events,” in Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM’11), short paper.Hila Becker, Feiyang Chen, Dan Iter, Mor Naaman, Luis Gravano, “Automatic Identification and Presentation of Twitter Content for Planned Events,” in Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM’11), demo paper.Mor Naaman, Hila Becker, Luis Gravano, “Hip and Trendy: Characterizing Emerging Trends on Twitter,” in Journal of the American Society for Information Science and Technology. Hila Becker, MorNaaman, Luis Gravano, “Learning Similarity Metrics for Event Identification in Social Media,” in Proceedings of the Third ACM International Conference on Web Search and Data Mining (WSDM '10), 291-300.Hila Becker, Bai Xiao, MorNaaman and Luis Gravano, “Exploiting Social Links for Event Identification in Social Media,” in Proceedings of the 3rd Annual Workshop on Search in Social Media (SSM '10), poster paper.Hila Becker, MorNaaman, Luis Gravano, “Event Identification in Social Media,” in Proceedings of the ACM SIGMOD Workshop on the Web and Databases (WebDB '09), 2009.67
Thank You!68
69

Identification and Characterization of Events in Social Media

  • 1.
    Identification and Characterizationof Events in Social Media Hila Becker, Thesis Defense
  • 2.
    Social Media isChanging the World2Lady Gaga, Justin Bieber, and Britney Spears have more Twitter followers than the entire populations of some countries (e.g., Israel, Greece)YouTube is the second largest search engine in the worldEvery minute, 24 hours of video are uploaded to YouTubeOver the past five years people uploaded 6,000,000,000 images to Flickr
  • 3.
  • 4.
    Event Content inSocial Media4
  • 5.
  • 6.
    6Source: Tweets fromTahrir, edited by Nadia Idle and Alex Nunns
  • 7.
  • 8.
    Event Identification, Characterization,and Content SelectionIdentify events and their associated social media documentsIn a timely mannerAcross different social media sitesCharacterize events along different dimensions Select high-quality, relevant, useful event documents8
  • 9.
    Event Content inSocial MediaChallenges:Wide variety of topics, not all related to events (e.g., personal status updates, every-day mundane conversations)Unconventional text: abbreviations, typosLarge-scale, rapidly produced contentOpportunities:Content generated in real-time, as events happenRich context features (e.g., time, location)Users’ perspective9
  • 10.
    Event Content inSocial Media10TimelinessReal-timeRetrospectiveTwitter new event detection [Petrović et al. NAACL’10]Event detection on Flickr[Chen and Roy CIKM’09]UnknownContent DiscoveryOrganization of YouTube concert videos [Kennedy and Naaman WWW’09]Earthquake prediction using Twitter [Sakaki et al. WWW’10]Known
  • 11.
    Event Content inSocial Media11Trending Event isa real-world occurrence described by:One or more terms and a time periodVolume of messages posted for the terms in the time period exceeds some expected level of activityUnknownContent DiscoveryPlanned Event is a real-world occurrence with corresponding published event record consisting of:Title, describing the subject of the eventThe time at which the event is planned to occurKnown
  • 12.
    ContributionsTrend (and trendingevent) study, for characterizing and differentiating between different types of trendsOnline clustering framework with an event classification step for identifying trending events and their associated documents in social mediaSocial media document similarity metric learning approachesQuery formulation strategies for identifying social media documents for planned eventsSelection techniques for identifying high quality, relevant, and useful event content12UnknownKnownUnknown/Known
  • 13.
    ContributionsTrend (and trendingevent) study, for characterizing and differentiating between different types of trendsOnline clustering framework with an event classification step for identifying trending events and their associated documents in social mediaSocial media document similarity metric learning approachesQuery formulation strategies for identifying social media documents for planned eventsSelection techniques for identifying high quality, relevant, and useful event content13KnownUnknown/Known
  • 14.
    ContributionsTrend (and trendingevent) study, for characterizing and differentiating between different types of trendsOnline clustering framework with an event classification step for identifying trending events and their associated documents in social mediaSocial media document similarity metric learning approachesQuery formulation strategies for identifying social media documents for planned eventsSelection techniques for identifying high quality, relevant, and useful event content14Unknown/Known
  • 15.
    ContributionsTrend (and trendingevent) study, for characterizing and differentiating between different types of trendsOnline clustering framework with an event classification step for identifying trending events and their associated documents in social mediaSocial media document similarity metric learning approachesQuery formulation strategies for identifying social media documents for planned eventsSelection techniques for identifying high quality, relevant, and useful event content15
  • 16.
    Identification and Characterizationof Events in Social Media16Characterizationof trending events Identification of trending events Similarity metric learning for trending eventsIdentification of content for planned eventsSelection of event content
  • 17.
    What Types ofTrends Exist in Social Media?Taxonomy of trendsCharacterization of each trendManually assigned categoriesAutomatically computed featuresAnalysis of differences between trend types according to each characteristic17Trending EventsNon-Event Trends
  • 18.
    TrendsTrend:One or moreterms and a time period Volume of messages posted for the terms in the time period exceeds some expected level of activityMay or may not reflect a real-world occurrenceA trending event is a type of trend18
  • 19.
    Twitter ContentStreams oftextual messagesBrief content (140 characters)Communicated to network of followersProvide timely reflection of thoughts and interests19
  • 20.
    Characterizing Trends onTwitterCollect a set of Twitter trendsBurst detectionTwitter’s “trending topics”Qualitative analysis: trend taxonomyQuantitative analysisAutomatically compute features of each trend and corresponding messagesManually label each trend according to categories introduced by the taxonomyIdentify differences between trend categories according to automatically computed features20
  • 21.
  • 22.
    Endogenous vs. ExogenousTrendsEndogenous Trends: Twitter-centric activities that do not correspond to external events (e.g., a popular post by a celebrity)Exogenous Trends: trending eventsthat originated outside of the Twitter system (e.g., an earthquake)Do exogenous and endogenous trends exhibit different characteristics?22
  • 23.
    Characterization of Trendsand Trending Events Automatically computed featuresContent FeaturesInteraction FeaturesTime-based FeaturesParticipation FeaturesSocial Network FeaturesCompared differences between categoriesHypotheses guided by differences in categories according to feature typesPerformed t-tests for significance analysis23
  • 24.
    Contributions of theStudyTrends fall into two main categories: exogenous (i.e., trending event) and endogenous (i.e., platform-centric trend)There are significant differences between exogenous and endogenous trendsProportion of messages with URLsUnique hashtag in top 10% of messagesProportion of retweetsReciprocity24
  • 25.
    Identification and Characterizationof Events in Social Media25Characterizationof trending events Identification of trending events Similarity metric learning for trending eventsIdentification of content for planned eventsSelection of event content
  • 26.
    Identifying Trending EventsEventClustersDocumentsDocument Clusters26
  • 27.
    Identifying Trending Eventsin Real-TimeOrder documents by post timeUse tf-idf vector representation of textual contentStop word eliminationStemmingidf computed over past dataSeparate tweets by locationFocus on tweets from NYCDifferent locations can be processed in parallel27
  • 28.
    Clustering AlgorithmMany alternativespossible! [Berkhin 2002]Single-pass incremental clustering algorithmScalable, online solutionUsing centroid representation Used effectively for Event identification in textual news [Allan et al. 1998]News event detection on Twitter [Sankaranarayanan et al. 2009]Does not require a priori knowledge of number of clustersParameters:Similarity Function σThreshold μ28
  • 29.
    Overview of Cluster-basedApproachGroup similar documents via online clusteringCompute statistics of cluster content Top terms (e.g., [earthquake, japan])Number of documents per hour…Use cluster-level features to identify trendingeventclustersSingle feature with threshold (e.g., increase in volume over time-window [Petrovićet al. 2010])Trained classification model29
  • 30.
    Event Classification onTwitterCluster-level featuresSocial interaction Topic coherenceTrending behaviorPlatform-centric Event classifierHuman-annotated training dataSVM model (selected during training phase)30
  • 31.
    Experimental SetupClassification accuracyBaseline:Naïve Bayes text classification (NB-Text) [Sankaranarayanan et al. 2009]10-fold cross validationBlind test set of randomly chosen tweetsEvent surfacing: select top event clusters per hourBaselinesFastest-growing clusters per hour (Fastest) [Petrović et al. 2010]Randomly selected clusters per hour (Random)5 hours, top-20 clusters per hour31
  • 32.
    Identified Events32A sampleof events identified by our classifiers on the test set
  • 33.
    Classification Performance (F-measure)RW-Eventevent classifier is more effective at discriminating between real-world events and rest of Twitter data33
  • 34.
    NDCG@K Evaluation34Performance ofevent classifier and baselines for event surfacing task.
  • 35.
    Identification and Characterizationof Events in Social Media35Characterizationof trending events Identification of trending events Similarity metric learning for trending eventsIdentification of content for planned eventsSelection of event content
  • 36.
    Social Media DocumentRepresentationTitleDescriptionTagsDate/TimeLocationAll-Text3636
  • 37.
    Social Media DocumentSimilarityText: cosine similarity of tf-idf vectors (tf-idf version?; stemming?; stop word elimination?)37TitleAAABBBDescriptionTime: proximity in minutesTagstimeDate/TimeLocation: geo-coordinate proximityLocationAll-Text37
  • 38.
    Clustering AlgorithmMany alternativespossible! [Berkhin 2002]Single-pass incremental clustering algorithmScalable, online solutionUsing centroid representation Used effectively for Event identification in textual news [Allan et al. 1998]News event detection on Twitter [Sankaranarayanan et al. 2009]Does not require a priori knowledge of number of clustersParameters:Similarity Function σThreshold μ38
  • 39.
    Cluster Representation andParameter TuningCentroid cluster representationAverage tf-idf scoresAverage timeGeographic mid-pointParameter tuning in supervised training phaseClustering quality metrics to optimize:Normalized Mutual Information (NMI) [Amigó et al. 2008]B-Cubed [Strehl et al. 2002]39
  • 40.
    Learning a SimilarityMetric for ClusteringEnsemble-based similarityTraining a cluster ensembleComputing a similarity score by:Combining individual partitionsCombining individual similaritiesClassification-based similarityTraining data sampling strategiesModeling strategies40
  • 41.
    Overview of aCluster Ensemble AlgorithmCtitleEnsemble clustering solutionConsensus Function:combine ensemble similaritiesWtitlef(C,W)WtagsCtagsWtimeCtimeLearned in a training step41
  • 42.
    Overview of aCluster Ensemble Algorithm: Combining PartitionsWtitleCtitlef(C,W)WtagsCtagsWtimeCtime42
  • 43.
    Overview of aCluster Ensemble Algorithm: Combining SimilaritiesFor each document diand cluster cjσCtitle(di,cj)>μCtitleWtitlef(C,W)WtagsσCtags(di,cj)>μCtagsWtimeσCtime(di,cj)>μCtime43
  • 44.
    Learning a SimilarityMetric for ClusteringEnsemble-based similarityTraining a cluster ensembleComputing a similarity score by:Combining individual partitionsCombining individual similaritiesClassification-based similarityTraining data sampling strategiesModeling strategies44
  • 45.
    Classification-based Similarity MetricsClassifypairs of documents as similar/dissimilarFeature vectorPairwise similarity scores One feature per similarity metric (e.g., time-proximity, location-proximity, …)Modeling strategiesDocument pairs Document-centroid pairs45
  • 46.
    Experiments: Alternative SimilarityMetricsEnsemble-based techniquesCombining individual partitions (ENS-PART)Combining individual similarities (ENS-SIM)Classification-based techniquesModeling: document-document vs. document-centroidpairsLogistic Regression (CLASS-LR), Support Vector Machines (CLASS-SVM)BaselinesTitle, Description, Tags, All-Text, Time-Proximity, Location-Proximity46
  • 47.
    Experimental SetupDatasets:Upcoming>270K FlickrphotosEvent labels from the “upcoming” event database (upcoming:event=12345)Split into 3 parts for training/validation/testingLastFM>594K Flickr photosEvent labels from last.fm music catalog (lastfm:event=6789)Used as an additional test set47
  • 48.
    Clustering Accuracy overUpcoming Test SetAll similarity learning techniques outperform the baselinesClassification-based techniques perform better than ensemble-based techniques48
  • 49.
    NMI: Clustering Accuracyover Both Test Sets Upcoming LastFM49NMISimilarity learning models trained on Upcoming data show similar trends when tested on LastFM data
  • 50.
    Identification and Characterizationof Events in Social Media50Characterizationof trending events Identification of trending events Similarity metric learning for trending eventsIdentification of content for planned eventsSelection of event content
  • 51.
    Identifying Content forPlanned EventsIdentify planned event documents given known event informationUser-contributed planned event recordsLastFM EventsEventBriteFacebook EventsStructured features (e.g., title, time, location)Challenging identification scenarioKnown event information is often inaccurate or incompleteSocial media documents are brief and noisy51
  • 52.
  • 53.
    Approach for KnownIdentification ScenarioTwo-step query formulation strategyPrecision-oriented queries using known event featuresRecall-oriented queries using retrieved content from precision-oriented queriesLeverage cross-site contentIdentify event documents on each site individuallyUse event documents on one site to retrieve additional event documents on a different site53
  • 54.
    Query Formulation StrategiesPrecision-orientedQueries: Combined event record featuresPhrase, bag-of-words, stop word eliminationExamples: [“title”+”venue”], [title-no-stopwords+”city”]Recall-oriented QueriesFrequency AnalysisFrequent terms in the event’s retrieved contentInfrequently found in Web documentsTerm Extraction54
  • 55.
    Leveraging Cross-Site ContentBuildprecision-oriented queries using planned eventfeaturesUse precision-oriented queries to retrieve data from:TwitterFlickrYouTubeBuild recall-oriented queries using data from:Each site individuallyAll sites collectively55[title+city][title+venue]…tweet1tweet2tweetnphoto1photo2photonvideo1video2videon
  • 56.
    Experimental Settings60 plannedevents from EventBrite, LastFM, LinkedIn, and FacebookCorresponding social media documentsRetrieved from Twitter, Flickr, and YouTubeRanked according to similarity to event recordTechniquesPrecision: only precision-oriented queriesMS: precision- and recall-oriented queries selected using Microsoft n-gram probability scoreRTR: precision- and recall-oriented queries selected using ratio of document frequency around the time of the event to document frequency in larger time window56
  • 57.
    NDCG Performance onTwitter57NDCG scores for top-k Twitter documents retrieved by Precision-oriented queries (Precision), and query strategies using Twitter data (Twitter-RTR, Twitter-MS).
  • 58.
    Cross-Site NDCG Performance58NDCGscores for top-k YouTube documents retrieved by Precision-oriented queries (Precision), and query strategies using data from Twitter (Twitter-MS) and YouTube (YouTube MS).
  • 59.
    Identification and Characterizationof Events in Social Media59Characterizationof trending events Identification of trending events Similarity metric learning for trending eventsIdentification of content for planned eventsSelection of event content
  • 60.
    Event Content Selection60Tiger Woods to make a public apology Friday and talk about his future in golf.Tiger woods y'all,tiger woods y'all,ah tiger woods y'allTiger Woods ApologyTiger Woods Hugs: http://tinyurl.com/yhf4uzwTiger Woods Returns To Golf - Public Apology http://bit.ly/9Ui5jxWedge wars upstage Watson v Woods: BBC Sport (blog)
  • 61.
    Event Content SelectionChallenges:Documentclusters contain noiseRelevant documents might have poor quality textRelevant, high quality documents might not be interestingFor each document and a given event evaluateQualityRelevanceUsefulness61
  • 62.
    Centrality Based DocumentSelectionCentroidCosine similarity of each document to cluster centroidDegreeDocuments are nodesDocuments are connected if their similarity is above a thresholdCompute degree centrality of each nodeLexRank[Erkan and Radev 2004]Same graph structure as Degree method Central documents are similar to other central documents62
  • 63.
    Experimental Methodology: ContentSelection50 event clustersRandomly selected5 top tweets per event for each: Centroid, Degree, LexRankLabeled on a 1-4 scaleQuality: excellent (4) poor (1)Relevance: clearly relevant (4)  not relevant (1)Usefulness: clearly useful (4)  not useful (1)63
  • 64.
    Content Selection ResultsAveragescores over all events (out of 4)High quality and relevance (>3) for both Degree and CentroidCentroid only method with high usefulness 64
  • 65.
    ConclusionsTechniques for identifying,characterizing, and selecting social media content for eventsThere are significant differences between types of trends in social media, specifically trending events and non-event trendsTrending events and their associated social media documents can be effectively identified using online clustering with:A classification step to separate event and non-event contentSocial media document similarity metrics for documents with rich context featuresA two-step query formulation technique is useful for identifying planned events across different social media sites Centrality-based techniques can be used to select high quality, relevant, and useful social media event content65
  • 66.
    Future WorkClustering frameworkoptimizationBlocking techniquesTopic modelsIdentify unknown events with learned similarity metrics across sitesImprove breadth of event contentRank events for search and presentationExtension of content selection techniquesLearned ranking models66
  • 67.
    PublicationsHila Becker, DanIter, MorNaaman, Luis Gravano, “Identifying Content for Planned Events Across Social Media Sites,” under submission.Hila Becker, Mor Naaman, Luis Gravano, “Beyond Trending Topics: Real-World Event Identification on Twitter,” in Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM’11), short paper.Hila Becker, Mor Naaman, Luis Gravano, “Selecting Quality Twitter Content for Events,” in Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM’11), short paper.Hila Becker, Feiyang Chen, Dan Iter, Mor Naaman, Luis Gravano, “Automatic Identification and Presentation of Twitter Content for Planned Events,” in Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM’11), demo paper.Mor Naaman, Hila Becker, Luis Gravano, “Hip and Trendy: Characterizing Emerging Trends on Twitter,” in Journal of the American Society for Information Science and Technology. Hila Becker, MorNaaman, Luis Gravano, “Learning Similarity Metrics for Event Identification in Social Media,” in Proceedings of the Third ACM International Conference on Web Search and Data Mining (WSDM '10), 291-300.Hila Becker, Bai Xiao, MorNaaman and Luis Gravano, “Exploiting Social Links for Event Identification in Social Media,” in Proceedings of the 3rd Annual Workshop on Search in Social Media (SSM '10), poster paper.Hila Becker, MorNaaman, Luis Gravano, “Event Identification in Social Media,” in Proceedings of the ACM SIGMOD Workshop on the Web and Databases (WebDB '09), 2009.67
  • 68.
  • 69.

Editor's Notes

  • #9 The problems we are trying to solve in this thesis
  • #10 Challenges and opportunities
  • #11 What’s been done in the space, very very very briefly, to introduce our known vs. unknown division
  • #12 Explain that we work in real-time (for the most part) and say we divide the space into unknown and know identification scenarios, then mention the type of even we focus on for each. Also briefly mention that as we discuss in the thesis, these are not disjoint
  • #13 Contributions in order, broken down into identification scenarios (more or less).
  • #14 Contributions in order, broken down into identification scenarios (more or less).
  • #15 Contributions in order, broken down into identification scenarios (more or less).
  • #16 Contributions in order, broken down into identification scenarios (more or less).
  • #18 Before we identify trending events, we asked ourselves what types of trending events exist in social media and how are they different from non-event trends that exhibit similar temporal behavior
  • #20 Over 200 million users
  • #24 Give a brief example of each feature type.
  • #25 These will help guide our features for event classification next…
  • #29 Leader-follower?Dropping old clusters, merging clusters, etc. for future work
  • #35 NDCG is a precision-based metric that takes rank into account
  • #39 Bringing it back to point out the parameters
  • #41 This is an outline for the similarity metric learning discussion
  • #45 This is an outline for the similarity metric learning discussion
  • #48 Get the upcoming page for the event with the photo thumbnails , show the machine tag
  • #49 TAGS – ALMOST AS GOOD AS ALL-TEXT
  • #63 LexRank: method for extractive summ. Central nodes are connected to other central nodes, each node has centrality value that it distributes to connected nodes
  • #65 Play with thresholds!
  • #68 … just the ones that went into this thesis 