• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Exploring the similarity between Social Knowledge Sources and Twitter for Cross-Domain Topic Classification of Tweets #KECSM 2012 #ISWC2012
 

Exploring the similarity between Social Knowledge Sources and Twitter for Cross-Domain Topic Classification of Tweets #KECSM 2012 #ISWC2012

on

  • 1,040 views

The rapid rate of information propagation on social streams has proven to be an up-to-date channel of communication, which can reveal events happening in the world. However, identifying the topicality ...

The rapid rate of information propagation on social streams has proven to be an up-to-date channel of communication, which can reveal events happening in the world. However, identifying the topicality of short messages (e.g. tweets) distributed on these streams poses new challenges in the development of accurate classification algorithms.
In order to alleviate this problem we study for the first time a transfer learning setting aiming to make use of two frequently updated social knowledge sources KSs (DBpedia and Freebase) for detecting topics in tweets. In this paper we investigate the similarity (and dissimilarity) between these KSs and Twitter at the lexical and conceptual (entity) level. We also evaluate the contribution of these types of features and propose various statistical measures for determining the topics which are highly similar or different in KSs and tweets.
Our findings can be of potential use to machine learning or domain adaptation algorithms aiming to use named entities for topic classification of tweets. These results can also be valuable in the identification of representative sets of annotated articles from the KSs, which can help in building accurate topic classifiers of tweets.

Statistics

Views

Total Views
1,040
Views on SlideShare
666
Embed Views
374

Actions

Likes
0
Downloads
3
Comments
0

3 Embeds 374

http://oak.dcs.shef.ac.uk 282
http://oak-dev.dcs.shef.ac.uk 91
http://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Exploring the similarity between Social Knowledge Sources and Twitter for Cross-Domain Topic Classification of Tweets #KECSM 2012 #ISWC2012 Exploring the similarity between Social Knowledge Sources and Twitter for Cross-Domain Topic Classification of Tweets #KECSM 2012 #ISWC2012 Presentation Transcript

    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Exploring the similarity between Social Knowledge Sources and Twitter for Cross-Domain Topic Classification of Tweets Andrea Varga, Amparo E. Cano and Fabio Ciravegna 1 Organisations Information and Knowledge (OAK) Research Group University of Sheffield 2 Knowledge Management Institute (KMI) Open University KECSM 2012/ISWC 2012 Nov 12, 2012 1/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkOutline 1 Motivation 2 State-of-the-art 3 Methodology 4 Results 5 Conclusions and Future Work 2/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkWhy classifying Tweets into topics? Topic classification (TC) of tweets can be important for multiple application: Information Retrieval Recommendation Emergency responses, etc. Topic name Example tweets Disaster&Accident(DisAcc) happening accident people dying could phone ambulance wakakkaka xd Entertainment&Culture(EntCult) google adwords commercial greeeat en- joyed watching greeeeeat day Politics(Pol) quoting military source sk media reports deployed rocket launchers decoys real Sports(Sports) ravens good position games left browns bengals playoffs 3/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkWhat are the challenges in Topic Classification (TC) of Tweets? Special characteristics of tweets the restricted size of a post (limited to 140 characters) the frequent use of misspellings and jargons the frequent use of abbreviations the use of non-standard English: reflected in vocabulary and writing style Topic name Example tweets Disaster&Accident(DisAcc) happening accident people dying could phone ambulance wakakkaka xd Entertainment&Culture(EntCult) google commercial greeeat enjoyed watching day Politics(Pol) quoting military source media reports de- ployed rocket launchers decoys real Sports(Sports) ravens good position games left browns bengals playoffs 4/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkWhat are the challenges in Topic Classification (TC) of Tweets? Special characteristics of tweets the restricted size of a post (limited to 140 characters) the frequent use of misspellings and jargons the frequent use of abbreviations the use of non-standard English: reflected in vocabulary and writing style Topic name Example tweets Disaster&Accident(DisAcc) happening accident people dying could phone ambulance wakakkaka xd Entertainment&Culture(EntCult) google adwords commercial greeeat en- joyed watching greeeeeat day Politics(Pol) quoting military source media reports de- ployed rocket launchers decoys real Sports(Sports) ravens good position games left browns bengals playoffs 4/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkWhat are the challenges in Topic Classification (TC) of Tweets? Special characteristics of tweets the restricted size of a post (limited to 140 characters) the frequent use of misspellings and jargons the frequent use of abbreviations the use of non-standard English: reflected in vocabulary and writing style Topic name Example tweets Disaster&Accident(DisAcc) happening accident people dying could phone ambulance wakakkaka xd Entertainment&Culture(EntCult) google adwords commercial greeeat en- joyed watching greeeeeat day Politics(Pol) quoting military source sk media reports deployed rocket launchers decoys real Sports(Sports) ravens good position games left browns bengals playoffs 4/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkWhat are the challenges in Topic Classification (TC) of Tweets? Special characteristics of tweets the restricted size of a post (limited to 140 characters) the frequent use of misspellings and jargons the frequent use of abbreviations the use of non-standard English: reflected in vocabulary and writing style Topic name Example tweets Disaster&Accident(DisAcc) happening accident people dying could phone ambulance wakakkaka xd Entertainment&Culture(EntCult) google adwords commercial greeeat en- joyed watching greeeeeat day Politics(Pol) quoting military source sk media reports deployed rocket launchers decoys real Sports(Sports) ravens good position games left browns bengals playoffs 4/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkWhat are the challenges in Topic Classification (TC) of Tweets? Special characteristics of tweets the restricted size of a post (limited to 140 characters) the frequent use of misspellings and jargons the frequent use of abbreviations the use of non-standard English: reflected in vocabulary and writing style Topic name Example tweets Disaster&Accident(DisAcc) happening accident people dying could phone ambulance wakakkaka xd Entertainment&Culture(EntCult) google adwords commercial greeeat en- joyed watching greeeeeat day Politics(Pol) quoting military source sk media reports deployed rocket launchers decoys real Sports(Sports) ravens good position games left browns bengals playoffs => These characteristics poses additional challenges for traditional supervised machine learning approaches for building accurate TC of tweets 4/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkWhy are Social Knowledge Sources (KS) relevant to Twitter? Data bottleneck problem: investigate an alternative approach inspired by domain adaptation/transfer learning for exploiting the information from Social Knowledge Sources (DBpedia and Freebase) for TC of Tweets Commonalities between KSs and Twitter they are constantly edited by web users they are social and built on a collaborative manner they cover a large number of topics 5/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkWhy are Social Knowledge Sources (KS) relevant to Twitter? Data bottleneck problem: investigate an alternative approach inspired by domain adaptation/transfer learning for exploiting the information from Social Knowledge Sources (DBpedia and Freebase) for TC of Tweets Commonalities between KSs and Twitter they are constantly edited by web users they are social and built on a collaborative manner they cover a large number of topics 5/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkWhy are Social Knowledge Sources (KS) relevant to Twitter? Data bottleneck problem: investigate an alternative approach inspired by domain adaptation/transfer learning for exploiting the information from Social Knowledge Sources (DBpedia and Freebase) for TC of Tweets Commonalities between KSs and Twitter they are constantly edited by web users they are social and built on a collaborative manner they cover a large number of topics 5/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkWhy are Social Knowledge Sources (KS) relevant to Twitter? Data bottleneck problem: investigate an alternative approach inspired by domain adaptation/transfer learning for exploiting the information from Social Knowledge Sources (DBpedia and Freebase) for TC of Tweets Commonalities between KSs and Twitter they are constantly edited by web users they are social and built on a collaborative manner they cover a large number of topics 5/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkWhy are Social Knowledge Sources (KS) relevant to Twitter? Data bottleneck problem: investigate an alternative approach inspired by domain adaptation/transfer learning for exploiting the information from Social Knowledge Sources (DBpedia and Freebase) for TC of Tweets Commonalities between KSs and Twitter they are constantly edited by web users they are social and built on a collaborative manner they cover a large number of topics More importantly: KSs contain a large number of annotated data on a large number of topics 5/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkResearch questions 1 Are KSs relevant for topic classification of Tweets? 2 Which features make the KSs look more similar to Twitter? 3 How similar or dissimilar are KSs to Twitter? Which similarity measure does better quantify the lexical changes between KSs and Twitter? 6/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkState-of-the-art approaches for TC of Tweets Using DBpedia for Topic Classification of Tweets: Wikify (Mihalcea, R. and Csomai, A., 2007) Enriching unstructured text with Wikipedia links (D. Milne and I. H. Witten, 2008) Tagme (P. Ferragina and U. Scaiella., 2010) Topical Social Sensor (P. K. P. N. Mendes et al., 2010) Vector space model (Oscar Munoz-Garcia et al. 2011) Using Freebase for Topic Classification of Tweets: Clustering based approach (S.P.Kasiviswanathan et al., 2011) Our main contribution: Understanding the similarity between KSs and Twitter Exploring multiple KSs (DBpedia + Freebase) Investigating various statistical metrics for quantifying the similarity between KSs and Twitter 7/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkMethodology followed 8/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkMethodology followed 1 Collecting Data from KSs Sc. DB Sc. FB Sc. DB-FB Retrieve articles Retrieve tweets Concept Concept enrichment enrichment Build Cross- Annotate Tweets domain Classifier 8/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkMethodology followed 1 Collecting Data from KSs 2 Building Cross-Domain (CD) Topic Classifier of Tweets Sc. DB Sc. FB Sc. DB-FB Retrieve articles Retrieve tweets Concept Concept enrichment enrichment Build Cross- Annotate Tweets domain Classifier 8/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkMethodology followed 1 Collecting Data from KSs 2 Building Cross-Domain (CD) Topic Classifier of Tweets 3 Measuring Distributional Changes Between KSs and Twitter Sc. DB Sc. FB Sc. DB-FB Retrieve articles Retrieve tweets Concept Concept enrichment enrichment Build Cross- Annotate Tweets domain Classifier 8/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkStep 1: Collecting Data from KSs Twitter corpus collected in Abel et al. (2011), tweets posted between October 2010 and Twitter multilabel frequency January 2011, annotated with 17 topics Random selection of 1,000 articles/tweets from DBpedia/Freebase/Twitter for each topic => 9,465 articles from DBpedia; 16,915Freebase multilabel frequency and 12,412 tweets Dbpedia multilabel frequency articles from Freebase; Preprocessing: removal of hastags, mentions and URLs from tweets; taking top-1000 71% features for each topic Dbpedia multilabel frequency Freebase multilabel frequency Twitter multilabel frequency 71% 88.6% 0.1% 1% 88.6% 99.9% 0.1% 0.9% 1.8% 5.6% 0.9% 99.9% 0.1% 0.1% 1% 1.8% 8.6% 5.6% 8.6% 22.3% 22.3% 1 8 2 3+4+5+6+7+9 1 2 1 1 2 2 3 3 4 4 6+5 6+5 1 8 2 3+4+5+6+7+9 1 2 9/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkStep 1: Collecting Data from KSs Business_Finance Disaster_Accident Education Entertainment Environment Health Human Interest Labor Law_Crime Technology_IT Religion Social Issues Weather Sports War_Conflict Politics Retrieval of articles for a given topic (e.g. Politics): from DBpedia: executing SPARQL queries for retrieving category names containing the topic name: Category:Politics_of_the_United_States Category:National_Democratic_Party_Egypt_politicians etc. from Freebase: accessing Text Service API for articles belonging to the topic: for underspecified topics/domains: consider articles containing the topic in their titles 10/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkStep 2: Building Cross-Domain (CD) Topic Classifier of Tweets Considering two different feature sets: 11/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkStep 2: Building Cross-Domain (CD) Topic Classifier of Tweets Considering two different feature sets: BOW: tf.idf value of the words present the examples (articles or tweets) 11/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkStep 2: Building Cross-Domain (CD) Topic Classifier of Tweets Considering two different feature sets: BOW: tf.idf value of the words present the examples (articles or tweets) BOE: tf.idf value of the words and entity+concept pairs present the examples (articles or tweets) Sc. DB Sc. FB Sc. DB-FB Retrieve articles Retrieve tweets Concept Concept enrichment enrichment Build Cross- Annotate Tweets domain Classifier 11/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkStep 3: Measuring Distributional Changes Between KSs and Twitter − → Building a vector ds for each the source dataset (Sc.DB, Sc.Fb, − → Sc.Db-FB) and a vector dt for the target dataset (Twitter) consisting of the TF-IDF weight for either the BoW or BoE feature sets statistical measures applied: (O−E)2 χ2 test: χ2 = E , where O is the observed value for a feature, while E is the expected value calculated on the basis of the joint corpus Kullback-Leibler symmetric distance: − − → → − → − → → − ds (f ) KL(ds || dt ) = f ∈F ∪FT (ds (f ) − dt (f )) log → − S dt (f ) FS ∪FT → − → − − − → → k =1 ( ds (fS )× ds (fT )) k k cosine similarity: cosine(ds , dt ) = FS ∪FT → − → − ( ds (fS )) 2 × FS ∪FT ( d (f ))2 k =1 k k =1 kt T 12/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkExperimental setting 1-vs-all approach, building individual CD classifier for each topic, SVM classifiers, performed 5 cross-fold validation Sc-Db, Sc-Fb, Sc-Db-Fb classifiers trained on full KS data, evaluated on 20% Twitter data 2,482 tweets) TGT classifier: trained on 80% Twitter data, evaluated on 20% Twitter data (2,482 tweets) 13/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkFindings -Classification performance in F1 measure Q1 : Which KS reflects better the lexical variation in Twitter? 14/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkFindings -Classification performance in F1 measure Q1 : Which KS reflects better the lexical variation in Twitter? 21.40 21.20 11.50 15.30 18.60 19.10 47.50 46.50 BusFi 15.30 15.10 15.50 16.20 19.50 20.20 42.20 43.10 EntCult 23.40 25.10 14.40 14.70 21.00 20.40 58.40 58.50 Religion 28.60 30.30 25.60 24.70 26.90 25.40 51.70 51.80 Health 22.20 21.00 27.80 26.80 24.20 26.80 45.10 44.90 Pol 0.90 2.70 16.80 17.80 14.80 13.30 46.80 46.40 Law 1.40 2.30 17.20 19.50 11.30 13.90 41.60 42.60 HospRecr 1.30 2.00 8.80 9.00 9.70 9.10 44.20 44.00 SocIssue 8.30 9.70 14.50 14.50 21.00 21.10 57.50 59.20 DisAcc 1.60 2.40 18.40 18.60 12.40 9.90 57.40 58.00 TechIT 15.20 14.20 2.20 2.20 8.90 8.40 46.60 48.30 Env 1.10 1.40 2.00 1.60 1.50 2.20 33.60 34.20 HumInt 3.10 7.00 39.80 39.90 36.70 36.00 81.20 81.50 Weather 1.40 1.30 31.90 31.90 30.10 29.90 79.90 79.40 Labor 9.30 10.90 23.90 23.60 24.60 25.70 67.60 72.70 War 10.30 11.70 26.50 26.20 26.20 26.00 60.10 59.20 Sports 37.40 37.80 42.50 47.20 42.50 45.70 71.90 71.30 Edu Sc.DB(BoE) Sc.DB(BoW) Sc.FB(BoE) SC.FB(BoW) Sc.DB−FB(BoE) SC.DB−FB(BoW) TGT(BoW) TGT(BoE) 14/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkFindings -Classification performance in F1 measure Q1 : Which KS reflects better the lexical variation in Twitter? 21.40 21.20 11.50 15.30 18.60 19.10 47.50 46.50 BusFi 15.30 15.10 15.50 16.20 19.50 20.20 42.20 43.10 EntCult 23.40 25.10 14.40 14.70 21.00 20.40 58.40 58.50 Religion 28.60 30.30 25.60 24.70 26.90 25.40 51.70 51.80 Health 22.20 21.00 27.80 26.80 24.20 26.80 45.10 44.90 Pol 0.90 2.70 16.80 17.80 14.80 13.30 46.80 46.40 Law 1.40 2.30 17.20 19.50 11.30 13.90 41.60 42.60 HospRecr 1.30 2.00 8.80 9.00 9.70 9.10 44.20 44.00 SocIssue 8.30 9.70 14.50 14.50 21.00 21.10 57.50 59.20 DisAcc 1.60 2.40 18.40 18.60 12.40 9.90 57.40 58.00 TechIT 15.20 14.20 2.20 2.20 8.90 8.40 46.60 48.30 Env 1.10 1.40 2.00 1.60 1.50 2.20 33.60 34.20 HumInt 3.10 7.00 39.80 39.90 36.70 36.00 81.20 81.50 Weather 1.40 1.30 31.90 31.90 30.10 29.90 79.90 79.40 Labor 9.30 10.90 23.90 23.60 24.60 25.70 67.60 72.70 War 10.30 11.70 26.50 26.20 26.20 26.00 60.10 59.20 Sports 37.40 37.80 42.50 47.20 42.50 45.70 71.90 71.30 Edu Sc.DB(BoE) Sc.DB(BoW) Sc.FB(BoE) SC.FB(BoW) Sc.DB−FB(BoE) SC.DB−FB(BoW) TGT(BoW) TGT(BoE) 14/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkFindings -Classification performance in F1 measure Q1 : Which KS reflects better the lexical variation in Twitter? 21.40 21.20 11.50 15.30 18.60 19.10 47.50 46.50 BusFi 15.30 15.10 15.50 16.20 19.50 20.20 42.20 43.10 EntCult 23.40 25.10 14.40 14.70 21.00 20.40 58.40 58.50 Religion 28.60 30.30 25.60 24.70 26.90 25.40 51.70 51.80 Health 22.20 21.00 27.80 26.80 24.20 26.80 45.10 44.90 Pol 0.90 2.70 16.80 17.80 14.80 13.30 46.80 46.40 Law 1.40 2.30 17.20 19.50 11.30 13.90 41.60 42.60 HospRecr 1.30 2.00 8.80 9.00 9.70 9.10 44.20 44.00 SocIssue 8.30 9.70 14.50 14.50 21.00 21.10 57.50 59.20 DisAcc 1.60 2.40 18.40 18.60 12.40 9.90 57.40 58.00 TechIT 15.20 14.20 2.20 2.20 8.90 8.40 46.60 48.30 Env 1.10 1.40 2.00 1.60 1.50 2.20 33.60 34.20 HumInt 3.10 7.00 39.80 39.90 36.70 36.00 81.20 81.50 Weather 1.40 1.30 31.90 31.90 30.10 29.90 79.90 79.40 Labor 9.30 10.90 23.90 23.60 24.60 25.70 67.60 72.70 War 10.30 11.70 26.50 26.20 26.20 26.00 60.10 59.20 Sports 37.40 37.80 42.50 47.20 42.50 45.70 71.90 71.30 Edu Sc.DB(BoE) Sc.DB(BoW) Sc.FB(BoE) SC.FB(BoW) Sc.DB−FB(BoE) SC.DB−FB(BoW) TGT(BoW) TGT(BoE) 14/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkFindings -Classification performance in F1 measure Q1 : Which KS reflects better the lexical variation in Twitter? 21.40 21.20 11.50 15.30 18.60 19.10 47.50 46.50 BusFi 15.30 15.10 15.50 16.20 19.50 20.20 42.20 43.10 EntCult 23.40 25.10 14.40 14.70 21.00 20.40 58.40 58.50 Religion 28.60 30.30 25.60 24.70 26.90 25.40 51.70 51.80 Health 22.20 21.00 27.80 26.80 24.20 26.80 45.10 44.90 Pol 0.90 2.70 16.80 17.80 14.80 13.30 46.80 46.40 Law 1.40 2.30 17.20 19.50 11.30 13.90 41.60 42.60 HospRecr 1.30 2.00 8.80 9.00 9.70 9.10 44.20 44.00 SocIssue 8.30 9.70 14.50 14.50 21.00 21.10 57.50 59.20 DisAcc 1.60 2.40 18.40 18.60 12.40 9.90 57.40 58.00 TechIT 15.20 14.20 2.20 2.20 8.90 8.40 46.60 48.30 Env 1.10 1.40 2.00 1.60 1.50 2.20 33.60 34.20 HumInt 3.10 7.00 39.80 39.90 36.70 36.00 81.20 81.50 Weather 1.40 1.30 31.90 31.90 30.10 29.90 79.90 79.40 Labor 9.30 10.90 23.90 23.60 24.60 25.70 67.60 72.70 War 10.30 11.70 26.50 26.20 26.20 26.00 60.10 59.20 Sports 37.40 37.80 42.50 47.20 42.50 45.70 71.90 71.30 Edu Sc.DB(BoE) Sc.DB(BoW) Sc.FB(BoE) SC.FB(BoW) Sc.DB−FB(BoE) SC.DB−FB(BoW) TGT(BoW) TGT(BoE) 14/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkFindings -Classification performance in F1 measure Q1 : Which KS reflects better the lexical variation in Twitter? Sc.Db-FB showed best performance, followed by Sc.Fb and Sc.Db 21.40 21.20 11.50 15.30 18.60 19.10 47.50 46.50 BusFi 15.30 15.10 15.50 16.20 19.50 20.20 42.20 43.10 EntCult 23.40 25.10 14.40 14.70 21.00 20.40 58.40 58.50 Religion 28.60 30.30 25.60 24.70 26.90 25.40 51.70 51.80 Health 22.20 21.00 27.80 26.80 24.20 26.80 45.10 44.90 Pol 0.90 2.70 16.80 17.80 14.80 13.30 46.80 46.40 Law 1.40 2.30 17.20 19.50 11.30 13.90 41.60 42.60 HospRecr 1.30 2.00 8.80 9.00 9.70 9.10 44.20 44.00 SocIssue 8.30 9.70 14.50 14.50 21.00 21.10 57.50 59.20 DisAcc 1.60 2.40 18.40 18.60 12.40 9.90 57.40 58.00 TechIT 15.20 14.20 2.20 2.20 8.90 8.40 46.60 48.30 Env 1.10 1.40 2.00 1.60 1.50 2.20 33.60 34.20 HumInt 3.10 7.00 39.80 39.90 36.70 36.00 81.20 81.50 Weather 1.40 1.30 31.90 31.90 30.10 29.90 79.90 79.40 Labor 9.30 10.90 23.90 23.60 24.60 25.70 67.60 72.70 War 10.30 11.70 26.50 26.20 26.20 26.00 60.10 59.20 Sports 37.40 37.80 42.50 47.20 42.50 45.70 71.90 71.30 Edu Sc.DB(BoE) Sc.DB(BoW) Sc.FB(BoE) SC.FB(BoW) Sc.DB−FB(BoE) SC.DB−FB(BoW) TGT(BoW) TGT(BoE) 15/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkFindings -Classification performance in F1 measure Q1 : Which KS reflects better the lexical variation in Twitter? Sc.Db-FB showed best performance, followed by Sc.Fb and Sc.Db 21.40 21.20 11.50 15.30 18.60 19.10 47.50 46.50 BusFi 15.30 15.10 15.50 16.20 19.50 20.20 42.20 43.10 EntCult 23.40 25.10 14.40 14.70 21.00 20.40 58.40 58.50 Religion 28.60 30.30 25.60 24.70 26.90 25.40 51.70 51.80 Health 22.20 21.00 27.80 26.80 24.20 26.80 45.10 44.90 Pol 0.90 2.70 16.80 17.80 14.80 13.30 46.80 46.40 Law 1.40 2.30 17.20 19.50 11.30 13.90 41.60 42.60 HospRecr 1.30 2.00 8.80 9.00 9.70 9.10 44.20 44.00 SocIssue 8.30 9.70 14.50 14.50 21.00 21.10 57.50 59.20 DisAcc 1.60 2.40 18.40 18.60 12.40 9.90 57.40 58.00 TechIT 15.20 14.20 2.20 2.20 8.90 8.40 46.60 48.30 Env 1.10 1.40 2.00 1.60 1.50 2.20 33.60 34.20 HumInt 3.10 7.00 39.80 39.90 36.70 36.00 81.20 81.50 Weather 1.40 1.30 31.90 31.90 30.10 29.90 79.90 79.40 Labor 9.30 10.90 23.90 23.60 24.60 25.70 67.60 72.70 War 10.30 11.70 26.50 26.20 26.20 26.00 60.10 59.20 Sports 37.40 37.80 42.50 47.20 42.50 45.70 71.90 71.30 Edu Sc.DB(BoE) Sc.DB(BoW) Sc.FB(BoE) SC.FB(BoW) Sc.DB−FB(BoE) SC.DB−FB(BoW) TGT(BoW) TGT(BoE) 15/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkFindings -Classification performance in F1 measure Q1 : Which KS reflects better the lexical variation in Twitter? Sc.Db-FB showed best performance, followed by Sc.Fb and Sc.Db 21.40 21.20 11.50 15.30 18.60 19.10 47.50 46.50 BusFi 15.30 15.10 15.50 16.20 19.50 20.20 42.20 43.10 EntCult 23.40 25.10 14.40 14.70 21.00 20.40 58.40 58.50 Religion 28.60 30.30 25.60 24.70 26.90 25.40 51.70 51.80 Health 22.20 21.00 27.80 26.80 24.20 26.80 45.10 44.90 Pol 0.90 2.70 16.80 17.80 14.80 13.30 46.80 46.40 Law 1.40 2.30 17.20 19.50 11.30 13.90 41.60 42.60 HospRecr 1.30 2.00 8.80 9.00 9.70 9.10 44.20 44.00 SocIssue 8.30 9.70 14.50 14.50 21.00 21.10 57.50 59.20 DisAcc 1.60 2.40 18.40 18.60 12.40 9.90 57.40 58.00 TechIT 15.20 14.20 2.20 2.20 8.90 8.40 46.60 48.30 Env 1.10 1.40 2.00 1.60 1.50 2.20 33.60 34.20 HumInt 3.10 7.00 39.80 39.90 36.70 36.00 81.20 81.50 Weather 1.40 1.30 31.90 31.90 30.10 29.90 79.90 79.40 Labor 9.30 10.90 23.90 23.60 24.60 25.70 67.60 72.70 War 10.30 11.70 26.50 26.20 26.20 26.00 60.10 59.20 Sports 37.40 37.80 42.50 47.20 42.50 45.70 71.90 71.30 Edu Sc.DB(BoE) Sc.DB(BoW) Sc.FB(BoE) SC.FB(BoW) Sc.DB−FB(BoE) SC.DB−FB(BoW) TGT(BoW) TGT(BoE) 15/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkFindings -Classification performance in F1 measure Q2 : What feature makes the KSs look more similar to Twitter? BoW features were found better than BoE for CD classifiers BoE features were found better than BoW for TGT 21.40 21.20 11.50 15.30 18.60 19.10 47.50 46.50 BusFi 15.30 15.10 15.50 16.20 19.50 20.20 42.20 43.10 EntCult 23.40 25.10 14.40 14.70 21.00 20.40 58.40 58.50 Religion 28.60 30.30 25.60 24.70 26.90 25.40 51.70 51.80 Health 22.20 21.00 27.80 26.80 24.20 26.80 45.10 44.90 Pol 0.90 2.70 16.80 17.80 14.80 13.30 46.80 46.40 Law 1.40 2.30 17.20 19.50 11.30 13.90 41.60 42.60 HospRecr 1.30 2.00 8.80 9.00 9.70 9.10 44.20 44.00 SocIssue 8.30 9.70 14.50 14.50 21.00 21.10 57.50 59.20 DisAcc 1.60 2.40 18.40 18.60 12.40 9.90 57.40 58.00 TechIT 15.20 14.20 2.20 2.20 8.90 8.40 46.60 48.30 Env 1.10 1.40 2.00 1.60 1.50 2.20 33.60 34.20 HumInt 3.10 7.00 39.80 39.90 36.70 36.00 81.20 81.50 Weather 1.40 1.30 31.90 31.90 30.10 29.90 79.90 79.40 Labor 9.30 10.90 23.90 23.60 24.60 25.70 67.60 72.70 War 10.30 11.70 26.50 26.20 26.20 26.00 60.10 59.20 Sports 37.40 37.80 42.50 47.20 42.50 45.70 71.90 71.30 Edu Sc.DB(BoE) Sc.DB(BoW) Sc.FB(BoE) SC.FB(BoW) Sc.DB−FB(BoE) SC.DB−FB(BoW) TGT(BoW) TGT(BoE) 16/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkFindings -Examining the number of annotation needed for Twitterclassifier to outperform Sc. Db-FB Investigated the impact of employing Sc. Db-FB classifier over the Twitter classifier in terms of number of annotations The performance of the Twitter classifier against the three CD classifiers over the full learning curve 17/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkFindings -Examining the number of annotation needed for Twitterclassifier to outperform Sc. Db-FB Investigated the impact of employing Sc. Db-FB classifier over the Twitter classifier in terms of number of annotations The performance of the Twitter classifier against the three CD classifiers over the full learning curve => In the absence of any annotated tweets, applying these CD classifiers are beneficial 17/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkFindings -Examining the number of annotation needed for Twitterclassifier to outperform the CD classifiers Q3 : How similar or dissimilar are KSs to Twitter posts; and which similarity measure does better reflect the lexical changes between KSs and Twitter posts? Compared χ2 , KL-divergence, cosine for each topic χ2 obtained the best correlation with the performance of CD classifiers, achived scores >70% for 32 cases cosine obtained correlation scores >70% for 25 cases KL obtained correlation scores >70% for 24 cases 18/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkFindings -Examining the number of annotation needed for Twitterclassifier to outperform the CD classifiers Q3 : How similar or dissimilar are KSs to Twitter posts; and which similarity measure does better reflect the lexical changes between KSs and Twitter posts? Compared χ2 , KL-divergence, cosine for each topic χ2 obtained the best correlation with the performance of CD classifiers, achived scores >70% for 32 cases cosine obtained correlation scores >70% for 25 cases KL obtained correlation scores >70% for 24 cases => χ2 test is the best measure for quantifying the distributional differences between KSs and Twitter. 18/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkConclusions and Future Work We presented a first study towards understanding the usefulness of KSs in TC of tweets at various granularities: lexical features (BoW) and entity features (BoE) Our main findings are: 19/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkConclusions and Future Work We presented a first study towards understanding the usefulness of KSs in TC of tweets at various granularities: lexical features (BoW) and entity features (BoE) Our main findings are: In the absence of any annotated tweets, applying these CD classifiers are beneficial 19/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkConclusions and Future Work We presented a first study towards understanding the usefulness of KSs in TC of tweets at various granularities: lexical features (BoW) and entity features (BoE) Our main findings are: In the absence of any annotated tweets, applying these CD classifiers are beneficial Out of the two KSs, Freebase topics seem to be much closer to the Twitter topics than the DBpedia topics. 19/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkConclusions and Future Work We presented a first study towards understanding the usefulness of KSs in TC of tweets at various granularities: lexical features (BoW) and entity features (BoE) Our main findings are: In the absence of any annotated tweets, applying these CD classifiers are beneficial Out of the two KSs, Freebase topics seem to be much closer to the Twitter topics than the DBpedia topics. The two KSs contain complementary information 19/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkConclusions and Future Work We presented a first study towards understanding the usefulness of KSs in TC of tweets at various granularities: lexical features (BoW) and entity features (BoE) Our main findings are: In the absence of any annotated tweets, applying these CD classifiers are beneficial Out of the two KSs, Freebase topics seem to be much closer to the Twitter topics than the DBpedia topics. The two KSs contain complementary information For the CD classifiers, on average BOW features were more useful than BoE features 19/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkConclusions and Future Work We presented a first study towards understanding the usefulness of KSs in TC of tweets at various granularities: lexical features (BoW) and entity features (BoE) Our main findings are: In the absence of any annotated tweets, applying these CD classifiers are beneficial Out of the two KSs, Freebase topics seem to be much closer to the Twitter topics than the DBpedia topics. The two KSs contain complementary information For the CD classifiers, on average BOW features were more useful than BoE features For the Twitter classifiers, on average BOE features were more useful than BoW features 19/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkConclusions and Future Work We presented a first study towards understanding the usefulness of KSs in TC of tweets at various granularities: lexical features (BoW) and entity features (BoE) Our main findings are: In the absence of any annotated tweets, applying these CD classifiers are beneficial Out of the two KSs, Freebase topics seem to be much closer to the Twitter topics than the DBpedia topics. The two KSs contain complementary information For the CD classifiers, on average BOW features were more useful than BoE features For the Twitter classifiers, on average BOE features were more useful than BoW features We found χ2 test as being the best measure for quantifying the distributional differences between KSs and Twitter. 19/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkConclusions and Future Work We presented a first study towards understanding the usefulness of KSs in TC of tweets at various granularities: lexical features (BoW) and entity features (BoE) Our main findings are: In the absence of any annotated tweets, applying these CD classifiers are beneficial Out of the two KSs, Freebase topics seem to be much closer to the Twitter topics than the DBpedia topics. The two KSs contain complementary information For the CD classifiers, on average BOW features were more useful than BoE features For the Twitter classifiers, on average BOE features were more useful than BoW features We found χ2 test as being the best measure for quantifying the distributional differences between KSs and Twitter. Our future work will focus on building more accurate TC classifiers and investigating better measures 19/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkCorpus Analysis - Size of vocabulary %!!!" J!I" I&L" $!!!" I!$L" IIIJ" #$#J" #J!K" L%L" L$I" #!!!" J#I" !!" !K#" !$IJ" !!L#" !##" !!I" !!#L" LLII" LJJ$" LI!K" L&#I" LK&" L&KJ" !!!!" LI$#" L##" LI$#" L$##" LI$#" LI$#" LI$#" LI$#" &K%&" &&J&" &&I" GMG" &!!!" K$#" %JK" %&$#" %K#&" F/N-(" %$#" JK%#" J&&&" %!!!" F/N+(" $%&$" $JJ" I&!I" I&%!" F/N-(O+(" IJ%" I%&!" IK!K" I$K!" I$L$" IJ#I" I$!I" IJL" $!!!" IIJ$" IIJI" I#%J" I!!" I#&#" I#JK" II%I" IIK!" I$#" #&L" ##J$" #$#K" #KJ!" &#J" J%&" &LK" #!!!" !" ," " " 3" 6" " " 3" " 2" " " G" " B" " " // ) 3: 8/ 5 )8 3* 8B 9B C *+ D; ?2 )5 A; 02 ? 01 ,; ;B /: @9 *. H 95 3: <= * () 34 > 5,E /?* @9 78 F< G8 89 -, 7) 02 * =8 7; F; H 20/22
    • Motivation Research question State-of-the-art Methodology Results Conclusions and Future WorkUnderstanding the results - Number of unique entities Examining the number of entities in the source (Sc. DB, Sc. FB, Sc. DB-FB) and target (TGT) datasets after pre-processing. the TGT dataset consists of 1.73 ± 0.35 entities/tweet the Sc.DB dataset consists of 22.24 ± 1.44 entities entities/article the Sc.FB dataset consists of 8.14 ± 5.78 entities entities/article #!!!" %&KI" &I%JK" !!!!" &K&!!" &LL$" &#!!!" &$LJ%" %I$J&" %IIJK" %I&$!" %III%" &!!!!" %K!K#" %#%" %#%&L" %&#" %#I!J" %#KK" %KK$" %#IL" %J&&" %#!!!" %&#!" %&L$" %%&K&" %%!IL" %%%J" %%&K&" %%#K%" %%&K&" %%&K&" %%&K&" %%%&I" %%&K&" GMG" %$$#L" %!LI%" %!JII" $I%&!" $JK&" F/N-(" %!!!!" $KIK" $L&L" $$#&" F/N+(" $#!!!" $%!#&" F/N-(O+(" IJ#$" $!!!!" KL%%" KIK%" KIKK" L!$" #K#J" #!!!" $II%" %&$L" $I$&" %&J#" %%" %&$%" %$" %J%I" %%JL" $JKK" $J!#" $J!#" $JK&" $K$" $%J&" $#I$" $#&" $ILI" $$%!" $#$%" %$$L" $KJ&" $%!L" !" ," " " 3" 6" " /" " " 3" " 2" G" " B" " " // ) 3: )8 3* 5 8B 9B C *+ D; )5 ?2 =8 A; 02 01 ? ,; ;B /: @9 *. 95 H 3: * () 34 > 5,E /?* *< @9 78 F< G8 89 -, 7) 02 =8 7; F; H 21/22
    • Motivation Research question Labor (Sc.FB) War (TGT) War (Sc.FB) SocIssue (Sc.FB) Weather (Sc.FB) TechIT (Sc.DB) HumInt (Sc.DB) Weather (Sc.DB) SocIssue (Sc.DB) Law (Sc.DB) Labor (Sc.DB) DisAcc (Sc.DB) HospRecr (Sc.DB) EntCult (Sc.FB) State-of-the-art TechIT (Sc.FB) HumInt (TGT) EntCult (TGT) EntCult (Sc.DB) TechIT (TGT) BusFi (Sc.FB) Health (Sc.FB) Health (Sc.DB) BusFi (Sc.DB) Env (Sc.DB) Law (Sc.FB) Env (Sc.FB) DisAcc (Sc.FB) HospRecr (Sc.FB) Weather (TGT) DisAcc (TGT) Env (TGT) Methodology Edu (Sc.DB) Edu (Sc.FB) Sports (Sc.DB) War (Sc.DB) HumInt (Sc.FB) Pol (Sc.FB) Sports (Sc.FB) Religion (Sc.FB) Sports (TGT) Religion (TGT) Pol (TGT) Corpus Analysis - Distribution of the top 15 entities Pol (Sc.DB) Religion (Sc.DB) Results Health (TGT) HospRecr (TGT) BusFi (TGT) Labor (TGT) SocIssue (TGT) Edu (TGT) Law (TGT) City Facility Person Region Position Country Company Continent Technology SportsEvent Organization IndustryTerm NaturalFeature ProvinceOrState MedicalCondition Conclusions and Future Work22/22