Harnessing Linked Knowledge Sources for Topic Classification in Social Media

1,126 views

Published on

Presented at Hypertext'13.
Topic classification (TC) of short text messages o↵ers an ef- fective and fast way to reveal events happening around the world ranging from those related to Disaster (e.g. Sandy hurricane) to those related to Violence (e.g. Egypt revolu- tion). Previous approaches to TC have mostly focused on exploiting individual knowledge sources (KS) (e.g. DBpedia or Freebase) without considering the graph structures that surround concepts present in KSs when detecting the top- ics of Tweets. In this paper we introduce a novel approach for harnessing such graph structures from multiple linked KSs, by: (i) building a conceptual representation of the KSs, (ii) leveraging contextual information about concepts by exploiting semantic concept graphs, and (iii) providing a principled way for the combination of KSs. Experiments evaluating our TC classifier in the context of Violence detec- tion (VD) and Emergency Responses (ER) show promising results that significantly outperform various baseline models including an approach using a single KS without linked data and an approach using only Tweets.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,126
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
9
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • I will present a work done in collaboration with the universities of sheffield, lancaster and Aston. This work was done as part of the Violence Detection project which investigates different approaches for the detection of violence-related events emerging from social media streams.
  • During the last 2 years we have witnessed the use of these services to express different emotions within society; these services have become a proxy of information which communicates the social perception of situations regarding for exampleTerrorismSocial Crisis RacismTherefore the real time identification of the topics discussed in these channels could aid in different scenarios includeing violence detection and emergency response situations.
  • Our intuition indicates that in the first case, the role of Obama as President of the United States, could be more indicative for the topic War and Connflict
  • Our intuition indicates that in the first case, the role of Obama as President of the United States, could be more indicative for the topic War and Connflict
  • These two tweets make reference to the same entity, “President Obama”.However the context in which the entity is used is different, in the first case, the co-occurrence of Obama, Egypt and Mubarak could be more indicative of the War and Conlict topic, while in the second case the occurrence of President Obama and Michelle, is less likely to indicate a war and conflict related topic.So we wonder whether the graph structure of existing Knowledge source could aid in provide an abstraction of the use of these entity types for representing a topic.
  • Our intuition indicates that in the first case, the role of Obama as President of the United States, could be more indicative for the topic War and Connflict
  • Our intuition indicates that in the first case, the role of Obama as President of the United States, could be more indicative for the topic War and Connflict
  • Our intuition indicates that in the first case, the role of Obama as President of the United States, could be more indicative for the topic War and Connflict
  • Our intuition indicates that in the first case, the role of Obama as President of the United States, could be more indicative for the topic War and ConnflictHow can we weight this graphs so as to reveal which of these features characterise Obama in the context of Violence?
  • In order to capture the relative importance of each feature in a semantic meta-graph we propose two different weighting strategies. These are based on generality and specificity of a feature in a given meta-graph.Models the relative importance of a property p to a given class, together with the generality of the property in a KS’s graph.Where Np is the number of times property p appears in all resources of type c in the KS graph KS.
  • In order to capture the relative importance of each feature in a semantic meta-graph we propose two different weighting strategies. These are based on generality and specificity of a feature in a given meta-graph.Models the relative importance of a property p to a given class, together with the generality of the property in a KS’s graph.Where Np is the number of times property p appears in all resources of type c in the KS graph KS.
  • Where parent(c) denotes the total number of unique parent classes derived from a Ks graph.
  • For evaluating the impact of enhancing the feature space with semantic features for the task of topic classification of tweets. We evaluated the performance of using a large corpus of tweets and a two large coverage KS which are Dbpedia and Freebase. The Twitter dataset was derived previously by Abel et al. and it comprises tweets which were collected during two months starting from November 2010. This dataset has been topically annotated.
  • For each of the tweets and each of the articles we performed lovins stemming and extracted entities using opencalais and zemanta. Then as described before we built the semantic metagraphs from DB and from Freebase KS. It is important to mention that the twitter dataset consists of tweets which contains at least one entity.
  • For each of the tweets and each of the articles we performed lovins stemming and extracted entities using opencalais and zemanta. Then as described before we built the semantic metagraphs from DB and from Freebase KS. It is important to mention that the twitter dataset consists of tweets which contains at least one entity.Topic-Class Entropy :- Low entropy(LE) indicates a focused topic, while high entropy(HE) indicates that it is more random on the subjects it discusses.Entity-Class Entropy: - LE indicates a topic is less ambiguous (i.e. entities belong to fewer classes, while (HE) high ambiguity at the level of the entities. Topic-Class-Property Entropy:- LE indicates a topic is dominated by few class-properties, while (HE) reveals high property diversity.
  • The darker the closer to red the more correlated the values are. These indicates that as the number of ambiguous entities increases in a topic, the performance of the TC decreases.
  • The darker the closer to red the more correlated the values are. These indicates that as the number of ambiguous entities increases in a topic, the performance of the TC decreases.
  • For each of the tweets and each of the articles we performed lovins stemming and extracted entities using opencalais and zemanta. Then as described before we built the semantic metagraphs from DB and from Freebase KS. It is important to mention that the twitter dataset consists of tweets which contains at least one entity.
  • For each of the tweets and each of the articles we performed lovins stemming and extracted entities using opencalais and zemanta. Then as described before we built the semantic metagraphs from DB and from Freebase KS. It is important to mention that the twitter dataset consists of tweets which contains at least one entity.
  • Harnessing Linked Knowledge Sources for Topic Classification in Social Media

    1. 1. A. Elizabeth Cano, Andrea VargaŸ, Matthew Rowew, Fabio CiravegnaŸ, andYulan He°Knowledge Media Institute, The Open University, Milton KeynesŸ University of Sheffield, Sheffieldw Lancaster University, Lancaster° Aston University, BirminghamUK. 2013Harnessing Linked Knowledge Sources forTopic Classification in Social Media
    2. 2. INTRODUCTIONSocial Media Streams - Risk in violent and criminal activities
    3. 3. INTRODUCTIONResearch Questions:o  Can semantic features help in topic classification (TC)?o  Which knowledge source (KS) data and KS taxonomiesprovide useful information for improving the TC of tweets?
    4. 4. OUTLINE• Introduction- Topic Classification (TC) of Microposts- Related Work- State of the art limitations• Proposed Approach• Experiments• Findings• Conclusions
    5. 5. INTRODUCTIONu  Difficulties of Topic Classification of micropostso  Restricted number of characterso  Irregular and ill-formed words•  Mixing upper and lowercase letter§  Makes it difficult to detect proper nouns, and other part ofspeech tags.•  Wide variety of language§  E.g., “see u soon”o  Event-dependent emerging jargon• Volatile jargon relevant to particular events§  E.g., “Jan.25” (used during the Egyptian revolutiono  High Topical Diversityo  Sparse data
    6. 6. INTRODUCTIONSocial Knowledge Sources (KS)DBpedia* Yago2 FreebaseResources 2.35 million 447million 3.6 millionClasses 359 562,312 1,450Properties 1,820 253,213,842 7,000*Using dbpedia ontologyo  Structured Semantic Web Representation of data•  Maintained by thousand of editors§  E.g DBpedia, derived from Wikipedia§  Freebase•  Evolves and adapts as knowledge changes [Syed et al,2008]o  Cover a broad range of topicso  Characterise topics with a large number of resources
    7. 7. INTRODUCTIONLocal and External Metadata of a Tweet
    8. 8. INTRODUCTIONLocal and External Metadata of a TweetNER:CountryNER:PersonNER:Person
    9. 9. INTRODUCTIONLocal and External Metadata of a TweetNER:CountryNER:PersonNER:Person<http://dbpedia.org/resource/Barack_Obama<http://dbpedia.org/resource/Egypt<http://dbpedia.org/resource/Hosni_Mubarak
    10. 10. PROPOSED APPROACHo  State of the art limitations§  Use of single knowledge sources§  Entities’ metadata is constrained by the used NER service(e.g OpenCalais, Alchemy).o  Our approach§  Exploits multiple knowledge sources.§  Enhances the entity metadata by deriving semantic graphs.§  Leverages the graph structures surrounding entities presentin a KS for the TC task.Exploiting Knowledge Sources for the Topic Classification ofMicroposts
    11. 11. OUTLINE• Introduction• Proposed Approach• Semantic Meta-graphs• Weighting Schemas• Enhancing TC with Semantic Features• Experiments• Findings• Conclusions
    12. 12. PROPOSED APPROACHRationale…12
    13. 13. PROPOSED APPROACHRationale…12Could be more indicativeof War and Conflict
    14. 14. PROPOSED APPROACHRationale…2Not necessarily a goodindicator of War andConflict
    15. 15. PROPOSED APPROACHRationale…12Can the graph structure of existing Knowledge sources providean abstraction of the use of these entity types for representing atopic ?
    16. 16. PROPOSED APPROACHFramework for Topic Classification of TweetsConcept EnrichmentDBFBDB-FBRetrieveArticlesTWRetrieveTweetsDerive Semantic FeaturesBuild Cross-Source Topic ClassifierAnnotateTweets1 Datasets CollectionSPARQL query for all resources from agiven Topic (e.g. War )
    17. 17. PROPOSED APPROACHFramework for Topic Classification of TweetsConcept EnrichmentDBFBDB-FBRetrieveArticlesTWRetrieveTweetsDerive Semantic FeaturesBuild Cross-Source Topic ClassifierAnnotateTweets2 Datasets EnrichmentFrom tweets and articles’ abstracts, extractentities and link them to resources inDBpedia and Freebase.
    18. 18. PROPOSED APPROACHFramework for Topic Classification of TweetsConcept EnrichmentDBFBDB-FBRetrieveArticlesTWRetrieveTweetsDerive Semantic FeaturesBuild Cross-Source Topic ClassifierAnnotateTweets2 Datasets EnrichmentFrom tweets and articles’ abstracts, extractentities and link them to resources inDBpedia and Freebase.
    19. 19. PROPOSED APPROACHFramework for Topic Classification of TweetsConcept EnrichmentDBFBDB-FBRetrieveArticlesTWRetrieveTweetsDerive Semantic FeaturesBuild Cross-Source Topic ClassifierAnnotateTweets2 Datasets EnrichmentFrom tweets and articles’ abstracts, extractentities and link them to resources inDBpedia and Freebase.
    20. 20. PROPOSED APPROACHFramework for Topic Classification of TweetsConcept EnrichmentDBFBDB-FBRetrieveArticlesTWRetrieveTweetsDerive Semantic FeaturesBuild Cross-Source Topic ClassifierAnnotateTweets3 Semantic Features Derivation
    21. 21. PROPOSED APPROACHFramework for Topic Classification of TweetsConcept EnrichmentDBFBDB-FBRetrieveArticlesTWRetrieveTweetsDerive Semantic FeaturesBuild Cross-Source Topic ClassifierAnnotateTweets4Build a Topic Classifier based on FeaturesDerived from Crossed-Sources
    22. 22. PROPOSED APPROACHFramework for Topic Classification of TweetsConcept EnrichmentDBFBDB-FBRetrieveArticlesTWRetrieveTweetsDerive Semantic FeaturesBuild Cross-Source Topic ClassifierAnnotateTweets4Build a Topic Classifier based on FeaturesDerived from Crossed-Sources
    23. 23. PROPOSED APPROACHDeriving Semantic Meta-Graphs<dbpedia:Barack_Obama, rdf:type, yago:PresidentOfTheUnitedStates><dbpedia:Barack_Obama, dbo:birthPlace, dbpedia:Hawaii>
    24. 24. PROPOSED APPROACHDeriving Semantic Meta-Graphs<dbpedia:Barack_Obama, rdf:type, yago:PresidentOfTheUnitedStates><dbpedia:Barack_Obama, dbo:birthPlace, dbpedia:Hawaii>
    25. 25. PROPOSED APPROACHDefinition 1- Resource Meta-graphIs a sequence of tuples G:=(R,P,C,Y) where•  R, P, C are finite sets whose elements are resources,properties and classes;•  Y is a ternary relation representing ahypergraph with ternary edges.•  Y is a tripartite graph where the verticesareY ! R " P "CH Y( ) = V, DD = r, p,c{ } r, p,c( ) ! Y{ }
    26. 26. PROPOSED APPROACHResource Meta-graphThe meta-graph of entity e is the aggregation of all resources,properties and classes related to this entity.ObamabirthPlaceauthorspouseProjecting on Properties Projecting on ClassesLivingPeoplePresidentOfTheUnitedStatesObamaPersonAuthor
    27. 27. PROPOSED APPROACHResource Meta-graphThe meta-graph of entity e is the aggregation of all resources,properties and classes related to this entity.ObamabirthPlaceauthorspouseProjecting on Properties Projecting on ClassesLivingPeoplePresidentOfTheUnitedStatesObamaPersonAuthorHow can we weight these graphs to reveal semanticfeatures characterise Obama in the context ofViolence??????? ?
    28. 28. PROPOSED APPROACHWeighting Semantic FeaturesSpecificityMeasures the relative importance of a property toa given class in a KS graph GKS:p ! G e( )c ! G e( )specificityKS p,c( ) = pN R(c)( )N(R(c))
    29. 29. PROPOSED APPROACHWeighting Semantic FeaturesGeneralityCaptures the specialisation of a property p to a given class c,by computing the property’s frequency among othersemantically related classes R’(c).Where N(R’(c)) is the number of resources whose type iseither c or a specialisation of c’s parent classes.generalityKS p,c( ) =N R(c)( )pN (R(c))
    30. 30. PROPOSED APPROACHWeighting Semantic FeaturesSG p,c( ) = specificityKS p,c( )! generalityKS p,c( )
    31. 31. PROPOSED APPROACHEnhancing Feature Space with Semantic FeaturesSemantic Augmentation (A1)Class FeaturesProperty FeaturesClass+ Property FeaturesA1!CF = F + CFA1!PF = F + pFA1!C+PF = F + cF + pF
    32. 32. PROPOSED APPROACHEnhancing Feature Space with Semantic FeaturesSemantic Augmentation (A1)Class FeaturesProperty FeaturesClass+ Property FeaturesA1!CF = F + CFA1!PF = F + pFA1!C+PF = F + cF + pFFpresident, obama, televised, statement, hosni, mubarak, resignation,cnn, says, egyptFA1+ P dbpedia:birth, dbpedia:state, …., dbpedia-owl:PopulatedPlace/populationDensity….FA1+ CPopulatedPlace, Office_holder, PresidentOfTheUnitedStates,Politician…
    33. 33. PROPOSED APPROACHEnhancing Feature Space with Semantic FeaturesSemantic Augmentation with Generalisation (A2)This augmentation exploits the subsumption relation amongclasses within the DBpedia or Freebase ontologies. In thiscases we consider the set of parent classes of c.Parent(c) FeaturesParent(c) + Property FeaturesA2!CF = F + parent(c)FA2!C+PF = F + pF + parent(c)F
    34. 34. PROPOSED APPROACHEnhancing Feature Space with Semantic FeaturesSemantic Augmentation with Generalisation (A2)This augmentation exploits the subsumption relation amongclasses within the DBpedia or Freebase ontologies. In thiscases we consider the set of parent classes of c.Parent(c) FeaturesParent(c)+Property FeaturesA2!CF = F + parent(c)FA2!C+PF = F + pF + parent(c)FFpresident, obama, televised, statement, hosni, mubarak, resignation,cnn, says, egyptFA2+ parent(c)Place, Office_holder, President, Politician…
    35. 35. OUTLINE• Introduction• Proposed Approach• Experiments• Dataset• Baseline Features• Results• Findings• Conclusions
    36. 36. PROPOSED APPROACHDatasetso  Twitter Dataset [Abel et al., 2011] (TW)§  Collected during two months starting on Nov 2010.§  Topically annotated§  Using tweets labelled as “War & Conflict” (War),“Law & Crime” (Cri), “Disaster &Accident” (DisAcc).§  Multilabelled dataset comprising 10,189 Tweets.o  DBpedia (DB) and Freebase (FB) Dataset§  SPARQL queried endpoints for all resources fromcategories and subcategories of skos:concept of War,Cri, DisAcc.•  DBpedia – 9,465 articles•  Freebase – 16,915 articles
    37. 37. PROPOSED APPROACHDatasets
    38. 38. PROPOSED APPROACHExperimental Setup A1.  Use annotated Tweets for training (TW)-  Baseline: Bag of Words (BoW), Bag of Entities (BoE),and Part of Speech tags (PoS).-  Enhance Features using the DBpedia and Freebasegraphs.2.  Train a SVM classifier based on the TW corpus. Trained/Tested on 80%-20% over five independent runs.3.  Compute Precision, Recall, and F-measure.
    39. 39. PROPOSED APPROACHResults for TW dataset
    40. 40. PROPOSED APPROACHExperimental Setup B1.  Use labelled articles from DBpedia (DB) and Freebase(FB) for training-  Baseline: Bag of Words (BoW), Bag of Entities (BoE),and Part of Speech tags (PoS).-  Enhance Features using the DBpedia and Freebasegraphs.2.  Train a SVM classifier based on the DB, FB, DB+FB, DB+FB+TW training corpus and test on TW. Trained/Testedon 80%-20% over five independent runs.3.  Compute Precision, Recall, and F-measure.
    41. 41. PROPOSED APPROACHResults for Training on KS articles, and Testing on TW
    42. 42. PROPOSED APPROACHFactors contributing to the performance of a KS graph for TC1.  Topic-Class Entropy2.  Entity-Class Entropy3.  Topic-Class-Property Entropy
    43. 43. PROPOSED APPROACHCorrelating Entropy metrics with the performance of thecross-source TC classifiers.
    44. 44. PROPOSED APPROACHCorrelating Entropy metrics with the performance of thecross-source TC classifiers.Indicates that the higher the number of ambiguousentities in a topic within a KS graph, the lower theperformance of the TC.
    45. 45. FINDINGS1.  KSs combined with Twitter data provide complementaryinformation for TC of Tweets, outperforming the KSapproaches and the approach using Tweets only.2.  A KS performance on TC depends on the coverage ofthe entities within that KS.3.  When entities have low coverage in a KS, exploiting themapping between corresponding KSs’ ontologies isbeneficial.
    46. 46. CONCLUSIONS•  Explored the task of topic classification of tweets•  Exploited information in KSs (e.g. DBpedia, Freebase)using semantic graphs for concepts and propertiessurrounding an entity.•  Presented the importance of considering graphstructures in KSs for the supervised classification oftweets, by achieving significant improvement overvarious state-of-the-art approaches using both singleKSs and Tweets only.
    47. 47. CONTACT USA.  Elizabeth Cano•  http://people.kmi.open.ac.uk/cano/B.  Andrea Varga•  http://sites.google.com/site/missandreavarga/C.  Matthew Rowe•  http://lancs.ac.uk/staff/rowem/D.  Fabio Ciravegna•  http://staffwww.dcs.shef.ac.uk/people/F.CiravegnaE.  Yulan He•  http://www1.aston.ac.uk/eas/staff/dr-yulan-he

    ×