Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Classifying Crisis Information Relevancy with Semantics (ESWC 2018)

255 views

Published on

Social media platforms have become key portals for sharing and consuming information during crisis situations. However, humanitarian organisations and effected communities often struggle to sieve through the large volumes of data that are typically shared on such platforms during crises to determine which posts are truly relevant to the crisis, and which are not. Previous work on automatically classifying crisis information was mostly focused on using statistical features. However, such approaches tend to be inappropriate when processing data on a type of crisis that the model was not trained on, such as processing information about a train crash, whereas the classifier was trained on floods, earthquakes, and typhoons. In such cases, the model will need to be retrained, which is costly and time-consuming.
In this paper, we explore the impact of semantics in classifying Twitter posts across same, and different, types of crises. We experiment with 26 crisis events, using a hybrid system that combines statistical features with various semantic features extracted from external knowledge bases. We show that adding semantic features has no noticeable benefit over statistical features when classifying same-type crises, whereas it enhances the classifier performance by up to 7.2% when classifying information about a new type of crisis.

Published in: Data & Analytics
  • Be the first to comment

Classifying Crisis Information Relevancy with Semantics (ESWC 2018)

  1. 1. www.comrades-project.eu Classifying Crisis-information Relevancy with Semantics Prashant Khare, Gregoire Burel, Harith Alani 1 {prashant.khare, g.burel, h.alani} @open.ac.uk Knowledge Media Institute, The Open University, UK ESWC2018 – 5 June 2018 Heraklion, Crete
  2. 2. www.comrades-project.eu Motivation 2 People of NSW, be careful because there's fires spreading! Stay safe everyone! Hundreds of volunteers in Mexico tried to unearth children they hoped were still alive beneath a school's ruins Two trucks and one car in the water after a road collapse at Hwy 287 and Dillon. #cowx #boulderflood CRISIS Wildfire Floods Earthquake
  3. 3. www.comrades-project.eu Motivation 3 Challenges  A flood of data gets generated. For e.g.:  Over a million tweets were posted during the 2017 Hurricane Harvey.  500% increase in the tweets bandwidth during 2011 Japan earthquake.  Almost impossible to manually absorb and process such sheer volumes.  In addition, the characteristics of social media posts such as short length, colloquialism, syntactic issues pose additional challenges of processing the data.
  4. 4. www.comrades-project.eu Motivation 4 FEMA launched an initiative to use public social media data for situational awareness purpose1. 1: https://www.dhs.gov/sites/default/files/publications/privacy-pia-FEMA-OUSM-April2016.pdf Image source – fema.gov
  5. 5. www.comrades-project.eu Motivation 5 Relevant and Non-Relevant
  6. 6. www.comrades-project.eu Key Problem- Diverse forms of Crisis 6 Floods FireQuake Human Disaster Food & Supplies Crisis
  7. 7. www.comrades-project.eu Key Problem- Broad Spectrum Data 7 The diverse range of situations result in a broad spectrum of content People of NSW, be careful because there's fires spreading! Stay safe everyone! BREAKING: Reports of shots fired at LAX Airport, says senior government official. Two trucks and one car in the water after a road collapse at Hwy 287 and Dillon. #cowx #boulderflood Report: Between 3 and 5 firefighters missing following massive blast at West, Texas, fertilizer plant, police say Hundreds of volunteers in Mexico tried to unearth children they hoped were still alive beneath a school's ruins during earthquake Casualties from 7.2 #earthquake in the #Philippines is now 20+ according to authorities. Casualties from 7.2 #earthquake in the #Philippines is now 20+ according to authorities.
  8. 8. www.comrades-project.eu Access Relevant Information Across Crisis Situations 8 • How do we handle information overload? • How do we identify relevant and irrelevant information across diverse crisis situations? • Can we learn from one type of crisis situation, and identify relevant information in another type?
  9. 9. www.comrades-project.eu Previous Efforts - Identifying Crisis Related Information • ML Classification Methods:  Supervised Approaches: Often making use of n-grams, linguistic features, and/or statistical features of tweets.  Unsupervised Approaches: Keyword processing and clustering. • Semantic Models:  Representation of the information emerging from Crisis Events, providing faceted search of crisis related information. 9
  10. 10. www.comrades-project.eu Hypothesis and Aim • Hypothesis:  Semantics establish a consistency across various types of crisis situations thereby enabling identification of relevant information and can enhance the discriminative power of the classification systems. • Go beyond statistical features, n-grams, and incorporate the contextual semantics to the statistical features. 10
  11. 11. www.comrades-project.eu Statistical Features Example of statistical features: - Text length. - Number of words. - Presence and count of various Parts of Speech (PoS). - Data specific features such as hashtags (in tweets). - E.g., #neworleans #nola #algiers #nolafood #hurricanekatrina. - Readability Score (Gunning Fox Index using average sentence length (ASL) and percentage of complex words (PCW) : 0.4*(ASL + PCW)). 11
  12. 12. www.comrades-project.eu Semantic Features Example of semantic features: Additional information about terms found in the tweets can be extracted using NER tools, entity linking tools, and semantic databases: - Entity linking in Knowledge base. - Co-occurring words (from a data corpus) - Synset Sense – WordNet - Hierarchical Context: Hypernyms, Synonyms - Dbpedia properties. 12
  13. 13. www.comrades-project.eu Extracting Semantics Available tools for entity extraction and knowledge expansion: NER  DBpedia Spotlight  Alchemy (IBM)  Babelfy (BabelNet)  Text Razor NLP API  Aylien Text Analysis API Knowledge Bases  Dbpedia  YAGO  BabelNet  WordNet  Google Knowledge Graph  Wikidata 13
  14. 14. www.comrades-project.eu Babelfy and BabelNet BabelNet – a multilingual lexicalised semantic network formed by combining various knowledge resources- WordNet, Wikipedia, Wikitionary, OmegaWiki etc. It can enable multilingual NLP applications. It can be used for words sense disambiguation and entity linking with Babelfy. Babelfy – A words sense disambiguation and entity linking API built on top of BabelNet. 14
  15. 15. www.comrades-project.eu Features extracted Statistical Features  Number of Nouns, Verbs, Pronouns  Tweet Length  Number of words/tokens  Number of Hashtags Semantic Features  BabelNet Semantics  BabelNet Sense: English labels of entities identified via Babelfy.  BabelNet Hypernym: Direct English hypernyms of each entity (at a distance 1).  Dbpedia Semantics: List of properties associated with Dbpedia URI returned by Babelfy.  subject, label, type, city, state, country 16
  16. 16. www.comrades-project.eu Semantic Enrichment- Broader Perspective Features Post A Post B ‘No confirmed casualties yet from landslide reported in Compostela Valley. #PabloPH’ ‘News: Italy quake victims given shelter http://t.co/cXQEusVm via @BBC’ Babelfy Entities Sense (English) confirm, casualty, report, landslide Italy, earthquake, victim, shelter, news Hypernyms (English) victim, affirm, flood, seismology, geology, soil slide, announce, disaster, natural disaster, geological phenomenon natural disaster, geological phenomenon, broadcasting, communication, nation, country, unfortunate DBpedia dbc:landslide, dbr:landslide, dbo:place, dbc:Geological hazards, dbc:Seismology, dbc:Geological hazards, dbc:Seismology, dbr:Earthquake, dbc:Communication, dbr:News 17
  17. 17. www.comrades-project.eu Method • Collect Data from CrisisLex.org- collection of Crisis oriented tweets. • Extract Statistical Features. • Semantic Enrichment of tweets via annotation using Babelfy API. • Expand the semantics by incorporating hypernyms through BabelNet. • Retrieve Dbpedia features through SPARQL endpoint. • Classify using SVM classification method. 19
  18. 18. www.comrades-project.eu Data • CrisisLexT26 • 26 crisis events with 1000 labelled tweets in each event. • 4 Labels: Related & Informative, Related & Not Informative, Not Related, and Not Applicable. • Merged Related & Informative, Related & Not Informative – Related. • Merged Not Related, and Not Applicable – Not Related. 20
  19. 19. www.comrades-project.eu Data • After removing duplicates: 21378 Related and 2965 Not Related. • To prevent bias, we chose a balanced data- • Selected same number of Related tweets as Not Related in each event. • Final figure: 2966 Related and 2965 Not Related. 21
  20. 20. www.comrades-project.eu Data 22 Related Not Related Total Related Not Related Total CWF Col. Wildfire 242 242 484 COS Costa Rica E’qke 470 470 940 GAU Gautemalla E’quake 103 103 206 ITL Italy E’quake 56 56 112 PHF Philippines Flood 70 70 140 TYP Typhoon P 88 88 176 VNZ Venezuela Fire 60 60 120 ALB Alberta Flood 16 16 32 ABF Australia Bushfire 183 183 366 BOL Bohol E’quake 31 31 62 BOB Boston Bomb 69 69 138 BRZ Brazil Fire 44 44 88 CFL Col.Fire 61 61 122 GLW Glasg Crash 110 110 220 LAX LA Shootout 112 112 224 LAM Train Crash 34 34 68 MNL Manila Flood 74 74 148 NYT NY Train Crash 2 1 3 QFL Queensland Flood 278 278 556 RUS Russia Meteor 241 241 482 SAR Sardinia Flood 67 67 134 SVR Savar Building 305 305 610 SGR Singapore Haze 54 54 108 SPT Spain Train Crash 8 8 16 TPY Typhoon Y 107 107 214 WTX West Texas Ex. 81 81 162
  21. 21. www.comrades-project.eu Data- Event Type Distribution 23 Event Type Events Event Type Events Wildfire/Bushfire (2) CWF, ABF Haze (1) SGR E’quakes(4) COS, ITL, BOL, GAU Helicopter Crash (1) GLW Flood/Typhoons (8) TPY, TYP, CFL, QFL, ALB, PHF, SAR, MNL Building Collapse (1) SVR Terror Shooting/Bombing (2) LAX, BOB Location Fire (2) BRZ, VNZ Train Crash (2) SPT, LAM Explosion (1) WTX Meteor (1) RUS Crisis Type Distribution Wildfire/Bushfire E’quakes Flood/Typhoons Terror Shooting/Bombing Train Crash Meteor Haze Helicopter Crash Building Collapse Location Fire Explosion
  22. 22. www.comrades-project.eu Experiment Design Feature Models:  Statistical Features (SF- baseline)  Statistical Features + BabelNet Semantics (SF + SemEF_BN)  Statistical Features + Dbpedia Semantics (SF + SemEF_DB)  Statistical Features + BabelNet Semantics + Dbpedia Semantics (SF + SemEF_BNDB) Crisis Classification Model  Merge the entire data and perform 20 iterations of 5-fold cross-validation across all the models to evaluate the performance. 24 Statistical Features (SF), BabelNet Semantics (SemEF_BN), DBpediaSemantics (SemEF_DB), BabelNet and Dbpedia Semantics (SemEF_BNDB)
  23. 23. www.comrades-project.eu Experiment Design Cross Crisis Classification  Criteria 1- Content relatedness classification of already seen crisis event type.  When type of test data already exists in training data.  e.g. A classifier trained on data containing tweets/documents from flood event types (along with other event types), is used to classify data from a new flood type crisis event. 25 Statistical Features (SF), BabelNet Semantics (SemEF_BN), DBpediaSemantics (SemEF_DB), BabelNet and Dbpedia Semantics (SemEF_BNDB)
  24. 24. www.comrades-project.eu Experiment Design Cross Crisis Classification  Criteria 2- Content relatedness classification of unseen crisis event type.  When type of test data does not exist in training data.  e.g. A classifier trained on data containing tweets/documents from crisis events types except building fire event types, and is used to classify data from a such crisis event. To classify - “With death toll at 300, Bangladesh factory collapse becomes worst tragedy in garment industry history” 26
  25. 25. www.comrades-project.eu Experiment • Classifier Selection  Support Vector Machine with Linear Kernel  Chosen after determining its performance significance over RBF Kernel, Polynomial Kernel, and Logistic Regression via 20 iterations of 5-fold CV over the entire data) • Tools & Library  Scikit-learn Library  Python 2.7 27
  26. 26. www.comrades-project.eu Results Crisis Classification Model (20 iterations 5- fold cross validation) 28 Features Pmean Rmean Fmean Std. Dev. σ (20 iteration) ∆F /F (%) Sig. (p-value) SF (Baseline) 0.8145 0.8093 0.8118 0.0101 - SF + SemEF_BN 0.8233 0.8231 0.8231 0.0111 1.3919 <0.00001 SF + SemEF_DB 0.8148 0.8146 0.8145 0.0113 0.3326 0.01878 SF + SemEF_BN DB 0.8169 0.8167 0.8167 0.0106 0.6036 0.00001 Statistical Features (SF), BabelNet Semantics (SemEF_BN), DBpediaSemantics (SemEF_DB), BabelNet and Dbpedia Semantics (SemEF_BNDB)
  27. 27. www.comrades-project.eu Results Cross-Crisis Classification- Criteria 1 29 SemEF_BN SemEF_DB SemEF_BNDB Test F F ∆F /F (%) F ∆F /F (%) F ∆F /F (%) Flood/Typhoon TPY 0.803 0.776 -3.44 0.771 -4.01 0.780 -2.83 TYP 0.863 0.840 -2.66 0.829 -3.84 0.851 -1.29 ALB 0.718 0.749 4.25 0.844 17.41 0.844 17.41 QFL 0.783 0.792 1.18 0.77 -1.66 0.781 -0.22 CFL 0.801 0.827 3.28 0.754 -5.88 0.765 -4.41 PHF 0.764 0.763 -0.13 0.771 0.93 0.743 -2.83 SAR 0.570 0.677 18.79 0.648 13.70 0.650 -14.10 Earthquake GAU 0.780 0.725 -7.1 0.784 0.51 0.770 -1.30 ITL 0.583 0.562 -3.58 0.615 5.49 0.588 0.98 BOL 0.742 0.724 -2.38 0.758 2.2 0.674 -9.07 COS 0.790 0.770 -2.56 0.739 -6.42 0.750 -5.08 Statistical Features (SF), BabelNet Semantics (SemEF_BN), DBpediaSemantics (SemEF_DB), BabelNet and Dbpedia Semantics (SemEF_BNDB)
  28. 28. www.comrades-project.eu Results Cross-Crisis Classification- Criteria 2 30 SemEF_BN SemEF_DB SemEF_BNDB Test F F ∆F /F (%) F ∆F /F (%) F ∆F /F (%) Terror/Bomb/Train LAX 0.652 0.677 3.9 0.665 1.95 0.656 0.58 LAM 0.618 0.626 1.2 0.616 -0.34 0.628 1.62 BOB 0.608 0.635 4.4 0.605 -0.56 0.607 -0.19 SPT 0.547 0.686 25.56 0.746 36.5 0.686 25.56 Flood/Typhoon TPY 0.642 0.606 -5.67 0.651 1.39 0.582 -9.45 TYP 0.678 0.679 -0.12 0.661 -2.54 0.603 -10.99 ALB 0.716 0.705 -1.63 0.81 13.02 0.712 -0.63 QFL 0.681 0.657 -3.51 0.698 2.58 0.696 2.23 CFL 0.776 0.706 -9.04 0.704 -9.27 0.754 -2.87 PHF 0.532 0.566 6.52 0.632 18.9 0.556 4.67 SAR 0.537 0.553 2.93 0.595 10.69 0.617 14.84 Earthquake GAU 0.487 0.495 1.62 0.630 29.39 0.593 21.79 ITL 0.509 0.516 1.26 0.553 8.54 0.555 8.93 BOL 0.724 0.639 -11.73 0.674 -6.86 0.588 -18.77 COS 0.515 0.480 -6.71 0.538 4.56 0.527 2.33 Statistical Features (SF), BabelNet Semantics (SemEF_BN), DBpediaSemantics (SemEF_DB), BabelNet and Dbpedia Semantics (SemEF_BNDB)
  29. 29. www.comrades-project.eu Results and Observations • Based on IG score across each feature model (on the overall data), we observed very event specific features in SF model such as collapse, terremoto, fire, earthquake in top ranked features. • Observed 7 different hashtags in top 50 features (indicate event specific vocabulary). • In SF+SemEF_BN and SF+SemEF_DB models, we observed concepts such as natural_hazard, structural_integrity_and_failure, conflagration, perception, geological_phenomenon, dbo:location, dbc:building_defect etc in top 50 features. • structural_integrity_and_failure – annotated entity for term like collapse, building collapse – frequently occurring terms in earthquake, flood type events. • Natural_disaster – hypernym to event terms such as flood, landslide, earthquake. 31
  30. 30. www.comrades-project.eu Results and Observations • On an average SF+SemEF_DB is the best performing model (from Criteria 2). • An avg. percentage gain in F1 score (△F/F) of +7.2% with a Std. Dev. 12.83%. • Improvement over the baseline SF model, in 10 out of 15 events • 5 of 7 flood/typhoon, 3 of 4 earthquake, 2 of 4 crash/terrorist. • The results show that when type of test event is NOT seen in the training data, semantics enhance classifier performance. 32
  31. 31. www.comrades-project.eu Results and Observations • Semantics generalise event specific terms and consequently adapt to new event types (e.g., dbc:flood and dbc:natural hazard ). • Semantic concepts can be also be too general and thus do not help the classification of document (e.g., desire and virtue hypernyms). – Virtue is hypernym of broad range of concepts such as loyalty, courage, cooperation, charity. • Automatic semantic extraction tools could extract many non- relevant entities and therefore might confuse the. – e.g. “Super Typhoon in Philliphines is 236 mph It's roughly the top speed of Formula 1 cars http://t.co/vcRE…” – the annotation and semantic extraction results in 33
  32. 32. www.comrades-project.eu Further explorations • A more in-depth error analysis of misclassified documents is required. • Event type is based on the nature of the crisis. However, events of different types could produce overlapping content. Hence, content similarity could also be taken into account, along with event types. • Data about the same crisis event can emerge in multiple languages. Hence we need to expand the analysis to multilingual content. • Khare, P., Burel, G., Maynard, D., and Alani, H., Cross-Lingual Classification of Crisis Data, Int. Semantic Web Conference (ISWC), Monterey, 2018 (to be presented) 34
  33. 33. www.comrades-project.eu 35 Thank you! Questions?

×