SlideShare a Scribd company logo
www.comrades-project.eu
Classifying Crisis-information Relevancy
with Semantics
Prashant Khare, Gregoire Burel,
Harith Alani
1
{prashant.khare, g.burel, h.alani} @open.ac.uk
Knowledge Media Institute, The Open University,
UK
ESWC2018 – 5 June 2018
Heraklion, Crete
www.comrades-project.eu
Motivation
2
People of NSW, be
careful because
there's fires spreading!
Stay safe everyone!
Hundreds of volunteers
in Mexico tried to
unearth children they
hoped were still alive
beneath a school's ruins
Two trucks and one car
in the water after a road
collapse at Hwy 287 and
Dillon. #cowx
#boulderflood
CRISIS
Wildfire
Floods
Earthquake
www.comrades-project.eu
Motivation
3
Challenges
 A flood of data gets generated.
For e.g.:
 Over a million tweets were
posted during the 2017
Hurricane Harvey.
 500% increase in the
tweets bandwidth during
2011 Japan earthquake.
 Almost impossible to manually
absorb and process such sheer
volumes.
 In addition, the characteristics
of social media posts such as
short length, colloquialism,
syntactic issues pose additional
challenges of processing the
data.
www.comrades-project.eu
Motivation
4
FEMA launched an initiative to use public social media data for situational
awareness purpose1.
1: https://www.dhs.gov/sites/default/files/publications/privacy-pia-FEMA-OUSM-April2016.pdf
Image source – fema.gov
www.comrades-project.eu
Motivation
5
Relevant and Non-Relevant
www.comrades-project.eu
Key Problem- Diverse forms of Crisis
6
Floods
FireQuake
Human
Disaster
Food & Supplies
Crisis
www.comrades-project.eu
Key Problem- Broad Spectrum
Data
7
The diverse range of situations result in a broad spectrum of content
People of NSW, be careful because
there's fires spreading! Stay safe
everyone!
BREAKING: Reports of shots fired at
LAX Airport, says senior government
official.
Two trucks and one car in the water
after a road collapse at Hwy 287 and
Dillon. #cowx #boulderflood
Report: Between 3 and 5 firefighters
missing following massive blast at West,
Texas, fertilizer plant, police say
Hundreds of volunteers in Mexico
tried to unearth children they hoped
were still alive beneath a school's
ruins during earthquake
Casualties from 7.2 #earthquake in
the #Philippines is now 20+ according
to authorities.
Casualties from 7.2 #earthquake in
the #Philippines is now 20+ according
to authorities.
www.comrades-project.eu
Access Relevant Information Across
Crisis Situations
8
• How do we handle information overload?
• How do we identify relevant and irrelevant
information across diverse crisis situations?
• Can we learn from one type of crisis situation, and
identify relevant information in another type?
www.comrades-project.eu
Previous Efforts - Identifying Crisis Related
Information
• ML Classification Methods:
 Supervised Approaches: Often making use of n-grams,
linguistic features, and/or statistical features of tweets.
 Unsupervised Approaches: Keyword processing and
clustering.
• Semantic Models:
 Representation of the information emerging from Crisis
Events, providing faceted search of crisis related
information.
9
www.comrades-project.eu
Hypothesis and Aim
• Hypothesis:
 Semantics establish a consistency across various types of
crisis situations thereby enabling identification of relevant
information and can enhance the discriminative power of
the classification systems.
• Go beyond statistical features, n-grams, and
incorporate the contextual semantics to the
statistical features.
10
www.comrades-project.eu
Statistical Features
Example of statistical features:
- Text length.
- Number of words.
- Presence and count of various Parts of Speech (PoS).
- Data specific features such as hashtags (in tweets).
- E.g., #neworleans #nola #algiers #nolafood #hurricanekatrina.
- Readability Score (Gunning Fox Index using average
sentence length (ASL) and percentage of complex words
(PCW) : 0.4*(ASL + PCW)).
11
www.comrades-project.eu
Semantic Features
Example of semantic features:
Additional information about terms found in the tweets can
be extracted using NER tools, entity linking tools, and
semantic databases:
- Entity linking in Knowledge base.
- Co-occurring words (from a data corpus)
- Synset Sense – WordNet
- Hierarchical Context: Hypernyms, Synonyms
- Dbpedia properties.
12
www.comrades-project.eu
Extracting Semantics
Available tools for entity extraction and knowledge expansion:
NER
 DBpedia Spotlight
 Alchemy (IBM)
 Babelfy (BabelNet)
 Text Razor NLP API
 Aylien Text Analysis API
Knowledge Bases
 Dbpedia
 YAGO
 BabelNet
 WordNet
 Google Knowledge Graph
 Wikidata
13
www.comrades-project.eu
Babelfy and BabelNet
BabelNet – a multilingual lexicalised semantic network formed by
combining various knowledge resources- WordNet, Wikipedia,
Wikitionary, OmegaWiki etc. It can enable multilingual NLP
applications. It can be used for words sense disambiguation and
entity linking with Babelfy.
Babelfy – A words sense disambiguation and entity linking
API built on top of BabelNet.
14
www.comrades-project.eu
Features extracted
Statistical Features
 Number of Nouns, Verbs, Pronouns
 Tweet Length
 Number of words/tokens
 Number of Hashtags
Semantic Features
 BabelNet Semantics
 BabelNet Sense: English labels of entities identified via Babelfy.
 BabelNet Hypernym: Direct English hypernyms of each entity (at a
distance 1).
 Dbpedia Semantics: List of properties associated with Dbpedia URI
returned by Babelfy.
 subject, label, type, city, state, country
16
www.comrades-project.eu
Semantic Enrichment- Broader
Perspective
Features Post A Post B
‘No confirmed casualties yet from
landslide reported in Compostela
Valley. #PabloPH’
‘News: Italy quake victims given
shelter http://t.co/cXQEusVm via
@BBC’
Babelfy Entities
Sense (English)
confirm, casualty, report, landslide Italy, earthquake, victim, shelter, news
Hypernyms
(English)
victim, affirm, flood, seismology, geology,
soil slide, announce, disaster, natural
disaster, geological phenomenon
natural disaster, geological phenomenon,
broadcasting, communication, nation,
country, unfortunate
DBpedia dbc:landslide, dbr:landslide, dbo:place,
dbc:Geological hazards, dbc:Seismology,
dbc:Geological hazards, dbc:Seismology,
dbr:Earthquake, dbc:Communication,
dbr:News
17
www.comrades-project.eu
Method
• Collect Data from CrisisLex.org- collection of Crisis oriented
tweets.
• Extract Statistical Features.
• Semantic Enrichment of tweets via annotation using Babelfy
API.
• Expand the semantics by incorporating hypernyms through
BabelNet.
• Retrieve Dbpedia features through SPARQL endpoint.
• Classify using SVM classification method.
19
www.comrades-project.eu
Data
• CrisisLexT26
• 26 crisis events with 1000 labelled tweets in each event.
• 4 Labels: Related & Informative, Related & Not
Informative, Not Related, and Not Applicable.
• Merged Related & Informative, Related & Not Informative –
Related.
• Merged Not Related, and Not Applicable – Not Related.
20
www.comrades-project.eu
Data
• After removing duplicates: 21378 Related and 2965 Not
Related.
• To prevent bias, we chose a balanced data-
• Selected same number of Related tweets as Not Related in
each event.
• Final figure: 2966 Related and 2965 Not Related.
21
www.comrades-project.eu
Data
22
Related Not
Related
Total Related Not
Related
Total
CWF Col. Wildfire 242 242 484 COS Costa Rica
E’qke
470 470 940
GAU Gautemalla
E’quake
103 103 206 ITL Italy E’quake 56 56 112
PHF Philippines
Flood
70 70 140 TYP Typhoon P 88 88 176
VNZ Venezuela
Fire
60 60 120 ALB Alberta
Flood
16 16 32
ABF Australia
Bushfire
183 183 366 BOL Bohol
E’quake
31 31 62
BOB Boston
Bomb
69 69 138 BRZ Brazil Fire 44 44 88
CFL Col.Fire 61 61 122 GLW Glasg Crash 110 110 220
LAX LA Shootout 112 112 224 LAM Train Crash 34 34 68
MNL Manila
Flood
74 74 148 NYT NY Train
Crash
2 1 3
QFL Queensland
Flood
278 278 556 RUS Russia
Meteor
241 241 482
SAR Sardinia
Flood
67 67 134 SVR Savar
Building
305 305 610
SGR Singapore
Haze
54 54 108 SPT Spain Train
Crash
8 8 16
TPY Typhoon Y 107 107 214 WTX West Texas
Ex.
81 81 162
www.comrades-project.eu
Data- Event Type Distribution
23
Event Type Events Event Type Events
Wildfire/Bushfire
(2)
CWF, ABF Haze (1) SGR
E’quakes(4) COS, ITL, BOL, GAU Helicopter Crash
(1)
GLW
Flood/Typhoons
(8)
TPY, TYP, CFL, QFL,
ALB, PHF, SAR,
MNL
Building Collapse
(1)
SVR
Terror
Shooting/Bombing
(2)
LAX, BOB Location Fire (2) BRZ, VNZ
Train Crash (2) SPT, LAM Explosion (1) WTX
Meteor (1) RUS
Crisis Type Distribution
Wildfire/Bushfire
E’quakes
Flood/Typhoons
Terror Shooting/Bombing
Train Crash
Meteor
Haze
Helicopter Crash
Building Collapse
Location Fire
Explosion
www.comrades-project.eu
Experiment Design
Feature Models:
 Statistical Features (SF- baseline)
 Statistical Features + BabelNet Semantics (SF + SemEF_BN)
 Statistical Features + Dbpedia Semantics (SF + SemEF_DB)
 Statistical Features + BabelNet Semantics + Dbpedia
Semantics (SF + SemEF_BNDB)
Crisis Classification Model
 Merge the entire data and perform 20 iterations of 5-fold
cross-validation across all the models to evaluate the
performance.
24
Statistical Features (SF), BabelNet Semantics (SemEF_BN), DBpediaSemantics (SemEF_DB), BabelNet and
Dbpedia Semantics (SemEF_BNDB)
www.comrades-project.eu
Experiment Design
Cross Crisis Classification
 Criteria 1- Content relatedness classification of already
seen crisis event type.
 When type of test data already exists in training data.
 e.g. A classifier trained on data containing
tweets/documents from flood event types (along with
other event types), is used to classify data from a new
flood type crisis event.
25
Statistical Features (SF), BabelNet Semantics (SemEF_BN), DBpediaSemantics (SemEF_DB), BabelNet and
Dbpedia Semantics (SemEF_BNDB)
www.comrades-project.eu
Experiment Design
Cross Crisis Classification
 Criteria 2- Content relatedness classification of unseen
crisis event type.
 When type of test data does not exist in training data.
 e.g. A classifier trained on data containing
tweets/documents from crisis events types except
building fire event types, and is used to classify data
from a such crisis event.
To classify - “With death toll at 300, Bangladesh factory collapse
becomes worst tragedy in garment industry history”
26
www.comrades-project.eu
Experiment
• Classifier Selection
 Support Vector Machine with Linear Kernel
 Chosen after determining its performance significance over RBF
Kernel, Polynomial Kernel, and Logistic Regression via 20 iterations
of 5-fold CV over the entire data)
• Tools & Library
 Scikit-learn Library
 Python 2.7
27
www.comrades-project.eu
Results
Crisis Classification Model (20 iterations 5- fold cross
validation)
28
Features Pmean Rmean Fmean Std. Dev. σ
(20 iteration)
∆F /F
(%)
Sig. (p-value)
SF
(Baseline)
0.8145 0.8093 0.8118 0.0101 -
SF +
SemEF_BN
0.8233 0.8231 0.8231 0.0111 1.3919 <0.00001
SF +
SemEF_DB
0.8148 0.8146 0.8145 0.0113 0.3326 0.01878
SF +
SemEF_BN
DB
0.8169 0.8167 0.8167 0.0106 0.6036 0.00001
Statistical Features (SF), BabelNet Semantics (SemEF_BN), DBpediaSemantics (SemEF_DB), BabelNet and
Dbpedia Semantics (SemEF_BNDB)
www.comrades-project.eu
Results
Cross-Crisis Classification- Criteria 1
29
SemEF_BN SemEF_DB SemEF_BNDB
Test F F ∆F /F (%) F ∆F /F (%) F ∆F /F (%)
Flood/Typhoon
TPY 0.803 0.776 -3.44 0.771 -4.01 0.780 -2.83
TYP 0.863 0.840 -2.66 0.829 -3.84 0.851 -1.29
ALB 0.718 0.749 4.25 0.844 17.41 0.844 17.41
QFL 0.783 0.792 1.18 0.77 -1.66 0.781 -0.22
CFL 0.801 0.827 3.28 0.754 -5.88 0.765 -4.41
PHF 0.764 0.763 -0.13 0.771 0.93 0.743 -2.83
SAR 0.570 0.677 18.79 0.648 13.70 0.650 -14.10
Earthquake
GAU 0.780 0.725 -7.1 0.784 0.51 0.770 -1.30
ITL 0.583 0.562 -3.58 0.615 5.49 0.588 0.98
BOL 0.742 0.724 -2.38 0.758 2.2 0.674 -9.07
COS 0.790 0.770 -2.56 0.739 -6.42 0.750 -5.08
Statistical Features (SF), BabelNet Semantics (SemEF_BN), DBpediaSemantics (SemEF_DB), BabelNet and Dbpedia
Semantics (SemEF_BNDB)
www.comrades-project.eu
Results
Cross-Crisis Classification- Criteria 2
30
SemEF_BN SemEF_DB SemEF_BNDB
Test F F ∆F /F (%) F ∆F /F (%) F ∆F /F (%)
Terror/Bomb/Train
LAX 0.652 0.677 3.9 0.665 1.95 0.656 0.58
LAM 0.618 0.626 1.2 0.616 -0.34 0.628 1.62
BOB 0.608 0.635 4.4 0.605 -0.56 0.607 -0.19
SPT 0.547 0.686 25.56 0.746 36.5 0.686 25.56
Flood/Typhoon
TPY 0.642 0.606 -5.67 0.651 1.39 0.582 -9.45
TYP 0.678 0.679 -0.12 0.661 -2.54 0.603 -10.99
ALB 0.716 0.705 -1.63 0.81 13.02 0.712 -0.63
QFL 0.681 0.657 -3.51 0.698 2.58 0.696 2.23
CFL 0.776 0.706 -9.04 0.704 -9.27 0.754 -2.87
PHF 0.532 0.566 6.52 0.632 18.9 0.556 4.67
SAR 0.537 0.553 2.93 0.595 10.69 0.617 14.84
Earthquake
GAU 0.487 0.495 1.62 0.630 29.39 0.593 21.79
ITL 0.509 0.516 1.26 0.553 8.54 0.555 8.93
BOL 0.724 0.639 -11.73 0.674 -6.86 0.588 -18.77
COS 0.515 0.480 -6.71 0.538 4.56 0.527 2.33
Statistical Features (SF), BabelNet Semantics (SemEF_BN), DBpediaSemantics (SemEF_DB), BabelNet and Dbpedia Semantics
(SemEF_BNDB)
www.comrades-project.eu
Results and Observations
• Based on IG score across each feature model (on the overall data), we observed
very event specific features in SF model such as collapse, terremoto, fire,
earthquake in top ranked features.
• Observed 7 different hashtags in top 50 features (indicate event specific
vocabulary).
• In SF+SemEF_BN and SF+SemEF_DB models, we observed concepts such as
natural_hazard, structural_integrity_and_failure, conflagration, perception,
geological_phenomenon, dbo:location, dbc:building_defect etc in top 50
features.
• structural_integrity_and_failure – annotated entity for term like collapse,
building collapse – frequently occurring terms in earthquake, flood type events.
• Natural_disaster – hypernym to event terms such as flood, landslide,
earthquake.
31
www.comrades-project.eu
Results and Observations
• On an average SF+SemEF_DB is the best performing model (from
Criteria 2).
• An avg. percentage gain in F1 score (△F/F) of +7.2% with a Std.
Dev. 12.83%.
• Improvement over the baseline SF model, in 10 out of 15 events
• 5 of 7 flood/typhoon, 3 of 4 earthquake, 2 of 4 crash/terrorist.
• The results show that when type of test event is NOT seen in the
training data, semantics enhance classifier performance.
32
www.comrades-project.eu
Results and Observations
• Semantics generalise event specific terms and consequently
adapt to new event types (e.g., dbc:flood and dbc:natural hazard
).
• Semantic concepts can be also be too general and thus do not
help the classification of document (e.g., desire and virtue
hypernyms).
– Virtue is hypernym of broad range of concepts such as loyalty, courage,
cooperation, charity.
• Automatic semantic extraction tools could extract many non-
relevant entities and therefore might confuse the.
– e.g. “Super Typhoon in Philliphines is 236 mph It's roughly the top speed of Formula 1
cars http://t.co/vcRE…” – the annotation and semantic extraction results in
33
www.comrades-project.eu
Further explorations
• A more in-depth error analysis of misclassified documents is
required.
• Event type is based on the nature of the crisis. However,
events of different types could produce overlapping content.
Hence, content similarity could also be taken into account,
along with event types.
• Data about the same crisis event can emerge in multiple
languages. Hence we need to expand the analysis to
multilingual content.
• Khare, P., Burel, G., Maynard, D., and Alani, H., Cross-Lingual Classification of Crisis
Data, Int. Semantic Web Conference (ISWC), Monterey, 2018 (to be presented)
34
www.comrades-project.eu
35
Thank you!
Questions?

More Related Content

Similar to Classifying Crisis Information Relevancy with Semantics (ESWC 2018)

Visual Information Analysis for Crisis and Natural Disasters Management and R...
Visual Information Analysis for Crisis and Natural Disasters Management and R...Visual Information Analysis for Crisis and Natural Disasters Management and R...
Visual Information Analysis for Crisis and Natural Disasters Management and R...
Yiannis Kompatsiaris
 
Emerging Trends in Crisis Informatics
Emerging Trends in Crisis InformaticsEmerging Trends in Crisis Informatics
Emerging Trends in Crisis Informatics
Adam Papendieck
 
Using DBpedia for Thesaurus Management and Linked Open Data Integration
Using DBpedia for Thesaurus Management and Linked Open Data IntegrationUsing DBpedia for Thesaurus Management and Linked Open Data Integration
Using DBpedia for Thesaurus Management and Linked Open Data Integration
Martin Kaltenböck
 
Collecting and Coding Twitter Data in DiscoverText
Collecting and Coding Twitter Data in DiscoverTextCollecting and Coding Twitter Data in DiscoverText
Collecting and Coding Twitter Data in DiscoverText
Jill Hopke
 
A preliminary approach to knowledge integrity risk assessment in Wikipedia p...
A preliminary approach to knowledge integrity  risk assessment in Wikipedia p...A preliminary approach to knowledge integrity  risk assessment in Wikipedia p...
A preliminary approach to knowledge integrity risk assessment in Wikipedia p...
Pablo Aragón
 
From Research to Applications: What Can We Extract with Social Media Sensing?
From Research to Applications: What Can We Extract with Social Media Sensing?From Research to Applications: What Can We Extract with Social Media Sensing?
From Research to Applications: What Can We Extract with Social Media Sensing?
Yiannis Kompatsiaris
 
The Semantic Web Exists. What Next?
The Semantic Web Exists. What Next?The Semantic Web Exists. What Next?
The Semantic Web Exists. What Next?
Anna Fensel
 
Semantic Data Enrichment: a Human-in-the-Loop Perspective
Semantic Data Enrichment: a Human-in-the-Loop PerspectiveSemantic Data Enrichment: a Human-in-the-Loop Perspective
Semantic Data Enrichment: a Human-in-the-Loop Perspective
Università degli Studi di Milano-Bicocca
 
Geographic context analysis of volunteered information
Geographic context analysis of volunteered informationGeographic context analysis of volunteered information
Geographic context analysis of volunteered information
foostermann
 
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Artificial Intelligence Institute at UofSC
 
Data Analytics and Industry-Academic Partnerships: An Irish Perspective
Data Analytics and Industry-Academic Partnerships: An Irish PerspectiveData Analytics and Industry-Academic Partnerships: An Irish Perspective
Data Analytics and Industry-Academic Partnerships: An Irish Perspective
John Breslin
 
Semantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer AppsSemantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer Apps
Jie Bao
 
Strategic perspectives 3
Strategic perspectives 3Strategic perspectives 3
Strategic perspectives 3
archiejones4
 
Web 2.0 Technology Building Situational Awareness: Free and Open Source Too...
Web 2.0 Technology  Building Situational Awareness:  Free and Open Source Too...Web 2.0 Technology  Building Situational Awareness:  Free and Open Source Too...
Web 2.0 Technology Building Situational Awareness: Free and Open Source Too...
Connie White
 
AMSWMC MV NPD.pptx
AMSWMC MV NPD.pptxAMSWMC MV NPD.pptx
AMSWMC MV NPD.pptx
Ana Canhoto
 
YiR2022-External-Final.pdf
YiR2022-External-Final.pdfYiR2022-External-Final.pdf
YiR2022-External-Final.pdf
FernandoJaimeEscobar
 
CAEPIA 2011
CAEPIA 2011CAEPIA 2011
CAEPIA 2011
Miriam Fernandez
 
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Stefan Dietze
 
Steps towards a Data Value Chain
Steps towards a Data Value ChainSteps towards a Data Value Chain
Steps towards a Data Value Chain
PRELIDA Project
 
Emerging Technologies in Synthetic Representation and Digital Twin
Emerging Technologies in Synthetic Representation and Digital TwinEmerging Technologies in Synthetic Representation and Digital Twin
Emerging Technologies in Synthetic Representation and Digital Twin
Liming Zhu
 

Similar to Classifying Crisis Information Relevancy with Semantics (ESWC 2018) (20)

Visual Information Analysis for Crisis and Natural Disasters Management and R...
Visual Information Analysis for Crisis and Natural Disasters Management and R...Visual Information Analysis for Crisis and Natural Disasters Management and R...
Visual Information Analysis for Crisis and Natural Disasters Management and R...
 
Emerging Trends in Crisis Informatics
Emerging Trends in Crisis InformaticsEmerging Trends in Crisis Informatics
Emerging Trends in Crisis Informatics
 
Using DBpedia for Thesaurus Management and Linked Open Data Integration
Using DBpedia for Thesaurus Management and Linked Open Data IntegrationUsing DBpedia for Thesaurus Management and Linked Open Data Integration
Using DBpedia for Thesaurus Management and Linked Open Data Integration
 
Collecting and Coding Twitter Data in DiscoverText
Collecting and Coding Twitter Data in DiscoverTextCollecting and Coding Twitter Data in DiscoverText
Collecting and Coding Twitter Data in DiscoverText
 
A preliminary approach to knowledge integrity risk assessment in Wikipedia p...
A preliminary approach to knowledge integrity  risk assessment in Wikipedia p...A preliminary approach to knowledge integrity  risk assessment in Wikipedia p...
A preliminary approach to knowledge integrity risk assessment in Wikipedia p...
 
From Research to Applications: What Can We Extract with Social Media Sensing?
From Research to Applications: What Can We Extract with Social Media Sensing?From Research to Applications: What Can We Extract with Social Media Sensing?
From Research to Applications: What Can We Extract with Social Media Sensing?
 
The Semantic Web Exists. What Next?
The Semantic Web Exists. What Next?The Semantic Web Exists. What Next?
The Semantic Web Exists. What Next?
 
Semantic Data Enrichment: a Human-in-the-Loop Perspective
Semantic Data Enrichment: a Human-in-the-Loop PerspectiveSemantic Data Enrichment: a Human-in-the-Loop Perspective
Semantic Data Enrichment: a Human-in-the-Loop Perspective
 
Geographic context analysis of volunteered information
Geographic context analysis of volunteered informationGeographic context analysis of volunteered information
Geographic context analysis of volunteered information
 
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
 
Data Analytics and Industry-Academic Partnerships: An Irish Perspective
Data Analytics and Industry-Academic Partnerships: An Irish PerspectiveData Analytics and Industry-Academic Partnerships: An Irish Perspective
Data Analytics and Industry-Academic Partnerships: An Irish Perspective
 
Semantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer AppsSemantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer Apps
 
Strategic perspectives 3
Strategic perspectives 3Strategic perspectives 3
Strategic perspectives 3
 
Web 2.0 Technology Building Situational Awareness: Free and Open Source Too...
Web 2.0 Technology  Building Situational Awareness:  Free and Open Source Too...Web 2.0 Technology  Building Situational Awareness:  Free and Open Source Too...
Web 2.0 Technology Building Situational Awareness: Free and Open Source Too...
 
AMSWMC MV NPD.pptx
AMSWMC MV NPD.pptxAMSWMC MV NPD.pptx
AMSWMC MV NPD.pptx
 
YiR2022-External-Final.pdf
YiR2022-External-Final.pdfYiR2022-External-Final.pdf
YiR2022-External-Final.pdf
 
CAEPIA 2011
CAEPIA 2011CAEPIA 2011
CAEPIA 2011
 
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
 
Steps towards a Data Value Chain
Steps towards a Data Value ChainSteps towards a Data Value Chain
Steps towards a Data Value Chain
 
Emerging Technologies in Synthetic Representation and Digital Twin
Emerging Technologies in Synthetic Representation and Digital TwinEmerging Technologies in Synthetic Representation and Digital Twin
Emerging Technologies in Synthetic Representation and Digital Twin
 

Recently uploaded

Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
revolutionary575
 
CHAPTER-1-Introduction-to-Marketing.pptx
CHAPTER-1-Introduction-to-Marketing.pptxCHAPTER-1-Introduction-to-Marketing.pptx
CHAPTER-1-Introduction-to-Marketing.pptx
girewiy968
 
Harendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting PortfolioHarendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting Portfolio
harendmgr
 
potential development of the A* search algorithm specifically
potential development of the A* search algorithm specificallypotential development of the A* search algorithm specifically
potential development of the A* search algorithm specifically
huseindihon
 
Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...
chetankumar9855
 
Maruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekhoMaruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekho
kamli sharma#S10
 
M44.pdf dairy management farm report of an
M44.pdf dairy management farm report of anM44.pdf dairy management farm report of an
M44.pdf dairy management farm report of an
ManjuBv2
 
Research proposal seminar ,Research Methodology
Research proposal seminar ,Research MethodologyResearch proposal seminar ,Research Methodology
Research proposal seminar ,Research Methodology
doctorzlife786
 
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
janvikumar4133
 
all about the data science process, covering the steps present in almost ever...
all about the data science process, covering the steps present in almost ever...all about the data science process, covering the steps present in almost ever...
all about the data science process, covering the steps present in almost ever...
palaniappancse
 
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
45unexpected
 
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
sharonblush
 
the unexpected potential of Dijkstra's Algorithm
the unexpected potential of Dijkstra's Algorithmthe unexpected potential of Dijkstra's Algorithm
the unexpected potential of Dijkstra's Algorithm
huseindihon
 
transgenders community data in india by govt
transgenders community data in india by govttransgenders community data in india by govt
transgenders community data in india by govt
palanisamyiiiier
 
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
bhupeshkumar0889
 
bai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).doc
bai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).docbai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).doc
bai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).doc
PhngThLmHnh
 
Potential Uses of the Floyd-Warshall Algorithm as appropriate
Potential Uses of the Floyd-Warshall Algorithm as appropriatePotential Uses of the Floyd-Warshall Algorithm as appropriate
Potential Uses of the Floyd-Warshall Algorithm as appropriate
huseindihon
 
Data analytics and Access Program Recommendations
Data analytics and Access Program RecommendationsData analytics and Access Program Recommendations
Data analytics and Access Program Recommendations
hemantsharmaus
 
Nipissing University degree offer Nipissing diploma Transcript
Nipissing University degree offer Nipissing diploma TranscriptNipissing University degree offer Nipissing diploma Transcript
Nipissing University degree offer Nipissing diploma Transcript
zyqedad
 
Celonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptxCelonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptx
AnujaGaikwad28
 

Recently uploaded (20)

Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
 
CHAPTER-1-Introduction-to-Marketing.pptx
CHAPTER-1-Introduction-to-Marketing.pptxCHAPTER-1-Introduction-to-Marketing.pptx
CHAPTER-1-Introduction-to-Marketing.pptx
 
Harendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting PortfolioHarendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting Portfolio
 
potential development of the A* search algorithm specifically
potential development of the A* search algorithm specificallypotential development of the A* search algorithm specifically
potential development of the A* search algorithm specifically
 
Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...
 
Maruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekhoMaruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekho
 
M44.pdf dairy management farm report of an
M44.pdf dairy management farm report of anM44.pdf dairy management farm report of an
M44.pdf dairy management farm report of an
 
Research proposal seminar ,Research Methodology
Research proposal seminar ,Research MethodologyResearch proposal seminar ,Research Methodology
Research proposal seminar ,Research Methodology
 
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
 
all about the data science process, covering the steps present in almost ever...
all about the data science process, covering the steps present in almost ever...all about the data science process, covering the steps present in almost ever...
all about the data science process, covering the steps present in almost ever...
 
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
 
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
 
the unexpected potential of Dijkstra's Algorithm
the unexpected potential of Dijkstra's Algorithmthe unexpected potential of Dijkstra's Algorithm
the unexpected potential of Dijkstra's Algorithm
 
transgenders community data in india by govt
transgenders community data in india by govttransgenders community data in india by govt
transgenders community data in india by govt
 
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
 
bai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).doc
bai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).docbai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).doc
bai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).doc
 
Potential Uses of the Floyd-Warshall Algorithm as appropriate
Potential Uses of the Floyd-Warshall Algorithm as appropriatePotential Uses of the Floyd-Warshall Algorithm as appropriate
Potential Uses of the Floyd-Warshall Algorithm as appropriate
 
Data analytics and Access Program Recommendations
Data analytics and Access Program RecommendationsData analytics and Access Program Recommendations
Data analytics and Access Program Recommendations
 
Nipissing University degree offer Nipissing diploma Transcript
Nipissing University degree offer Nipissing diploma TranscriptNipissing University degree offer Nipissing diploma Transcript
Nipissing University degree offer Nipissing diploma Transcript
 
Celonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptxCelonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptx
 

Classifying Crisis Information Relevancy with Semantics (ESWC 2018)

  • 1. www.comrades-project.eu Classifying Crisis-information Relevancy with Semantics Prashant Khare, Gregoire Burel, Harith Alani 1 {prashant.khare, g.burel, h.alani} @open.ac.uk Knowledge Media Institute, The Open University, UK ESWC2018 – 5 June 2018 Heraklion, Crete
  • 2. www.comrades-project.eu Motivation 2 People of NSW, be careful because there's fires spreading! Stay safe everyone! Hundreds of volunteers in Mexico tried to unearth children they hoped were still alive beneath a school's ruins Two trucks and one car in the water after a road collapse at Hwy 287 and Dillon. #cowx #boulderflood CRISIS Wildfire Floods Earthquake
  • 3. www.comrades-project.eu Motivation 3 Challenges  A flood of data gets generated. For e.g.:  Over a million tweets were posted during the 2017 Hurricane Harvey.  500% increase in the tweets bandwidth during 2011 Japan earthquake.  Almost impossible to manually absorb and process such sheer volumes.  In addition, the characteristics of social media posts such as short length, colloquialism, syntactic issues pose additional challenges of processing the data.
  • 4. www.comrades-project.eu Motivation 4 FEMA launched an initiative to use public social media data for situational awareness purpose1. 1: https://www.dhs.gov/sites/default/files/publications/privacy-pia-FEMA-OUSM-April2016.pdf Image source – fema.gov
  • 6. www.comrades-project.eu Key Problem- Diverse forms of Crisis 6 Floods FireQuake Human Disaster Food & Supplies Crisis
  • 7. www.comrades-project.eu Key Problem- Broad Spectrum Data 7 The diverse range of situations result in a broad spectrum of content People of NSW, be careful because there's fires spreading! Stay safe everyone! BREAKING: Reports of shots fired at LAX Airport, says senior government official. Two trucks and one car in the water after a road collapse at Hwy 287 and Dillon. #cowx #boulderflood Report: Between 3 and 5 firefighters missing following massive blast at West, Texas, fertilizer plant, police say Hundreds of volunteers in Mexico tried to unearth children they hoped were still alive beneath a school's ruins during earthquake Casualties from 7.2 #earthquake in the #Philippines is now 20+ according to authorities. Casualties from 7.2 #earthquake in the #Philippines is now 20+ according to authorities.
  • 8. www.comrades-project.eu Access Relevant Information Across Crisis Situations 8 • How do we handle information overload? • How do we identify relevant and irrelevant information across diverse crisis situations? • Can we learn from one type of crisis situation, and identify relevant information in another type?
  • 9. www.comrades-project.eu Previous Efforts - Identifying Crisis Related Information • ML Classification Methods:  Supervised Approaches: Often making use of n-grams, linguistic features, and/or statistical features of tweets.  Unsupervised Approaches: Keyword processing and clustering. • Semantic Models:  Representation of the information emerging from Crisis Events, providing faceted search of crisis related information. 9
  • 10. www.comrades-project.eu Hypothesis and Aim • Hypothesis:  Semantics establish a consistency across various types of crisis situations thereby enabling identification of relevant information and can enhance the discriminative power of the classification systems. • Go beyond statistical features, n-grams, and incorporate the contextual semantics to the statistical features. 10
  • 11. www.comrades-project.eu Statistical Features Example of statistical features: - Text length. - Number of words. - Presence and count of various Parts of Speech (PoS). - Data specific features such as hashtags (in tweets). - E.g., #neworleans #nola #algiers #nolafood #hurricanekatrina. - Readability Score (Gunning Fox Index using average sentence length (ASL) and percentage of complex words (PCW) : 0.4*(ASL + PCW)). 11
  • 12. www.comrades-project.eu Semantic Features Example of semantic features: Additional information about terms found in the tweets can be extracted using NER tools, entity linking tools, and semantic databases: - Entity linking in Knowledge base. - Co-occurring words (from a data corpus) - Synset Sense – WordNet - Hierarchical Context: Hypernyms, Synonyms - Dbpedia properties. 12
  • 13. www.comrades-project.eu Extracting Semantics Available tools for entity extraction and knowledge expansion: NER  DBpedia Spotlight  Alchemy (IBM)  Babelfy (BabelNet)  Text Razor NLP API  Aylien Text Analysis API Knowledge Bases  Dbpedia  YAGO  BabelNet  WordNet  Google Knowledge Graph  Wikidata 13
  • 14. www.comrades-project.eu Babelfy and BabelNet BabelNet – a multilingual lexicalised semantic network formed by combining various knowledge resources- WordNet, Wikipedia, Wikitionary, OmegaWiki etc. It can enable multilingual NLP applications. It can be used for words sense disambiguation and entity linking with Babelfy. Babelfy – A words sense disambiguation and entity linking API built on top of BabelNet. 14
  • 15. www.comrades-project.eu Features extracted Statistical Features  Number of Nouns, Verbs, Pronouns  Tweet Length  Number of words/tokens  Number of Hashtags Semantic Features  BabelNet Semantics  BabelNet Sense: English labels of entities identified via Babelfy.  BabelNet Hypernym: Direct English hypernyms of each entity (at a distance 1).  Dbpedia Semantics: List of properties associated with Dbpedia URI returned by Babelfy.  subject, label, type, city, state, country 16
  • 16. www.comrades-project.eu Semantic Enrichment- Broader Perspective Features Post A Post B ‘No confirmed casualties yet from landslide reported in Compostela Valley. #PabloPH’ ‘News: Italy quake victims given shelter http://t.co/cXQEusVm via @BBC’ Babelfy Entities Sense (English) confirm, casualty, report, landslide Italy, earthquake, victim, shelter, news Hypernyms (English) victim, affirm, flood, seismology, geology, soil slide, announce, disaster, natural disaster, geological phenomenon natural disaster, geological phenomenon, broadcasting, communication, nation, country, unfortunate DBpedia dbc:landslide, dbr:landslide, dbo:place, dbc:Geological hazards, dbc:Seismology, dbc:Geological hazards, dbc:Seismology, dbr:Earthquake, dbc:Communication, dbr:News 17
  • 17. www.comrades-project.eu Method • Collect Data from CrisisLex.org- collection of Crisis oriented tweets. • Extract Statistical Features. • Semantic Enrichment of tweets via annotation using Babelfy API. • Expand the semantics by incorporating hypernyms through BabelNet. • Retrieve Dbpedia features through SPARQL endpoint. • Classify using SVM classification method. 19
  • 18. www.comrades-project.eu Data • CrisisLexT26 • 26 crisis events with 1000 labelled tweets in each event. • 4 Labels: Related & Informative, Related & Not Informative, Not Related, and Not Applicable. • Merged Related & Informative, Related & Not Informative – Related. • Merged Not Related, and Not Applicable – Not Related. 20
  • 19. www.comrades-project.eu Data • After removing duplicates: 21378 Related and 2965 Not Related. • To prevent bias, we chose a balanced data- • Selected same number of Related tweets as Not Related in each event. • Final figure: 2966 Related and 2965 Not Related. 21
  • 20. www.comrades-project.eu Data 22 Related Not Related Total Related Not Related Total CWF Col. Wildfire 242 242 484 COS Costa Rica E’qke 470 470 940 GAU Gautemalla E’quake 103 103 206 ITL Italy E’quake 56 56 112 PHF Philippines Flood 70 70 140 TYP Typhoon P 88 88 176 VNZ Venezuela Fire 60 60 120 ALB Alberta Flood 16 16 32 ABF Australia Bushfire 183 183 366 BOL Bohol E’quake 31 31 62 BOB Boston Bomb 69 69 138 BRZ Brazil Fire 44 44 88 CFL Col.Fire 61 61 122 GLW Glasg Crash 110 110 220 LAX LA Shootout 112 112 224 LAM Train Crash 34 34 68 MNL Manila Flood 74 74 148 NYT NY Train Crash 2 1 3 QFL Queensland Flood 278 278 556 RUS Russia Meteor 241 241 482 SAR Sardinia Flood 67 67 134 SVR Savar Building 305 305 610 SGR Singapore Haze 54 54 108 SPT Spain Train Crash 8 8 16 TPY Typhoon Y 107 107 214 WTX West Texas Ex. 81 81 162
  • 21. www.comrades-project.eu Data- Event Type Distribution 23 Event Type Events Event Type Events Wildfire/Bushfire (2) CWF, ABF Haze (1) SGR E’quakes(4) COS, ITL, BOL, GAU Helicopter Crash (1) GLW Flood/Typhoons (8) TPY, TYP, CFL, QFL, ALB, PHF, SAR, MNL Building Collapse (1) SVR Terror Shooting/Bombing (2) LAX, BOB Location Fire (2) BRZ, VNZ Train Crash (2) SPT, LAM Explosion (1) WTX Meteor (1) RUS Crisis Type Distribution Wildfire/Bushfire E’quakes Flood/Typhoons Terror Shooting/Bombing Train Crash Meteor Haze Helicopter Crash Building Collapse Location Fire Explosion
  • 22. www.comrades-project.eu Experiment Design Feature Models:  Statistical Features (SF- baseline)  Statistical Features + BabelNet Semantics (SF + SemEF_BN)  Statistical Features + Dbpedia Semantics (SF + SemEF_DB)  Statistical Features + BabelNet Semantics + Dbpedia Semantics (SF + SemEF_BNDB) Crisis Classification Model  Merge the entire data and perform 20 iterations of 5-fold cross-validation across all the models to evaluate the performance. 24 Statistical Features (SF), BabelNet Semantics (SemEF_BN), DBpediaSemantics (SemEF_DB), BabelNet and Dbpedia Semantics (SemEF_BNDB)
  • 23. www.comrades-project.eu Experiment Design Cross Crisis Classification  Criteria 1- Content relatedness classification of already seen crisis event type.  When type of test data already exists in training data.  e.g. A classifier trained on data containing tweets/documents from flood event types (along with other event types), is used to classify data from a new flood type crisis event. 25 Statistical Features (SF), BabelNet Semantics (SemEF_BN), DBpediaSemantics (SemEF_DB), BabelNet and Dbpedia Semantics (SemEF_BNDB)
  • 24. www.comrades-project.eu Experiment Design Cross Crisis Classification  Criteria 2- Content relatedness classification of unseen crisis event type.  When type of test data does not exist in training data.  e.g. A classifier trained on data containing tweets/documents from crisis events types except building fire event types, and is used to classify data from a such crisis event. To classify - “With death toll at 300, Bangladesh factory collapse becomes worst tragedy in garment industry history” 26
  • 25. www.comrades-project.eu Experiment • Classifier Selection  Support Vector Machine with Linear Kernel  Chosen after determining its performance significance over RBF Kernel, Polynomial Kernel, and Logistic Regression via 20 iterations of 5-fold CV over the entire data) • Tools & Library  Scikit-learn Library  Python 2.7 27
  • 26. www.comrades-project.eu Results Crisis Classification Model (20 iterations 5- fold cross validation) 28 Features Pmean Rmean Fmean Std. Dev. σ (20 iteration) ∆F /F (%) Sig. (p-value) SF (Baseline) 0.8145 0.8093 0.8118 0.0101 - SF + SemEF_BN 0.8233 0.8231 0.8231 0.0111 1.3919 <0.00001 SF + SemEF_DB 0.8148 0.8146 0.8145 0.0113 0.3326 0.01878 SF + SemEF_BN DB 0.8169 0.8167 0.8167 0.0106 0.6036 0.00001 Statistical Features (SF), BabelNet Semantics (SemEF_BN), DBpediaSemantics (SemEF_DB), BabelNet and Dbpedia Semantics (SemEF_BNDB)
  • 27. www.comrades-project.eu Results Cross-Crisis Classification- Criteria 1 29 SemEF_BN SemEF_DB SemEF_BNDB Test F F ∆F /F (%) F ∆F /F (%) F ∆F /F (%) Flood/Typhoon TPY 0.803 0.776 -3.44 0.771 -4.01 0.780 -2.83 TYP 0.863 0.840 -2.66 0.829 -3.84 0.851 -1.29 ALB 0.718 0.749 4.25 0.844 17.41 0.844 17.41 QFL 0.783 0.792 1.18 0.77 -1.66 0.781 -0.22 CFL 0.801 0.827 3.28 0.754 -5.88 0.765 -4.41 PHF 0.764 0.763 -0.13 0.771 0.93 0.743 -2.83 SAR 0.570 0.677 18.79 0.648 13.70 0.650 -14.10 Earthquake GAU 0.780 0.725 -7.1 0.784 0.51 0.770 -1.30 ITL 0.583 0.562 -3.58 0.615 5.49 0.588 0.98 BOL 0.742 0.724 -2.38 0.758 2.2 0.674 -9.07 COS 0.790 0.770 -2.56 0.739 -6.42 0.750 -5.08 Statistical Features (SF), BabelNet Semantics (SemEF_BN), DBpediaSemantics (SemEF_DB), BabelNet and Dbpedia Semantics (SemEF_BNDB)
  • 28. www.comrades-project.eu Results Cross-Crisis Classification- Criteria 2 30 SemEF_BN SemEF_DB SemEF_BNDB Test F F ∆F /F (%) F ∆F /F (%) F ∆F /F (%) Terror/Bomb/Train LAX 0.652 0.677 3.9 0.665 1.95 0.656 0.58 LAM 0.618 0.626 1.2 0.616 -0.34 0.628 1.62 BOB 0.608 0.635 4.4 0.605 -0.56 0.607 -0.19 SPT 0.547 0.686 25.56 0.746 36.5 0.686 25.56 Flood/Typhoon TPY 0.642 0.606 -5.67 0.651 1.39 0.582 -9.45 TYP 0.678 0.679 -0.12 0.661 -2.54 0.603 -10.99 ALB 0.716 0.705 -1.63 0.81 13.02 0.712 -0.63 QFL 0.681 0.657 -3.51 0.698 2.58 0.696 2.23 CFL 0.776 0.706 -9.04 0.704 -9.27 0.754 -2.87 PHF 0.532 0.566 6.52 0.632 18.9 0.556 4.67 SAR 0.537 0.553 2.93 0.595 10.69 0.617 14.84 Earthquake GAU 0.487 0.495 1.62 0.630 29.39 0.593 21.79 ITL 0.509 0.516 1.26 0.553 8.54 0.555 8.93 BOL 0.724 0.639 -11.73 0.674 -6.86 0.588 -18.77 COS 0.515 0.480 -6.71 0.538 4.56 0.527 2.33 Statistical Features (SF), BabelNet Semantics (SemEF_BN), DBpediaSemantics (SemEF_DB), BabelNet and Dbpedia Semantics (SemEF_BNDB)
  • 29. www.comrades-project.eu Results and Observations • Based on IG score across each feature model (on the overall data), we observed very event specific features in SF model such as collapse, terremoto, fire, earthquake in top ranked features. • Observed 7 different hashtags in top 50 features (indicate event specific vocabulary). • In SF+SemEF_BN and SF+SemEF_DB models, we observed concepts such as natural_hazard, structural_integrity_and_failure, conflagration, perception, geological_phenomenon, dbo:location, dbc:building_defect etc in top 50 features. • structural_integrity_and_failure – annotated entity for term like collapse, building collapse – frequently occurring terms in earthquake, flood type events. • Natural_disaster – hypernym to event terms such as flood, landslide, earthquake. 31
  • 30. www.comrades-project.eu Results and Observations • On an average SF+SemEF_DB is the best performing model (from Criteria 2). • An avg. percentage gain in F1 score (△F/F) of +7.2% with a Std. Dev. 12.83%. • Improvement over the baseline SF model, in 10 out of 15 events • 5 of 7 flood/typhoon, 3 of 4 earthquake, 2 of 4 crash/terrorist. • The results show that when type of test event is NOT seen in the training data, semantics enhance classifier performance. 32
  • 31. www.comrades-project.eu Results and Observations • Semantics generalise event specific terms and consequently adapt to new event types (e.g., dbc:flood and dbc:natural hazard ). • Semantic concepts can be also be too general and thus do not help the classification of document (e.g., desire and virtue hypernyms). – Virtue is hypernym of broad range of concepts such as loyalty, courage, cooperation, charity. • Automatic semantic extraction tools could extract many non- relevant entities and therefore might confuse the. – e.g. “Super Typhoon in Philliphines is 236 mph It's roughly the top speed of Formula 1 cars http://t.co/vcRE…” – the annotation and semantic extraction results in 33
  • 32. www.comrades-project.eu Further explorations • A more in-depth error analysis of misclassified documents is required. • Event type is based on the nature of the crisis. However, events of different types could produce overlapping content. Hence, content similarity could also be taken into account, along with event types. • Data about the same crisis event can emerge in multiple languages. Hence we need to expand the analysis to multilingual content. • Khare, P., Burel, G., Maynard, D., and Alani, H., Cross-Lingual Classification of Crisis Data, Int. Semantic Web Conference (ISWC), Monterey, 2018 (to be presented) 34

Editor's Notes

  1. ‘Classifying Crisis-information Relevancy with Semantics’
  2. As the topic of the paper suggests, crisis situations are the principal motivation behind the work. People around the world are impacted by crisis and disasters in various forms. And in this era with the ability to share and access information in real time they resort to different online social media forums. Twitter certainly is among the most prominent medium for sharing and accessing real time information.
  3. To support this idea, I would like to highlight this case from Hurricane Harvey in 2017. What you see highlighted are two very crucial piece of information shared in course of crisis situations. A volunteer driven handle, that collates info on a portal about who needs help and rescue, and who could assist in that geographical area. This is an interaction between two parties that resulted in rescuing 3 elderly ladies. But these critical information are not always easy to find and access as we are also well aware of the challenges that social media projects along side the opportunity that it offers.
  4. But given those challenges, the opportunities these platforms offer have widely been acknowledged by humanitarian and government agencies.
  5. The need for having tools and systems to rightly determine what is valuable on social media with respect to its relatedness to crisis situations is highlighted in this minor exercise of performing keyword based search on twitter during Hurricane Harvey. It is evident that not everything that might contain crisis specific based terminology is always a related content. That’s precisely where the problem of ‘what is crisis related’ defined.
  6. But the problem explored in this work isn’t just about what is crisis related and what is not. They key problem is the diversity in the crisis events. Different crisis situations have different types and levels of impact on human life in form of well being, civic facilities, and what not.
  7. Which results in a very broad spectrum of data. For instance, if we see here the social media data from situations like e’quakes, wildfires, explosions, road situations , terrorism to name a few. You can see how diverse the information is. For that we do need to come up with ways to be able to filter in as much diverse related content as we can.
  8. So as a requirement of disaster management, we come down to the following questions. Since we understand that manually sieving through the social stream is nearly impossible so can we have automated ways to identify relevant information and can we learn from crisis situations to identify relevancy in new crisis situations.
  9. Previous approaches have made attempts at tackling this problem. Some have adopted ML approaches where they go either by supervised classification approach or unsupervised. Supervised Approaches: Often making use of n-grams, linguistic features, and/or statistical features of tweets. Unsupervised Approaches: Keyword processing and clustering. Some of the approaches perform semantic enrichment of the data to create a faceted search on top of the semantic data. But that does not strictly tell you if something is crisis related or not. And may require a new strategy to search each time in new type of crisis events.
  10. Our hypothesis is that the crisis relatedness is exhibited by combination of various concepts that occur in the user generated content. It might not always be just one key term that establishes the crisis relatedness of a tweet. So, we go beyond the statistical features and incorporate the contextual semantics along with the statistical features. We hypothesize that different crisis situations, while they maybe very discrete in their vocabulary and sense, can relate to each other somewhere at a broader contextual sense.
  11. A number of statistical features can be extracted: length, number of words, various part of speech such as noun, verbs, pronouns, hashtags, and some can also calculate the readability score that scores a tweet on how easy or complicated it is structured to read. Gunning Fox Index is one of the methods to calculate that.
  12. Various additional semantics can be considered. Co-occurring words (words that are very common to occur together across large scale corpuses). Word-embeddings are a good example of it. Extracted entities- along with the original text can sharpen the context more. Next, we can refer to the knowledge base (such as DBpedia) to retrieve extra information/properties about that entity. We can also use hierarchical context (from WordNet) to retrieve hypernyms, synonyms to each concept to generalise the context more.
  13. We require NER tools and Knowledge Bases to extract the semantics that we want. Here are some of the well known tools available.
  14. In this work we have relied on BabelNet knowledge base and NER API built on top of it. BabelNet is a multilingual semantic network resource which incorporates multiple knowledgebase such as wikipedia and wordnet, that really caters to the requirements here.
  15. Here is an example of using BabelNet to semantically enrich a tweet by extracting entities and their hypernyms. We annotate the key entities in the text and then look for their corresponding hypernyms and augment them to the overall context.
  16. The features that we have extracted are as follows. The statistical features are… The semantic features are. We extract hypernyms assuming that diverse concepts can relate when the context is expanded to parent levels of concepts. We refer to the Dbpedia properties to retrieve extra information/properties about that entity, which can link us to broader knowledge about its nature.
  17. To gain a perspective of what we imply by semantic enrichment across the crisis types let us look at the following real scenario tweets.
  18. Another example elaborating the same.
  19. As an end to end process in a nutshell, this is what we do.
  20. CrisisLex is a very popular data repository which has time and again been referred to for various related research studies. We have used one particular data corpus from Crisis Lex, and that is called CrisisLexT26. This dataset comprises of manually labelled tweets of 26 crisis events that occurred between 2012 and 13. Each of these 26 events have close to 1000 labelled tweets. So, here they have 4 labels. We merge the two to create a binary class system.
  21. To have a balanced learning of both the classes, we ensure to pass an un-skewed data to the classifier. In each of the 26 events we select the same number of related tweets as the unrelated.
  22. Here you can see related and unrelated distribution across all the 26 crisis events. As it is obvious, there isn’t an equal distribution across each event though. I would like to highlight each event is basically a broad crisis situation and tweets have been collected during that situation and that we call as event. For instance CWF is Colorado Wildfire, and it contains various tweets collected during that event.
  23. Here we have categorised events based on their types. And this infographic shows that maximum events in this dataset are in flood/typhoons and next is earthquakes. Next we have same number of crash and shooting/bombing events.
  24. Now we describe out experiment design. We create 4 feature models to evaluate. Statistical which is our baseline. Next, we create 3 semantic models where we enhance the SF feature model with the semantic features. In 3 models, we create SF + Babel Semantics, SF + Dbpedia semantics, and in last model we combine both the Babel and Dbpedia features. When we add the semantics we concatenate the semantics with the original text of the tweet and the do the tokenisation and create n-grams. So, these are the feature models. Now we design the classification methods. First, as a broad run, we simply merge the data from all the events and perform 20 iterations of 5-fold CV to just see how the 4 feature models perform.
  25. Now here, we begin designing the cross crisis classification methods. Where we aim to evaluate how the classifier performs in while classifying the data from a new crisis event type. We set up 2 criteria. In Criteria 1, we test a type of crisis event when the type is already seen by the classifier in the training data.
  26. In Criteria 2, we test a type of crisis event when the type is not seen by the classifier in the training data. For instance, we have a classifier trained on data collected during say flood and earthquake events. And we get to classify something coming from a factory collapse situation like here. Basically these 2 criteria set up here are critical part of our analysis.
  27. We chose SVM linear kernel. SVM is known for its suitability in text classification problems. On top of it we compared the linear kernel’s performance against other kernels and logistic regression following a 20 iteration of 5-fold CV over entire data. SVM Linear Kernel was found to be more statistically significant, and had a better mean F1 value of 0.8118 and a p-value of < 0.00001.
  28. We look at the results, so this is when we merge the entire data and do cross validation. So while this is not cross-crisis classification strictly, but it still shows that broadly the semantics perform slightly better than the baseline, improvement ranging from 0.6 to 1.4 %.
  29. Now in Criteria 1, where the event is completely new but the classifier has seen the type of the test event in training. In this case we, take out the test event from the data, and use rest of the dataset as training data. As seen in an earlier infographic, flood/typhoon and earthquake had good number of events in the overall dataset, we chose to perform our analysis on these crisis types. So, when the classifier has already seen a type of crisis event, the semantics may not always be superior to the statistical features as we can observer from the results. The Dbpedia semantics seem to be the more consistent of the semantic feature model, performing better than the baseline in 6 out of 11 test events.
  30. Now, when we see the Criteria 2, where the type of the event is not seen in the training data. In this case, while we test an event of a given type we ensure that none of the other events of similar type are in the training data. Here we see that firstly the performance of baseline drops significantly in comparison to baseline in Criteria 1. Secondly the semantic models, particularly Dbpedia feature model, outperforms the baseline in 10 out of 15 test events. As an additional category of types we also included the events from train crash, bombing/terror attacks.
  31. To analyse how the semantics were effecting the nature of the data and the classifier, we performed Information Gain across all the feature models. We observed the in Statistical Features we observed very crisis specific terms among the top ranked features. Also, we saw 7 different hashtags in top 50 features and that is indicative of how vocabulary specific important features are. As we analysed the semantic models, we saw more generic, yet crisis related concepts showing up in the top ranked features. Some of the semantics concepts existed across different crisis types.
  32. Overall the Dbpedia semantics was the best performing model when the classifier was tested on an unseen type of the crisis. The avg. gain was 7.2% over the baseline. And it performed uniformly well across all the 3 tested types. The Dbpedia performed well, likely due to its better coverage and semantic depth. Something we need to explore more.
  33. Few more take aways from the analysis were: generalizing the data semantically help in adapting new crisis events. Sometimes very broad and general concepts can result in underperformance of the classifier. For instance, virtue is the hypernym of diverse concepts which can often be used in different context. Also, sometimes semantic extraction can yield very unwanted concepts and in huge volumes. For e.g. here.
  34. As a progression to this work, we need to perform a more in-depth error analysis. Currently we only take the type of event in account, which is broadly the nature of a given crisis. However, different events can have overlapping content based on similar situations. So it would make sense to take that also into account. The crisis data can also originate in different languages, so how the classifier can be tuned to handle multilingual aspect of the crisis data, that analysis should be expanded. We made an attempt on doing the same, and that can be referred to in the near future at the following research paper that we will present at upcoming semantic web conference.