Identifying Relevant Messages in a Twitter-based Citizen Channel for Natural Disaster Situations

Identifying Relevant Messages in a
Twitter-based Citizen Channel
for Natural Disaster Situations
Alfredo
Cobo

ajcobo@uc.cl

Denis
Parra

dparra@ing.puc.cl

Jaime
Navón

jnavon@ing.puc.cl

Pon=ﬁcia
Universidad
Católica
de
Chile

Departamento
de
Ciencia
de
la
Computación

Av.
Vicuña
Mackenna
4860,
Macul

San=ago,
Chile

I (… and some other people in this room)
…
come
from
Chile

Picture
from
hMp://www.quadrodemedalhas.com/images/mapas/mapa-‐chile.jpg

hMp://upload.wikimedia.org/wikipedia/commons/thumb/9/91/Chile_in_South_America_(-‐mini_map_-‐rivers).svg/409px-‐Chile_in_South_America_(-‐mini_map_-‐
rivers).svg.png

Chile, well-known for its..
• 
Copper
(Top
Producer)

"Top
5
Copper
Producers"
by
Plazak
-‐
Own
work.
Licensed
under
CC
BY-‐SA
3.0
via
Wikimedia
Commons
-‐
hMp://commons.wikimedia.org/wiki/
File:Top_5_Copper_Producers.png#/media/File:Top_5_Copper_Producers.png

hMps://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0CAYQjB0&url=hMp%3A%2F%2Fcommons.wikimedia.org%2Fwiki%2FFile
%3ANa=ve_Copper_(mineral).jpg&ei=L31ZVbOsL4r1UrbRgKAB&bvm=bv.93564037,d.d24&psig=AFQjCNHr2zm5m4Jmim7AgkCwwSb0b5mGUA&ust=1432014509629311

Chile, well-known for its..
• Wine

(Price
+
quality)

"Fiesta
de
Vendimia"
by
LuxoDresden
-‐
Own
work.
Licensed
under
CC
BY-‐SA
3.0
via
Wikimedia
Commons
-‐
hMp://commons.wikimedia.org/wiki/
File:Fiesta_de_Vendimia.JPG#/media/File:Fiesta_de_Vendimia.JPG

If you start typing in Google…
9
out
of
10

disasters
…

If you start typing in Google…
9
out
of
10

disasters
…

prefer
Chile

… and for Natural Disasters L
• Largest
ever
registered
earthquake
in
History:

Valdivia,
Chile,
22nd
of
May
of
1960
(9.5
in
Richter

Scale)

• We
usually
have
1
large
earthquake
every
30
years
(~

8
degrees

in
Richter
Scale)

• Last
one
in
2010
close
to
Concepción,
but
it
also

aﬀected
San=ago
(the
capital)

… so, at PUC Chile
• We
created
CIGIDEN
“Na=onal
Research
Center
for

the
Integrated
Administra=on
of
Natural
Disasters”

CIGIDEN’s Goal in this project
• Help
ci=zens
staying
informed
during
situa=ons

of
natural
disasters
by
using
Social
Media.

• Build
Mobile
Applica=on
(Carlos
Molina)

• Filter
automa=cally
relevant
messages
from
those

not
related
to
earthquakes
(Alfredo
Cobo)
to
feed

the
applica=on

Our Task: Building a Twitter classifier
-‐ Filter
tweets
related
to
natural
disasters
from
those

who
did
not.

Related Work
Manual
Classiﬁca8on
Data
Post-‐processing
Feature
Genera8on
Tools
for
Disaster
Management

Vieweg
et
al.
(2010)

Imran
et
al.
(2013)

Mendoza
et
al.
(2010)

Mendoza
et
al.
(2010)

Cas=llo
et
al.
(2011)

(Informa=on
Credibility

on
TwiMer)

Gimpel
et
al.
(2011)

Koloumpis
et
al.
(2011)

Liu
et
al.
(2012)

Wu
et
al.
(2011)

Lee
et
al.
(2014)

(Not
necessarily
for

natural
disasters)

Hiltz
et
al.
(2013)

Power
et
al.
(2013)

Caragea
et
al.
(2011)

Abel
et
al.
(2012)

Middleton
et
al.
(2014)

MorstaMer
et
al.
(2013)

Imran
et
al.
(2014)

Why building this classifier would be a
contribution?
• Building
and
valida=ng
a
ground
truth
for

classifying
tweets
in
Spanish.

• Building
the
classiﬁer
and
dealing
with

• Class
Imbalance

• Number
of
latent
dimensions
(Feature
Genera=on

using
LDA)

Workflow of Activities
Chile’s
Earthquake

2010

Cas=llo
et
al.

(2010)

Our

ground
truth

Non-‐
relevant

messages

Realis=c

dataset

Sampling,

Cleaning
&

ﬁltering

Classiﬁers

-‐  Feature

selec=on
(LDA)

-‐  Class
Imbalance

10%
-‐
80%

Building the ground truth
• Random
sampling
of
5,000
tweets
from
Cas=llo
et

al.
(2010)
dataset,
used
to
study
credibility
~
Chile’s

2010
earthquake.

• Dates:
From
February
27th
un=l
March
2nd

(Spanning
4
days
in
2010)

• We
kept
only
Spanish
messages,
removed

messages
too
similar
(Lavenshtein
distance):
2,187

messages
leE

Validating of the ground truth
•  Fleiss
Kappa:

•  κ
=
0.645,
p
<
.001

•  Intraclass
correla=on

•  ICC(2,1):
IIC
=
0.646,
p

<
.001

•  Landis
and
Koch
et
al.

(1977)

• 
Relevant
messages
were

labeled
based
on
Imran
et
al.

(2013)
classiﬁca=on:

• Cau=on/Warning

• Casual=es
and
Damage

• People
(missing,
found,
etc.)

• Informa=on
source

Workflow of Activities
Chile’s
Earthquake

2010

Cas=llo
et
al.

(2010)

Our

ground
truth

Non-‐
relevant

messages

Realis=c

dataset

Sampling,

Cleaning
&

ﬁltering

Classiﬁers

-‐  Feature

selec=on
(LDA)

-‐  Class
Imbalance

Classification Problem
Features

Class
Imbalance

User

Network

Content
(4,766

unique
words)

Followers
Hashtags

Followees
Words

User
men=ons

•  Ground
Truth
is
a
not
realis=c

representa=on
of
TwiMer

•  We
added
“Noise”:
Introduced

Tweets
non-‐relevant
to
the
event

(20%
-‐
80%)

•  Sampled
non-‐relevant
tweets

from
5
months.

•  Removed
all
tweets
posted

during
days
of
seismic
ac=vi=es

Model
Precision
Recall
F1
score
Accuracy
AUC
Dimensions
Noise

Propor8on

Baseline
0.625
0.545
0.53
0.5
0.568
-‐
0

Bernoulli

NB

0.831
0.226
0.355
0.594
0.605
2000
0

Logis=c

Regression

0.827
0.641
0.722
0.756
0.834
2000
0.6

Linear
SVM
0.687
0.677
0.682
0.687
0.719
1000
0.6

Random

Forest

0.807
0.673
0.734
0.758
0.844
1000
0.8

Classification Results

Analysis ~ LDA Dimensions and Noise

Conclusions & Future Work
• We
built
and
validated
a
ground
truth
of
tweets

in
Spanish
relevant
to
disasters

• We
implemented
a
classiﬁer
and
analyzed
its

performance
based
on
several
algorithms
and

dealing
with
class
imbalance
problem

• Future
Work:
Move
the
applica=on
from

prototype
to
produc=on,
test
online
scalability

That’s all folks!
• 
Thanks
and
ques=ons
to
corresponding
author

Alfredo
Cobo:
ajcobo@uc.cl
or

Denis
Parra:
dparra@uc.cl

Chile, small country, but well-known for its..
• Length
(4,300
Km)

~
4,300
Km
~8,000
Km

Model Features
•  Newman
et
al.
(2007)

•  Biro
et
al.
(2008)

•  Wei
et
al.
(2006)

•  Wang
et
al.
(2012)

•  Han
(2005)

Features
Corpora
Features

Followers
Hashtags

Friends
Words

User
men=ons

Results
•  Amatriain
et
al.
(2013)

Plots of bootstrap
Agreement
Day
1
Agreement
Day
2

Agreement
Day
4
Agreement
Day
3

Manual classification
•  Vieweg
et
al.
(2010)

•  Imran
et
al.
(2013)

Post Processing
•  Cas=llo
et
al.
(2011)

•  Mendoza
et
al.
(2010)

Feature Generation Approaches
•  Gimpel
et
al.
(2011)

•  Koloumpis
et
al.
(2011)

•  Liu
et
al.
(2012)

•  Wu
et
al.
(2011)

•  Lee
et
al.
(2014)

Tools For Disaster Management
•  Hiltz
et
al.
(2013)

•  Power
et
al.
(2013)

•  Caragea
et
al.
(2011)

•  Abel
et
al.
(2012)

•  Middleton
et
al.
(2014)

•  MorstaMer
et
al.
(2013)

•  Imran
et
al.
(2014)

Building the ground truth
•  Mendoza
et
al.
(2010)

•  Imran
et
al.
(2013)

Algorithms and evaluation procedure
•  Cas=llo
et
al.
(2011)

•  FawceM
et
al.
(2004)

•  Manning
et
al.
(2008)

•  Wen
et
al.
(2014)

Identifying Relevant Messages in a Twitter-based Citizen Channel for Natural Disaster Situations

Recommended

Recommended

More Related Content

Similar to Identifying Relevant Messages in a Twitter-based Citizen Channel for Natural Disaster Situations

Similar to Identifying Relevant Messages in a Twitter-based Citizen Channel for Natural Disaster Situations (20)

More from Denis Parra Santander

More from Denis Parra Santander (9)

Recently uploaded

Recently uploaded (20)

Identifying Relevant Messages in a Twitter-based Citizen Channel for Natural Disaster Situations