Add more information to your upload Tip: Better titles and descriptions lead ...
Finding Relevant Crisis Information on Social Media
1. Crisis Computing
Finding relevant and credible information on social
media during disasters
Big Data Analytics Conference
Delhi, India, December 2014
8. 8
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
An earthquake hits a Twitter user
• When an earthquake strikes, the first tweets are
posted 20-30 seconds later
• Damaging seismic waves travel at 3-5 km/s, while
network communications are light speed on
fiber/copper + latency
• After ~100km seismic waves may be overtaken by
tweets about them
http://xkcd.com/723/
10. Alexandra Olteanu, Sarah Vieweg and Carlos Castillo: What to Expect When the
Unexpected Happens: Social Media Communications Across Crises.
To appear in CSCW 2015.
Examples of crisis tweets (cont.)
11. 11
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Fertile grounds for applied research
✔
Problems of global significance
✔
Solved with labor-intensive methods
✔
Better solution provides a public good
✔
Large and noisy data sets available
✔
Engage volunteer communities
12. 12
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Fertile grounds for applied research
✔
Problems of global significance
✔
Solved with labor-intensive methods
✔
Better solution provides a public good
✔
Large and noisy data sets available
✔
Engage volunteer communities
• Relevance to practitioners?
13. 13
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Current collaborators
Patrick Meier
– QCRI
Sarah Vieweg
– QCRI
Muhammad Imran
– QCRI
Irina Temnikova
– QCRI
Alexandra Olteanu
– EPFL
Aditi Gupta
– IIIT Delhi
“P.K.” Kumaraguru
– IIIT Delhi
Fernando Diaz
– Microsoft
15. Crisis maps from social media
Carlos Castillo, Fernando Diaz, and Hemant Purohit:
Leveraging Social Media and Web of Data to Assist Crisis Response Coordination
Tutorial at SDM, Philadelphia, PA, USA. April 2014.
Hemant Purohit, Carlos Castillo, Patrick Meier and Amit Sheth:
Crisis Mapping, Citizen Sensing and Social Media Analytics
Tutorial at ICWSM, May 2013.
16.
17.
18.
19.
20. Patrick Meier, Social Innovation Director @ QCRI – http://irevolution.net/
“What can speed humanitarian
response to tsunami-ravaged
coasts? Expose human rights
atrocities? Launch helicopters to
rescue earthquake victims?
Outwit corrupt regimes?
A map.”
28. Understanding Crisis Tweets
Alexandra Olteanu, Sarah Vieweg and Carlos Castillo: What to Expect When the
Unexpected Happens: Social Media Communications Across Crises.
To appear in CSCW 2015.
29. 29
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Types of Disaster
31. 31
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Filtering
Is disaster-
related?
Contributes to
situational
awareness?
Yes Yes
No No
32. 32
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Classification
Caution &
Advice
Information
Sources
Damage &
Casualties
Donations
Gov
Eyewitness
Media
NGO
Outsider
...
...
Filtered
tweets
33. 33
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
A large-scale study of crisis tweets
• Collect tweets from 26 disasters
• Classify according to:
●
Informative / Not informative
●
Information provided
●
Information source
34. 34
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Advice on labeling
• Your instructions will never be correct the first
time you try
– e.g. personal / eyewitness
– Instructions must be re-written reactively
– Perform small-scale labeling first
• Instructions must be concrete and brief
– If you can't do it, the task has to be divided
35. 35
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Information Provided in Crisis Tweets
N=26; Data available at http://crisislex.org/
36. 36
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
What do people tweet about?
• Affected individuals
– 20% on average (min. 5%, max. 57%)
– most prevalent in human-induced, focalized & instantaneous events
• Sympathy and emotional support
– 20% on average (min. 3%, max. 52%)
– most prevalent in instantaneous events
• Other useful information
– 32% on average (min. 7%, max. 59%)
– least prevalent in diffused events
37. 37
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
What do people tweet about? (cont.)
• Infrastructure and utilities
– 7% on average (min. 0%, max. 22%)
– most prevalent in diffused events, in particular floods
• Caution and advice
– 10% on average (min. 0%, max. 34%)
– least prevalent in instantaneous & human-induced events
• Donations and volunteering
– 10% on average (min. 0%, max. 44%)
– most prevalent in natural hazards
40. Extracting information and matching
emergency-related resources
Muhammad Imran, Shady Elbassuoni, Carlos Castillo, Fernando Diaz and Patrick Meier:
Extracting Information Nuggets from Disaster-Related Messages in Social Media
In ISCRAM. Baden-Baden, Germany, 2013. Best paper award.
Hemant Purohit, Amit Sheth, Carlos Castillo, Patrick Meier, Fernando Diaz:
Emergency-Relief Coord. on Social Media: Auto. Matching Resource Requests and Offers
First Monday 19 (1), January 2014
Muhammad Imran, Shady Elbassuoni, Carlos Castillo, Fernando Diaz and Patrick Meier:
Practical Extraction of Disaster-Relevant Information from Social Media
In SWDM. Rio de Janeiro, Brazil, 2013
41. 41
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Information Extraction
...
Classified
tweets
@JimFreund: Apparently we have no choice.
There is a tornado watch in effect
tonight.
42. 42
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Extraction
• #hashtags, @user mentions, URLs, etc.
– Regular expressions
– Text library from Twitter
• Temporal expressions
– Part-of-speech tagger + heuristics
– Natty library
• Supervised learning
43. 43
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Labels for extraction
• Type-dependent instruction
• Ask evaluators to copy-paste a word/phrase from
each tweet
44. 44
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Learning: Conditional Random Fields
• Used extensively in NLP for part-of-speech tagging
and information extraction
• Representation of observations is important
(capitalization, position, etc.)
HMM Linear-chain CRF
hidden
observed
45. 45
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Tool
• CMU ARK Twitter NLP
– Tokenization
– Feature extraction
– CRF learning
• Very easy to use: simply change the training set
(part-of-speech tags) into anything, and re-train
46. 46
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Output examples
RT @weatherchannel: .@NYGovCuomo orders closing of NYC bridges. Only
Staten Island bridges unaffected at this time. Bridges must close by 7pm. #Sandy
#NYC
Wow what a mess #Sandy has made. Be sure to check on the elderly and
homeless please! Thoughts and prayers to all affected
RT @twc_hurricane: Wind gusts over 60 mph are being reported at Central Park
and JFK airport in #NYC this hour. #Sandy
RT @mitchellreports: Red Cross tells us grateful for Romney donation but prefer
people send money or donate blood dont collect goods NOT best way to help
#Sandy
47. 47
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Extractor evaluation
Setting Rec Prec
Train 2/3 Joplin, Test 1/3 Joplin 78% 90%
Train 2/3 Sandy, Test 1/3 Sandy 41% 79%
Train Joplin, Test Sandy 11% 78%
Train Joplin + 10% Sandy, Test 90% Sandy 21% 81%
• Precision is: one word or more in common with
what humans extracted
48. 48
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Donations matching
• Identify and match requests/offers for donations
– Money, clothing, food, shelter, volunteers, blood
Average precision = 0.21 (0.16 if only text similarity is used)
49. Crowdsourced stream processing systems
Muhammad Imran, Ioanna Lykourentzou and Carlos Castillo:
Engineering Crowdsourced Stream Processing Systems
http://arxiv.org/abs/1310.5463
51. 51
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Design objectives and principles
Design principles
Design objective Example metric Automatic
components
Crowdsourced
components
Low latency End-to-end time Keep-items moving Trivial tasks
High throughput Output items per
unit of time
High-performance
processing
Task automation
Load adaptability Rate response
function
Load shedding, load
queueing
Task prioritization
Cost effectiveness Cost vs. quality,
throughput, etc.
N/A Task frugality
High quality Application-
dependent
Redudancy, aggregation and quality control
52. Design patterns
● QA loop
● Task assignment
● Process/verify
● Supervised learning
● Crowdwork sub-task
chaining
● Humans are not a
bottleneck
● Humans review every
output element
53. 53
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
http://aidr.qcri.org/
54. 54
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Self-service for crisis-related classification
Unstructured
text reports
Categorized
information
Automatic
classifier
Model
Builder
Crowdsourced
ground-truth
Library of
training data
55.
56.
57. Credibility and verification
Aditi Gupta, Ponnurangam Kumaraguru, Carlos Castillo and Patrick Meier:
TweetCred: A Real-time Web-based System for Credibility of Content on Twitter
In SocInfo 2014. Runner-up for best paper award.
Carlos Castillo, Marcelo Mendoza, Barbara Poblete:
Predicting Information Credibility in Time-Sensitive Social Media
In Internet Research, Vol. 23, Issue 5. October 2013.
A. Popoola, D. Krasnoshtan, A. Toth, V. Naroditskiy, C. Castillo, P. Meier and I. Rahwan:
Information Verification during Natural Disasters
Social Web and Disaster Management (SWDM) workshop, 2013.
62. 62
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Crowdsourced verification: Veri.ly
• Frame crowdwork correctly
• Not upvoting/downvoting a claim
• Instead, providing evidence for/against
@VeriDotLy — http://veri.ly/
63.
64.
65. 65
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Examples of evidence provided
66. 66
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Automatic credibility evaluation: TweetCred
• Real-time web-based service
• Used as a Chrome extension
• Annotates Twitter's timeline with credibility
scores
67. 67
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
http://twitdigest.iiitd.edu.in/TweetCred/
68. 68
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Next steps
• Credibility facets
– Factually written
– Detailed
– Author on the ground
– ...
• Respond to searches about an event
71. 71
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Computationally
feasible
Supported by
data
Useful
Good projects in this space
72. 72
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Computationally
feasible
Supported by
data
Useful
Good projects in this space
Temptation! Danger!
Poorly planned
projects :-(
AI-complete
problems
73. 73
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Some venues
• SWDM – Workshop on Social Web
for Disaster Management
– Deadline: January 24th
• ISCRAM – International Conference on Information Systems
for Crisis Response and Management
+ the usual suspects, depending on your area ;-)
74. 74
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Possibility of large impact by using computer
science to support humanitarian work
=
Applied computing at its best
75. Thank you!
Carlos Castillo · chato@acm.org
http://www.chato.cl/research/
With thanks to Patrick Meier for several slides