This document presents a case study on analyzing Twitter data related to UK floods in early 2014. Researchers collected over 2.3 million geotagged tweets from the UK during the flooding period. They developed a custom flood-related lexicon to filter tweets and identify affected areas. Spatial clustering was used to group tweets into regions, which were then prioritized by flood signal metrics. The most effective metric for identifying impacted areas was signal-to-noise ratio. A second-level clustering identified groups of areas with similar flooding event perceptions over time. The analysis effectively identified flood-stricken areas and those impacted in a similar manner.
VIRUSES structure and classification ppt by Dr.Prince C P
UK Flood Case Study: Identifying Affected Areas from Twitter Data
1. Twitter floods when it rains:
A case study of the UK floods in
early 2014
Antonia Saravanou
University of Athens
Dimitrios Gunopulos
University of Athens
George Valkanas
Stevens Institute of Technology
Gennady Andrienko
Fraunhofer Institute IAIS, DE
Social Web for Disaster Management (WWW workshop 2015)
Florence, Italy
National and Kapodistrian
University of Athens
2. Outline
● Motivation
● Research Questions
● Methodology
○ Data Collection
○ Filtering Step: Flood-Related Lexicon
○ Clustering Step
○ Second Level Clustering
● Results
Twitter floods when it rains: A case study of the UK floods in early 2014 18 May 2015
5. Twitter floods when it rains: A case study of the UK floods in early 2014 18 May 2015
Motivation
6. Twitter floods when it rains: A case study of the UK floods in early 2014 18 May 2015
Motivation
● Identify early the event and the affected area
● Monitor the evolution of the event
● Inform users for emergencies
● Resource allocation
● Ιmmediate notification of special incident
management units
7. Research questions
RQ1: How can we identify the areas that have been
hit the most by an event?
- where to dispatch emergency response units
RQ2: How effective can we be in identifying these
areas?
- robust and effective techniques to base decisions
RQ3: Can we identify areas that have been stricken
by the event in a similar manner?
- transfer the same techniques to similar affected areas
Twitter floods when it rains: A case study of the UK floods in early 2014 18 May 2015
8. Data Collection
● Twitter - custom crawler
○ Streaming API
● Collection of public tweets
○ Bounding box that covers UK
○ Extract only tweets with GPS
● 13-17 January 2014
● > 2.3 million geotagged tweets
Twitter floods when it rains: A case study of the UK floods in early 2014 18 May 2015
9. Flood Related Tweets
Twitter floods when it rains: A case study of the UK floods in early 2014 18 May 2015
?
Entire Dataset
gps location within
UK b.b.
Flood - Related
Tweets
10. Filtering Step: Custom Flood-Related
lexicon
rain,
flood,
weather,
storm,
showers,
...
13 tokens
1546 tokens
456 tokens
tokens that contain
at least one word of
the initial seed set
as a substring
only related
tokens to the
event
initial seed set
Entire
Dataset
manually
review each
keyword and
discard non-
related
false positives
e.g. brain, train, e.t.c.
e.g. raining, floods,
#ukweather, e.t.c.
Twitter floods when it rains: A case study of the UK floods in early 2014 18 May 2015
11. Twitter floods when it rains: A case study of the UK floods in early 2014 18 May 2015
Original vs. Flood Related Lexicon
● Manual cleaning process
is necessary
● Only 4 keywords flood-
related in the original
lexicon
● Flood Lexicon is ⅓ of the
Original
- Slow process
+ One time at the beginning
Top-10 most frequent keywords
12. Flood Related Tweets
exact match with at least
one keyword from our
flood related lexicon
Entire
Dataset
Flood - Related
Tweets
Flood
Related
Lexicon
Twitter floods when it rains: A case study of the UK floods in early 2014 18 May 2015
13. ● Why we care
○ where to dispatch emergency response units
○ notify citizens about areas with problems caused by
floods
● From GPS to areas
○ Perform spatial clustering using the GPS
coordinates
■ Convert GPS coordinates to Cartesian ones
Twitter floods when it rains: A case study of the UK floods in early 2014 18 May 2015
RQ1: Identifying flood-affected areas
14. Clustering Step: K-Means
K = 10, 100, 500, 1000
Generated clusters as Voronoi polygons
➔ more splits in the densely populated areas
10 clusters 100 clusters 500 clusters 1000 clusters
Twitter floods when it rains: A case study of the UK floods in early 2014 18 May 2015
15. Which areas are the most affected?
● Prioritize generated areas by their potential of
being affected
Prioritization schemes by area a:
1. By total #tweets: baseline
2. By flood-related #tweets
3. By Signal-to-Noise Ratio:score(a) =
#flood-related tweets in a
#tweets in a
Twitter floods when it rains: A case study of the UK floods in early 2014 18 May 2015
16. Visualization of top-100 most affected
areas
1. total #tweets 2. flood-related #tweets
Twitter floods when it rains: A case study of the UK floods in early 2014 18 May 2015
add map
with SNR
3. SNR
Top 100 for K-Means (K=500)
17. RQ2: Identification Effectiveness
1. Likert Scale [1-5]: to specify
the degree that an area has
been affected
a. 1 = “normal levels of rainfall”
b. 5 = “completely flooded”
2. Running Average Likert:
Twitter floods when it rains: A case study of the UK floods in early 2014 18 May 2015
Ground Truth
- MetOffice
add
map
with
SNR
3. SNR
19. Results (k = 100)
Twitter floods when it rains: A case study of the UK floods in early 2014 18 May 2015
● Baseline < Flood, SNR
● Flood ~ SNR
20. Results (k = 500)
Twitter floods when it rains: A case study of the UK floods in early 2014 18 May 2015
● Baseline << Flood < SNR
● #tweets is not a good proxy
● #flood-related tweets is a better one
21. Results (k = 1000)
Twitter floods when it rains: A case study of the UK floods in early 2014 18 May 2015
● SNR the best metric (especially top20)
● how many users talk about the specific event
22. RQ3: Similarly affected areas
Identify areas with similar behavior on a temporal
aspect, in the way that the flooding event was
perceived by Twitter users
Underlying connection:
● population level, e.g., similar posting patterns
● other variable, e.g., a nearby river
Twitter floods when it rains: A case study of the UK floods in early 2014 18 May 2015
23. Second Level Clustering: Attributes
Features that show the temporal evolution of the
event in an area
1. Number of tweets in day d, count(d)
2. Ratio of day d from area a,
ratio(d) = count(d) / Σ count(d’), forall d’
3. Speed of day d, speed(d) = ratio(d)-ratio(d-1)
Twitter floods when it rains: A case study of the UK floods in early 2014 18 May 2015
24. Second Level Clustering:
Areas from 2 clusters
: cluster 1
: cluster 2
Twitter floods when it rains: A case study of the UK floods in early 2014 18 May 2015
● Speed feature
● Red cluster: Scotland,
Liverpool and Ireland,
mostly unaffected
● Purple cluster: Midlands,
affected
● Red speed decreases
● Purple speed increases
● Verification with historical
data
25. The INSIGHT project
Twitter floods when it rains: A case study of the UK floods in early 2014 18 May 2015
Detecting Events:
- censors on road
network
- censors on buses
- Twitter data
http://www.insight-ict.eu/
Intelligent Synthesis and Real-time Response
using Massive Streaming of Heterogeneous Data
26. Conclusions
● Analysis on Twitter data
○ emergencies, disaster management & relief
● Experimental analysis on floodings
○ establishment of “flood related lexicon”
○ division of the entire UK to affected areas
○ identification of flood-stricken areas with high accuracy
● Comparison with ground truth data
○ quality evaluation
Twitter floods when it rains: A case study of the UK floods in early 2014 18 May 2015
27. Future Work
● Collect more data of similar flooding events and
test our approach in larger datasets
○ generalize in other areas
○ test with larger timespan
● Develop online clustering approaches (1ier)
● To incorporate into the INSIGHT tool
Twitter floods when it rains: A case study of the UK floods in early 2014 18 May 2015
28. Thank you!
Acknowledgements:
MMD - Mining Mobility DataINSIGHT - Intelligent
Synthesis and Real-time
Response using Massive
Streaming of Heterogeneous
Data