This document discusses using social media to enhance emergency situation awareness. It analyzes tweets related to earthquakes and floods in Italy to classify them as containing damage information, no damage, or being irrelevant. It uses clustering to discover topics and support vector machines (SVM) and Naive Bayes for classification. Preprocessing includes removing punctuation, numbers, stop words and stemming. Tweets are represented as vectors using TF-IDF. Hierarchical clustering is used for topic discovery while SVM and Naive Bayes are used for classification, achieving accuracies over 86%. The results demonstrate social media can provide real-time information to help emergency responders.
What are the advantages and disadvantages of membrane structures.pptx
Using Social Media to Enhance Emergency Situation Awareness
1. USING SOCIAL MEDIA TO ENHANCE
EMERGENCY SITUATION AWARENESS
Web Information Retrieval 2018/2019
Danilo Marzilli
Andrea Lombardo
Daniele Davoli
Prof. Andrea Vitaletti - Prof. Luca Becchetti
2. Goals
Real-time event detection through social media
• Earthquake and flood Users as sensors
• Type of disaster: earthquake and flood
Introduction and goals
Online/Hierarchical clustering
• Topic discovery
Classification problems
• Relevant/Not relevant (SVM)
• Flood/Earthquake (SVM & NB)
Experimental results
3. The dataset: CRESCI-SWDM15
5.642 manually annoted tweets in Italian language, 4 different natural disasters
occurred in Italy between 2009 and 2014 and 3 classes:
• Damage class
• No damage class
• Not relevant class
Differences with the paper dataset
Real time tweets collected in an entire year
English tweets
Focus to austrialian natural disasters
4. Preprocessing and vector trasformation
NLTK Python library
1. Punctuation, numbers, symbols, stop words elimination;
2. Stemming: Snowball Stemmer (for the Italian language);
3. Lemmatization: Not possible.
SciKit Learn Python library (TfIdfVectorizer)
1. Build the vocabulary of terms;
2. Representing a tweet as a vector in a multidimensional space;
3. TF-IDF weight.
5. Clustering VS Classification
• Used for topics discovery
• Unsupervised learning
• You don’t know how many and which
clusters at priori
Clustering Classification
• Used for binary classification
problem
• Supervised learning
• You know the classes (ex: relevant
and not relevant)
• Pre-annoteted training dataset
6. Hierarchical/Agglomerative Clustering
Used for topics discovery
Cosine similarity to computing the distance
Clustering based on centroid/prototype
Prototype/Centroid is the representation of a cluster
Bottom – Up approach
7. Support Vector Machine & Naive Bayes
SVM finds a hyperplane to separate 2 classes keeping the lowest possible error
Naive Bayes count words, use relative and absolute frequency
Target classes:
• Relevant or Not Relevant
• Flood or Earthquake
8. Results
Number of clusters for each
defined threshold
Clustering Naive Bayes
Parameters for validating
Accurancy by original paper (1): 0,862
Accurancy by original paper (2): 0,875*distance computed as dist = 1 - cos(vec1, vec2)
9. Results
ROC curve and AUC
SVM: first experiment SVM: second experiment
ROC curve and AUC
10. Burst detection
Goal: identify a natural disaster comparing the terms frequency in a given time window
in respect to a historical average frequency
Not implemented in our project because:
• No real time tweet stream
• Unknown historical average frequency
• Only tweet about natural disasters time window, no presence of noise
12. Vocabulary and vector representation
Vocabulary: collection of the terms found in the tweets
Vector representation: to evaluate the likelihood among
tweets
TF-IDF: to evaluate the frequency and the
informativiness of a term
15. Gamma parameter
• The gamma parameter is the inverse of the radius of the samples selected by the
model as support vectors;
• It represents a penalty for each misclassification;
• The higher is the value of gamma, the lower is the separator width.
16. ROC curve and AUC
• The ROC curve is a graphical plot that
represents the ability of a binary classifier
system;
• It’s creating by plotting the true positive rate
against false positive rate;
• AUC is a [0,1] area under the curve:
• 0 means that every element, decided by
the system, is always wrong guessed;
• 1 means a perfect classifier
17. Improvements
It could be interesting make the following expirements:
• Make the same expirements using different social medias (Facebook and Instagram)
• Create a system to help populations and police forces in case of criminal and terrorist
attacks
• Make cross validation for classification training algorithms
18. Danilo Marzilli Andrea Lombardo Daniele Davoli
https://www.linkedin.com/in/danieledavoli/https://www.linkedin.com/in/andrea-
lombardo-2103ba15a/
https://www.linkedin.com/in/danilomarzilli/
Our team
19. REFERENCES
• Jie Yin, Andrew Lampert, Mark Cameron, Bella Robinson, and Robert
Power, Using Social Media to Enhance Emergency Situation
Awareness, 2012, IEEEE, 1541-1672;
https://ieeexplore.ieee.org/document/6148196/
• Cresci-SWDM15, http://socialsensing.it/en/datasets
• NumPy library, https://www.numpy.org/devdocs/
• Sklearn library, http://scikit-learn.org/stable/documentation.html
• Natural Language ToolKit library, http://www.nltk.org/
Code on GitHub