INFORMS 2021 Social cohesion and emotion analysis of media during 2020 wildfires a case study

Social Cohesion and Emotion Analysis of Social Media
During 2020 Wildfires: A Case Study
1
INFORMS2021
Alexander Gilgur
Jose Emmanuel Ramirez-Marquez
The research performed by Jose E. Ramirez Marquez leading to these results has received funding from the National Science Foundation, CRISP Type 2 /
Collaborative Research: Resilience Analytics: A Data-Driven Approach for Enhanced Interdependent Network Resilience, Award number 1541165.

Scenario Background
2
Wildfires in California have been a fact of life for many
years, including the 2018, 2019, and 2020 wildfires. This
comparison provides a way to analyze the baseline and to
tease out the interaction of wildfires with the other events.
Usually SF Bay Area is not affected by wildfires, which tend
to ravage the Santa Rosa / Napa / Sonoma areas, as well
as South California.
In 2020, SF Bay Area got hit by a rare combination of
wildfires, triggered by a series of dry lightning storms,
which set afire the hills surrounding the Bay (Santa Cruz
Mountains, Coastal Mountains, the range of populated hills
stretching from East San Jose to Pleasanton), in addition
to the “usual” danger zones.

Scenario
2020
SF Bay Area:
● COVID
● Protests
● Wildfires
3
● Emotions
● Cohesion
Can we predict Cohesion from
Sentiment & Emotions?

Data Sources
4
https://www.fire.ca.gov/stats-events/
Meltwater

The Timeline
Reference: https://en.wikipedia.org/wiki/August_2020_California_lightning_wildfires
August 11,
2020
August 12,
2020
August 15,
2020
August 16,
2020
August 17,
2020
August 18,
2020
August 19,
2020
August 20,
2020
1 fire
started
1 fire
started
2 fires
started
7 fires started 5 fires
started
3 fires
started
2 fires
started
1 fire
started
September 22, 2020 January 5, 2021
Most major fires contained All August wildfires contained

Measuring Social Cohesion
6
Statistical Analysis
● Z-score
● Inverse CV
Cohesion
Emotion &
Sentiment
Analysis
Social
Network
Analysis
Echo-Chamber Effect
● Amplification
Social Network Metrics
● Tie Strength
● Centrality
syuzhet
sentmentr
nltk.vader

Social Network Analysis: Degree Distribution
7
Tweets mentioning SF Bay
Area cities & counties and
wildfires - before, during,
and after major wildfires
Background:
Degree Centrality (degree) of
a node (user) is the number
of connections (edges) it has.
(source)
Before Bay Area Wildfires
degrees
nodes
100
900K
111 low-degree(<= 100K) nodes.
6 high-degree (>100K) nodes.
Max Degree Centrality = 900K.
During Bay Area Wildfires
degrees
nodes
10K
45M
22.6K low-degree (<=100K) nodes.
Max Degree Centrality = 47M.
After Containment of
Major Bay Area Wildfires
degrees
nodes
1K
4M
1080 low-degree nodes.
Max Degree Centrality = 4.2M.

Structural Cohesion
8
Structural Cohesion is defined as the minimal
number of actors in a social network that need to be
removed to disconnect the group
(source)
Logically, removal of higher-degree nodes in a social
network (degree outliers) would more likely result in
network disconnection.
=> count of degree outliers in a social network can
be used as a measurable proxy for structural
cohesion
Before Wildfires: SC = 5 During Wildfires: SC = 196 After Wildfires: SC = 25

Sentiment Cohesion Metric
9
Sentiment
Analysis
Tools
Sentiment Cohesion

Sentiment Cohesion: Absolute Inverse CV: Benchmark
10
Benchmark:
Non-Specific Bay Area Sentiment
Absolute Inverse CV is the signal to
noise ratio that can be used as a
measure of cohesiveness in positive
or negative sentiment.
Coefficient of Variation:
Abs-Inverse-CV spikes:
Positive Sentiment:
● 2020-04-20: “4/20”
● 2020-10-26: final
presidential debate
Negative Sentiment:
● 2020-06-08: protests
● 2020-08-17: wildfires
● 2020-10-26: final
presidential debate
● 2020-11-23: lockdown

Absolute Inverse CV: SF Bay Area Wildfires
11
Wildfires:
Bay Area Sentiment
Inverse-CV spikes:
Positive Sentiment:
● 2020-08-24: no new fires
Negative Sentiment:
● 2020-08-10: wildfires
● 2020-08-24: wildfires; air
quality dangerous
● 2020-09-21: largest local
fires contained
Absolute Inverse CV is the signal to
noise ratio that can be used as a
measure of cohesiveness in positive
or negative sentiment.
Coefficient of Variation:

Emotion Analysis
12
Emotion
Analysis
Tools

Emotion Timeline During CA Wildfires
13
fear
Polarity = -68
fear
Polarity = -31 Polarity = -784
fear
Polarity = -6400
fear
trust
anger
surprise
anger
sadness
trust
anger
sadness sadness
fear
trust
joy
anger
Polarity = -774
surprise
anticipation
sadness
fear
trust anticipation
Polarity = 342
Polarity = 3451
fear
trust anticipation
sadness
joy
fear
trust anticipation
sadness
Polarity = 114
fear
trust
sadness
Polarity = 98 Polarity = -580
fear
sadness
anger
anticipation
Weights & values of the 6 emotions:
Fear was the dominant emotion.
Anger effect on polarity was negative.
Surprise was rare. Its effect was uncertain.
Trust and Anticipation effects were positive.
Joy was rare. Its effect on polarity was positive.
2020-08-01 2020-08-17
2020-08-24
2020-09-14
2020-09-21
2020-09-28

Emotion and Cohesion Correlation Analysis
14

Linear Correlations
15
Many Features (Sentiment & Emotions)
are cross-correlated => need PCA
Structural Cohesion is:
Most strongly positively correlated with:
● Fear
● Sadness
Weakly positively correlated with:
● Anger
● Disgust
● Trust
Weakly negatively correlated with:
● Sentiment Cohesion
Most strongly negatively correlated with:
● Negative-Sentiment Cohesion
Cohesion Metrics are all intercorrelated
=> need PCA

Nonlinear Monotonic Correlations
16
Accepting nonlinearity makes things very
structured: we can group strongly correlated
emotions:
X1
= (
anticipation,
disgust,
joy,
sadness,
trust
)
X2
= (
anger,
fear,
surprise
)
We can also roll them all into one metric.
Then we can model
C = f (X1
, X2
)

Principal Component Analysis (PCA)
17
PCA finds the linear combinations (principal components, or PCs) of original variables that maximize
the variances of the principal components
This results in covariance being 0 => principal components are independent.

PCA for Cohesion
18
For the 4 Cohesion-related metrics, PCA has returned 4 Principal Components (PCs)
The PCs explain:
● pc_0: 51.1 % of the variance
Total:
● 100 % of the variance is explained
Discounting pc_3 will only add 3.2% to the noise

PCA-Derived Cohesion Metric
19
The end result is a PCA-derived metric based on the 4 proxies we
defined for social cohesion:
● Structural
● Sentiment:
○ Negative
○ Positive
○ Overall (Compound)
The stepwise changes are due to weekly aggregations used in
deriving the proxies.
The new metric is computed as the length of the
vector built on the Principal Components (PCs).
The PCs are orthogonal; the vector length is the
square root of the sum of squares of the PCs

Linear Correlations for Cpca
20
are cross-correlated => need PCA or RFR
The new cohesion metric Cpca
is negatively
correlated with emotions and compound
sentiment: the stronger emotions and
sentiment the less cohesive the community.
Disgust, Sadness, and Trust are the
strongest linear correlates for Cpca
, followed
by Anticipation and Fear.
Overall Sentiment, Anger, and Surprise are
weaker correlated with Cpca
than the other 5

Nonlinear Monotonic Correlations for Cpca
21
are cross-correlated => need PCA
The new cohesion metric Cpca
is negatively
correlated with emotions and compound
sentiment: the stronger emotions and
sentiment the less cohesive the community.
Anticipation, Disgust, Joy, Sadness, and
Trust are the strongest nonlinear negative
correlates of Cpca
.
Overall Sentiment, Anger, and Surprise are
weaker correlated with Cpca
than the other 5

PCA for Sentiment and Emotions
22
The PCs explain:
● 61.1 % of the variance
Total:
● 100 % of the variance is explained
We should be fine with only 4 PCs

PCA for Sentiment and Emotions
23
Like Cpca
, this new metric (Epca
) is computed as the
length of the vector built on the Principal
Components (PCs). The PCs are orthogonal; the
vector length is the square root of the sum of
squares of the PCs
This PCA-derived metric is based on the 8 dimensions of
Emotions:
● 'Anger',
● 'Anticipation',
● 'Disgust',
● 'Fear',
● 'Joy',
● 'Sadness',
● 'Surprise',
● 'Trust'
and 1 dimension for Sentiment - a combination of Positive and
Negative Sentiment derived in the vader package

Modeling
24
nonlinearities
The nonlinear effects at low values of Epca
and Cpca
speak strongly for a nonlinear model

Linear Model
25
Cpca
= a0
* pc0
+ a1
* pc1
+ a2
* pc2
+ a3
* pc3
R2
= 0.388
Steering away from combining the PCs into one metric (Epca
) and using linear regression on PCs did not result in a
good model. A nonlinear model is more appropriate.

Random Forest Regression (RFR)
26
● We do not know if the model can be written as a closed-form equation =>
● Random Forest Regression works well in this situation.
● RFR does not need features to be orthogonal => interpretable results.
● RFR computes feature importance as their relative contribution to the variance of the dependent variable.
● RFR does not tell us whether a an increment of a feature will result in an increase or decrease of the dependent variable.
=> Sensitivity Analysis, LIME, or SHAP follow-up is needed.
R2
= 0.959
Feature Importances (Contributions To Variance, or CTV):
● trust: 0.673
● anger: 0.147
● disgust: 0.109
● surprise: 0.030
● joy: 0.019
● anticipation: 0.016
● sadness: 0.004
● fear: 0.003
● sentiment: 0.000
Identified important features (CTV cutoff = 0.02)
● trust: 0.673
● anger: 0.147
● disgust: 0.109
● surprise: 0.030
Trust, Anger, Disgust, and Surprise are sufficient to predict PCA-transformed Community Cohesion Metric (Cpca
) with a good
fit (R2
= 0.959). Adding Joy, Anticipation, Sadness, Fear, and Sentiment will make the fit slightly better.

Conclusions
27
Using off-the-shelf sentiment and emotion analysis tools and relying on statistical analysis of their outputs, we:
● Derived measurable proxy metrics of social cohesion in two dimensions - structural and sentiment-based - using the data
and metadata available from social-media interactions (tweets) within a loosely-defined community.
● Used Principal Component Analysis (PCA) to build a statistically sound metric of social cohesion.
● Used PCA to reduce 1 compound 'Sentiment' metric and the 8 basic measurable emotions into 1 statistically sound linear
combination of these metrics.
● Demonstrated that the relationship between PCA-transformed Cohesion Metric and the PCA-transformed Sentiment and
Emotions is linear and strong (Pearson correlation = 0.985).
● Feature Importance Analysis of Random Forest Regression (RFR) showed that Anger, Trust, Disgust, and Surprise, in
a nonlinear combination, are the emotions important for social cohesion.
● Applied Random Forest Regression (RFR) to predict PCA-transformed Cohesion metric Cpca
as a function of Sentiment
and the 8 basic emotions. Resulting R2
= 0.959 = 95.9% of the variance in Cpca
is explained by the RFR model.
○ It can be used to accurately predict social cohesion during and after disturbances.
○ Combining this with forecasts of trends of prevailing emotions can help in determining time to loss of cohesion.

Further Work
28
● Apply the unified metric to other communities, topics & events (e.g., COVID-19, protests, Presidential
elections, etc.)
● Perform Sensitivity Analysis of the RFR model.
● Model Community Resilience process with Cohesion as the metric of interest.
Cohesion = F(t, S, E)
S = Sentiment
E = Emotion

1. https://flowingdata.com/2020/09/10/a-timeline-of-california-wildfires/
2. https://psycnet.apa.org/record/2000-12222-004
3. https://aisel.aisnet.org/icis2009/112/
4. https://www.sciencedirect.com/science/article/abs/pii/0378873394002478
5. https://www.jstor.org/stable/3088904
6. https://doi.org/10.1016/0378-8733(94)00247-8
7. https://www.sciencedirect.com/topics/computer-science/degree-centrality
8. Our INFORMS 2020 presentation
31

Social Network Cohesiveness
33
Tweets mentioning SF Bay Area cities & counties and
wildfires - before, during, and after major wildfires.
Degree Centrality (degree) of a node (user) is the number of
connections (edges) it has. (source)
Before the wildfires, the network of Twitter users concerned about wildfires in SF Bay Area only had 117 users. Only 6 of them
had more than 100K connections (followers + followed users). No users with more than 882.5K connections were identified.
During the wildfires, the network of Twitter users concerned about wildfires in SF Bay Area grew to 10.7 K users. 333 of them
had more than 100K connections (followers + followed users). On 3 occasions, ‘@nytimes’ had more than 47M connections.
As the major Bay Area wildfires were contained, the network of Twitter users concerned about wildfires in SF Bay Area shrank
to 1.1 K users. 16 of them had more than 100K connections (followers + followed users). On 1 occasion, ‘@USATODAY’ had
more than 4.2M connections.

Making Variables Independent: Principal Component Analysis
PCA finds the linear combinations (principal components, or PCs) of original variables that maximize
the variances of the principal components
This results in covariance being 0 => principal components are independent.
34

Principal Component Analysis and Dimensionality Reduction
Problem - thresholds are arbitrary
EV Threshold = 0.01
25 PCs 13 PCs
EV Threshold = 0.05
5 PCs
35

INFORMS 2021 Social cohesion and emotion analysis of media during 2020 wildfires a case study

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

INFORMS 2021 Social cohesion and emotion analysis of media during 2020 wildfires a case study