Machine Learning and Data
Mining in Identification of
Unhappy Communities
Alexander Gilgur
Jose Emmanuel Ramirez-Marquez
1
The research performed by Jose E. Ramirez Marquez leading to these results has received funding from the National Science Foundation, CRISP Type 2 /
Collaborative Research: Resilience Analytics: A Data-Driven Approach for Enhanced Interdependent Network Resilience, Award number 1541165.
2
“All happy families are alike; each unhappy family is unhappy in its own way.”
“... for systems belonging to the singular part of the stability boundary a small
change of the parameters is more likely to send the system into the unstable
region than into the stable region.”
Can we predict unhappiness as a social phenomenon?
Clarification
Myers-Diener, 1995 “Who Is Happy?”
https://doi.org/10.1111/j.1467-9280.1995.tb00298.x
Four traits of Happy People
A city full of happy or unhappy people? Is it possible?
4
A Small(er) Problem
How unhappy are the people in
the communities mentioned in
the News Media?
Preserve People’s
Privacy in Analysis
Our Model
Assumption: Self-Harm
statistics is a measurable
proxy for Unhappiness.
Hypothesis: There is a
correlation between
Reader-Media Interaction
and Unhappiness.
Multiple Reinforcing Loops (“Echo Chambers”) of Unhappiness
Negative Media
Sentiment
High
Cost of Living
“Negative” Feelings:
● Pressure to Achieve
● Cognitive Dissonance
● Pessimistic Outlook
● High Stress Levels
Unhappiness
Negative Interaction
Sentiment
Communication
Self-Harm
Media
Interactions
Action
Anti-Social Behavior
Measuring Unhappiness: Suicide Attempts
6
55
20
0.8
0.3
Contented
Unhappy
2 orders of magnitude difference
Same Counties year over year
Measuring Unhappiness: Suicide Attempts
7
55
20
0.8
0.3
Contented
Unhappy
2 orders of magnitude difference
Same Counties year over year
8
Montgomery, VA
Beaver, PA
King, WA
Alameda, CA
Contented
Unhappy
Media Response Analysis
In Contented counties, response
rate is significantly steeper than in
Unhappy counties.
Classification Algorithm
9
Start
Finish
Collect & preprocess NYT data
(https://bit.ly/2PIr4mc)
LOCAL: apply classifier: Fit
linear model (comments vs.
articles) for each type of
material and county.
GLOBAL: prepare the labels: Fit
linear model (comments vs.
articles) for each type of
material & happiness level.
Compute
Distance
Metric
Place counties
(“contented”/”unhappy”)
based on Distance Metric
values
What the Algorithm Does
10Classify counties by their Confidence Ellipses
?
?
Regression parameters provide a clear
grouping of unhappy counties separated from
contented ones. The slope (sensitivity) of the
line is lower for unhappy counties, while the
intercept is higher.
Classifier Performance
11
Precision and Recall for “unhappy” counties is 80% & 84%; overall Accuracy is 76% -
For a detector based exclusively on very limited behavior metadata, it is better than acceptable.
Towards a Better Understanding: Sentiment & Emotion Analysis
12
Using R packages:
● sentimentr
● syuzhet
Shared Sentiment of Comments
13
Observations:
1. Unhappy Counties are higher on
Average Sentiment Scores.
2. For Unhappy Counties, the Gap
between Negative and Positive
Sentiment is 67% Wider (173 vs. 104)
3. The difference between Contented
and Unhappy is most pronounced in
the Negative Sentiment.
And this is where ML ends,
and Domain Expertise begins
Interpretation of Negativity: OCEAN (Big 5) Framework?
14
Neuroticism is the tendency to experience
negative emotions, such as anger, anxiety,
or depression.
Interpretation of Negativity: Neuroticism Framework?
15
[Jeronimus et. al, 2014, p.759] demonstrated that:
(a) neuroticism showed high temporal stability,
(b) Long-Term Difficulties and Deteriorated Life Quality
predicted lasting increases in neuroticism,
(c) Improved Life Quality predicted lasting decreases in
neuroticism,
(d) life event aggregates had no persistent impact on
neuroticism,
(e) neuroticism predicted experiences more consistently
than experiences predicted change in neuroticism.
Jeronimus, B. F., Riese, H., Sanderman, R., & Ormel, J. (2014).
Mutual reinforcement between neuroticism and life experiences: A
five-wave, 16-year study to test reciprocal causation. Journal of
Personality and Social Psychology, 107(4), 751-764.
http://dx.doi.org/10.1037/a0037009
Further Work: Deteriorated vs. Improved Quality of Life
16
Counties with small to non-existent “cushion” in low-priced housing have high suicide-attempt rates,
whereas counties with a bimodal distribution of monthly housing costs fall into the “contented”category.
A smoking gun?
Monthly Housing Cost Distribution
Conclusions
17
1. Yes, we can predict societal unhappiness.
2. ML & DM can be used in Identification of Unhappy
Communities based on Newspaper Reader Comments.
3. Metadata alone can be used to build an Unhappiness
Classifier with decent performance:
● Precision = 0.8
● Recall = 0.84
4. Analysis of Sentiment and Emotions expressed in reader
comments can be used to further enhance the
performance of such classifier and to suggest possible
paths to root-cause analysis of unhappiness.
5. Interpretation of ML & DM results is best done in
continuous collaboration with the SMEs.
18
19
Shared Emotions: Year-over-Year Dynamics
20
Negativity Dynamics
21
Observations:
1. Overall, Negativity is the prevailing
sentiment in both contentment classes.
2. Unhappy counties’ Negativity has grown
faster than for Contented counties.
Shared Emotions in Comments: Expectations
22
Observations:
1. Expectations in Unhappy
counties are higher than in
Contented.
2. Expectations grow steeper in
Unhappy counties than in
Contented.
Shared Emotions in Comments: Expectation vs. Surprise
23
Observations:
1. The gap between
Expectation and Surprise
reached maximum in 2012.
2. This gap varies similarly in
Contented and Unhappy
counties.
3. This Gap is Wider in
Unhappy counties
Learned Disability?
Higher expectations in Unhappy?
Shared Emotions in Comments: Sadness vs. Joy
24
Observations:
1. Overall, Sadness is the
prevailing emotion over Joy.
2. In 8 years, Unhappy
counties’ Gap between
Sadness and joy has
increased more than for
Contented counties.
Disappointment?
Fatigue?
Burnout?
Shared Emotions in Comments: Fear vs. Trust
25
Observations:
1. Fear has not always been
prevailing over Trust.
2. In Contented counties,
this dynamics has been
growing Very Slowly.
3. In Unhappy counties, Fear
Grew Faster and became
Prevalent after 2012.
Fear of what?
Shared Emotions in Comments: Anger
26
Observations:
1. There is more Anger in
comments for Unhappy
counties
2. The Anger in
comments grows
faster for Unhappy
counties
Fear?
Learned Disability?
Anger?
Shared Emotions in Comments: Disgust
27
Observations:
1. There is more disgust in
comments for Unhappy
counties
2. Disgust reaches its peak
a year earlier in
comments for
Contented counties
3. For Contented counties,
disgust grows to a lower
level.
28
Appendix
29
Correlations of Emotions in Comments: Potential Anger Drivers
30
Anger Sensitivity to: Anticipation(Expectations) Disgust Fear
Unhappy 1/0.548=1.825 1/0.735=1.360 1/1.103=0.907
Contented 1/0.651=1.536 1/0.762=1.312 1/1.122=0.891
Unhappy : Contented 1.19 1.037 1.017
Online Materials
31
● https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml
● https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_16_5YR_
DP03&src=pt
● https://factfinder.census.gov/faces/nav/jsf/pages/searchresults.xhtml?refresh=t
● https://www.bloomberg.com/graphics/world-economic-indicators-dashboard/
● https://thedataweb.rm.census.gov/TheDataWeb_HotReport2/econsnapshot/2012/snapshot.hrml?
NAICS=51
● http://www.pewresearch.org/
● http://www.pewresearch.org/download-datasets/
● http://www.journalism.org/topics/ - part of Pew Research
● http://www.journalism.org/2015/03/05/how-demographics-play-into-local-news-habits-a-visual-disp
lay/
● http://www.journalism.org/2015/03/05/local-news-in-a-digital-age/
● http://www.journalism.org/2009/10/05/covering-great-recession/
● https://maps.communitycommons.org/viewer/?action=open_map&id=32125
● https://maps.communitycommons.org/viewer/?action=open_map&id=29455
● https://maps.communitycommons.org/viewer/?action=open_map&id=33559
● https://www.cdc.gov/500cities/
● https://www.usnews.com/news/healthiest-communities/articles/methodology

Informs2019 machine learning and data mining in identification of unhappy communities

  • 1.
    Machine Learning andData Mining in Identification of Unhappy Communities Alexander Gilgur Jose Emmanuel Ramirez-Marquez 1 The research performed by Jose E. Ramirez Marquez leading to these results has received funding from the National Science Foundation, CRISP Type 2 / Collaborative Research: Resilience Analytics: A Data-Driven Approach for Enhanced Interdependent Network Resilience, Award number 1541165.
  • 2.
    2 “All happy familiesare alike; each unhappy family is unhappy in its own way.” “... for systems belonging to the singular part of the stability boundary a small change of the parameters is more likely to send the system into the unstable region than into the stable region.” Can we predict unhappiness as a social phenomenon?
  • 3.
    Clarification Myers-Diener, 1995 “WhoIs Happy?” https://doi.org/10.1111/j.1467-9280.1995.tb00298.x Four traits of Happy People A city full of happy or unhappy people? Is it possible?
  • 4.
    4 A Small(er) Problem Howunhappy are the people in the communities mentioned in the News Media? Preserve People’s Privacy in Analysis
  • 5.
    Our Model Assumption: Self-Harm statisticsis a measurable proxy for Unhappiness. Hypothesis: There is a correlation between Reader-Media Interaction and Unhappiness. Multiple Reinforcing Loops (“Echo Chambers”) of Unhappiness Negative Media Sentiment High Cost of Living “Negative” Feelings: ● Pressure to Achieve ● Cognitive Dissonance ● Pessimistic Outlook ● High Stress Levels Unhappiness Negative Interaction Sentiment Communication Self-Harm Media Interactions Action Anti-Social Behavior
  • 6.
    Measuring Unhappiness: SuicideAttempts 6 55 20 0.8 0.3 Contented Unhappy 2 orders of magnitude difference Same Counties year over year
  • 7.
    Measuring Unhappiness: SuicideAttempts 7 55 20 0.8 0.3 Contented Unhappy 2 orders of magnitude difference Same Counties year over year
  • 8.
    8 Montgomery, VA Beaver, PA King,WA Alameda, CA Contented Unhappy Media Response Analysis In Contented counties, response rate is significantly steeper than in Unhappy counties.
  • 9.
    Classification Algorithm 9 Start Finish Collect &preprocess NYT data (https://bit.ly/2PIr4mc) LOCAL: apply classifier: Fit linear model (comments vs. articles) for each type of material and county. GLOBAL: prepare the labels: Fit linear model (comments vs. articles) for each type of material & happiness level. Compute Distance Metric Place counties (“contented”/”unhappy”) based on Distance Metric values
  • 10.
    What the AlgorithmDoes 10Classify counties by their Confidence Ellipses ? ? Regression parameters provide a clear grouping of unhappy counties separated from contented ones. The slope (sensitivity) of the line is lower for unhappy counties, while the intercept is higher.
  • 11.
    Classifier Performance 11 Precision andRecall for “unhappy” counties is 80% & 84%; overall Accuracy is 76% - For a detector based exclusively on very limited behavior metadata, it is better than acceptable.
  • 12.
    Towards a BetterUnderstanding: Sentiment & Emotion Analysis 12 Using R packages: ● sentimentr ● syuzhet
  • 13.
    Shared Sentiment ofComments 13 Observations: 1. Unhappy Counties are higher on Average Sentiment Scores. 2. For Unhappy Counties, the Gap between Negative and Positive Sentiment is 67% Wider (173 vs. 104) 3. The difference between Contented and Unhappy is most pronounced in the Negative Sentiment. And this is where ML ends, and Domain Expertise begins
  • 14.
    Interpretation of Negativity:OCEAN (Big 5) Framework? 14 Neuroticism is the tendency to experience negative emotions, such as anger, anxiety, or depression.
  • 15.
    Interpretation of Negativity:Neuroticism Framework? 15 [Jeronimus et. al, 2014, p.759] demonstrated that: (a) neuroticism showed high temporal stability, (b) Long-Term Difficulties and Deteriorated Life Quality predicted lasting increases in neuroticism, (c) Improved Life Quality predicted lasting decreases in neuroticism, (d) life event aggregates had no persistent impact on neuroticism, (e) neuroticism predicted experiences more consistently than experiences predicted change in neuroticism. Jeronimus, B. F., Riese, H., Sanderman, R., & Ormel, J. (2014). Mutual reinforcement between neuroticism and life experiences: A five-wave, 16-year study to test reciprocal causation. Journal of Personality and Social Psychology, 107(4), 751-764. http://dx.doi.org/10.1037/a0037009
  • 16.
    Further Work: Deterioratedvs. Improved Quality of Life 16 Counties with small to non-existent “cushion” in low-priced housing have high suicide-attempt rates, whereas counties with a bimodal distribution of monthly housing costs fall into the “contented”category. A smoking gun? Monthly Housing Cost Distribution
  • 17.
    Conclusions 17 1. Yes, wecan predict societal unhappiness. 2. ML & DM can be used in Identification of Unhappy Communities based on Newspaper Reader Comments. 3. Metadata alone can be used to build an Unhappiness Classifier with decent performance: ● Precision = 0.8 ● Recall = 0.84 4. Analysis of Sentiment and Emotions expressed in reader comments can be used to further enhance the performance of such classifier and to suggest possible paths to root-cause analysis of unhappiness. 5. Interpretation of ML & DM results is best done in continuous collaboration with the SMEs.
  • 18.
  • 19.
  • 20.
  • 21.
    Negativity Dynamics 21 Observations: 1. Overall,Negativity is the prevailing sentiment in both contentment classes. 2. Unhappy counties’ Negativity has grown faster than for Contented counties.
  • 22.
    Shared Emotions inComments: Expectations 22 Observations: 1. Expectations in Unhappy counties are higher than in Contented. 2. Expectations grow steeper in Unhappy counties than in Contented.
  • 23.
    Shared Emotions inComments: Expectation vs. Surprise 23 Observations: 1. The gap between Expectation and Surprise reached maximum in 2012. 2. This gap varies similarly in Contented and Unhappy counties. 3. This Gap is Wider in Unhappy counties Learned Disability? Higher expectations in Unhappy?
  • 24.
    Shared Emotions inComments: Sadness vs. Joy 24 Observations: 1. Overall, Sadness is the prevailing emotion over Joy. 2. In 8 years, Unhappy counties’ Gap between Sadness and joy has increased more than for Contented counties. Disappointment? Fatigue? Burnout?
  • 25.
    Shared Emotions inComments: Fear vs. Trust 25 Observations: 1. Fear has not always been prevailing over Trust. 2. In Contented counties, this dynamics has been growing Very Slowly. 3. In Unhappy counties, Fear Grew Faster and became Prevalent after 2012. Fear of what?
  • 26.
    Shared Emotions inComments: Anger 26 Observations: 1. There is more Anger in comments for Unhappy counties 2. The Anger in comments grows faster for Unhappy counties Fear? Learned Disability? Anger?
  • 27.
    Shared Emotions inComments: Disgust 27 Observations: 1. There is more disgust in comments for Unhappy counties 2. Disgust reaches its peak a year earlier in comments for Contented counties 3. For Contented counties, disgust grows to a lower level.
  • 28.
  • 29.
  • 30.
    Correlations of Emotionsin Comments: Potential Anger Drivers 30 Anger Sensitivity to: Anticipation(Expectations) Disgust Fear Unhappy 1/0.548=1.825 1/0.735=1.360 1/1.103=0.907 Contented 1/0.651=1.536 1/0.762=1.312 1/1.122=0.891 Unhappy : Contented 1.19 1.037 1.017
  • 31.
    Online Materials 31 ● https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml ●https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_16_5YR_ DP03&src=pt ● https://factfinder.census.gov/faces/nav/jsf/pages/searchresults.xhtml?refresh=t ● https://www.bloomberg.com/graphics/world-economic-indicators-dashboard/ ● https://thedataweb.rm.census.gov/TheDataWeb_HotReport2/econsnapshot/2012/snapshot.hrml? NAICS=51 ● http://www.pewresearch.org/ ● http://www.pewresearch.org/download-datasets/ ● http://www.journalism.org/topics/ - part of Pew Research ● http://www.journalism.org/2015/03/05/how-demographics-play-into-local-news-habits-a-visual-disp lay/ ● http://www.journalism.org/2015/03/05/local-news-in-a-digital-age/ ● http://www.journalism.org/2009/10/05/covering-great-recession/ ● https://maps.communitycommons.org/viewer/?action=open_map&id=32125 ● https://maps.communitycommons.org/viewer/?action=open_map&id=29455 ● https://maps.communitycommons.org/viewer/?action=open_map&id=33559 ● https://www.cdc.gov/500cities/ ● https://www.usnews.com/news/healthiest-communities/articles/methodology