00 Automatic Mental Health Classification in Online Settings and Language Embeddings

Can x2vec Save Lives?
Automatic Mental Health Classiﬁcation in Online
Settings Using Graph and Language Embeddings
Alexander Ruch, MPH MA
PhD Student, Cornell University
amr442@cornell.edu ~ alexruch.weebly.com
The study was supported by the US NSF (1756822) and NIH (R25HD079352)

Background
● Analyzing massive multimodal graphs is complex and resource-intensive (RAM)
● ML approaches to SNA circumvent many of these issues via online learning
while retaining graphs’ relational attributes and clustering propensity
○ Many extend word2vec embedding architecture to preserve homophily,
structural equivalence, edge context(Goyal and Ferrara 2017, Mikolov et al. 2013)
● Online communities’ language dynamics (e.g., norms, conformity, innovation)
correlate with users’ interaction patterns and “life cycles” (DNM et al. 2013)
● Document embeddings effectively measure language similarity in documents,
classes of documents, and authors of documents (Le and Mikolov 2014)
● Few researchers, however, have tested how graph and document embeddings
may be combined to analyze behavior and language dynamics together over
massive networks of millions of nodes and edges (cf., Bail 2016)

Questions
● How well can graph embeddings predict where users post submissions?
● How well can document embeddings predict where users make posts?
● Does integrating graph and document embeddings generate the best
predictions of where users post submissions or are they better separate?
● How correlated are graph and document embeddings?
Goal
● Can these methods help us predict individuals at risk of suicide?

Overview
1
sampling,
processing
metapath2vec
graph embeddings
2
doc2vec document
embeddings
3
similarities,
prediction tasks
4

Population:
490M submissions,
4.3T comments,
66M authors,
27M subreddits
Timeframe: June 2005 – June 2018
Datasource: https://ﬁles.pushshift.io/reddit/

Reddit Data Sample
● Main SW sample: 10M nodes with 45.3M edges (= largest component)
○ 6.6M observations collected starting from 5K SW author seeds
■ 700K submission authors (= 1% of Reddit authors)
■ 1.3M submissions
■ 1.6M comment authors
■ 6.5M comments
■ 21K subreddits
● Complement samples: 35.5M nodes with 190M edges (= largest component)
○ 6.6M observations from main sample
○ 7.0M observations from a subsample of mental health subreddits
○ 7.6M observations from a subsample of self-help subreddits
○ 5.8M observations randomly selected across all subreddits

Total-degree: x̄ (sd)
SW: 9.1 (0.63)
MH:9.0 (0.59)
SH: 9.0 (0.56)
R: 9.4 (0.64)

metapath2vec graph embedding
How close are authors and subreddits
over a network’s interaction space?

Dong et al. (2017) metapath2vec
Embedding multi-relational networks has unique challenges from their many types of nodes
and edges, which limits the feasibility of conventional network embedding techniques
metapath2vec uses metapath-based random walks to sample nodes’ heterogeneous
neighborhoods and embeds nodes using a heterogeneous skip-gram model (cf. word2vec)
metapath2vec++ enables simultaneous modeling of structural and contextual correlations
Both models outperform state-of-the-art embedding models
in many network mining tasks, including node classiﬁcation,
similarity search, and clustering
Strong results are often achieved with very little data (5%)

Sampling Graphs with Biased Random Walks
Random walks are computationally efficient in terms of space and time:
● Storing nodes’ immediate neighbors is O(|E|)
● Retrieving nodes’ neighbors is then O(|V |)
● Storing the interconnections between nodes’ neighbors is O(a2
|V |), where a
is the graph’s average degree and is usually small for real-world networks
Preprocessing transition probabilities makes walking from nodes O(1)
Writing walks’ real-time sampling results to disc instead of RAM saves memory
Since walks are independent, can parallelize sampler using multiprocessing to
greatly enhanced speed or to run multiple samplers over different metapaths
∴ metapath2vec is 8-times more efficient than SBM and requires much less RAM

Sampled nodes using metapath
subreddit → submission → author → submission → subreddit
● Goal: extract similarities between subreddits via the authors who post in them
Walked from each node 1000 times for a length of 100 steps
Embedded subreddit and author nodes appearing ≥5 times in 128 dimensions
using a neighborhood size of 7 and a negative sampling rate of 5
Result: embedding vectors for 1.8M subreddits and authors
≠ 10M due to minimum appearance thresholds and skipping submission/comment nodes
Total sampling and processing time < 1 day; mp2v model ﬁle size = 0.9 GB
MP2V Sampling and Embedding

Similarity between SuicideWatch and ...
Depression: 0.83 Advice: 0.74
Anxiety: 0.82 socialanxiety: 0.74
Mentalhealth: 0.75 selfharm: 0.73
AskDocs: 0.75 Needafriend: 0.72
MMFB: 0.74 StopSelfHarm:0.72

doc2vec document embedding
How similar is the language in authors’ submissions
to language that’s common in different subreddits?

Background: DBOW & DM doc2vec Models

Random subreddit submissions
x̄ (sd): DBOW = 0.45 (0.15); DM = 0.40 (0.09)
SuicideWatch submissions
x̄ (sd): DBOW = 0.77 (0.07); DM = 0.49 (0.06)

doc2vec Similarities: SW’s 15 nearest neighbors
DMM similarities to SuicideWatch:
depression, 0.99
depressed, 0.98
depression_help, 0.98
Suicide_help, 0.97
getting_over_it, 0.96
Prevent_Suicide, 0.96
mentalhealth, 0.96
sad, 0.96
MMFB, 0.95
SanctionedSuicide, 0.95
venting, 0.95
suicidenotes, 0.95
mentalillness, 0.95
ptsd, 0.94
BPD, 0.94
DBOW similarities to SuicideWatch:
depression, 0.96
MMFB, 0.93
depression_help, 0.92
whatsbotheringyou, 0.92
depressed, 0.92
Suicide_help, 0.91
sad, 0.91
suicidenotes, 0.91
offmychest, 0.91
SanctionedSuicide, 0.90
getting_over_it, 0.90
venting, 0.90
mentalhealth, 0.89
selfhelp, 0.89
Vent, 0.89

Similarities: doc2vec vs metapath2vec
Correlation of embedding distances to SW
MP2V DBOW DM D2V_x̄
MP2V 1.00 0.23 0.15 0.22
DBOW 0.23 1.00 0.59 0.93
DM 0.15 0.59 1.00 0.83
D2V_x̄ 0.22 0.93 0.83 1.00

Prediction Task:
Will an author post in SuicideWatch?

Model trained/tested with a 80/20 split (n=8610/2153)
= subsamples of the embedding data to balance training
● Training/test split of SW authors = 4060/1015
● Covariates = 128 MP2V embedding positions
Testing accuracy = 69%
Quite a few false-positives (25%) and false-negatives (38%)
Overall: not bad for only including unsupervised positional
data based on network connections
Logistic Regression: MP2V only

Logistic Regression: D2V only
Model trained/tested with a 80/20 split (n=8610/2153)
= subsamples of the embedding data to balance training
● Covariates = DBOW and DM distances to SW
Fewer false-positives (21%); still many false-negatives (27%)
Overall: surprisingly good results for only two covariates

Logistic Regression: MP2V+D2V
Model trained/tested with a 3/97 split (n=9,865/283,638)
= subsampled training data to balance training
● Covariates = 128 MP2V embedding positions + DBOW
and DMM distances to SW
Few false-positives (10%) and false-negatives (12%)
∴ graph & document embeddings work very well together
● Users’ behavior and language are both important,
especially for reducing false-positives/false-negatives

ŷ = SW author
ŷ = not SW author

Next Steps
● Better compare graph/document embeddings
○ Predict membership in other subreddits (e.g., depression, Anxiety, stopdrinking)
○ Determine where/when one type of embedding helps more than the other
● Reveal differences in membership among similar subreddits
○ For example, between alcoholicsanonymous, AlAnon, cripplingalcoholism, stopdrinking, addiction
● Use embeddings to predict the presentation of psychiatric attributes in posts
○ Use multi-label neural networks to predict suicidality, depression, anxiety, substance abuse, etc.
● Discover users’ “emotional arcs” before/after posting in SW (Reagan et al. 2016)
○ How do users’ paths to posting in SW differ, and how do paths leaving SW differ over time?
● Test social inﬂuence, social contagion, and other social dynamics
○ Use DeepInf to analyze and visualize neighbors’ social inﬂuence over time (Qiu et al. 2018)

Questions/Comments?
Special thanks to Drs. Jennifer Ruch, Michael Macy, David Mimno, Lillian Lee, Christopher Bail, and Thomas Gilovich for
feedback and support on parts of this project. Thanks as well to Seunghyun Kim, Lillyan Pan, Hannah Lee, Helen Sun,
James Zou, Gary Zhuge, Jeffrey Tsang, Bryan Min, Juliana Hong, Yejeong Choi, Cornell’s Social Dynamics Laboratory,
Cornell’s Computational Social Science Reading Group, Duke’s NAC, NSF, and NIH for assistance and funding support.

Citations
Bail (2016) “Combining natural language processing and network analysis…”
Danescu-Niculescu-Mizil et al. (2014) “No Country for Old Members”
Dong et al. (2018) “metapath2vec: Scalable Representation Learning for Heterogeneous Networks”
Goyal and Ferrara (2017) “Graph Embedding Techniques, Applications, and Performance: A Survey”
Le and Mikolov (2014) “Distributed Representations of Sentences and Documents”
Mikolov et al. (2013) “Distributed Representations of Words and Phrases and their Compositionality”
Pedregosa et al. (2011) “Scikit-learn: Machine Learning in Python”
Peixoto (2014) “The graph-tool python library” (graph-tool.skewed.de/)
Reagan et al. (2016) “The emotional arcs of stories are dominated by six basic shapes”
Qiu et al. (2018) “DeepInf: Social Inﬂuence Prediction with Deep Learning”

metapath2vec code
metapath2vec original code: https://ericdongyx.github.io/metapath2vec/m2v.html
● This repository contains Dong et al.’s scripts to sample and embed graphs
stellargraph: https://github.com/stellargraph/stellargraph
● Please note that stellargraph runs on networkx, which is extremely memory
inefficient and slow compared to graph-tool
● stellargraph works well with small/moderate graphs, but you should use Dong
et al.’s original code for large/massive graphs (especially for sampling)

Sampling Process
SuicideWatch (main sample)
24,281 (n>=20): get SW submission authors
20% sample → 4,948 (= distinct authors)
777,243 (= 94 subm/auth): get SW authors’ submissions
20% sample → 155,646 (= 31 subm/auth)
9,611,359 (= 1942 coms-subms/auth): get SW authors’ comments & com-subm info
20% sample → 1,415,357 (= 286 distinct coms-subms/auth)
447,579,856: get all comments to all submissions
2,109,393: get SW authors’ info
1% sample → 4,475,200: get non-SW authors info
6,584,593 (= 2,109,393 + 4,475,200): final sample count
Mental Health
7,035,904 (= 2,449,045 + 4,586,859): from 20 MH subreddit seeds
Selfhelp
7,576,231 (= 2,621,663 + 4,954,568): from 10 SH subreddit seeds
Random
5,880,113 (= 1,791,641 + 4,088,472): from a simple random sample of 5000 distinct user seeds

Metapath Examples and Complexity
# author to subr (via subm or comm)
["author", "submission", "subreddit", "submission", "author"], #subm to same subr
["author", "comment", "submission", "subreddit", "submission", "comment", "author"], #comm to same subr
# submission to submission (via subm or auth)
["submission", "subreddit", "submission"], #subm to same subr
["submission", "author", "submission"], #subm by same auth
# comment to comment (via subm or auth)
["comment", "submission", "comment"], #comm to same subm
["comment", "author", "comment"], #comm by same auth
# subreddit to subreddit
["subreddit", "submission", "author", "submission", "subreddit"], #subr by same auth via subm
["subreddit", "submission", "comment", "author", "comment", "submission", "subreddit"] #subr by same auth via comm
Estimated complexity:
SBM: O((Vln2
V + E) × MCMC_sweeps) → O(SMBv=10M,e=45M,s=15
) = 40B (1 sweep = 2.7B)
HSBM:O((Vln2
V + E×Blocks) × MCMC_sweeps) → O(HSBMv=10M,e=45M,b=20,s=15
) = 52B
MP2V: O((V_seeds × walks × walk_length) + ((V_sampled - mp2v_window) × mp2v_iter)) → O(MP2Vv=10M,e=45M,iter=15
) =
2B
∴ MP2V 8 times more computationally efficient

metapath2vec vs metapath2vec++ Results

Clinical Diagnostic Criteria: Risk Factor Keywords

00 Automatic Mental Health Classification in Online Settings and Language Embeddings

00 Automatic Mental Health Classification in Online Settings and Language Embeddings

More Related Content

What's hot

Similar to 00 Automatic Mental Health Classification in Online Settings and Language Embeddings

More from Duke Network Analysis Center

Recently uploaded

00 Automatic Mental Health Classification in Online Settings and Language Embeddings