Predictive Linguistic Features of Schizophrenia

Predictive Linguistic Features
of Schizophrenia
Efsun Sarioglu Kayi1
, Mona Diab1
, Luca Pauselli2
, Michael Compton2
, Glen Coppersmith3
1Computer Science Department, George Washington University
2 Columbia University Medical Center
3Qntfy

Outline
• Background
• Motivation
• Related Work
• Dataset
• Methodology
• Experiments & Results
• Conclusion

Background
• Schizophrenia is one of the most disabling and difficult to treat of all human
medical/health conditions
• Ranking in the top ten causes of disability worldwide
• Has been studied extensively on the neurological and behavioral levels
• Typically diagnosed on the basis of clinical observation
• Difficulty in identifying its basic, fundamental components
• Need for objective measures
• Some manifestations of schizophrenia can be understood from the perspective of
linguistics
• Negative symptoms, e.g. blunting of speech prosody
• Disorganization symptoms that lead to disordered language
3

Motivation
• When comparing patients and controls, many language abnormalities
have been discovered (Covington et al., 2005)
• reduction in syntax complexity
• impaired semantics, such as the organization of individual propositions into
larger structures
• abnormalities in pragmatics, i.e., relationship between language and context
• Compute many syntactic, semantic and pragmatic features from
patients' writings/tweets and examine their success in predicting
schizophrenia
4

Related Work
• Latent Semantic Analysis (LSA) for lexical coherence (Elvevag et al., 2007)
• LSA with phrase level syntactic measures such as POS tags (Bedi et al., 2015)
• LDA analysis of doctor-patient communication in therapy (Howes et al., 2013)
• Analysis of Twitter dataset (Mitchell et al., 2015)
• LDA, Brown Clustering, and Linguistic Inquiry Word Count (LIWC) for
sentiment analysis
5

Datasets
• Writing Samples (LabWriting)
• 93 schizophrenia patients and 95 controls
• English speaking, aged 18-50
• Total of 373 writings
• Twitter
• Self-reported diagnoses
• 174 users and 174 age&gender matched controls
• ~2800 tweets
6

LabWriting Dataset
• Two paragraph length essays
• Average Sunday
• What makes them angriest
• IRB approved processes
7

Twitter Dataset
• Schizophrenia users selected via regex on ‘schizo’
• ‘skitzo', ‘schizoid’, ‘skitso’, ‘schizotypal’
• Each verified genuine by a human annotator
• Excluding jokes, quotes, disingenuous statements
• Sample tweets (rephrased to preserve anonymity)
• “this is my first time being unemployed. please forgive me. i’m crazy. #schizophrenia”
• “i’m in my late 50s. i worry if i have much time left as they say people with #schizophrenia die 15-
20 years younger”
• “#schizophrenia takes me to devil-like places in my mind”
8

Feature Categories
9
Category LabWriting Twitter
Syntactic
POS
Dependency Parse
POS
Dependency Parse
Semantic
SRL
Topics
Clusters
Topics
Clusters
Pragmatic
Sentiment
Sentiment Intensity
Sentiment

Syntactic Features
• POS tags
• Dependency Parse labels
• Tools
• Stanford CoreNLP for LabWriting dataset
• CMU Tweet NLP for Twitter dataset
• Coarse POS tags (25 vs 36 of Penn Treebank)
• Tweebo Parser trained on Tweebank
• Many elements have no syntactic function, i.e., hashtags, URLs, and emoticons
10

Semantic Features
• Semantic Role Labeling (SRL)
• Shallow parsing technique where predicate or verb of a sentence is associated with semantic
roles such as personal relationship, social event, kinship and so on
• SEMAFOR (Semantic Analyzer of Frame Representations) tool (Das et al., 2010)
• Topics
• Mallet (McCallum 2002)
• Number of topics chosen on validation set
• k=20 for LabWriting & k=40 for Twitter dataset
• Dense clusters based on global word vectors (GloVe-Pennington et al., 2014)
• Wikipedia 2014 dump and GigaWord 5 for LabWriting dataset
• 6B tokens, 400K vocabulary, 300d vectors
• Twitter model used for Twitter dataset
• 2B tweets, 27B tokens, 1.2M vocabulary, 200d vectors
• k-means clustering
• k=100 selected on validation set
• Bag-of-Clusters
11

Pragmatic Features
• Sentiment and sentiment intensity
• Stanford Sentiment Tool for LabWriting dataset
• {very positive, positive, neutral, negative, very negative} at sentence level
• Intensity at phrase level
• Columbia Sentiment Tool for Twitter dataset
• {positive, neutral, negative}
• Trained on social media data
12

Methodology
• Binary classification between schizophrenia patients and controls
using SVM and Random Forest
• 10-fold stratified cross-validation
• Report ROC plots, AUC, Precision, Recall, F-score
• Majority baseline F-score
• LabWriting: 34.39
• Twitter: 32.11

LabWriting ROC Plots
14
POS Parse SRL Topics
Clusters Sentiment Sentiment Intensity

LabWriting Dataset Classification Results
15

Twitter ROC Plots
16
POS Parse Topics
Clusters Sentiment

Twitter Dataset Classification Results
17

Feature Selection
• Information Gain (IG)
• Measures change in entropy
• 𝐻 𝑋 = − ∑ 𝑃 𝑥𝑖 log ( 𝑃(𝑥𝑖)/
012
• Used in random forest attribute selection
• SVM Recursive Feature Elimination (RFE) (Guyon et al., 2002)
• Bigger weights |w| or w^2 correspond to more informative features
18

Top Syntactic Features
• Top POS tags
• FW (Foreign Word) —>misspellings, missing punctuation, abbreviations
• ‘comptenplate’, ‘tha’, ‘angerest’, ‘yo’, ‘havent’, ‘dont’, ‘u’, ‘i’, ‘chosed’, ‘beleive’, ‘awakin’
• LS (List Item Marker) —>’i’ (pronoun I)
• Similar conclusion for depression (Rude et al., 2004)
• RP (Adverbial particle)
• ‘up’, ‘down’, ‘out’, ‘on’, ‘off’
• # (hash tags) for Twitter dataset
• Patients use less hashtags than controls
• Top Dependency Parse tags
• advmod (adverb modifier)
19

Discriminative SRLs
20
Label Sample Words/Phrases
Quantity several, both, all, a lot, many, a little, a few, lots
Social Event social, dinner, picnics, hosting, dance
Morality Evaluation wrong, depraved, foul, evil, moral
Catastrophe incident,tragedy, suffer
Manipulate into Doing harassing, bullying
Being Obligated duty, job, have to, had to, should, must, responsibility, entangled, task

Discriminative Topics
21
Method Dataset Top Words
IG Writing church sunday wake god service pray sing worship bible spending thanking
IG & RFE Writing i’m can’t trust upset person lie real feel honest lied lies judge lying steal
IG & RFE Twitter god jesus mental schizophrenic schizophrenia illness paranoid medical evil
IG & RFE Twitter don love people fuck life feel fucking hate shit stop god person sleep bad die

Top Pragmatic Features
• Intensity performs better than sentiment (LabWriting only)
• Top sentiment features
• neutral, negative, very negative (LabWriting only)
• Top sentiment intensity (LabWriting only)
• neutral intensity, negative intensity, very negative intensity
22

Effect of Dataset’s Characteristics
• Social media specific tools for Twitter dataset
• Informal, lacks proper grammar, includes many abbreviations
• Some tools unsuccessful on Twitter such as SRL
• Less comprehensive output for Twitter (smaller tag sets)
• Machine learning approaches less successful
• LabWriting dataset
• Spelling errors not corrected --> could be a symptom of SZ
• Some machine learning techniques do not perform well
23

Conclusion
• For LabWriting dataset, the best performing features are syntactic based:
• According to F-Score
• POS+Parse (Syntax) for SVM
• Syntax + Pragmatics for Random Forest
• According to AUC
• POS for both classifiers
• For Twitter dataset, the best performing features are combination of features:
• According to both F-Score and AUC
• All features for SVM (Syntax + Semantics + Pragmatics)
• Semantics + Pragmatics for Random Forest
• Computational assessment models of schizophrenia
• Provide ways for clinicians to monitor symptoms more effectively
• Deeper understanding of schizophrenia could benefit affected individuals, families, and society at
large
24

References
1. Michael A Covington, Congzhou He, Cati Brown, Lorina Naci, Jonathan T McClain, Bess Sirmon Fjordbak, James Semple, and
John Brown. 2005. Schizophrenia and the Structure of Language: The Linguist’s View. Schizophr Res 77(1):85–98.
2. Brita Elvevag, Peter W Foltz, Daniel R Weinberger, and Terry E Goldberg. 2007. Quantifying Incoherence in Speech: An
Automated Methodology and Novel Application to Schizophrenia. Schizophr Res 93(1-3):304–316.
3. Gillinder Bedi, Facundo Carrillo, Guillermo A Cecchi, Diego Fernandez Slezak, Mariano Sigman, Natalia B Mota, Sidarta Ribeiro,
Daniel C Javitt, Mauro Copelli, and Cheryl M Corcoran. 2015. Automated Analysis of Free Speech Predicts Psychosis Onset in
High-Risk Youths. Npj Schizophrenia 1.
4. Christine Howes, Matthew Purver, and Rose Mc- Cabe. 2013. Using Conversation Topics for Predicting Therapy Outcomes in
Schizophrenia. Biomed Inform Insights 6(Suppl 1):39–50.
5. Margaret Mitchell, Kristy Hollingshead, and Glen Coppersmith. 2015. Quantifying the Language of Schizophrenia in Social
Media. In Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to
Clinical Reality. pages 11–20.
6. Dipanjan Das, Nathan Schneider, Desai Chen, and Noah A. Smith. 2010. Probabilistic Frame-semantic Parsing. In Human
Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational
Linguistics. pages 948–956.
7. Andrew Kachites McCallum. 2002. MALLET: A Machine Learning for Language Toolkit. Http://mallet.cs.umass.edu.
8. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In
Empirical Methods in Natural Language Processing (EMNLP). pages 1532– 1543.
9. Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. 2002. Gene Selection for Cancer Classification Using
Support Vector Machines. Mach. Learn. 46(1-3):389–422.
10.Stephanie Rude, Eva-Maria Gortner, and James Pennebaker. 2004. Language Use of Depressed and Depression-Vulnerable
College Students. Cognition and Emotion 18(8):1121–1133.

Predictive Linguistic Features of Schizophrenia

Recommended

Recommended

More Related Content

Similar to Predictive Linguistic Features of Schizophrenia

Similar to Predictive Linguistic Features of Schizophrenia (20)

More from Efsun Kayi

More from Efsun Kayi (6)

Recently uploaded

Recently uploaded (20)

Predictive Linguistic Features of Schizophrenia