Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Alz Hack II

585 views

Published on

A progress update on the AlzHack project; attempt to classify dementia sufferers from their partners via text analytics and machine learning.

Published in: Science
  • Be the first to comment

Alz Hack II

  1. 1. AlzHack Data Driven Diagnosis of Alzheimer's Disease Frank Kelly
  2. 2. Goal definition
  3. 3. Diagnose Alzheimer’s disease as early as possible Benefit to millions of people (potentially) Our goal:
  4. 4. Why is Alzheimer’s disease diagnosis important? Chronic neurodegenerative disease 60-70% of dementia cases = Alzheimer's 48 million people affected worldwide (2015) Wrecks people’s lives (+ their families’) 800,000 people (in the UK) formally diagnosed Only 43% of those with the condition get a diagnosis Figures: wikipedia & http://www.bbc.co.uk/science/0/21878238
  5. 5. Demographic changes mean it will be more widespread Chart credit: economist.com By 2050 the number of dementia sufferers is expected to triple A global, mounting problem
  6. 6. How is Alzheimer’s disease diagnosed today? Medical history Mental status tests Physical and neurological examination Blood tests and brain imaging Example test sheet:: http://www.ftdrg.org/wp-content/uploads/4a-CCT_revised-Picture-stimulus.pdf
  7. 7. A gradual decline -20 years -10 years Death-15 years -5 years Earliest Alzheimer’s Mild to moderate Severe Common diagnosis period
  8. 8. Who are we ? Full bios: https://alzhack.wordpress.com What is our approach? We’re doing citizen science ● No lab, or lab coats ● Readily available data ● Other people’s research
  9. 9. Diagnose Alzheimer’s disease as early as possible Why? Participate in clinical drug trials Benefit from treatment More time to plan Take own decisions Better carer relationship Reduce anxieties about unknowns Sketch: http://www.businessfinancenews.com/28526-will-astrazeneca-plc-and-eli-lilly-give-breakthrough-in-alzheimers/
  10. 10. Design of Study & Data Collection
  11. 11. How the disease manifests itself Protein plaques and tangles accumulate in the brain: Disrupting communication between nerve cells Kills nerve cells Loss of brain tissue Facts: https://www.alzheimers.org.uk/site/scripts/documents_info.php?documentID=100 Imagery: www.alz.org
  12. 12. How the disease manifests itself (1) Starts in the hippocampus Harder to form new memories Difficult to recollect from days or hours ago Video: https://www.youtube.com/watch?v=Eq_Er-tqPsA
  13. 13. How the disease manifests itself (2) ...then takes root in other areas 2. Language processing 3. Logical thought 4. Emotions 5. Senses 6. Older memories 7. Balance and coordination Video: https://www.youtube.com/watch?v=Eq_Er-tqPsA
  14. 14. Relevant symptoms Confusion with time/place Spatial memory Problems with words Misplacing items Decreased / poor judgment Withdrawal from work Mood change Difficulty with familiar tasksChallenges in planning SpeechShort term memory loss -20 years -10 years Death -15 years -5 years Earliest Alzheimer’s Mild to moderate Severe
  15. 15. Previously...
  16. 16. Previously: Analysis of a single user’s emails ● An Alzheimer’s disease sufferer’s emails over 4 years ● Conversion of email text to vectors ● Counts, lengths and other metrics Features Memory, language and sentiment related metrics extracted
  17. 17. Results Some “explainable” trends Challenges Single user: lack of data and likely bias Scaling up: security concerns & deletion
  18. 18. How did we get more data?
  19. 19. Forum post scraping First lxml, then BeautifulSoup ● Two sub-forums ● ~3,600 threads ● ~78,000 posts ○ Post content ○ Post metadata ○ User metadata
  20. 20. Data preparation Content punctuation sanitised by regexp substitutions. Sub forum post data (x2)
  21. 21. User labelling
  22. 22. How do we label a user? ● Users frequently post in both sub-forums ● To differentiate: ○ Assume that OPs (thread starters) in a sub-forum are of that category ○ Otherwise look at ratio of posts (replies) between the two sub forums* FP = First Post in thread SP = Subsequent Post in thread
  23. 23. How do we label a user? Thread Reply Dementia Partner Discard Unknown
  24. 24. Features and EDA
  25. 25. Sentiment “polarity” (out-of-the-box via NLTK & TextBlob) ● Alternatively can train your own text classifier: http://streamhacker. com/2010/05/10/text-classification- sentiment-analysis-naive-bayes- classifier/ ‘Mood change’ as a feature
  26. 26. ● Average of sentence sentiments per post ● Slightly higher sentiment for dementia sufferers’ posts
  27. 27. Language-oriented features Lexical functions Comprehension functions Empty phrases Paraphasias and neologisms Vocabulary-related Readability “Go ahead” phrases Unintended or invented words Difficult words count Dale-Chall readability Flesch Kincaid Flesch Reading Ease Counts of “ummm...errr” Words that are not in common usage
  28. 28. Simple language features ● Sentence count ● Word count ● Words per sentence ● Unique word count ● Unique words to total ratio ● “Go Ahead” words (Empty phrases)
  29. 29. Readability (package readability-lxml) ● Avg syllables per word ● Avg letter per word ● Flesch reading ease ● Flesch kincaid grade ● Polysyllabcount ● Automated readability index ● Number of “difficult” words ● Dale-chall readability score ● Gunning fog
  30. 30. Vocabulary & word counts
  31. 31. Memory-oriented features ● Sort posts by username and timestamp, add a shifted column
  32. 32. Apply comparison function between post and previous post: ○ NLTK edit_distance (fuzzy match) ○ Cosine similarity between TF- IDF vectors
  33. 33. Part of speech (POS) features ● Tag words and tally up frequencies ● Calculate “rates”
  34. 34. Models & results
  35. 35. Explanatory or predictive modelling ? ● Actually both. ● First ‘interpret’ a classifier (explanatory) ● Secondly need a ‘real-time’ detection system (predictive)
  36. 36. Data modelling strategy (used for initial ML runs) Aggregation of posts ● pandas: groupby, agg by username Balancing out the dataset ● Many more partner users than sufferers ● Subsample larger (partner) dataset to even things up Validate using random train and test sets ● Randomly select 80% of users for training, 20% test
  37. 37. Model Results for Misc. Features ● Median values (aggregated over all posts per user) Best: SVM Radial basis function classifier (with grid search) User classification accuracy: 57%
  38. 38. Model Results for Memory Features ● Median values (aggregated over all posts per user) Best: K-nearest neighbours Classifier User classification accuracy: 63%
  39. 39. Model Results for Readability Features ● Median values (aggregated over all posts / user) Best: K-nearest neighbours Classifier User classification accuracy: 59%
  40. 40. Model Results for Part-Of-Speech Features ● Median values (aggregated over all posts per user) Best: SVM Radial basis function classifier (with grid search) User classification accuracy: 61%
  41. 41. Model Results for All Features ● Median values (aggregated over all posts per user) Best: Naïve Bayes Classifier User classification accuracy: 63%
  42. 42. Re-think: Classify posts, not users ● Currently group by userID ● Some users post more than others ● Posts would utilise full “richness” of the dataset ● Double round of sampling required on post set: ○ 3 - 4 times more “partners” than dementia sufferers ○ Partners post approx. 3 times more posts than sufferers do
  43. 43. Model Results for All Features (by post) ● Filtered set of posts Best: Random Forest Classifier Accuracy of 68% percent in ability to classify a post
  44. 44. Wrap up
  45. 45. Results in summary ● Best performing feature group so far on aggregated set by user: ○ Memory-based features ● Best performing individual feature on aggregated set by user: ○ Verb rate = ratio of verbs to word count in post ● Best performing individual feature on individual post: ○ Cosine similarity to previous post ● Aligns with symptoms expected in early stage to mild dementia
  46. 46. Future avenues ● Data ○ Further data gathering (more blogs including non-alzheimer's topic blogs) ○ Better user identification (e.g. active learning) ● Features ○ More and better ○ Types of individual dementia distinguish ○ More memory-related features (e.g. LSI) ● Clustering of posts into ‘topics’ or users into ‘types’ ○ gensim / LDA topic modelling ○ Early stage / medium condition / advanced condition posters ● Classification and modelling ○ Time series analysis ○ New sampling techniques, input validation and models
  47. 47. Future: Time series analysis ● Noisy datasets ○ Apply numerical Bayesian inference ● Are we looking for a steady change in the mean? ○ Ramp detection ● Or a sudden change in variance? ○ Step change detection Dementia sufferer Partner
  48. 48. Conclusions ● Introduction to Alzheimer’s and its impact ● Explanation of our technical approach and surrounding challenges ● Initial observations and predictions ● Tough problem and a worthwhile cause for data science ● Please contact us if you would like to help, or have ideas: frank.kelly@cantab.net https://alzhack.wordpress.com/contribute-2/ Thank you!

×