SlideShare a Scribd company logo
AlzHack
Data Driven Diagnosis of Alzheimer's Disease
Frank Kelly
Goal definition
Diagnose Alzheimer’s disease as early as possible
Benefit to millions of people (potentially)
Our goal:
Why is Alzheimer’s disease diagnosis important?
Chronic neurodegenerative disease
60-70% of dementia cases = Alzheimer's
48 million people affected worldwide (2015)
Wrecks people’s lives (+ their families’)
800,000 people (in the UK) formally diagnosed
Only 43% of those with the condition get a diagnosis
Figures: wikipedia & http://www.bbc.co.uk/science/0/21878238
Demographic changes mean it
will be more widespread
Chart credit: economist.com
By 2050 the number of dementia
sufferers is expected to triple
A global, mounting problem
How is Alzheimer’s disease
diagnosed today?
Medical history
Mental status tests
Physical and neurological examination
Blood tests and brain imaging
Example test sheet:: http://www.ftdrg.org/wp-content/uploads/4a-CCT_revised-Picture-stimulus.pdf
A gradual decline
-20
years
-10
years
Death-15
years
-5
years
Earliest Alzheimer’s Mild to moderate Severe
Common diagnosis period
Who are we ?
Full bios: https://alzhack.wordpress.com
What is our approach? We’re doing citizen science
● No lab, or lab coats
● Readily available data
● Other people’s research
Diagnose Alzheimer’s disease as early as possible
Why?
Participate in clinical drug trials Benefit from treatment
More time to plan
Take own decisions
Better carer relationship
Reduce anxieties about unknowns
Sketch: http://www.businessfinancenews.com/28526-will-astrazeneca-plc-and-eli-lilly-give-breakthrough-in-alzheimers/
Design of Study
&
Data Collection
How the disease manifests itself
Protein plaques and
tangles accumulate in the
brain:
Disrupting
communication
between nerve cells
Kills nerve cells
Loss of brain tissue
Facts: https://www.alzheimers.org.uk/site/scripts/documents_info.php?documentID=100 Imagery: www.alz.org
How the disease manifests itself (1)
Starts in the hippocampus
Harder to form new memories
Difficult to recollect from days or
hours ago
Video: https://www.youtube.com/watch?v=Eq_Er-tqPsA
How the disease manifests itself (2)
...then takes root in other areas
2. Language processing
3. Logical thought
4. Emotions
5. Senses
6. Older memories
7. Balance and coordination
Video: https://www.youtube.com/watch?v=Eq_Er-tqPsA
Relevant symptoms
Confusion with
time/place
Spatial memory
Problems with words
Misplacing items
Decreased / poor
judgment
Withdrawal from
work
Mood change Difficulty
with familiar
tasksChallenges in planning
SpeechShort term memory loss
-20
years
-10
years Death
-15
years
-5
years
Earliest Alzheimer’s Mild to moderate Severe
Previously...
Previously: Analysis of a single user’s emails
● An Alzheimer’s
disease sufferer’s
emails over 4 years
● Conversion of email
text to vectors
● Counts, lengths and
other metrics
Features
Memory, language and sentiment related metrics extracted
Results
Some “explainable” trends
Challenges
Single user: lack of data and likely bias
Scaling up: security concerns & deletion
How did we get more data?
Forum post scraping
First lxml, then BeautifulSoup
● Two sub-forums
● ~3,600 threads
● ~78,000 posts
○ Post content
○ Post metadata
○ User metadata
Data preparation
Content punctuation
sanitised by regexp
substitutions.
Sub forum
post data
(x2)
User labelling
How do we label a user?
● Users frequently post in both sub-forums
● To differentiate:
○ Assume that OPs (thread starters) in a sub-forum are of that category
○ Otherwise look at ratio of posts (replies) between the two sub forums*
FP = First Post in thread SP = Subsequent Post in thread
How do we label a user?
Thread
Reply
Dementia Partner
Discard Unknown
Features and EDA
Sentiment “polarity”
(out-of-the-box via
NLTK & TextBlob)
● Alternatively can
train your own text
classifier:
http://streamhacker.
com/2010/05/10/text-classification-
sentiment-analysis-naive-bayes-
classifier/
‘Mood change’
as a feature
● Average of sentence
sentiments per post
● Slightly higher
sentiment for
dementia sufferers’
posts
Language-oriented features
Lexical functions
Comprehension functions
Empty phrases
Paraphasias and
neologisms
Vocabulary-related
Readability
“Go ahead” phrases
Unintended or
invented words
Difficult words count
Dale-Chall readability
Flesch Kincaid
Flesch Reading Ease
Counts of “ummm...errr”
Words that are not in
common usage
Simple language features
● Sentence count
● Word count
● Words per sentence
● Unique word count
● Unique words to total ratio
● “Go Ahead” words (Empty phrases)
Readability
(package readability-lxml)
● Avg syllables per word
● Avg letter per word
● Flesch reading ease
● Flesch kincaid grade
● Polysyllabcount
● Automated readability index
● Number of “difficult” words
● Dale-chall readability score
● Gunning fog
Vocabulary & word counts
Memory-oriented features
● Sort posts by username and timestamp, add a shifted column
Apply comparison function
between post and previous post:
○ NLTK edit_distance (fuzzy
match)
○ Cosine similarity between TF-
IDF vectors
Part of speech (POS) features
● Tag words and
tally up
frequencies
● Calculate
“rates”
Models & results
Explanatory or predictive modelling ?
● Actually both.
● First ‘interpret’ a classifier (explanatory)
● Secondly need a ‘real-time’ detection system (predictive)
Data modelling strategy (used for initial ML runs)
Aggregation of posts
● pandas: groupby, agg by username
Balancing out the dataset
● Many more partner users than sufferers
● Subsample larger (partner) dataset to even things up
Validate using random train and test sets
● Randomly select 80% of users for training, 20% test
Model Results for Misc. Features
● Median values (aggregated over all posts per user)
Best: SVM Radial basis function classifier (with grid
search)
User classification accuracy: 57%
Model Results for Memory Features
● Median values (aggregated over all posts per user)
Best: K-nearest neighbours Classifier
User classification accuracy: 63%
Model Results for Readability Features
● Median values (aggregated over all posts / user)
Best: K-nearest neighbours Classifier
User classification accuracy: 59%
Model Results for Part-Of-Speech Features
● Median values (aggregated over all posts per user)
Best: SVM Radial basis function classifier (with grid
search)
User classification accuracy: 61%
Model Results for All Features
● Median values (aggregated over all posts per user)
Best: Naïve Bayes Classifier
User classification accuracy: 63%
Re-think: Classify posts, not users
● Currently group by userID
● Some users post more than others
● Posts would utilise full “richness” of the dataset
● Double round of sampling required on post set:
○ 3 - 4 times more “partners” than dementia sufferers
○ Partners post approx. 3 times more posts than sufferers do
Model Results for All Features (by post)
● Filtered set of posts
Best: Random Forest Classifier
Accuracy of 68% percent in ability to classify a post
Wrap up
Results in summary
● Best performing feature group so far on aggregated set by user:
○ Memory-based features
● Best performing individual feature on aggregated set by user:
○ Verb rate = ratio of verbs to word count in post
● Best performing individual feature on individual post:
○ Cosine similarity to previous post
● Aligns with symptoms expected in early stage to mild dementia
Future avenues
● Data
○ Further data gathering (more blogs including non-alzheimer's topic blogs)
○ Better user identification (e.g. active learning)
● Features
○ More and better
○ Types of individual dementia distinguish
○ More memory-related features (e.g. LSI)
● Clustering of posts into ‘topics’ or users into ‘types’
○ gensim / LDA topic modelling
○ Early stage / medium condition / advanced condition posters
● Classification and modelling
○ Time series analysis
○ New sampling techniques, input validation and models
Future:
Time series analysis
● Noisy datasets
○ Apply numerical Bayesian
inference
● Are we looking for a steady
change in the mean?
○ Ramp detection
● Or a sudden change in
variance?
○ Step change detection
Dementia sufferer
Partner
Conclusions
● Introduction to Alzheimer’s and its impact
● Explanation of our technical approach and surrounding challenges
● Initial observations and predictions
● Tough problem and a worthwhile cause for data science
● Please contact us if you would like to help, or have ideas:
frank.kelly@cantab.net https://alzhack.wordpress.com/contribute-2/
Thank you!
Alz Hack II

More Related Content

Viewers also liked

Changepoint Detection with Bayesian Inference
Changepoint Detection with Bayesian InferenceChangepoint Detection with Bayesian Inference
Changepoint Detection with Bayesian Inference
Frank Kelly
 
Change Point Analysis
Change Point AnalysisChange Point Analysis
Change Point Analysis
Mark Conway
 
Change Point Analysis
Change Point AnalysisChange Point Analysis
Change Point Analysis
Taha Kass-Hout, MD, MS
 
Hierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyondHierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyond
Frank Kelly
 
Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learn
Jimmy Lai
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Jimmy Lai
 
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKStatistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Olivier Grisel
 

Viewers also liked (7)

Changepoint Detection with Bayesian Inference
Changepoint Detection with Bayesian InferenceChangepoint Detection with Bayesian Inference
Changepoint Detection with Bayesian Inference
 
Change Point Analysis
Change Point AnalysisChange Point Analysis
Change Point Analysis
 
Change Point Analysis
Change Point AnalysisChange Point Analysis
Change Point Analysis
 
Hierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyondHierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyond
 
Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learn
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
 
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKStatistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
 

Similar to Alz Hack II

Weird News Ranking : IRE project
Weird News Ranking : IRE projectWeird News Ranking : IRE project
Weird News Ranking : IRE project
Rupali Aher
 
Best practices machine learning final
Best practices machine learning finalBest practices machine learning final
Best practices machine learning final
Dianna Doan
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Xavier Amatriain
 
Model evaluation in the land of deep learning
Model evaluation in the land of deep learningModel evaluation in the land of deep learning
Model evaluation in the land of deep learning
Pramit Choudhary
 
Past, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspectivePast, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspective
Xavier Amatriain
 
Reference Domain Ontologies and Large Medical Language Models.pptx
Reference Domain Ontologies and Large Medical Language Models.pptxReference Domain Ontologies and Large Medical Language Models.pptx
Reference Domain Ontologies and Large Medical Language Models.pptx
Chimezie Ogbuji
 
Diabetes Prediction Using Machine Learning
Diabetes Prediction Using Machine LearningDiabetes Prediction Using Machine Learning
Diabetes Prediction Using Machine Learning
jagan477830
 
Best Practices for Big Data Analytics with Machine Learning by Datameer
Best Practices for Big Data Analytics with Machine Learning by DatameerBest Practices for Big Data Analytics with Machine Learning by Datameer
Best Practices for Big Data Analytics with Machine Learning by Datameer
Datameer
 
Klout as an Example Application of Topics-oriented NLP APIs
Klout as an Example Application of Topics-oriented NLP APIsKlout as an Example Application of Topics-oriented NLP APIs
Klout as an Example Application of Topics-oriented NLP APIs
Tyler Singletary
 
Sistemas de Recomendação sem Enrolação
Sistemas de Recomendação sem Enrolação Sistemas de Recomendação sem Enrolação
Sistemas de Recomendação sem Enrolação
Gabriel Moreira
 
tranSMART Community Meeting 5-7 Nov 13 - Session 3: tranSMART and the One Min...
tranSMART Community Meeting 5-7 Nov 13 - Session 3: tranSMART and the One Min...tranSMART Community Meeting 5-7 Nov 13 - Session 3: tranSMART and the One Min...
tranSMART Community Meeting 5-7 Nov 13 - Session 3: tranSMART and the One Min...
David Peyruc
 
Named Entity Recognition from Online News
Named Entity Recognition from Online NewsNamed Entity Recognition from Online News
Named Entity Recognition from Online News
Bernardo Najlis
 
Data science guide
Data science guideData science guide
Data science guide
gokulprasath06
 
Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005
Paolo Missier
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruption
jagan477830
 
Text Analytics for Legal work
Text Analytics for Legal workText Analytics for Legal work
Text Analytics for Legal work
AlgoAnalytics Financial Consultancy Pvt. Ltd.
 
Oxford Lectures Part 1
Oxford Lectures Part 1Oxford Lectures Part 1
Oxford Lectures Part 1
Andrea Pasqua
 
Make Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringMake Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature Engineering
DataRobot
 
Dwdm ppt for the btech student contain basis
Dwdm ppt for the btech student contain basisDwdm ppt for the btech student contain basis
Dwdm ppt for the btech student contain basis
nivatripathy93
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and How
Julián Urbano
 

Similar to Alz Hack II (20)

Weird News Ranking : IRE project
Weird News Ranking : IRE projectWeird News Ranking : IRE project
Weird News Ranking : IRE project
 
Best practices machine learning final
Best practices machine learning finalBest practices machine learning final
Best practices machine learning final
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
 
Model evaluation in the land of deep learning
Model evaluation in the land of deep learningModel evaluation in the land of deep learning
Model evaluation in the land of deep learning
 
Past, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspectivePast, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspective
 
Reference Domain Ontologies and Large Medical Language Models.pptx
Reference Domain Ontologies and Large Medical Language Models.pptxReference Domain Ontologies and Large Medical Language Models.pptx
Reference Domain Ontologies and Large Medical Language Models.pptx
 
Diabetes Prediction Using Machine Learning
Diabetes Prediction Using Machine LearningDiabetes Prediction Using Machine Learning
Diabetes Prediction Using Machine Learning
 
Best Practices for Big Data Analytics with Machine Learning by Datameer
Best Practices for Big Data Analytics with Machine Learning by DatameerBest Practices for Big Data Analytics with Machine Learning by Datameer
Best Practices for Big Data Analytics with Machine Learning by Datameer
 
Klout as an Example Application of Topics-oriented NLP APIs
Klout as an Example Application of Topics-oriented NLP APIsKlout as an Example Application of Topics-oriented NLP APIs
Klout as an Example Application of Topics-oriented NLP APIs
 
Sistemas de Recomendação sem Enrolação
Sistemas de Recomendação sem Enrolação Sistemas de Recomendação sem Enrolação
Sistemas de Recomendação sem Enrolação
 
tranSMART Community Meeting 5-7 Nov 13 - Session 3: tranSMART and the One Min...
tranSMART Community Meeting 5-7 Nov 13 - Session 3: tranSMART and the One Min...tranSMART Community Meeting 5-7 Nov 13 - Session 3: tranSMART and the One Min...
tranSMART Community Meeting 5-7 Nov 13 - Session 3: tranSMART and the One Min...
 
Named Entity Recognition from Online News
Named Entity Recognition from Online NewsNamed Entity Recognition from Online News
Named Entity Recognition from Online News
 
Data science guide
Data science guideData science guide
Data science guide
 
Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruption
 
Text Analytics for Legal work
Text Analytics for Legal workText Analytics for Legal work
Text Analytics for Legal work
 
Oxford Lectures Part 1
Oxford Lectures Part 1Oxford Lectures Part 1
Oxford Lectures Part 1
 
Make Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringMake Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature Engineering
 
Dwdm ppt for the btech student contain basis
Dwdm ppt for the btech student contain basisDwdm ppt for the btech student contain basis
Dwdm ppt for the btech student contain basis
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and How
 

Recently uploaded

cathode ray oscilloscope and its applications
cathode ray oscilloscope and its applicationscathode ray oscilloscope and its applications
cathode ray oscilloscope and its applications
sandertein
 
Male reproduction physiology by Suyash Garg .pptx
Male reproduction physiology by Suyash Garg .pptxMale reproduction physiology by Suyash Garg .pptx
Male reproduction physiology by Suyash Garg .pptx
suyashempire
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
hozt8xgk
 
HUMAN EYE By-R.M Class 10 phy best digital notes.pdf
HUMAN EYE By-R.M Class 10 phy best digital notes.pdfHUMAN EYE By-R.M Class 10 phy best digital notes.pdf
HUMAN EYE By-R.M Class 10 phy best digital notes.pdf
Ritik83251
 
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdfHolsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
frank0071
 
Anti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark UniverseAnti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark Universe
Sérgio Sacani
 
IMPORTANCE OF ALGAE AND ITS BENIFITS.pptx
IMPORTANCE OF ALGAE  AND ITS BENIFITS.pptxIMPORTANCE OF ALGAE  AND ITS BENIFITS.pptx
IMPORTANCE OF ALGAE AND ITS BENIFITS.pptx
OmAle5
 
Farming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptxFarming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptx
Frédéric Baudron
 
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
frank0071
 
Pests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdfPests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdf
PirithiRaju
 
Clinical periodontology and implant dentistry 2003.pdf
Clinical periodontology and implant dentistry 2003.pdfClinical periodontology and implant dentistry 2003.pdf
Clinical periodontology and implant dentistry 2003.pdf
RAYMUNDONAVARROCORON
 
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDSJAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
Sérgio Sacani
 
AJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdfAJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR
 
Sustainable Land Management - Climate Smart Agriculture
Sustainable Land Management - Climate Smart AgricultureSustainable Land Management - Climate Smart Agriculture
Sustainable Land Management - Climate Smart Agriculture
International Food Policy Research Institute- South Asia Office
 
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  𝐳 = 2.9  wi...Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  𝐳 = 2.9  wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...
Sérgio Sacani
 
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Sérgio Sacani
 
Summary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdfSummary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdf
vadgavevedant86
 
Lattice Defects in ionic solid compound.pptx
Lattice Defects in ionic solid compound.pptxLattice Defects in ionic solid compound.pptx
Lattice Defects in ionic solid compound.pptx
DrRajeshDas
 
BIRDS DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
BIRDS  DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptxBIRDS  DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
BIRDS DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
goluk9330
 
Gadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdfGadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdf
PirithiRaju
 

Recently uploaded (20)

cathode ray oscilloscope and its applications
cathode ray oscilloscope and its applicationscathode ray oscilloscope and its applications
cathode ray oscilloscope and its applications
 
Male reproduction physiology by Suyash Garg .pptx
Male reproduction physiology by Suyash Garg .pptxMale reproduction physiology by Suyash Garg .pptx
Male reproduction physiology by Suyash Garg .pptx
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
 
HUMAN EYE By-R.M Class 10 phy best digital notes.pdf
HUMAN EYE By-R.M Class 10 phy best digital notes.pdfHUMAN EYE By-R.M Class 10 phy best digital notes.pdf
HUMAN EYE By-R.M Class 10 phy best digital notes.pdf
 
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdfHolsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
 
Anti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark UniverseAnti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark Universe
 
IMPORTANCE OF ALGAE AND ITS BENIFITS.pptx
IMPORTANCE OF ALGAE  AND ITS BENIFITS.pptxIMPORTANCE OF ALGAE  AND ITS BENIFITS.pptx
IMPORTANCE OF ALGAE AND ITS BENIFITS.pptx
 
Farming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptxFarming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptx
 
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
 
Pests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdfPests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdf
 
Clinical periodontology and implant dentistry 2003.pdf
Clinical periodontology and implant dentistry 2003.pdfClinical periodontology and implant dentistry 2003.pdf
Clinical periodontology and implant dentistry 2003.pdf
 
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDSJAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
 
AJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdfAJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdf
 
Sustainable Land Management - Climate Smart Agriculture
Sustainable Land Management - Climate Smart AgricultureSustainable Land Management - Climate Smart Agriculture
Sustainable Land Management - Climate Smart Agriculture
 
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  𝐳 = 2.9  wi...Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  𝐳 = 2.9  wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...
 
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
 
Summary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdfSummary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdf
 
Lattice Defects in ionic solid compound.pptx
Lattice Defects in ionic solid compound.pptxLattice Defects in ionic solid compound.pptx
Lattice Defects in ionic solid compound.pptx
 
BIRDS DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
BIRDS  DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptxBIRDS  DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
BIRDS DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
 
Gadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdfGadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdf
 

Alz Hack II

  • 1. AlzHack Data Driven Diagnosis of Alzheimer's Disease Frank Kelly
  • 3. Diagnose Alzheimer’s disease as early as possible Benefit to millions of people (potentially) Our goal:
  • 4. Why is Alzheimer’s disease diagnosis important? Chronic neurodegenerative disease 60-70% of dementia cases = Alzheimer's 48 million people affected worldwide (2015) Wrecks people’s lives (+ their families’) 800,000 people (in the UK) formally diagnosed Only 43% of those with the condition get a diagnosis Figures: wikipedia & http://www.bbc.co.uk/science/0/21878238
  • 5. Demographic changes mean it will be more widespread Chart credit: economist.com By 2050 the number of dementia sufferers is expected to triple A global, mounting problem
  • 6. How is Alzheimer’s disease diagnosed today? Medical history Mental status tests Physical and neurological examination Blood tests and brain imaging Example test sheet:: http://www.ftdrg.org/wp-content/uploads/4a-CCT_revised-Picture-stimulus.pdf
  • 7. A gradual decline -20 years -10 years Death-15 years -5 years Earliest Alzheimer’s Mild to moderate Severe Common diagnosis period
  • 8. Who are we ? Full bios: https://alzhack.wordpress.com What is our approach? We’re doing citizen science ● No lab, or lab coats ● Readily available data ● Other people’s research
  • 9. Diagnose Alzheimer’s disease as early as possible Why? Participate in clinical drug trials Benefit from treatment More time to plan Take own decisions Better carer relationship Reduce anxieties about unknowns Sketch: http://www.businessfinancenews.com/28526-will-astrazeneca-plc-and-eli-lilly-give-breakthrough-in-alzheimers/
  • 11. How the disease manifests itself Protein plaques and tangles accumulate in the brain: Disrupting communication between nerve cells Kills nerve cells Loss of brain tissue Facts: https://www.alzheimers.org.uk/site/scripts/documents_info.php?documentID=100 Imagery: www.alz.org
  • 12. How the disease manifests itself (1) Starts in the hippocampus Harder to form new memories Difficult to recollect from days or hours ago Video: https://www.youtube.com/watch?v=Eq_Er-tqPsA
  • 13. How the disease manifests itself (2) ...then takes root in other areas 2. Language processing 3. Logical thought 4. Emotions 5. Senses 6. Older memories 7. Balance and coordination Video: https://www.youtube.com/watch?v=Eq_Er-tqPsA
  • 14. Relevant symptoms Confusion with time/place Spatial memory Problems with words Misplacing items Decreased / poor judgment Withdrawal from work Mood change Difficulty with familiar tasksChallenges in planning SpeechShort term memory loss -20 years -10 years Death -15 years -5 years Earliest Alzheimer’s Mild to moderate Severe
  • 16. Previously: Analysis of a single user’s emails ● An Alzheimer’s disease sufferer’s emails over 4 years ● Conversion of email text to vectors ● Counts, lengths and other metrics Features Memory, language and sentiment related metrics extracted
  • 17. Results Some “explainable” trends Challenges Single user: lack of data and likely bias Scaling up: security concerns & deletion
  • 18. How did we get more data?
  • 19. Forum post scraping First lxml, then BeautifulSoup ● Two sub-forums ● ~3,600 threads ● ~78,000 posts ○ Post content ○ Post metadata ○ User metadata
  • 20. Data preparation Content punctuation sanitised by regexp substitutions. Sub forum post data (x2)
  • 22. How do we label a user? ● Users frequently post in both sub-forums ● To differentiate: ○ Assume that OPs (thread starters) in a sub-forum are of that category ○ Otherwise look at ratio of posts (replies) between the two sub forums* FP = First Post in thread SP = Subsequent Post in thread
  • 23. How do we label a user? Thread Reply Dementia Partner Discard Unknown
  • 25. Sentiment “polarity” (out-of-the-box via NLTK & TextBlob) ● Alternatively can train your own text classifier: http://streamhacker. com/2010/05/10/text-classification- sentiment-analysis-naive-bayes- classifier/ ‘Mood change’ as a feature
  • 26. ● Average of sentence sentiments per post ● Slightly higher sentiment for dementia sufferers’ posts
  • 27. Language-oriented features Lexical functions Comprehension functions Empty phrases Paraphasias and neologisms Vocabulary-related Readability “Go ahead” phrases Unintended or invented words Difficult words count Dale-Chall readability Flesch Kincaid Flesch Reading Ease Counts of “ummm...errr” Words that are not in common usage
  • 28. Simple language features ● Sentence count ● Word count ● Words per sentence ● Unique word count ● Unique words to total ratio ● “Go Ahead” words (Empty phrases)
  • 29. Readability (package readability-lxml) ● Avg syllables per word ● Avg letter per word ● Flesch reading ease ● Flesch kincaid grade ● Polysyllabcount ● Automated readability index ● Number of “difficult” words ● Dale-chall readability score ● Gunning fog
  • 31. Memory-oriented features ● Sort posts by username and timestamp, add a shifted column
  • 32. Apply comparison function between post and previous post: ○ NLTK edit_distance (fuzzy match) ○ Cosine similarity between TF- IDF vectors
  • 33. Part of speech (POS) features ● Tag words and tally up frequencies ● Calculate “rates”
  • 35. Explanatory or predictive modelling ? ● Actually both. ● First ‘interpret’ a classifier (explanatory) ● Secondly need a ‘real-time’ detection system (predictive)
  • 36. Data modelling strategy (used for initial ML runs) Aggregation of posts ● pandas: groupby, agg by username Balancing out the dataset ● Many more partner users than sufferers ● Subsample larger (partner) dataset to even things up Validate using random train and test sets ● Randomly select 80% of users for training, 20% test
  • 37. Model Results for Misc. Features ● Median values (aggregated over all posts per user) Best: SVM Radial basis function classifier (with grid search) User classification accuracy: 57%
  • 38. Model Results for Memory Features ● Median values (aggregated over all posts per user) Best: K-nearest neighbours Classifier User classification accuracy: 63%
  • 39. Model Results for Readability Features ● Median values (aggregated over all posts / user) Best: K-nearest neighbours Classifier User classification accuracy: 59%
  • 40. Model Results for Part-Of-Speech Features ● Median values (aggregated over all posts per user) Best: SVM Radial basis function classifier (with grid search) User classification accuracy: 61%
  • 41. Model Results for All Features ● Median values (aggregated over all posts per user) Best: Naïve Bayes Classifier User classification accuracy: 63%
  • 42. Re-think: Classify posts, not users ● Currently group by userID ● Some users post more than others ● Posts would utilise full “richness” of the dataset ● Double round of sampling required on post set: ○ 3 - 4 times more “partners” than dementia sufferers ○ Partners post approx. 3 times more posts than sufferers do
  • 43. Model Results for All Features (by post) ● Filtered set of posts Best: Random Forest Classifier Accuracy of 68% percent in ability to classify a post
  • 45. Results in summary ● Best performing feature group so far on aggregated set by user: ○ Memory-based features ● Best performing individual feature on aggregated set by user: ○ Verb rate = ratio of verbs to word count in post ● Best performing individual feature on individual post: ○ Cosine similarity to previous post ● Aligns with symptoms expected in early stage to mild dementia
  • 46. Future avenues ● Data ○ Further data gathering (more blogs including non-alzheimer's topic blogs) ○ Better user identification (e.g. active learning) ● Features ○ More and better ○ Types of individual dementia distinguish ○ More memory-related features (e.g. LSI) ● Clustering of posts into ‘topics’ or users into ‘types’ ○ gensim / LDA topic modelling ○ Early stage / medium condition / advanced condition posters ● Classification and modelling ○ Time series analysis ○ New sampling techniques, input validation and models
  • 47. Future: Time series analysis ● Noisy datasets ○ Apply numerical Bayesian inference ● Are we looking for a steady change in the mean? ○ Ramp detection ● Or a sudden change in variance? ○ Step change detection Dementia sufferer Partner
  • 48. Conclusions ● Introduction to Alzheimer’s and its impact ● Explanation of our technical approach and surrounding challenges ● Initial observations and predictions ● Tough problem and a worthwhile cause for data science ● Please contact us if you would like to help, or have ideas: frank.kelly@cantab.net https://alzhack.wordpress.com/contribute-2/ Thank you!