Your SlideShare is downloading. ×

Mining Unstructured Healthcare Data

517
views

Published on

Alliance Health Chief Data Scientist Deep Dhillon presentation to UW CS students on mining unstructured healthcare data. This talk describes technical information on a system designed to empower …

Alliance Health Chief Data Scientist Deep Dhillon presentation to UW CS students on mining unstructured healthcare data. This talk describes technical information on a system designed to empower patients and health care professionals by automatically generating health care content that is fresh, authoritative, statistically relevant and not available via other means.

Published in: Technology

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
517
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
19
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. mining unstructured healthcare data deep dhillonchief data scientist | ddhillon@alliancehealth.com | twitter.com/zang0
  • 2. alliance health networks HCP/advanced patients medify.com – 10,000% user growth past year patientsdiabeticconnect.com - # 1 online diabetes site @ ~1.4M uniques *connect.com – content sites + guided social networks health care industry patient surveying, matchmaking and analysis
  • 3. why mine healthcare text?anatomy of what patients currently use: webmd (e.g. drugs.com, yahoo health, etc.)q/a topic pages news + media experts discussions • original content • addresses emotional needs • simple to understand • provides answers • original content • moderately authoritative • simple to understand • original content • good coverage in the head • fresh • provides answers • simple to understand • good coverage in the head • original content • typically aligned well with google searches, i.e. treatments for X, symptoms for Y original content=important • good coverage in the head providing answers=important• original content head=developed• aligns well with google searches fresh=important• provides answers authority=important
  • 4. why mine healthcare text?Patients and HCPs need long tail, statistically meaningful, consumer friendly,authoritative and fresh health content.q/a topic pages news + media experts discussions • manually written, not authoritative • not thorough, i.e. not long tail • not consistently credible, i.e. minimal • accredation • not statistically meaningful • sparse • manually written, moderately authoritative • not thorough, i.e. not long tail • manually written, not authoritative • not consistently credible, i.e. minimal accredation • not thorough, i.e. not long tail • evergreen, rarely change • manually written, not authoritative • not consistently credible, i.e. minimal accredation • not thorough, i.e. not long tail manual=expensive manual=head focused • manually written, not authoritative • not consistently credible, i.e. minimal accredation manual=not authoritative • not statistically meaningful manual=old and dated
  • 5. automated content generation• cost effective• structured content• statistically based• scales to millions of patients• scales to long tail treatments + conditions• authoritative / citation driven• fresh
  • 6. types of text mining
  • 7. medify demo
  • 8. how do we mine text? assignmentscurators examples knowledge parser Indexexternal repos Page like UMLS Search Module
  • 9. annotate in the product:• Easier for people to give explicit GT feedback• Channels high visibility error angst productively• Channels more GT toward areas most seen by users• Override high visibility system mistakes with human datagt demo:• http://www.diabeticconnect.com/discussions/4790• https://www.medify.com/internal/annotate/abstract?abstractId=8181575
  • 10. Detailed Signals Preliminary Mining From Discussion Threads5,0004,5004,0003,5003,0002,5002,0001,5001,000 500 0
  • 11. Distribution of Signals Mined From Diabetic Connect Discussion Threads Medical History - Newly Lifestyle Adjustment Diagnosed Medical History - Symptom 6% 0% 3% Medical History - Condition 10% Helper/Influencer 33% Personal Account 12% Treatment Experience 15% Medical History 21%
  • 12. API Research (Outcome) Discussion (Gromin) curator treatment/symptom/condition/demographic • identity (i.e. UMLS id) UMLS • taxonomical relations • hierarchical is_a • condition/arthritis/Rhematoid Arthritis • synonym relations • RA=Rhematoid Arthritis • metformim • polarityknowledge base • effective=positive • ineffective=negative • parsing clues (ambiguity) • anaphor clues
  • 13. abstract parser work flowdocument sentence selection 1 1. sentences w/ high conclusion presence selected 2. shallow entity tagging applied based on umls 2 3. sentence text is modified to retain entities, optimize shallow tagging for performance, and eliminate unnecessary filler. 4. deep typed dependency parsing of modified text 5. 3 tier text classification applied on BOW + entities 3 sentence 6. SVO triples extracted from dependencies modification 7. rule based conclusions generated from triples 8. confidence models applied to rule and classifier based conclusions9 9. knowledge base used to represent domain and dependency tree discourse style.knowledge parse 4 10. errors measured against curated data applied for improvements 5 triple extraction 6 3 tier classification 10 concluder 7 8 confidence assignment annotations
  • 14. discussion parser work flow message sentence split 1 1. sentences split from discussion thread messages 2. shallow entity tagging applied based on umls 3. sentence text is modified to retain entities, optimize 2 for performance, and eliminate unnecessary filler. shallow tagging 4. deep typed dependency parsing of modified text for select conclusion types 5. select conclusion type based text classification 3 applied on BOW + entities after dimensionality sentence reduction modification 6. SVO triples extracted from dependencies 7. rule based conclusions generated from triples9 8. rule based KB driven anaphora resolution applied. Classification based conclusions added. dependency tree 9. knowledge base used to represent domain andknowledge parse 4 discourse style. 10. errors measured against curated data applied for improvements 5 triple extraction 6 classification 10 concluder 7 8 anaphora resolution conclusions
  • 15. feature engineering• entities, i.e. normalize synonyms, id new entity types, like social relations• entity types, i.e. metformin > treatment_medication• phrase driven cues, i.e. [have] [you] [considered] > suggestion_indicator
  • 16. anaphora resolution• relation structure, i.e. [person]>takes>it it refers to treatment (i.e. not condition/symptom), and specifically: medication but not device• statistically driven, manually curated cues i.e. drug > treatment/medication• filter – non matching antecedent candidates – singular/plural agreement• score candidates: – antecedent occurrence frequency – distance (#sents) from antecedent to anaphor – co-occurrence of anaphor w/ antecedent
  • 17. technology• Lang: Java + Python + Ruby• DB: Solr 4, Mongo DB, S3• Work: Map Reduce• Dependencies: Malt Parser, Stanford Parser• Misc: Tomcat, Spring, Mallet, Reverb, Minor Third• Tagging: Peregrine + home grown
  • 18. Data Pipeline 21
  • 19. Solr 1 Solr 2 Solr N Load Balancer – API Cache Portal 1 Portal 2 Portal N Load Balancer – www.medify.comrequest transaction flow browser
  • 20. questions?
  • 21. Advair: Experts vs. Patients • “medicalese” vs. patients words • more granularity • a story like perspective w/ words of inspiration My Pulmonologist today said that he had just come from the hospital bedside of a patient with my exact symptoms who is on oxygen and not doing well. If I hadnt been so diligent in following my asthma plan that could easily have been me. His telling me that really hit home. By the way, my Pulmonologist surprised me with his view on many of his patients. He actually got excited and thanked me for being knowledgeable about asthma, understanding my own body and health needs and for following my treatment plan. He said he doesnt get many patients who can actually talk about their disease, symptoms, time frames, treatments/medications, and non-medical measures they are taking. He also doesnt get many people who ask questions. My response to this: What are people thinking? They need to take control of their own health or noone else will.

×