Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

NLP & ML Webinar

563 views

Published on

Artificial intelligence (AI) technologies, such as natural language processing (NLP), have been around for some time, and more recently there has been much hype surrounded the potential of combining AI with Machine Learning (ML) for decision making. But has it met the challenge? This webinar reviews what NLP is, the role NLP plays in machine learning approaches, such as deep learning, and some real-world use cases for application to life sciences and healthcare to improve patient outcomes.

Published in: Healthcare
  • Be the first to comment

NLP & ML Webinar

  1. 1. This webinar is being recorded
  2. 2. Natural Language Processing and Machine Learning: Beyond the Hype A Pistoia Alliance Debates Webinar Moderated by David Milward –Linguamatics September 14, 2017
  3. 3. This webinar is being recorded
  4. 4. Poll Question 1: What role do you play in your company? A. IT B. Data scientist/bioinformatician C. Clinical/bench scientist D. Information professional E. Other
  5. 5. ©PistoiaAlliance The Panel 5 David Milward, Ph.D., CTO Linguamatics David Milward is chief technology officer (CTO) at Linguamatics. He is a pioneer of interactive text mining, and a founder of Linguamatics. He has over 20 years experience of product development, consultancy and research in natural language processing (NLP). After receiving a PhD from the University of Cambridge, he was a researcher and lecturer at the University of Edinburgh. He has published in the areas of information extraction, spoken dialogue, parsing, syntax and semantics. Chengyi Zheng, Ph.D. , NLP Specialist Kaiser Permanente Chengyi Zheng, PhD, is a NLP specialist at the Kaiser Permanente Southern California. He has worked on over 30 research projects using the electronic health records (EHR) data from millions of patients. He is the principal investigator of a CDC funded study involving 5 health care institutions on using NLP in the vaccine safety studies. He was the winner of the Kaiser Permanente predictive modeling competition. He ranked the 1st place in the innovation competition (InnoCentive@Lilly) while served as the biomedical informatics scientist at Eli Lilly. He was trained in computer science with a concentration on speech recognition. He will share some experiences on using NLP and Machine learning on EHR for outcomes prediction. Eugene Myshkin, Ph.D., Senior Research Scientist, Clarivate Eugene Myshkin, PhD, is a senior scientist in bioinformatics at Clarivate Analytics. He has over 15 years experience in drug discovery, cheminformatics and bioinformatics. He has also been involved in a number of text mining projects including mining of chemical reagents and antibodies from scientific literature. September 14, 2017 NLP and ML
  6. 6. ©PistoiaAlliance Agenda 6 • AI, NLP and ML (David) • Using NLP and ML in clinical research (Chengyi) • Network and pathway driven machine learning approaches to biomarker discovery and patient stratification (Eugene) 6September 14, 2017 NLP and ML
  7. 7. NLP, AI and Machine Learning David Milward, PhD CTO, Linguamatics 2017
  8. 8. Overview AI (Artificial Intelligence) NLP (Natural Language Processing) − and its applications in life sciences ML (Machine Learning) and DL (Deep Learning) NLP to feed ML-based DS (Decision Support) ML in NLP DLAI ML NLP DS © 2017 Linguamatics8
  9. 9. Overview AI (Artificial Intelligence) NLP (Natural Language Processing) − and its applications in life sciences ML (Machine Learning) and DL (Deep Learning) NLP to feed ML-based DS (Decision Support) ML in NLP AI © 2017 Linguamatics9
  10. 10. Overview AI (Artificial Intelligence) NLP (Natural Language Processing) − and its applications in life sciences ML (Machine Learning) and DL (Deep Learning) NLP to feed ML-based DS (Decision Support) ML in NLP AI NLP © 2017 Linguamatics10
  11. 11. Overview AI (Artificial Intelligence) NLP (Natural Language Processing) − and its applications in life sciences ML (Machine Learning) and DL (Deep Learning) NLP to feed ML-based DS (Decision Support) ML in NLP DLAI ML © 2017 Linguamatics11
  12. 12. Overview AI (Artificial Intelligence) NLP (Natural Language Processing) − and its applications in life sciences ML (Machine Learning) and DL (Deep Learning) NLP to feed ML-based DS (Decision Support) ML in NLP AI NLP DS © 2017 Linguamatics12
  13. 13. Overview AI (Artificial Intelligence) NLP (Natural Language Processing) − and its applications in life sciences ML (Machine Learning) and DL (Deep Learning) NLP to feed ML-based DS (Decision Support) ML in NLP AI ML NLP © 2017 Linguamatics13
  14. 14. Artificial Intelligence (AI) Artificial intelligence is intelligence exhibited by machines The central goals of AI research include reasoning, knowledge, planning, learning, natural language processing (communication), perception and the ability to move and manipulate objects As machines become increasingly capable, tasks considered as requiring "intelligence" are often removed from the definition, leading to the quip “AI is whatever hasn't been done yet” Wikipedia © 2017 Linguamatics14
  15. 15. Natural Language Processing (NLP) Processing of natural languages e.g. English, French, Chinese by computers NLP is part of AI, but also key to other areas of AI e.g. providing decision support − If 80% of knowledge is unstructured we need NLP to get the right information to provide good suggestions − Currently many AI projects are limited: they can only address questions where there is structured data − Worse, they often use inappropriate structured data such as ICD billing codes for non-billing tasks © 2017 Linguamatics15
  16. 16. Find information however it is expressed © 2017 Linguamatics16 Different word, same meaning cyclosporine ciclosporin Neoral Sandimmune Different expression, same meaning Non-smoker Does not smoke Does not drink or smoke Denies tobacco use Different grammar, same meaning 5mg/kg of cyclosporine daily 5mg/kg/d of cyclosporine cyclosporine 5mg/kg/day Same word, different context Diagnosed with diabetes Family history of diabetes No family history of diabetes NLP
  17. 17. Represent it in a standardized form © 2017 Linguamatics17 Concept Text Normalized Value Diseases breast cancer Breast Neoplasm carcinoma of the breast Genes Raf-1 RAF1 Raf I Dates 27th Feb 2014 20140227 2014/02/27 Measurements 0.2g 200 mg Two hundred milligrams Mutations Val 158 Met V158M Val by Met at codon 158 Entrez Gene ID: 5743 inhibits nimesulide, a selective COX2 inhibitor, …
  18. 18. From Bench to Bedside: NLP Provides Insight © 2017 Linguamatics18 Regulatory approval Phase 3 Clinical trials Basic research Idea Patient care Phase 2Phase 1 DeliveryDevelopmentDiscovery Business critical questions What targets are involved in bone cancer? What companies are patenting a particular technology? What are the safety risks of my drug? Where can I site my Phase 1, Phase 3 clinical study? What are the clinical risks for my patients?
  19. 19. Direct access to the Unstructured © 2017 Linguamatics Weight ≥ 80kg Below 60 years old Reports after 2010 With mutation C677T Cancer patients 19
  20. 20. Machine Learning Machine Learning is used for AI in general and as a technique within NLP 3 main flavours: − Supervised − uses annotated data mapping between inputs and outputs − Semi-supervised − uses machine analysis but incorporates a human in the loop − Unsupervised − uses unannotated data, usually at very large scale. © 2017 Linguamatics20 Recent successes with deep learning approaches based on neural networks for supervised and unsupervised ML e.g. − Machine translation using parallel corpora − Image classification in medicine
  21. 21. Using NLP to feed other AI NLP provides access to the 80% of information in unstructured text Provides a set of potential features to be used in e.g. ML models for Decision Support Example: building risk models from RWD sets − Predicting patients at risk of misusing opioid prescription drugs (AMIA November 2017) − Features extracted by Linguamatics I2E from 8.9 million de-identified medical record full- text transcripts from RealHealthData − SVM classifier trained on the features to flag patients at risk © 2017 Linguamatics21
  22. 22. Machine Learning in NLP Supervised ML − Requires large-scale, representative annotated documents − Main paradigm for core NLP components − For extraction patterns, used in academic systems but less commonly in commercial Semi-supervised ML − Useful for new tasks or data sets where no existing representative annotated data − Useful where a task is initially ill-defined − Puts a human in the loop judging suggestions from the machine learning − Can provide good quality results quickly e.g. to test whether a feature extracted by NLP is useful for a ML model Unsupervised ML − Uses large-scale unannotated data − Key example is learning the meaning of a word via the context it keeps (word embeddings) © 2017 Linguamatics22
  23. 23. Semi-Supervised ML Approaches Similar distributions for words and syntactic constructions Automatically discover what is in the data using an interactive, agile text mining platform such as Linguamatics I2E A long tail of infrequent cases − prioritize the more frequent constructions − generalize to cover items in the tail © 2017 Linguamatics23 Zipf’s Law: the frequency of any word is inversely proportional to its rank in the frequency table
  24. 24. Semi-Supervised NLP using Linguamatics I2E © 2017 Linguamatics24
  25. 25. Summary NLP is critical to success of many ML projects − access to the unstructured text is key to using ML widely, not just where there is convenient structured data Semi-supervised approaches to NLP provide an efficient way to capture features for ML projects © 2017 Linguamatics25 DLAI ML NLP DS
  26. 26. Poll Question 2: What is your company’s primary use for NLP? A. Early Discovery/ Pre-clinical B. Clinical C. Real world data D. Other E. Don’t use NLP
  27. 27. Using NLP and ML in clinical research Chengyi Zheng, PhD, MS DEPARTMENT of Research & Evaluation
  28. 28. 28 DEPARTMENT of Research & Evaluation 10/6/2012 10/19/2012 10/7/2012 10/14/2012 10/7/2012 Pt called 10/7/2012 Nurse Called Back 10/8/2012 Orthopedic office visit Where: Medical Center, Department 10/8/2012 Progress Notes: Reason for visit: Knee Pain Vital Sign/BMI/Pain level/History PE/Findings/Impression/A&P Dx: icd-9 code Nurse Exam Note: … 10/9/2012 Lab 10/10/2012 Pre-op dental exam (ext) 10/6/2012 Imaging: DEXA Bone density 10/11/2012 office visit 10/11/2012 Rx Prescribed 10/10/2012 Surgery Scheduled 10/11/2012 Office Visit Sinus Congestion Ankle itchy Dx: 401.9 Essential Hypertension 274.9 Gout 461.9 Acute Sinusitis 10/12/2012 Picked up the Rx 10/13/2012 Pt missed appt. 10/13/2012 Telephone Consult Healthy bones PN 10/14/2012 Pt emailed: Drug adverse event 10/14/2012 Pt called cancerous area 10/15/2012 EKG Dx: Screening 10/15/2012 Ear Wax Wash 10/18/2012 Pathology Report Out 10/16/2012 Procedure: Remove Skin 10/16/2012 - 10/19/2012 Hospitalization Two weeks records of a patient in an EMR system 5 Ws: What, Who, When, Where and Why Membership length: 70% > 5 years, 50% >10 years.
  29. 29. 29 5 Ws: What, Who, When, Where and Why  What – What is the reason of visit? – What happened? (pain after fall, pain after drink a beer?)  Who – Who is the caregiver? (primary physician, rheumatologist?) – What we know about this patient? (age, race, past medical history, et. al.)  Where – Where this visit occurred?  When – When the problem started?  Why – Why this problem happened? Possible causes? DEPARTMENT of Research & Evaluation
  30. 30. 30 Visual representation of KPSC research databases DEPARTMENT of Research & Evaluation
  31. 31. 31 Case study: Identify acute gout flare  Published methods to identify gout flares using claims data – Clinical coding is unreliable: under-coding, over-coding, too general – Medication is unreliable:  Drugs for gout maintenance  Drugs also for other diseases (Share similar symptoms)  NLP has been used to: – Identify study population and patients information – Identify and extract clinical variables (genetic, biopsy, radiology) – Evaluate patients status (disease progression, medication status) DEPARTMENT of Research & Evaluation
  32. 32. Solution and challenges (NLP) Challenges: – Gout is a chronic disease which can be controlled but not cured  Signs and symptoms could appeared in follow up visit  Differentiate between acute and chronic status – Gout population is generally old with comorbidity sharing similar symptoms  100+ types of arthritis (> 50 million people)  Pain, erythema, and swelling joint – Information documented varies by clinical notes Standard solutions: – Each search query captures one set of information – Each search query has its own sensitivity/specificity etc. – Logic operator combines search results (union, join, etc.)  Difficult to optimize on the overall sensitivity/specificity etc. 32 DEPARTMENT of Research & Evaluation
  33. 33. Mining vs. NLP & ML in clinical research Steps: 1. Preliminary analysis, estimate feasibility 2. Develop plan, estimate cost 3. Seek permit (government vs. IRB) 4. Mine (mining equipment vs. NLP) – Focus on completeness (high sensitivity) – Shallow & deep mining (good specificity) 5. Refine (chemical process vs. ML) – Improve purity (higher specificity) 6. Manual verification (optional) 7. Deliver to customer “art and science combined” “resource-heavy and time-consuming process” 33 DEPARTMENT of Research & Evaluation
  34. 34. Solution and challenges (NLP+ML) Goal:  NLP focus on sensitivity or information completeness – Separate ores from rock  ML focus on improving the specificity – Improve purity without much loss of sensitivity Solution:  NLP results as input features to the ML system – Identify related signs and symptoms – Identify temporal relationship (when and how long?) – Identify disease association (related to any other disease?) – Identify implicit and explicit mention of gout flare – Identify treatment plan associated with disease onset 34 DEPARTMENT of Research & Evaluation
  35. 35. Overview of the system development steps 35 Study period: 1/1/2007 to 12/31/2010. Patients > 18 years, with a diagnosis of gout and on urate-lowering therapy. Within [-3,+12] months of index date, 599,317 clinical notes for 16,519 patients. DEPARTMENT of Research & Evaluation
  36. 36. Overview of the NLP+ML system 36 DEPARTMENT of Research & Evaluation
  37. 37. Performance comparisons 81.1 95.4 88.3 92.290.9 97.3 93 96.5 84.8 92.2 81.1 93.9 70 80 90 100 Sensitivity Specificity PPV NPV Clinical note level gout flare identification Rheumatologist 1 Rheumatologist 2 NLP+ML 37 98.5 92.9 97.1 96.397.1 92.9 97.1 92.9 98.5 96.4 98.5 96.4 88.2 89.3 95.2 75.8 70 80 90 100 Sensitivity Specificity PPV NPV Identify patients with ≥ 1 gout flares Rheumatologist 1 Rheumatologist 2 NLP+ML ICD-9 74.2 92.3 82.1 88.283.9 95.4 89.7 92.593.5 84.6 74.4 96.5 41.9 95.4 81.3 77.5 30 50 70 90 Sensitivity Specificity PPV NPV Identify patients with ≥ 3 gout flares (refractory gout) Rheumatologist 1 Rheumatologist 2 NLP+ML ICD-9 DEPARTMENT of Research & Evaluation
  38. 38. Results  Note level (gout flare, n= 599,317): – NLP: 49,415 positive cases => ML: 18,869 positive cases  Patient level (with ≥ 3 flares, n=16,519): – Number of patients: 1,402 (NLP+ML) vs. 516 (Claim) – Sensitivity: 93.5% (NLP+ML) vs. 41.9% (Claim)  Impact: – Identify refractory disease patients – Estimate market size (KPSC / US population = 4.5/325 million = 1.4%) – Better disease management, improve quality of life, and help reduce healthcare resource use.  1,402 patients is more manageable than 16,519 patients 38 DEPARTMENT of Research & Evaluation
  39. 39. 39 ML in healthcare  Tremendous opportunities  Prediction: high utilizers, risk scores  Identification: cases, outcomes, social needs  Image recognition: pathology and radiology images – Challenges (Data)  Data quality: dirty, missing data  Heterogeneous data: different systems  Structured, semi-structured and free text data  Image, scanned documents  Genetic and biobank data – Challenges (People)  Who understands NLP, ML and healthcare  Who understands the complexity of healthcare data DEPARTMENT of Research & Evaluation
  40. 40. Poll Question 3: How does your company primarily use machine learning in drug discovery? A. Target prediction and repositioning B. Biomarker discovery C. Patient stratification D. Other E. We don’t use machine learning
  41. 41. Network and pathway driven machine learning approaches to biomarker discovery and patient stratification Eugene Myshkin, PhD September 2017
  42. 42. 42 CLARIVATE ANALYTICS TEXT MINING • Clarivate Analytics literature data feed • Comprehensive coverage – >20,000 journals – Journal content mirrors: Current Contents; Web of Science; Biosis; International Pharmaceutical Abstracts; Derwent Drug File – http://ip-science.thomsonreuters.com/cgi-bin/jrnlst/jlresults.cgi?PC=MASTER • Latest information – Updated with over 170,921 articles/month, or 2,051,051+ articles/year • Full text, cover to cover searching of all journals • Comprehensive synonym collections • Controlled vocabulary management software to support mining
  43. 43. 43 CLARIVATE ANALYTICS LIFE SCIENCES SOLUTIONS Pharmacovigilance Literature Monitoring Biological and Chemical Reagent Monitoring Concepts in social media Automated Curation of Clinical Data Protein and Gene Variant Monitoring
  44. 44. 44 USING NLP FOR MANUSCRIPT MATCHING Analyze citation connections to place the publication in the right journal
  45. 45. 45 DRUG TARGET DISEASE PITFALLS OF NLP FEATURES FOR ML • 1-10 million of features • Feature vectors are binary and sparse • Feature redundancy • Feature selection takes a long time These associations can be obtained with NLP but precision is a problem - a flood of false positives and the necessity to hire a bunch of people just to sort the true from the false alerts. FOCUS OF DRUG DISCOVERY:
  46. 46. 46 — METABASE MANUALLY ANNOTATED CONTENT PUBLICATIONS (209 for EGF-EGFR interaction) •Manual annotation from publications •Team of PhDs, MDs •Advanced editorial systems •Controlled vocabularies •Multiple levels of QC •invested more than 400 man yearsMOLECULAR INTERACTION NETWORK: PATHWAY ~ 1,500,000 molecular interactions ~ 3,000 pathways
  47. 47. 47 — INTEGRATED APPROACH Pathway knowledge Pathway-driven approaches Statistical approaches 1. Target identification or repositioning 2. Biomarker discovery 3. Patient stratification
  48. 48. 48 — Drug toxic but beneficial Drug toxic but NOT beneficial Drug NOT toxic and beneficial Drug NOT toxic and NOT beneficial Patient stratification “The most efficient and safe drug for a cohort of patients” WHY DIFFERENT PATIENT RESPONSE? Blockbuster strategy “One drug for all patients” New strategy is needed
  49. 49. 49 — HOW CAN PATIENTS BE STRATIFIED? Mechanism 1 Mechanism 2 Biomarkers Biomarkers Biomarker – measurable molecular indicator of: disease subtype/progress drug efficacy side effect/toxicity • Identify subtypes resulting in multiple drug targets rather than one. • A shift from the presumption of a disease to multiple diseases would reframe the drug development strategy
  50. 50. 50 — ORION BIONETWORKS Orion Bionetworks (Cohen Veteran Biosciences) is an alliance of world leading organizations in patient care, computational modelling, translational research and patient advocacy that aims to develop open-source computational models for multiple sclerosis and improve upon existing analytical tools for model development. ~186 subjects with gene expression data and clinical parameters like time to relapse, etc GOALS:  Understand the structure of the population based on the molecular data – identify cohorts of patients whose clinical course differs over time  Build stratification models  Identify new therapeutic targets
  51. 51. 51 — NETWORK/PATHWAY BASED METHODS FOR BIOMARKER DISCOVERY
  52. 52. 52 — 1. PATHWAY IDENTIFICATION — 56 pathways identified • 136 genes • 39/136 genes were present in multiple pathways • 44/136 genes known MS biomarkers or drug targets (p = 5x10-6) 52 • individual expression values of each member gene were averaged into a combined z-score • activity score association with time to relapse in a Cox proportional hazard model was calculated
  53. 53. 53 — 2. PATIENTS CLUSTERING BY PATHWAYS Clusters are significantly associated with time to relapse in the presence of important clinical covariates patients were clustered into groups based on k-means clustering of their pathway activity profiles, k=3 resulted in the best separation of patient profiles.
  54. 54. 54 — — A K-Nearest Neighbor model was previously generated to predict risk groups 1-3 using all biomarkers — Feature selection was performed by taking the variable importance calculated from the trained KNN model. — Forward feature selection was then conducted using 10-fold CV adding features to the model in order of their importance. — Once this process was complete the predictive performance was evaluated in terms of the ability of the model to separate the three risk groups — Final feature set was applied to test data 3. CLASSIFICATION MODEL Signature was reduced from 56 to 13 pathways, containing 65 genes
  55. 55. GENE ONLY MODEL WAS NOT ROBUST TO TEST DATA PATHWAY BASED APPROACH GENE BASED APPROACH
  56. 56. 56 — CONCLUSIONS — Signature differentiating between patient cohorts was reduced from 56 to 13 pathways — This new signature contains 65 genes — 13 biomarkers could stratify subjects into risk groups with statistically significant differences in time to relapse — This was validated in test subjects with results being consistent to what was observed in the training cohort — Pathway activities were more robust than gene expression 56
  57. 57. Poll Question 4: What is the greatest barrier to application of NLP/ML at your company? A. Technical expertise B. Access to data C. Data quality D. Management support/understanding E. Other
  58. 58. Poll Question 5: Do you expect an increase in ML within Life Science in the next 2 years? A. Yes B. No C: Don’t Know
  59. 59. Audience Q&A Please use the Question function in GoToWebinar
  60. 60. Where will AI/Deep learning have an impact in Life Science & Health? The next Pistoia Alliance Debates Webinar: Moderator: Nick Lynch with Sean Ekins CEO, Collaborations Pharmaceuticals Inc, David Pearah, CEO HDF group, and Peter Henstock, Pfizer Research Date: September 27, 2017 check http://www.pistoiaalliance.org/pistoia-alliance-debates-webinar- series/ for the latest information
  61. 61. info@pistoiaalliance.org @pistoiaalliance www.pistoiaalliance.org

×