Artificial intelligence (AI) technologies, such as natural language processing (NLP), have been around for some time, and more recently there has been much hype surrounded the potential of combining AI with Machine Learning (ML) for decision making. But has it met the challenge? This webinar reviews what NLP is, the role NLP plays in machine learning approaches, such as deep learning, and some real-world use cases for application to life sciences and healthcare to improve patient outcomes.
2. Natural Language
Processing and Machine
Learning: Beyond the Hype
A Pistoia Alliance Debates Webinar
Moderated by David Milward –Linguamatics
September 14, 2017
4. Poll Question 1: What role do you play in
your company?
A. IT
B. Data scientist/bioinformatician
C. Clinical/bench scientist
D. Information professional
E. Other
26. Poll Question 2: What is your company’s
primary use for NLP?
A. Early Discovery/ Pre-clinical
B. Clinical
C. Real world data
D. Other
E. Don’t use NLP
27. Using NLP and ML in clinical research
Chengyi Zheng, PhD, MS
DEPARTMENT of Research & Evaluation
28. 28 DEPARTMENT of Research & Evaluation
10/6/2012 10/19/2012
10/7/2012 10/14/2012
10/7/2012
Pt called
10/7/2012
Nurse Called Back
10/8/2012
Orthopedic office visit
Where: Medical Center, Department
10/8/2012
Progress Notes:
Reason for visit: Knee Pain
Vital Sign/BMI/Pain level/History
PE/Findings/Impression/A&P
Dx: icd-9 code
Nurse Exam Note:
…
10/9/2012
Lab
10/10/2012
Pre-op dental exam (ext)
10/6/2012
Imaging:
DEXA Bone density
10/11/2012
office visit
10/11/2012
Rx Prescribed
10/10/2012
Surgery Scheduled
10/11/2012
Office Visit
Sinus Congestion
Ankle itchy
Dx:
401.9 Essential Hypertension
274.9 Gout
461.9 Acute Sinusitis
10/12/2012
Picked up the Rx
10/13/2012
Pt missed appt.
10/13/2012
Telephone Consult
Healthy bones PN
10/14/2012
Pt emailed:
Drug adverse event
10/14/2012
Pt called
cancerous area
10/15/2012
EKG
Dx: Screening
10/15/2012
Ear Wax Wash
10/18/2012
Pathology Report Out
10/16/2012
Procedure:
Remove Skin
10/16/2012 - 10/19/2012
Hospitalization
Two weeks records of a patient in an EMR system
5 Ws: What, Who, When, Where and Why
Membership length: 70% > 5 years, 50% >10 years.
29. 29
5 Ws: What, Who, When, Where and Why
What
– What is the reason of visit?
– What happened? (pain after fall, pain after drink a beer?)
Who
– Who is the caregiver? (primary physician, rheumatologist?)
– What we know about this patient? (age, race, past medical history, et.
al.)
Where
– Where this visit occurred?
When
– When the problem started?
Why
– Why this problem happened? Possible causes?
DEPARTMENT of Research & Evaluation
31. 31
Case study: Identify acute gout flare
Published methods to identify gout flares using claims
data
– Clinical coding is unreliable: under-coding, over-coding, too
general
– Medication is unreliable:
Drugs for gout maintenance
Drugs also for other diseases (Share similar symptoms)
NLP has been used to:
– Identify study population and patients information
– Identify and extract clinical variables (genetic, biopsy, radiology)
– Evaluate patients status (disease progression, medication status)
DEPARTMENT of Research & Evaluation
32. Solution and challenges (NLP)
Challenges:
– Gout is a chronic disease which can be controlled but not cured
Signs and symptoms could appeared in follow up visit
Differentiate between acute and chronic status
– Gout population is generally old with comorbidity sharing
similar symptoms
100+ types of arthritis (> 50 million people)
Pain, erythema, and swelling joint
– Information documented varies by clinical notes
Standard solutions:
– Each search query captures one set of information
– Each search query has its own sensitivity/specificity etc.
– Logic operator combines search results (union, join, etc.)
Difficult to optimize on the overall sensitivity/specificity etc.
32 DEPARTMENT of Research & Evaluation
33. Mining vs. NLP & ML in clinical research
Steps:
1. Preliminary analysis, estimate feasibility
2. Develop plan, estimate cost
3. Seek permit (government vs. IRB)
4. Mine (mining equipment vs. NLP)
– Focus on completeness (high sensitivity)
– Shallow & deep mining (good specificity)
5. Refine (chemical process vs. ML)
– Improve purity (higher specificity)
6. Manual verification (optional)
7. Deliver to customer
“art and science combined” “resource-heavy and time-consuming
process”
33 DEPARTMENT of Research & Evaluation
34. Solution and challenges (NLP+ML)
Goal:
NLP focus on sensitivity or information completeness
– Separate ores from rock
ML focus on improving the specificity
– Improve purity without much loss of sensitivity
Solution:
NLP results as input features to the ML system
– Identify related signs and symptoms
– Identify temporal relationship (when and how long?)
– Identify disease association (related to any other disease?)
– Identify implicit and explicit mention of gout flare
– Identify treatment plan associated with disease onset
34 DEPARTMENT of Research & Evaluation
35. Overview of the system development steps
35
Study period: 1/1/2007 to 12/31/2010.
Patients > 18 years, with a diagnosis of gout and on urate-lowering therapy.
Within [-3,+12] months of index date, 599,317 clinical notes for 16,519 patients.
DEPARTMENT of Research & Evaluation
36. Overview of the NLP+ML system
36 DEPARTMENT of Research & Evaluation
38. Results
Note level (gout flare, n= 599,317):
– NLP: 49,415 positive cases => ML: 18,869 positive cases
Patient level (with ≥ 3 flares, n=16,519):
– Number of patients: 1,402 (NLP+ML) vs. 516 (Claim)
– Sensitivity: 93.5% (NLP+ML) vs. 41.9% (Claim)
Impact:
– Identify refractory disease patients
– Estimate market size (KPSC / US population = 4.5/325 million =
1.4%)
– Better disease management, improve quality of life, and help
reduce healthcare resource use.
1,402 patients is more manageable than 16,519 patients
38 DEPARTMENT of Research & Evaluation
39. 39
ML in healthcare
Tremendous opportunities
Prediction: high utilizers, risk scores
Identification: cases, outcomes, social needs
Image recognition: pathology and radiology images
– Challenges (Data)
Data quality: dirty, missing data
Heterogeneous data: different systems
Structured, semi-structured and free text data
Image, scanned documents
Genetic and biobank data
– Challenges (People)
Who understands NLP, ML and healthcare
Who understands the complexity of healthcare data
DEPARTMENT of Research & Evaluation
40. Poll Question 3: How does your company
primarily use machine learning in drug
discovery?
A. Target prediction and repositioning
B. Biomarker discovery
C. Patient stratification
D. Other
E. We don’t use machine learning
41. Network and pathway
driven machine learning
approaches to biomarker
discovery and patient
stratification
Eugene Myshkin, PhD
September 2017
42. 42
CLARIVATE ANALYTICS TEXT MINING
• Clarivate Analytics literature data feed
• Comprehensive coverage
– >20,000 journals
– Journal content mirrors: Current Contents; Web of Science; Biosis; International Pharmaceutical Abstracts;
Derwent Drug File
– http://ip-science.thomsonreuters.com/cgi-bin/jrnlst/jlresults.cgi?PC=MASTER
• Latest information
– Updated with over 170,921 articles/month, or 2,051,051+ articles/year
• Full text, cover to cover searching of all journals
• Comprehensive synonym collections
• Controlled vocabulary management software to support mining
43. 43
CLARIVATE ANALYTICS LIFE SCIENCES SOLUTIONS
Pharmacovigilance Literature Monitoring Biological and Chemical Reagent Monitoring
Concepts in social media Automated Curation of Clinical
Data
Protein and Gene Variant
Monitoring
44. 44
USING NLP FOR MANUSCRIPT MATCHING
Analyze citation connections to place the publication in the right journal
45. 45
DRUG TARGET DISEASE
PITFALLS OF NLP FEATURES FOR ML
• 1-10 million of features
• Feature vectors are binary and sparse
• Feature redundancy
• Feature selection takes a long time
These associations can be obtained
with NLP but precision is a problem -
a flood of false positives and the
necessity to hire a bunch of people
just to sort the true from the false
alerts.
FOCUS OF DRUG DISCOVERY:
46. 46
—
METABASE MANUALLY ANNOTATED CONTENT
PUBLICATIONS
(209 for EGF-EGFR interaction)
•Manual annotation from publications
•Team of PhDs, MDs
•Advanced editorial systems
•Controlled vocabularies
•Multiple levels of QC
•invested more than 400 man yearsMOLECULAR
INTERACTION
NETWORK:
PATHWAY
~ 1,500,000
molecular interactions
~ 3,000 pathways
48. 48
—
Drug toxic but
beneficial
Drug toxic but
NOT beneficial
Drug NOT toxic and
beneficial
Drug NOT toxic and
NOT beneficial
Patient stratification
“The most efficient and safe drug for a
cohort of patients”
WHY DIFFERENT PATIENT RESPONSE?
Blockbuster strategy
“One drug for all patients”
New strategy is needed
49. 49
—
HOW CAN PATIENTS BE STRATIFIED?
Mechanism 1 Mechanism 2
Biomarkers Biomarkers
Biomarker – measurable molecular indicator of:
disease subtype/progress
drug efficacy
side effect/toxicity
• Identify subtypes resulting in multiple
drug targets rather than one.
• A shift from the presumption of a disease
to multiple diseases would reframe the
drug development strategy
50. 50
—
ORION BIONETWORKS
Orion Bionetworks (Cohen Veteran Biosciences) is an alliance of world leading
organizations in patient care, computational modelling, translational research and
patient advocacy that aims to develop open-source computational models for
multiple sclerosis and improve upon existing analytical tools for model
development.
~186 subjects with gene expression data and clinical parameters like time to
relapse, etc
GOALS:
Understand the structure of the population based on
the molecular data – identify cohorts of patients whose
clinical course differs over time
Build stratification models
Identify new therapeutic targets
52. 52
—
1. PATHWAY IDENTIFICATION
— 56 pathways identified
• 136 genes
• 39/136 genes were present in multiple pathways
• 44/136 genes known MS biomarkers or drug targets (p =
5x10-6)
52
• individual expression values of each
member gene were averaged into a
combined z-score
• activity score association with time to
relapse in a Cox proportional hazard
model was calculated
53. 53
—
2. PATIENTS CLUSTERING BY PATHWAYS
Clusters are significantly
associated with time to
relapse in the presence of
important clinical
covariates
patients were clustered into groups based on k-means clustering of their pathway
activity profiles, k=3 resulted in the best separation of patient profiles.
54. 54
—
— A K-Nearest Neighbor model was previously generated to predict
risk groups 1-3 using all biomarkers
— Feature selection was performed by taking the variable importance
calculated from the trained KNN model.
— Forward feature selection was then conducted using 10-fold CV
adding features to the model in order of their importance.
— Once this process was complete the predictive performance was
evaluated in terms of the ability of the model to separate the three
risk groups
— Final feature set was applied to test data
3. CLASSIFICATION MODEL
Signature was reduced
from 56 to 13 pathways,
containing 65 genes
55. GENE ONLY MODEL WAS NOT ROBUST TO TEST DATA
PATHWAY BASED APPROACH GENE BASED APPROACH
56. 56
—
CONCLUSIONS
— Signature differentiating between patient cohorts was reduced
from 56 to 13 pathways
— This new signature contains 65 genes
— 13 biomarkers could stratify subjects into risk groups with
statistically significant differences in time to relapse
— This was validated in test subjects with results being consistent
to what was observed in the training cohort
— Pathway activities were more robust than gene expression
56
57. Poll Question 4: What is the greatest
barrier to application of NLP/ML at your
company?
A. Technical expertise
B. Access to data
C. Data quality
D. Management support/understanding
E. Other
58. Poll Question 5: Do you expect an
increase in ML within Life Science in the
next 2 years?
A. Yes
B. No
C: Don’t Know
60. Where will AI/Deep learning
have an impact in Life Science
& Health?
The next Pistoia Alliance Debates Webinar:
Moderator: Nick Lynch with Sean Ekins CEO, Collaborations
Pharmaceuticals Inc, David Pearah, CEO HDF group, and Peter Henstock,
Pfizer Research
Date: September 27, 2017
check http://www.pistoiaalliance.org/pistoia-alliance-debates-webinar-
series/ for the latest information