NLP & ML Webinar

This webinar is being recorded

Natural Language
Processing and Machine
Learning: Beyond the Hype
A Pistoia Alliance Debates Webinar
Moderated by David Milward –Linguamatics
September 14, 2017

Poll Question 1: What role do you play in
your company?
A. IT
B. Data scientist/bioinformatician
C. Clinical/bench scientist
D. Information professional
E. Other

©PistoiaAlliance
The Panel
5
David Milward, Ph.D., CTO Linguamatics
David Milward is chief technology officer (CTO) at Linguamatics. He is a pioneer of interactive
text mining, and a founder of Linguamatics. He has over 20 years experience of product
development, consultancy and research in natural language processing (NLP). After receiving a
PhD from the University of Cambridge, he was a researcher and lecturer at the University of
Edinburgh. He has published in the areas of information extraction, spoken dialogue, parsing,
syntax and semantics.
Chengyi Zheng, Ph.D. , NLP Specialist Kaiser Permanente
Chengyi Zheng, PhD, is a NLP specialist at the Kaiser Permanente Southern California. He has
worked on over 30 research projects using the electronic health records (EHR) data from millions
of patients. He is the principal investigator of a CDC funded study involving 5 health care
institutions on using NLP in the vaccine safety studies. He was the winner of the Kaiser
Permanente predictive modeling competition. He ranked the 1st place in the innovation
competition (InnoCentive@Lilly) while served as the biomedical informatics scientist at Eli Lilly.
He was trained in computer science with a concentration on speech recognition. He will share
some experiences on using NLP and Machine learning on EHR for outcomes prediction.
Eugene Myshkin, Ph.D., Senior Research Scientist, Clarivate
Eugene Myshkin, PhD, is a senior scientist in bioinformatics at Clarivate Analytics. He
has over 15 years experience in drug discovery, cheminformatics and bioinformatics. He
has also been involved in a number of text mining projects including mining of chemical
reagents and antibodies from scientific
literature.
September 14, 2017 NLP and ML

©PistoiaAlliance
Agenda
6
• AI, NLP and ML (David)
• Using NLP and ML in clinical research (Chengyi)
• Network and pathway driven machine learning
approaches to biomarker discovery and patient
stratification (Eugene)
6September 14, 2017 NLP and ML

NLP, AI and Machine Learning
David Milward, PhD
CTO, Linguamatics
2017

Overview
AI (Artificial Intelligence)
NLP (Natural Language Processing)
− and its applications in life sciences
ML (Machine Learning) and DL (Deep Learning)
NLP to feed ML-based DS (Decision Support)
ML in NLP
DLAI
ML
NLP DS
© 2017 Linguamatics8

Overview
ML in NLP
AI

Overview
ML in NLP
AI
NLP

Overview
ML in NLP
DLAI
ML

Overview
ML in NLP
AI
NLP DS

Overview
ML in NLP
AI
ML
NLP

Artificial Intelligence (AI)
Artificial intelligence is intelligence
exhibited by machines
The central goals of AI research include
reasoning, knowledge, planning,
learning, natural language processing
(communication), perception and the
ability to move and manipulate objects
As machines become increasingly
capable, tasks considered as requiring
"intelligence" are often removed from
the definition, leading to the quip
“AI is whatever hasn't been done yet”
Wikipedia

Natural Language Processing (NLP)
Processing of natural languages e.g.
English, French, Chinese by computers
NLP is part of AI, but also key to other
areas of AI e.g. providing decision
support
− If 80% of knowledge is unstructured
we need NLP to get the right information
to provide good suggestions
− Currently many AI projects are limited: they
can only address questions where there is
structured data
− Worse, they often use inappropriate
structured data such as ICD billing codes for
non-billing tasks

Find information however it is expressed
Different word,
same meaning
cyclosporine
ciclosporin
Neoral
Sandimmune
Different expression,
same meaning
Non-smoker
Does not smoke
Does not drink or smoke
Denies tobacco use
Different grammar,
same meaning
5mg/kg of cyclosporine daily
5mg/kg/d of cyclosporine
cyclosporine 5mg/kg/day
Same word,
different context
Diagnosed with diabetes
Family history of diabetes
No family history of diabetes
NLP

Represent it in a standardized form
Concept Text Normalized Value
Diseases breast cancer Breast Neoplasm
carcinoma of the breast
Genes Raf-1 RAF1
Raf I
Dates 27th Feb 2014 20140227
2014/02/27
Measurements 0.2g 200 mg
Two hundred milligrams
Mutations Val 158 Met V158M
Val by Met at codon 158
Entrez Gene ID: 5743
inhibits
nimesulide, a selective COX2 inhibitor, …

From Bench to Bedside: NLP Provides Insight
Regulatory
approval
Phase 3
Clinical
trials
Basic
research
Idea Patient
care
Phase 2Phase 1
DeliveryDevelopmentDiscovery
Business critical questions
What targets are
involved in bone
cancer?
What companies are
patenting a particular
technology?
What are the safety
risks of my drug?
Where can I site my
Phase 1, Phase 3
clinical study?
What are the clinical
risks for my patients?

Direct access to the Unstructured
© 2017 Linguamatics
Weight ≥ 80kg
Below 60 years old
Reports after 2010
With mutation C677T
Cancer patients
19

Machine Learning
Machine Learning is used for AI in general and as a
technique within NLP
3 main flavours:
− Supervised
− uses annotated data mapping between inputs and outputs
− Semi-supervised
− uses machine analysis but incorporates a human in the loop
− Unsupervised
− uses unannotated data, usually at very large scale.
Recent successes with deep
learning approaches based on
neural networks for supervised and
unsupervised ML e.g.
− Machine translation using parallel
corpora
− Image classification in medicine

Using NLP to feed other AI
NLP provides access to the 80% of
information in unstructured text
Provides a set of potential features to be
used in e.g. ML models for Decision
Support
Example: building risk models from RWD
sets
− Predicting patients at risk of misusing opioid
prescription drugs (AMIA November 2017)
− Features extracted by Linguamatics I2E from
8.9 million de-identified medical record full-
text transcripts from RealHealthData
− SVM classifier trained on the features to flag
patients at risk

Machine Learning in NLP
Supervised ML
− Requires large-scale, representative annotated documents
− Main paradigm for core NLP components
− For extraction patterns, used in academic systems but less commonly
in commercial
Semi-supervised ML
− Useful for new tasks or data sets where no existing representative
annotated data
− Useful where a task is initially ill-defined
− Puts a human in the loop judging suggestions from the machine
learning
− Can provide good quality results quickly e.g. to test whether a feature
extracted by NLP is useful for a ML model
Unsupervised ML
− Uses large-scale unannotated data
− Key example is learning the meaning of a word via the context it
keeps (word embeddings)

Semi-Supervised ML Approaches
Similar distributions for words and syntactic constructions
Automatically discover what is in the data using an interactive,
agile text mining platform such as Linguamatics I2E
A long tail of infrequent cases
− prioritize the more frequent constructions
− generalize to cover items in the tail
Zipf’s Law: the frequency of
any word is inversely
proportional to its rank in the
frequency table

Semi-Supervised NLP using
Linguamatics I2E

Summary
NLP is critical to success of many ML projects
− access to the unstructured text is key to using ML
widely, not just where there is convenient structured
data
Semi-supervised approaches to NLP provide an
efficient way to capture features for ML projects
DLAI
ML
NLP DS

Poll Question 2: What is your company’s
primary use for NLP?
A. Early Discovery/ Pre-clinical
B. Clinical
C. Real world data
D. Other
E. Don’t use NLP

Using NLP and ML in clinical research
Chengyi Zheng, PhD, MS
DEPARTMENT of Research & Evaluation

28 DEPARTMENT of Research & Evaluation
10/6/2012 10/19/2012
10/7/2012 10/14/2012
10/7/2012
Pt called
10/7/2012
Nurse Called Back
10/8/2012
Orthopedic office visit
Where: Medical Center, Department
10/8/2012
Progress Notes:
Reason for visit: Knee Pain
Vital Sign/BMI/Pain level/History
PE/Findings/Impression/A&P
Dx: icd-9 code
Nurse Exam Note:
…
10/9/2012
Lab
10/10/2012
Pre-op dental exam (ext)
10/6/2012
Imaging:
DEXA Bone density
10/11/2012
office visit
10/11/2012
Rx Prescribed
10/10/2012
Surgery Scheduled
10/11/2012
Office Visit
Sinus Congestion
Ankle itchy
Dx:
401.9 Essential Hypertension
274.9 Gout
461.9 Acute Sinusitis
10/12/2012
Picked up the Rx
10/13/2012
Pt missed appt.
10/13/2012
Telephone Consult
Healthy bones PN
10/14/2012
Pt emailed:
Drug adverse event
10/14/2012
Pt called
cancerous area
10/15/2012
EKG
Dx: Screening
10/15/2012
Ear Wax Wash
10/18/2012
Pathology Report Out
10/16/2012
Procedure:
Remove Skin
10/16/2012 - 10/19/2012
Hospitalization
Two weeks records of a patient in an EMR system
5 Ws: What, Who, When, Where and Why
Membership length: 70% > 5 years, 50% >10 years.

29
5 Ws: What, Who, When, Where and Why
 What
– What is the reason of visit?
– What happened? (pain after fall, pain after drink a beer?)
 Who
– Who is the caregiver? (primary physician, rheumatologist?)
– What we know about this patient? (age, race, past medical history, et.
al.)
 Where
– Where this visit occurred?
 When
– When the problem started?
 Why
– Why this problem happened? Possible causes?

30
Visual representation of KPSC research databases

31
Case study: Identify acute gout flare
 Published methods to identify gout flares using claims
data
– Clinical coding is unreliable: under-coding, over-coding, too
general
– Medication is unreliable:
 Drugs for gout maintenance
 Drugs also for other diseases (Share similar symptoms)
 NLP has been used to:
– Identify study population and patients information
– Identify and extract clinical variables (genetic, biopsy, radiology)
– Evaluate patients status (disease progression, medication status)

Solution and challenges (NLP)
Challenges:
– Gout is a chronic disease which can be controlled but not cured
 Signs and symptoms could appeared in follow up visit
 Differentiate between acute and chronic status
– Gout population is generally old with comorbidity sharing
similar symptoms
 100+ types of arthritis (> 50 million people)
 Pain, erythema, and swelling joint
– Information documented varies by clinical notes
Standard solutions:
– Each search query captures one set of information
– Each search query has its own sensitivity/specificity etc.
– Logic operator combines search results (union, join, etc.)
 Difficult to optimize on the overall sensitivity/specificity etc.

Mining vs. NLP & ML in clinical research
Steps:
1. Preliminary analysis, estimate feasibility
2. Develop plan, estimate cost
3. Seek permit (government vs. IRB)
4. Mine (mining equipment vs. NLP)
– Focus on completeness (high sensitivity)
– Shallow & deep mining (good specificity)
5. Refine (chemical process vs. ML)
– Improve purity (higher specificity)
6. Manual verification (optional)
7. Deliver to customer
“art and science combined” “resource-heavy and time-consuming
process”

Solution and challenges (NLP+ML)
Goal:
 NLP focus on sensitivity or information completeness
– Separate ores from rock
 ML focus on improving the specificity
– Improve purity without much loss of sensitivity
Solution:
 NLP results as input features to the ML system
– Identify related signs and symptoms
– Identify temporal relationship (when and how long?)
– Identify disease association (related to any other disease?)
– Identify implicit and explicit mention of gout flare
– Identify treatment plan associated with disease onset

Overview of the system development steps
35
Study period: 1/1/2007 to 12/31/2010.
Patients > 18 years, with a diagnosis of gout and on urate-lowering therapy.
Within [-3,+12] months of index date, 599,317 clinical notes for 16,519 patients.

Overview of the NLP+ML system

Performance comparisons
81.1
95.4
88.3 92.290.9
97.3
93 96.5
84.8
92.2
81.1
93.9
70
80
90
100
Sensitivity Specificity PPV NPV
Clinical note level gout flare identification
Rheumatologist 1 Rheumatologist 2 NLP+ML
37
98.5
92.9
97.1 96.397.1
92.9
97.1
92.9
98.5 96.4 98.5 96.4
88.2 89.3
95.2
75.8
70
80
90
100
Identify patients with ≥ 1 gout flares
Rheumatologist 1 Rheumatologist 2 NLP+ML ICD-9
74.2
92.3
82.1 88.283.9
95.4 89.7 92.593.5
84.6
74.4
96.5
41.9
95.4
81.3 77.5
30
50
70
90
Identify patients with ≥ 3 gout flares (refractory gout)
Rheumatologist 1 Rheumatologist 2 NLP+ML ICD-9

Results
 Note level (gout flare, n= 599,317):
– NLP: 49,415 positive cases => ML: 18,869 positive cases
 Patient level (with ≥ 3 flares, n=16,519):
– Number of patients: 1,402 (NLP+ML) vs. 516 (Claim)
– Sensitivity: 93.5% (NLP+ML) vs. 41.9% (Claim)
 Impact:
– Identify refractory disease patients
– Estimate market size (KPSC / US population = 4.5/325 million =
1.4%)
– Better disease management, improve quality of life, and help
reduce healthcare resource use.
 1,402 patients is more manageable than 16,519 patients

39
ML in healthcare
 Tremendous opportunities
 Prediction: high utilizers, risk scores
 Identification: cases, outcomes, social needs
 Image recognition: pathology and radiology images
– Challenges (Data)
 Data quality: dirty, missing data
 Heterogeneous data: different systems
 Structured, semi-structured and free text data
 Image, scanned documents
 Genetic and biobank data
– Challenges (People)
 Who understands NLP, ML and healthcare
 Who understands the complexity of healthcare data

Poll Question 3: How does your company
primarily use machine learning in drug
discovery?
A. Target prediction and repositioning
B. Biomarker discovery
C. Patient stratification
D. Other
E. We don’t use machine learning

Network and pathway
driven machine learning
approaches to biomarker
discovery and patient
stratification
Eugene Myshkin, PhD
September 2017

42
CLARIVATE ANALYTICS TEXT MINING
• Clarivate Analytics literature data feed
• Comprehensive coverage
– >20,000 journals
– Journal content mirrors: Current Contents; Web of Science; Biosis; International Pharmaceutical Abstracts;
Derwent Drug File
– http://ip-science.thomsonreuters.com/cgi-bin/jrnlst/jlresults.cgi?PC=MASTER
• Latest information
– Updated with over 170,921 articles/month, or 2,051,051+ articles/year
• Full text, cover to cover searching of all journals
• Comprehensive synonym collections
• Controlled vocabulary management software to support mining

43
CLARIVATE ANALYTICS LIFE SCIENCES SOLUTIONS
Pharmacovigilance Literature Monitoring Biological and Chemical Reagent Monitoring
Concepts in social media Automated Curation of Clinical
Data
Protein and Gene Variant
Monitoring

44
USING NLP FOR MANUSCRIPT MATCHING
Analyze citation connections to place the publication in the right journal

45
DRUG TARGET DISEASE
PITFALLS OF NLP FEATURES FOR ML
• 1-10 million of features
• Feature vectors are binary and sparse
• Feature redundancy
• Feature selection takes a long time
These associations can be obtained
with NLP but precision is a problem -
a flood of false positives and the
necessity to hire a bunch of people
just to sort the true from the false
alerts.
FOCUS OF DRUG DISCOVERY:

46
—
METABASE MANUALLY ANNOTATED CONTENT
PUBLICATIONS
(209 for EGF-EGFR interaction)
•Manual annotation from publications
•Team of PhDs, MDs
•Advanced editorial systems
•Controlled vocabularies
•Multiple levels of QC
•invested more than 400 man yearsMOLECULAR
INTERACTION
NETWORK:
PATHWAY
~ 1,500,000
molecular interactions
~ 3,000 pathways

47
—
INTEGRATED APPROACH
Pathway knowledge
Pathway-driven
approaches
Statistical
approaches
1. Target identification or repositioning
2. Biomarker discovery
3. Patient stratification

48
—
Drug toxic but
beneficial
Drug toxic but
NOT beneficial
Drug NOT toxic and
beneficial
Drug NOT toxic and
NOT beneficial
Patient stratification
“The most efficient and safe drug for a
cohort of patients”
WHY DIFFERENT PATIENT RESPONSE?
Blockbuster strategy
“One drug for all patients”
New strategy is needed

49
—
HOW CAN PATIENTS BE STRATIFIED?
Mechanism 1 Mechanism 2
Biomarkers Biomarkers
Biomarker – measurable molecular indicator of:
disease subtype/progress
drug efficacy
side effect/toxicity
• Identify subtypes resulting in multiple
drug targets rather than one.
• A shift from the presumption of a disease
to multiple diseases would reframe the
drug development strategy

50
—
ORION BIONETWORKS
Orion Bionetworks (Cohen Veteran Biosciences) is an alliance of world leading
organizations in patient care, computational modelling, translational research and
patient advocacy that aims to develop open-source computational models for
multiple sclerosis and improve upon existing analytical tools for model
development.
~186 subjects with gene expression data and clinical parameters like time to
relapse, etc
GOALS:
 Understand the structure of the population based on
the molecular data – identify cohorts of patients whose
clinical course differs over time
 Build stratification models
 Identify new therapeutic targets

51
—
NETWORK/PATHWAY BASED METHODS FOR BIOMARKER DISCOVERY

52
—
1. PATHWAY IDENTIFICATION
— 56 pathways identified
• 136 genes
• 39/136 genes were present in multiple pathways
• 44/136 genes known MS biomarkers or drug targets (p =
5x10-6)
52
• individual expression values of each
member gene were averaged into a
combined z-score
• activity score association with time to
relapse in a Cox proportional hazard
model was calculated

53
—
2. PATIENTS CLUSTERING BY PATHWAYS
Clusters are significantly
associated with time to
relapse in the presence of
important clinical
covariates
patients were clustered into groups based on k-means clustering of their pathway
activity profiles, k=3 resulted in the best separation of patient profiles.

54
—
— A K-Nearest Neighbor model was previously generated to predict
risk groups 1-3 using all biomarkers
— Feature selection was performed by taking the variable importance
calculated from the trained KNN model.
— Forward feature selection was then conducted using 10-fold CV
adding features to the model in order of their importance.
— Once this process was complete the predictive performance was
evaluated in terms of the ability of the model to separate the three
risk groups
— Final feature set was applied to test data
3. CLASSIFICATION MODEL
Signature was reduced
from 56 to 13 pathways,
containing 65 genes

GENE ONLY MODEL WAS NOT ROBUST TO TEST DATA
PATHWAY BASED APPROACH GENE BASED APPROACH

56
—
CONCLUSIONS
— Signature differentiating between patient cohorts was reduced
from 56 to 13 pathways
— This new signature contains 65 genes
— 13 biomarkers could stratify subjects into risk groups with
statistically significant differences in time to relapse
— This was validated in test subjects with results being consistent
to what was observed in the training cohort
— Pathway activities were more robust than gene expression
56

Poll Question 4: What is the greatest
barrier to application of NLP/ML at your
company?
A. Technical expertise
B. Access to data
C. Data quality
D. Management support/understanding
E. Other

Poll Question 5: Do you expect an
increase in ML within Life Science in the
next 2 years?
A. Yes
B. No
C: Don’t Know

Audience Q&A
Please use the Question function in GoToWebinar

Where will AI/Deep learning
have an impact in Life Science
& Health?
The next Pistoia Alliance Debates Webinar:
Moderator: Nick Lynch with Sean Ekins CEO, Collaborations
Pharmaceuticals Inc, David Pearah, CEO HDF group, and Peter Henstock,
Pfizer Research
Date: September 27, 2017
check http://www.pistoiaalliance.org/pistoia-alliance-debates-webinar-
series/ for the latest information

info@pistoiaalliance.org @pistoiaalliance www.pistoiaalliance.org

NLP & ML Webinar

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to NLP & ML Webinar

Similar to NLP & ML Webinar (20)

More from Pistoia Alliance

More from Pistoia Alliance (20)

Recently uploaded

Recently uploaded (20)

NLP & ML Webinar