Automated & Explainable Deep
Learning for Clinical Language
Understanding at Roche
Vishakha Sharma, PhD
Principal Data Scientist
Roche
Yogesh Pandit, MS
Staff Software Engineer
Roche
David Talby, PhD
Chief Technology Officer
John Snow Labs
Agenda
The Clinical Language Understanding
Challenge at Roche
Why patients & doctors need accurate & automated
natural language understanding at scale
Delivering an Automated, Explained,
State-of-the-art NLP & OCR System
The deep-learning NLP & OCR models and pipelines
that were built to address the challenge
Achieving state-of-the-art accuracy
on real healthcare data in production
Reusing, training & tuning clinical embeddings,
entity recognition, entity linking, and OCR
Disclaimer
▪ Roche has been one of John Snow Labs’ customers since August 2018.
▪ This presentation has been prepared by Roche and John Snow Labs to
provide a high-level overview of Roche’s use of John Snow Labs’
products.
▪ Nothing contained or stated herein or during the presentation
constitutes Roche’s endorsement of John Snow Labs’ products.
▪ John Snow Labs is fully responsible for accuracy and completeness of
any statements related to John Snow Labs’ products, including the
product’s performance.
The Roche difference
Rooted in Science • Trusted in Healthcare
Diagnostics
#1 in biotechnology
and in vitro diagnostics
20 billion diagnostic tests performed*
Advanced scientific knowledge
and technology that increases
the medical value
of diagnostic solutions
Pharmaceuticals
Leading provider
of cancer treatments worldwide
127 million patients treated
with Roche medicines*
Focused on major medical indications
and disease areas
30 Roche medicines on the
WHO Model List of Essential Medicines*
Decision Support
Workflow | Data | Analytics
Delivering
Personalized Healthcare
Decision support software leveraging more
than 120 years of medical innovation rooted in science
*Roche Annual Report, 2018 © 2019 F. Hoffmann-La Roche, Ltd NAVIFY is a trademark of Roche.
NAVIFY Tumor Board
NAVIFY Clinical Decision Support appsA cloud-based workflow product that securely integrates
and displays relevant aggregated data into a single, holistic patient
dashboard for oncology care teams to review, align and decide on
the optimal treatment for the patient.
The clinical decision support apps ecosystem is
secured and fully integrated with NAVIFY Tumor Board.
NAVIFY Guidelines app
Delivering personalized up-to-date guidelines for reviewing
and recording patient diagnostic and treatment paths and
documentation adherence using an intuitive execution
decision tree released in collaboration with GE Healthcare.
NAVIFY Clinical Trial Match app*
Easily search the largest international trial registries,
including ClinicalTrials.gov, European Medicines Agency,
Japan Medical Association Center for Clinical Trials, etc.
NAVIFY Publication Search app*
Effortlessly search more than 858,000 publications across
PubMed, American Society of Clinical Oncology and
American Association of Cancer Research.
*Powered by MolecularMatch, Inc. © 2019 F. Hoffmann-La Roche, Ltd NAVIFY is a trademark of Roche.
Unstructured healthcare data challenges for
NAVIFY portfolio
▪ Diverse customers distributed across the world
▪ Multiple Languages
▪ Oncology
▪ Different report formats (ex: pathology, radiology)
▪ Different terminologies (ex: SNOMED, LOINC, ICD-O-3)
Must unlock unstructured data to build a comprehensive,
longitudinal view of the patient, and enable both clinical decision
support and population analytics
Sample Pathology Report
Disclaimer: There is no real patient data being displayed here.
Pathology reports are very
diverse:
▪ Jargon
▪ Tables
▪ Key-value pairs
▪ Hand-written notes
Manually Curated Report
Manual curation is extremely
time consuming, expensive,
and prone to errors
Disclaimer: There is no real patient data being displayed here.
The NAVIFY team identified two significant needs
Requirements for both:
▪ Scalable (support 10 million pathology and
radiology reports per year
▪ Compliant with privacy laws
▪ Integrates easily with AWS services
▪ Low cost
Natural Language Processing (NLP):
▪ High accuracy
▪ Specialized for medical data
▪ Minimize time to train new models
▪ Extensible for new content types
Optical Character Recognition (OCR):
▪ High accuracy
▪ Retain document structure
(i.e. tables, lists, paragraphs…)
45+ Oncology Entities to Extract
Example: Surgical Pathology Report (Lung, Breast, Colon)
Disclaimer: This is sample data from TCGA. There is no real patient data being displayed here.
Optical Character Recognition (OCR)
PDF Text
engine_mode
page_segmentation_mode
erosion
page_iterator_level
scaling_factor
Parameters
Metrics
word_error_rate
character_error_rate
bag_of_words_error_rate
Experiment
& Optimize
Named Entity Recognition (NER)
Spark NLP provides both CNN+Bi-LSTM and Bio-Bert implementations
We trained a model to extract 45+ labels from Pathology reports
Chiu, J. P., et al.(2016). Named entity recognition with bidirectional
LSTM-CNNs. Transactions of the Association for Computational
Linguistics, 4, 357-370.
Bi-directional Long Short Term Memory
with Convolutional Neural Network
Tensorflow Models
Devlin J. et al. (2019) Bert: pre-training of deep bidirectional
transformers for language understanding. In: Proceedings of the 2019
Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1
(Long and Short Papers), Minneapolis, MN, USA. pp. 4171–4186.
Bidirectional Encoder Representations from
Transformers (BERT)
https://github.com/google-research/bert
Lee, J., et al. (2020). BioBERT: a pre-trained biomedical language
representation model for biomedical text mining. Bioinformatics, 36(4),
1234-1240.
Bio-BERT
Entity Resolution (ER)
With just NER - we can not resolve entities to structured code
Pre-trained models for resolving healthcare entities to standard SNOMED & ICD-10 codes
Workflow
Workflow
Training NER model with BERT
Initialization
Training data
Resources
Annotator
Pipeline
Run Training
The use of NLP will be a journey
▪ Initial goal of speeding up of
pathology and radiology reports
▪ Faster curation and term
highlighting in clinical reports
▪ Automate extraction of
high-confidence entities,
relationships
What is Spark NLP?
▪ State of the art Natural Language Processing
▪ Production-grade, trainable, and scalable
▪ Open-Source Python, Java & Scala libraries
▪ 100+ Pre-trained models & pipelines
▪ Active: 26 new releases in 2018, 30 in 2019
Spark NLP
for Healthcare
Accuracy Benchmarks
Scaling Benchmarks
Speed Benchmarks
• Optimized builds of Spark NLP
for both Intel and Nvidia
• Benchmark done on AWS:
Train a French NER model
• Achieving F1-score of 89%
requires at least 80 Epochs with
batch size of 512
• Intel outperformed Nvidia:
Cascade Lake was 19% faster &
46% cheaper than Tesla P-100
Clinical Entity Recognition: Accuracy
Bert
NerDLApproach
93.3 %
on blind test set
The best NER score
in a production system
Clinical Entity Resolution
“CNN-based ranking for
biomedical entity normalization”.
Li et al., BMC Bioinformatics,
October 2017.
Learn more: Spark NLP
Public Python notebooks - runnable on Google Colab with one click in a browser:
github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/colab
github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/jupyter/
enterprise/healthcare/colab
Overview of the Spark NLP:
nlp.johnsnowlabs.com
Thank you!
We are hiring!!!
www.navify.com/careers/
Try Spark NLP at
nlp.johnsnowlabs.com

Automated and Explainable Deep Learning for Clinical Language Understanding at Roche

  • 2.
    Automated & ExplainableDeep Learning for Clinical Language Understanding at Roche Vishakha Sharma, PhD Principal Data Scientist Roche Yogesh Pandit, MS Staff Software Engineer Roche David Talby, PhD Chief Technology Officer John Snow Labs
  • 3.
    Agenda The Clinical LanguageUnderstanding Challenge at Roche Why patients & doctors need accurate & automated natural language understanding at scale Delivering an Automated, Explained, State-of-the-art NLP & OCR System The deep-learning NLP & OCR models and pipelines that were built to address the challenge Achieving state-of-the-art accuracy on real healthcare data in production Reusing, training & tuning clinical embeddings, entity recognition, entity linking, and OCR
  • 4.
    Disclaimer ▪ Roche hasbeen one of John Snow Labs’ customers since August 2018. ▪ This presentation has been prepared by Roche and John Snow Labs to provide a high-level overview of Roche’s use of John Snow Labs’ products. ▪ Nothing contained or stated herein or during the presentation constitutes Roche’s endorsement of John Snow Labs’ products. ▪ John Snow Labs is fully responsible for accuracy and completeness of any statements related to John Snow Labs’ products, including the product’s performance.
  • 5.
    The Roche difference Rootedin Science • Trusted in Healthcare Diagnostics #1 in biotechnology and in vitro diagnostics 20 billion diagnostic tests performed* Advanced scientific knowledge and technology that increases the medical value of diagnostic solutions Pharmaceuticals Leading provider of cancer treatments worldwide 127 million patients treated with Roche medicines* Focused on major medical indications and disease areas 30 Roche medicines on the WHO Model List of Essential Medicines* Decision Support Workflow | Data | Analytics Delivering Personalized Healthcare Decision support software leveraging more than 120 years of medical innovation rooted in science *Roche Annual Report, 2018 © 2019 F. Hoffmann-La Roche, Ltd NAVIFY is a trademark of Roche.
  • 6.
    NAVIFY Tumor Board NAVIFYClinical Decision Support appsA cloud-based workflow product that securely integrates and displays relevant aggregated data into a single, holistic patient dashboard for oncology care teams to review, align and decide on the optimal treatment for the patient. The clinical decision support apps ecosystem is secured and fully integrated with NAVIFY Tumor Board. NAVIFY Guidelines app Delivering personalized up-to-date guidelines for reviewing and recording patient diagnostic and treatment paths and documentation adherence using an intuitive execution decision tree released in collaboration with GE Healthcare. NAVIFY Clinical Trial Match app* Easily search the largest international trial registries, including ClinicalTrials.gov, European Medicines Agency, Japan Medical Association Center for Clinical Trials, etc. NAVIFY Publication Search app* Effortlessly search more than 858,000 publications across PubMed, American Society of Clinical Oncology and American Association of Cancer Research. *Powered by MolecularMatch, Inc. © 2019 F. Hoffmann-La Roche, Ltd NAVIFY is a trademark of Roche.
  • 7.
    Unstructured healthcare datachallenges for NAVIFY portfolio ▪ Diverse customers distributed across the world ▪ Multiple Languages ▪ Oncology ▪ Different report formats (ex: pathology, radiology) ▪ Different terminologies (ex: SNOMED, LOINC, ICD-O-3) Must unlock unstructured data to build a comprehensive, longitudinal view of the patient, and enable both clinical decision support and population analytics
  • 8.
    Sample Pathology Report Disclaimer:There is no real patient data being displayed here. Pathology reports are very diverse: ▪ Jargon ▪ Tables ▪ Key-value pairs ▪ Hand-written notes
  • 9.
    Manually Curated Report Manualcuration is extremely time consuming, expensive, and prone to errors Disclaimer: There is no real patient data being displayed here.
  • 10.
    The NAVIFY teamidentified two significant needs Requirements for both: ▪ Scalable (support 10 million pathology and radiology reports per year ▪ Compliant with privacy laws ▪ Integrates easily with AWS services ▪ Low cost Natural Language Processing (NLP): ▪ High accuracy ▪ Specialized for medical data ▪ Minimize time to train new models ▪ Extensible for new content types Optical Character Recognition (OCR): ▪ High accuracy ▪ Retain document structure (i.e. tables, lists, paragraphs…)
  • 11.
    45+ Oncology Entitiesto Extract Example: Surgical Pathology Report (Lung, Breast, Colon) Disclaimer: This is sample data from TCGA. There is no real patient data being displayed here.
  • 12.
    Optical Character Recognition(OCR) PDF Text engine_mode page_segmentation_mode erosion page_iterator_level scaling_factor Parameters Metrics word_error_rate character_error_rate bag_of_words_error_rate Experiment & Optimize
  • 13.
    Named Entity Recognition(NER) Spark NLP provides both CNN+Bi-LSTM and Bio-Bert implementations We trained a model to extract 45+ labels from Pathology reports Chiu, J. P., et al.(2016). Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics, 4, 357-370. Bi-directional Long Short Term Memory with Convolutional Neural Network Tensorflow Models Devlin J. et al. (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA. pp. 4171–4186. Bidirectional Encoder Representations from Transformers (BERT) https://github.com/google-research/bert Lee, J., et al. (2020). BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234-1240. Bio-BERT
  • 14.
    Entity Resolution (ER) Withjust NER - we can not resolve entities to structured code Pre-trained models for resolving healthcare entities to standard SNOMED & ICD-10 codes
  • 15.
  • 16.
  • 17.
    Training NER modelwith BERT Initialization Training data Resources Annotator Pipeline Run Training
  • 18.
    The use ofNLP will be a journey ▪ Initial goal of speeding up of pathology and radiology reports ▪ Faster curation and term highlighting in clinical reports ▪ Automate extraction of high-confidence entities, relationships
  • 19.
    What is SparkNLP? ▪ State of the art Natural Language Processing ▪ Production-grade, trainable, and scalable ▪ Open-Source Python, Java & Scala libraries ▪ 100+ Pre-trained models & pipelines ▪ Active: 26 new releases in 2018, 30 in 2019
  • 20.
  • 21.
  • 22.
  • 23.
    Speed Benchmarks • Optimizedbuilds of Spark NLP for both Intel and Nvidia • Benchmark done on AWS: Train a French NER model • Achieving F1-score of 89% requires at least 80 Epochs with batch size of 512 • Intel outperformed Nvidia: Cascade Lake was 19% faster & 46% cheaper than Tesla P-100
  • 24.
    Clinical Entity Recognition:Accuracy Bert NerDLApproach 93.3 % on blind test set The best NER score in a production system
  • 25.
    Clinical Entity Resolution “CNN-basedranking for biomedical entity normalization”. Li et al., BMC Bioinformatics, October 2017.
  • 26.
    Learn more: SparkNLP Public Python notebooks - runnable on Google Colab with one click in a browser: github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/colab github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/jupyter/ enterprise/healthcare/colab Overview of the Spark NLP: nlp.johnsnowlabs.com
  • 27.
    Thank you! We arehiring!!! www.navify.com/careers/ Try Spark NLP at nlp.johnsnowlabs.com