Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Advanced Natural
Language
Processing with
Spark NLP
David Talby
CTO, John Snow Labs
2
Agenda
Introducing Spark NLP
State-of-the-art Accuracy
Speed & Scalability
Ease of Use
Examples
1.
2.
3.
4.
5.
3
Introducing Spark NLP
Most popular
O’Reilly Media
54% share
of healthcare AI teams
use Spark NLP
Gradient Flow
16x growt...
What is Spark NLP?
▪ State of the art Natural Language Processing
▪ Production-grade, trainable, and scalable
▪ Open-Sourc...
Spark NLP in Industry
NLP Industry Survey by Gradient Flow,
an independent data science research & insights company, Septe...
Trusted By
There’s a world of difference between
an academic result and a production system
TRAINABLE &
TUNABLE
100% PRIVATE
EXPLAINA...
8
Spark NLP
Introducing Spark NLP 3
• Massive speedups
[Databricks 7.2 ML GPU on 10 AWS f4dn.large:]
7.9 times faster in calculating B...
10
Agenda
Introducing Spark NLP
State-of-the-art Accuracy
Speed & Scalability
Ease of Use
Examples
1.
2.
3.
4.
5.
11
On Accuracy
Biomedical Named
Entity Recognition at Scale
Improving Clinical Document
Understanding on COVID-19
Research...
● “State of the art” means the
best peer-reviewed academic
results
● For example: Best F1 score on
CoNLL-2003 NER
benchmar...
● The best F1 score on CoNLL-2003
NER benchmark for a system in
production by using Spark NLP
● BERT Large model was used ...
● Everything must work right out of the box
● All the parameters are default
● CoNLL 2003 dataset is used in this
benchmar...
Transformers & Embeddings
Spark NLP: 100+ Word
Embeddings
● BERT
● Small BERT
● BioBERT
● CovidBERT
● ALBERT
● ELECTRA
● X...
Accuracy: State-of-the-art Models
Multi-class & Multi-label Text Classifications
● Multi-class text classification to
dete...
Accuracy: State-of-the-art Models
SentimentDL, ClassifierDL, and MultiClassifierDL
● BERT
● Small BERT
● BioBERT
● CovidBE...
Accuracy: State-of-the-art Models
Language Detection & Identification
● LanguageDetectorDL is a state-of-the-art
TensorFlo...
Accuracy: State-of-the-art Models
Context Spell Checker
● Ability to consider OCR specific error
patterns
● Ability to lev...
20
Agenda
Introducing Spark NLP
State-of-the-art Accuracy
Speed & Scalability
Ease of Use
Examples
1.
2.
3.
4.
5.
Optimizing Performance
BERT Embeddings
● Transformers are slow!
● They need GPUs
● It depends highly on max sequence
lengt...
Performance
BERT Embeddings
● Trade off size, memory, and accuracy
● Tiny BERT
● Mini BERT
● Small BERT
● Medium BERT
● Ot...
Performance:
Hardware
● Optimized builds of Spark NLP
for both Intel and Nvidia
● Out-of-the-box optimizations for
Intel (...
Scale: Distribution & Parallelism
● Zero code changes to scale a pipeline to any
Spark cluster
● Only natively distributed...
Scale: Distribution & Parallelism
Recognize Entity DL Pipeline
● Amazon full reviews, 15 million
sentences, and
255 millio...
Scale: Distribution & Parallelism
BERT Embeddings
● Amazon full reviews, 15 million
sentences, and 255 million tokens
● Si...
27
Agenda
Introducing Spark NLP
State-of-the-art Accuracy
Speed & Scalability
Ease of Use
Examples
1.
2.
3.
4.
5.
Easy to Use
Python, Scala, and Java
● Pretrained pipelines
● Pretrained models
● Training your own models
Easy to Use
Pretrained Pipelines
● 100+ pretrained pipelines
● Full support for 13 languages
● Simple and easy to use
● Wo...
Easy to Use
Pretrained Models
● Hundreds of pretrained models
● Support for 46 languages
● Works online and offline
● Flex...
Easy to Use
Train your own POS tagging models
● POS() accepts token-tag format
● POS Tagger is based on Perceptron Average...
Easy to Use
Train your own NER models
● CoNLL 2003 format as input
● Accepts 50+ Word Embeddings
models
● Train on CPU or ...
Easy to Use
Train your own NER models
● BERT with 2 layers & 768 dimensions
● 16 minutes training
● 91% Micro F1 on Dev
● ...
Easy to Use
Train your own multi-class classifiers
● Supports up to 100 classes
● Accepts 90+ Word & Sentence Embeddings
m...
35
Agenda
Introducing Spark NLP
State-of-the-art Accuracy
Speed & Scalability
Ease of Use
Examples
1.
2.
3.
4.
5.
36
Spark NLP for
Healthcare
37
Spark OCR
38
Project creation
Team setup
Tasks creation
Labeling
The Annotation Lab
39
Learn More
Using Spark NLP to build a drug
discovery knowledge graph for Covid-19
Vishnu Vettrivel & Alexander Thomas
F...
40
Thank you!
© 2015-2021 John Snow Labs Inc. All rights reserved. The John Snow Labs logo is a trademarks of John Snow La...
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Advanced Natural Language Processing with Apache Spark NLP

Download to read offline

This hands-on deep-dive session uses the open-source Apache Spark NLP library to explore advanced NLP in Python. Apache Spark NLP provides state-of-the-art accuracy, speed, and scalability for language understanding by delivering production-grade implementations of some of the most recent research in applied deep learning. Apache Spark NLP is the only open-source NLP library that can natively scale to use any Apache Spark cluster, as well as take advantage of the latest processors from Intel and Nvidia. It’s the most widely used NLP library in the enterprise today.

You’ll edit and run executable Python notebooks as we walk through these common NLP tasks: document classification, named entity recognition, sentiment analysis, spell checking and correction, grammar understanding, question answering, and translation. The discussion of each NLP task includes the latest advances in deep learning and transfer learning used to tackle it – from the hundreds of BERT-based embeddings to models based on the T5 transformer, MarianNMT, multilingual and domain-specific models.

  • Be the first to like this

Advanced Natural Language Processing with Apache Spark NLP

  1. 1. Advanced Natural Language Processing with Spark NLP David Talby CTO, John Snow Labs
  2. 2. 2 Agenda Introducing Spark NLP State-of-the-art Accuracy Speed & Scalability Ease of Use Examples 1. 2. 3. 4. 5.
  3. 3. 3 Introducing Spark NLP Most popular O’Reilly Media 54% share of healthcare AI teams use Spark NLP Gradient Flow 16x growth In downloads of the library Since Jan 2020 PyPI Download Stats NLP library in the enterprise
  4. 4. What is Spark NLP? ▪ State of the art Natural Language Processing ▪ Production-grade, trainable, and scalable ▪ Open-Source Python, Java & Scala libraries ▪ 1,400+ Pre-trained models & pipelines ▪ Active: 26+ new releases/year since 2017!
  5. 5. Spark NLP in Industry NLP Industry Survey by Gradient Flow, an independent data science research & insights company, September 2020 Which NLP libraries does your organization use?
  6. 6. Trusted By
  7. 7. There’s a world of difference between an academic result and a production system TRAINABLE & TUNABLE 100% PRIVATE EXPLAINABLE REPRODUCIBLE HARDWARE OPTIMIZED SCALABLE COMMUNITY & EDUCATION
  8. 8. 8 Spark NLP
  9. 9. Introducing Spark NLP 3 • Massive speedups [Databricks 7.2 ML GPU on 10 AWS f4dn.large:] 7.9 times faster in calculating BERT–Large 6.5 times faster in calculating BERT-base 3.0 times faster in calculating NER DL • The latest compute platforms Spark 3.1, 3.0, 2.4, 2.3 Databricks 8.x, 7.x, 6.x – CPU and GPU Linux, Max, Windows – local development Docker – with & without Kubernetes Hadoop 2.7 and 3.x Cloudera & Hortonworks AWS, Azure, and GCP
  10. 10. 10 Agenda Introducing Spark NLP State-of-the-art Accuracy Speed & Scalability Ease of Use Examples 1. 2. 3. 4. 5.
  11. 11. 11 On Accuracy Biomedical Named Entity Recognition at Scale Improving Clinical Document Understanding on COVID-19 Research with Spark NLP Accurate Clinical Named Entity Recognition at Scale • Obtains new state-of-the-art results on seven public biomedical benchmarks without using heavy contextual embeddings, including: • BC4CHEMD to 93.72% (4.1% gain) • Species800 to 80.91% (4.6% gain) • JNLPBA to 81.29% (5.2% gain) • Production-grade codebase on top of the Spark NLP library; can scale up for training and inference in any Spark cluster; GPU support; Polyglot API • Improve on the previous best accuracy benchmarks for assertion status detection • Recognize 100+ entity types including social determinants of health, anatomy, risk factors, and adverse events in addition to other commonly used clinical and biomedical entities • Extract trends and insights: Most frequent disorders & symptoms and most common vital signs and EKG findings from CORD-19 Presented at CADL 2020 (International Workshop on Computational Aspects of Deep Learning), in conjunction with ICPR 2020 Presented at SDU (Scientific Document Understanding) workshop at AAAI 2021 • Establishes new state-of-the-art accuracy on 3 clinical concept extraction challenges: • 2010 i2b2/VA clinical concept extraction • 2014 n2c2 de-identification • 2018 n2c2 medication extraction • Outperform the accuracy of AWS Medical Comprehend and Google Cloud Healthcare API by a large margin (8.9% and 6.7% respectively) • Outperform plain Keras implementation Under review
  12. 12. ● “State of the art” means the best peer-reviewed academic results ● For example: Best F1 score on CoNLL-2003 NER benchmark for a system in production ● Spark NLP uses a custom model based on Bi-LSTM + Char-CNN + CRF + Word Embeddings Accuracy: State-of-the-art Models Named Entity Recognition
  13. 13. ● The best F1 score on CoNLL-2003 NER benchmark for a system in production by using Spark NLP ● BERT Large model was used to train our Bi-LSTM + Char-CNN + CRF model Accuracy: State-of-the-art Models Named Entity Recognition
  14. 14. ● Everything must work right out of the box ● All the parameters are default ● CoNLL 2003 dataset is used in this benchmark. The eng.train was used for training and the eng.testa was used for evaluating the model Accuracy: State-of-the-art Models Named Entity Recognition
  15. 15. Transformers & Embeddings Spark NLP: 100+ Word Embeddings ● BERT ● Small BERT ● BioBERT ● CovidBERT ● ALBERT ● ELECTRA ● XLNet ● ELMO ● GloVe
  16. 16. Accuracy: State-of-the-art Models Multi-class & Multi-label Text Classifications ● Multi-class text classification to detect emotions, cyberbullying, fake news, spams, etc. ● Multi-label text classification to detect toxic comments, movie genre, etc. ● Hundreds of pre-tained Word and Sentence Embeddings ● Language-Agnostic BERT Sentence Embedding ● Universal Sentence Encoder as an input for text classifications
  17. 17. Accuracy: State-of-the-art Models SentimentDL, ClassifierDL, and MultiClassifierDL ● BERT ● Small BERT ● BioBERT ● CovidBERT ● LaBSE ● ALBERT ● ELECTRA ● XLNet ● ELMO ● Universal Sentence Encoder ● GloVe ● 100 dimensions ● 200 dimensions ● 128 dimensions ● 256 dimensions ● 300 dimensions ● 512 dimensions ● 768 dimensions ● 1024 dimensions ● tfhub_ues ● tfhub_use_lg ● glove_6B_100 ● glove_6B_300 ● glove_840B_300 ● bert_base_cased ● bert_base_uncased ● bert_large_cased ● bert_large_uncased ● bert_multi_uncased ● electra_small_uncased ● elmo ● ... ● 2 classes (positive/negative) ● 3 classes (0, 1, 2) ● 4 classes (Sports, Business, etc.) ● 5 classes (1.0, 2.0, 3.0, 4.0, 5.0) ● ... 100 classes!
  18. 18. Accuracy: State-of-the-art Models Language Detection & Identification ● LanguageDetectorDL is a state-of-the-art TensorFlow/Keras model ● Uses the positions of the characters ● It is around 3 MB to 5 MB ● It has been trained over 8 million Wikipedia pages ● It has between 97% to 99% accuracy for text longer than 140 characters
  19. 19. Accuracy: State-of-the-art Models Context Spell Checker ● Ability to consider OCR specific error patterns ● Ability to leverage the context ● Ability to preserve and even correct custom patterns ● Flexibility to incorporate your own custom patterns
  20. 20. 20 Agenda Introducing Spark NLP State-of-the-art Accuracy Speed & Scalability Ease of Use Examples 1. 2. 3. 4. 5.
  21. 21. Optimizing Performance BERT Embeddings ● Transformers are slow! ● They need GPUs ● It depends highly on max sequence length Spark NLP 2.6 optimizations: ● Improve the memory consumption by 30% ● Improve performance by more than 70% with dynamic shape
  22. 22. Performance BERT Embeddings ● Trade off size, memory, and accuracy ● Tiny BERT ● Mini BERT ● Small BERT ● Medium BERT ● Others… Example: ● BERT-Tiny is 24x times smaller and 28x times faster than BERT-Base
  23. 23. Performance: Hardware ● Optimized builds of Spark NLP for both Intel and Nvidia ● Out-of-the-box optimizations for Intel (MKL, etc.) and Nvidia (Spark 3, etc.) ● Ongoing profiling with engineering teams at both companies
  24. 24. Scale: Distribution & Parallelism ● Zero code changes to scale a pipeline to any Spark cluster ● Only natively distributed open-source NLP library ● Spark provides execution planning, caching, serialization, and shuffling ● Caveats ● Speedup depends on what you actually do ● Spark configurations matter ● Cluster tuning based on your data is advised
  25. 25. Scale: Distribution & Parallelism Recognize Entity DL Pipeline ● Amazon full reviews, 15 million sentences, and 255 million tokens ● Single node, 32G memory & 32 cores ● 10x workers with 32G memory & 16 cores ● The pipeline includes sentence detection, tokenization, word embeddings, and NER Setup: ● Single node is dedicated Dell Server ● 10 Nodes are in Databricks on AWS
  26. 26. Scale: Distribution & Parallelism BERT Embeddings ● Amazon full reviews, 15 million sentences, and 255 million tokens ● Single node with 64G memory & 32 cores ● 10x workers with 32G memory & 16 cores ● 128 max sequence length Setup: ● Single node is dedicated Dell Server ● 10 Nodes are in Databricks on AWS
  27. 27. 27 Agenda Introducing Spark NLP State-of-the-art Accuracy Speed & Scalability Ease of Use Examples 1. 2. 3. 4. 5.
  28. 28. Easy to Use Python, Scala, and Java ● Pretrained pipelines ● Pretrained models ● Training your own models
  29. 29. Easy to Use Pretrained Pipelines ● 100+ pretrained pipelines ● Full support for 13 languages ● Simple and easy to use ● Works online and offline ● Preconfigured
  30. 30. Easy to Use Pretrained Models ● Hundreds of pretrained models ● Support for 46 languages ● Works online and offline ● Flexible & customized pipelines ● Caveat: some models depend on each other
  31. 31. Easy to Use Train your own POS tagging models ● POS() accepts token-tag format ● POS Tagger is based on Perceptron Average algorithm ● Language-agnostic and supports any language
  32. 32. Easy to Use Train your own NER models ● CoNLL 2003 format as input ● Accepts 50+ Word Embeddings models ● Train on CPU or GPU ● Extended metrics and evaluation ● Built-in validation split with metrics
  33. 33. Easy to Use Train your own NER models ● BERT with 2 layers & 768 dimensions ● 16 minutes training ● 91% Micro F1 on Dev ● 90% conll_eval on Dev ● Full CoNLL 2003 training dataset ● Google Colab with GPU
  34. 34. Easy to Use Train your own multi-class classifiers ● Supports up to 100 classes ● Accepts 90+ Word & Sentence Embeddings models ● Train on CPU or GPU ● Extended metrics and evaluation ● Built-in validation split with metrics
  35. 35. 35 Agenda Introducing Spark NLP State-of-the-art Accuracy Speed & Scalability Ease of Use Examples 1. 2. 3. 4. 5.
  36. 36. 36 Spark NLP for Healthcare
  37. 37. 37 Spark OCR
  38. 38. 38 Project creation Team setup Tasks creation Labeling The Annotation Lab
  39. 39. 39 Learn More Using Spark NLP to build a drug discovery knowledge graph for Covid-19 Vishnu Vettrivel & Alexander Thomas Founder & Principal Data Scientist at Wisecube NLP in Healthcare: Challenges & Opportunities Ganesh Thodikulam Executive Director, Kaiser Permanente A Unified CV, OCR, and NLP for Scalable Document Understanding Text Analytics and its Applications in the Pharma Industry Harsha Gurulingappa, Ph.D. Text Analytics Product Owner at Merck NLP in Oncology Real World Data: Opportunities to develop a true learning healthcare system Patrick Beukema, Ph.D. Senior ML Engineer, DocuSign Automated & Explainable Deep Learning for Clinical Language Understanding at Roche Vishakha Sharma, Ph.D. Principal Data Scientist, Roche George A. Komatsoulis, Ph.D. Chief of Bioinformatics at CancerLinQ
  40. 40. 40 Thank you! © 2015-2021 John Snow Labs Inc. All rights reserved. The John Snow Labs logo is a trademarks of John Snow Labs Inc. The included information is for informational purposes only and represents the current view of John Snow Labs as of the date of this presentation. Since John Snow Labs must respond to changing market conditions, it should not be interpreted to be a commitment on its part, and John Snow Labs cannot guarantee the accuracy of any information provided after the date of this presentation. John Snow Labs makes no warranties, express or statutory, as to the information in this presentation. demo.johnsnowlabs.com nlp.johnsnowlabs.com Live demos: Get Started:

This hands-on deep-dive session uses the open-source Apache Spark NLP library to explore advanced NLP in Python. Apache Spark NLP provides state-of-the-art accuracy, speed, and scalability for language understanding by delivering production-grade implementations of some of the most recent research in applied deep learning. Apache Spark NLP is the only open-source NLP library that can natively scale to use any Apache Spark cluster, as well as take advantage of the latest processors from Intel and Nvidia. It’s the most widely used NLP library in the enterprise today. You’ll edit and run executable Python notebooks as we walk through these common NLP tasks: document classification, named entity recognition, sentiment analysis, spell checking and correction, grammar understanding, question answering, and translation. The discussion of each NLP task includes the latest advances in deep learning and transfer learning used to tackle it – from the hundreds of BERT-based embeddings to models based on the T5 transformer, MarianNMT, multilingual and domain-specific models.

Views

Total views

152

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

8

Shares

0

Comments

0

Likes

0

×