Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 32

Advanced Natural Language Processing with Apache Spark NLP

1

Share

Download to read offline

NLP is a key component in many data science systems that must understand or reason about text. This hands-on tutorial uses the open-source Spark NLP library to explore advanced NLP in Python

Advanced Natural Language Processing with Apache Spark NLP

  1. 1. Advanced Natural Language Processing with Spark NLP Alex Thomas, Principal Data Scientist at WiseCube David Talby, CTO at John Snow Labs
  2. 2. Agenda Introducing Spark NLP Accuracy, scalability, and speed benchmarks Out-of-the-box functionality Getting Things Done End-to-end NLP tasks in 3 lines of code Key concepts and a backstage tour Notebooks! Using pre-trained pipelines & models Named entity recognition Document classification
  3. 3. INTRODUCING SPARK NLP STATE OF THE ART NLP FOR PYTHON, JAVA & SCALA 1. ACCURACY 2. SCALABILITY 3. SPEED
  4. 4. SPARK NLP IN THE ENTERPRISE O’REILLY AI ADOPTION IN THE ENTERPRISE SURVEY OF 1,300 PRACTITIONERS, FEB 2019
  5. 5. • ”State of the art” means the best peer-reviewed academic results • Public benchmarks: Comparing production-grade NLP libraries • Public benchmarks of pre-trained models: nlp.johnsnowlabs.com “Spark NLP 2.4 sets new accuracy records for common tasks including NER, OCR & Matching” New: Redesigned NER-DL and BERT-large New: Spark OCR image filters & scalable pipelines New: Hierarchical clinical entity resolution “Spark NLP 2.5 delivers state-of-the-art accuracy for spell checking and sentiment analysis” New: ALBERT & XLNet embeddings New: Contextual spell checker New: DL-based sentiment analysis ACCURACY
  6. 6. SCALABILITY • Zero code changes to scale a pipeline to any Spark cluster • Only natively distributed open-source NLP library • Spark provides execution planning, caching, serialization, shuffling • Caveats – Speedup depends heavily on what you actually do – Not all algorithms scale well – Spark configuration matters
  7. 7. SPEED: GET THE MOST FROM MODERN HARDWARE • Optimized builds of Spark NLP for both Intel and Nvidia • Benchmark done on AWS: Train a Named Entity Recognizer in French • Achieving F1-score of 89% requires at least 80 Epochs with batch size of 512 • Intel outperformed Nvidia: Cascade Lake was 19% faster & 46% cheaper than Tesla P-100
  8. 8. Production Grade + Active Community In production in multiple Fortune 500’s 26 new releases in 2018, 30 in 2019 Active Slack community Permissive open source license: Apache 2.0
  9. 9. SPARK NLP out-of-the-box functionality
  10. 10. OFFICIALLY SUPPORTED RUNTIMES
  11. 11. Getting Things Done
  12. 12. SENTIMENT ANALYSIS import sparknlp sparknlp.start() from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline('analyze-sentiment', 'en') result = pipeline.annotate('Harry Potter is a great movie’) print(result['sentiment’]) ## will print ['positive']
  13. 13. NAMED ENTITY RECOGNITION pipeline = PretrainedPipeline('recognize_entities_bert', 'en') result = pipeline.annotate('Harry and Ron met in Hogsmeade') print(result['ner']) # prints ['I-PER', 'O', 'I-PER', 'O', 'O', 'I-LOC')
  14. 14. SPELL CHECKING & CORRECTION Now in Scala: val pipeline = PretrainedPipeline("spell_check_ml", "en") val result = pipeline.annotate("Harry Potter is a graet muvie") println(result("spell")) /* will print Seq[String](…, "is", "a", "great", "movie") */
  15. 15. UNDER THE HOOD 1.sparknlp.start() starts a new Spark session if there isn’t one, and returns it. 2.PretrainedPipeline() loads the English version of the explain_document_dl pipeline, the pre-trained models, and the embeddings it depends on. 3. These are stored and cached locally. 4. TensorFlow is initialized, within the same JVM process that runs Spark. The pre-trained embeddings and deep-learning models (like NER) are loaded. Models are automatically distributed and shared if running on a cluster. 5. The annotate() call runs an NLP inference pipeline which activates each stage’s algorithm (tokenization, POS, etc.). 6. The NER stage is run on TensorFlow – applying a neural network with bi-LSTM layers for tokens and a CNN for characters. 7. Embeddings are used to convert contextual tokens into vectors during the NER inference process. 8. The result object is a plain old local Python dictionary.
  16. 16. KEY CONCEPT #1: PIPELINE A list of text processing steps. Each step has input and output columns. Document Assembler Sentence Detector Tokenizer Sentiment Analyzer text document sentence token sentiment
  17. 17. KEY CONCEPT #2: ANNOTATOR sentiment_detector = SentimentDetector() .setInputCols(["sentence”]) .setOutputCol("sentiment_score") .setDictionary(resource_path+"sent.txt") An object encapsulating one text processing step.
  18. 18. KEY CONCEPT #3: RESOURCE • Trained ML models • Trained DL networks • Dictionaries • Embeddings • Rules • Pretrained pipelines An external file that an annotator needs. Resources can be shared, cached, and locally stored.
  19. 19. KEY CONCEPT #4: PRETRAINED PIPELINE A pre-built pipeline, with all the annotators and resources it needs.
  20. 20. PUTTING IT ALL TOGETHER: TRAINING A NER WITH BERT Initialization Training data Resources Annotator Pipeline Run Training
  21. 21. Notebooks!
  22. 22. Cleaning, Splitting, and Finding Text + Understanding Grammar Run “Spark NLP Basics” notebook Oprn on Google Colab: https://github.com/JohnSnowLabs/spark-nlp-workshop/ blob/master/tutorials/Certification_Trainings/Public/1.SparkNLP_Basics.ipynb
  23. 23. Using Pre-trained Pipelines + Named Entity Recognition Run “Entity Recognizer with Deep Learning” notebook Open on Google Colab: https://github.com/JohnSnowLabs/spark-nlp-workshop/ blob/master/tutorials/colab/4-%20Entity%20Recognizer%20DL.ipynb
  24. 24. Training your own NER model Run “NER BERT Training” notebook Open on Google Colab: https://colab.research.google.com/drive/1A1ovV74nOG-MEpVQnmageeU-ksRLSmXZ Walkthrough in blog post: https://www.johnsnowlabs.com/named-entity-recognition-ner-with-bert-in-spark-nlp/
  25. 25. Document Classification + Universal Sentence Embeddings Run “Text Classification with ClassifierDL” notebook Open on Google Colab: https://github.com/JohnSnowLabs/spark-nlp-workshop/ blob/master/tutorials/Certification_Trainings/Public/5.Text_Classification_with_ClassifierDL.ipynb
  26. 26. WHAT ELSE IS AVAILABLE? • Spark NLP for Healthcare: 50+ models for clinical entity recognition, linking to medical terminologies, assertion status detection, and de-identification • Spark OCR: 20 annotators for image enhancement, layout, and smart editing
  27. 27. LEARN MORE: TECHNICAL CASE STUDIES Improving Patient Flow Forecasting Automated clinical coding & chart reviews Knowledge Extraction from Pathology Reports High-accuracy fact extraction from long financial documents Improving Mental Health for HIV-Positive Adolescents Accelerating Clinical Trial Recruiting
  28. 28. NEXT STEPS 1. READ THE DOCS & JOIN SLACK HTTPS://NLP.JOHNSNOWLABS.COM 2. STAR & FORK THE REPO GITHUB.COM/JOHNSNOWLABS/SPARK-NLP 3. QUESTIONS? GET IT TOUCH
  29. 29. Thank you! alex@wisecube.com david@johnsnowlabs.com

×