1. Michelle Casbon
January 12, 2016 – Advanced Apache Spark Meetup
Training & Serving NLP
Models in a Distributed Cloud-
based Infrastructure
2. 2
What do we do?
• Idibon creates
adaptive machine
intelligence that can
analyze text in any
language
natural language text
social media
structured insights
3. 3
• Background
• Platform description
• Why we chose Spark
• How we’re using Spark ML & MLlib
• Challenges of adopting Spark in a distributed NLP system
Agenda
4. 4
What are our use cases?
Intent to purchase
Global health
trends
Interactive Voice
Response
Multilingual news
SMS Prioritization
Supply Chain Risk
Change
reception
6. • Fewer annotations
• Lower costs
• Less time spent training
• Higher accuracy
• Improves over time
labeled training set
human annotation intelligent queuing
&
machine learning
unlabeled pool
Adaptive learning
7. 7
How do we do it?
Dataset
Models
Identification2
Cleansing3
Training data
creation4
Quality Control5
Creation6
Hyperparameter
Tuning
7
Intelligent
Queueing
8
Rule Creation9
10 Unseen Data
Prediction
Goal Definition1
9. 9
• Wide variety of algorithms
• Active development
• Open source
• Industry-standard algorithm implementation
• Intended for use in enterprise applications
• Scalability
Why are we using Spark?
10. 10
• Feature Extraction
• TF-IDF
• Word2Vec
• Dimensionality reduction
• Training
• Logistic Regression
• SVM
• Naïve Bayes
• LDA
• Prediction
• Evaluation metrics
How are we using Spark?
[1.0, [1.0, 0.0, 3.0]]
Feature
Extraction
Training
Prediction
14. 14
How do we provide online predictions with
Spark?
… if you have small data
Task Time in µs
Vector prediction 300
DataFrame prediction 7800
DataFrames are slow
...
15. 15
How do we fit Spark into our existing system?
Core
functionality
Idibon
custom ML
…
REST API
ML persistence
layer
16. 16
• Real-time operationalization of many, many models
• Embed within different platforms
• Single save/load framework
• Rapidly incorporate new NLP features
• Logging/monitoring standardized & abstracted
How does a persistence layer enable us to use
Spark?
17. 17
• Analyzing human language is hard
• We’re using the most exciting parts of Spark to build
performant NLP systems that are faster & better than ever
before
Summary
Detecting intent-to-buy in twitter data
Extracting insights about global health trends
Making the call center iVR smarter
Informing investors of company performance based on news sentiment
Identifying supply chain risk in news articles
Understanding user reception of code pushes in online games
Prioritizing urgent SMS messages for UNICEF