Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Michelle Casbon
January 12, 2016 – Advanced Apache Spark Meetup
Training & Serving NLP
Models in a Distributed Cloud-
base...
2
What do we do?
• Idibon creates
adaptive machine
intelligence that can
analyze text in any
language
natural language tex...
3
• Background
• Platform description
• Why we chose Spark
• How we’re using Spark ML & MLlib
• Challenges of adopting Spa...
4
What are our use cases?
Intent to purchase
Global health
trends
Interactive Voice
Response
Multilingual news
SMS Priorit...
How do we do it?
• Fewer annotations
• Lower costs
• Less time spent training
• Higher accuracy
• Improves over time
labeled training set
h...
7
How do we do it?
Dataset
Models
Identification2
Cleansing3
Training data
creation4
Quality Control5
Creation6
Hyperparam...
What does our platform look like?
9
• Wide variety of algorithms
• Active development
• Open source
• Industry-standard algorithm implementation
• Intended ...
10
• Feature Extraction
• TF-IDF
• Word2Vec
• Dimensionality reduction
• Training
• Logistic Regression
• SVM
• Naïve Baye...
11
Feature Extraction
Extract
Content
Tokenize
Bigrams
Trigrams
Feature
Lookup
[1.0, 0.0, 3.0]
Vector
12
Training
LogisticRegression
WithLBFGS
[1.0, [1.0, 0.0, 3.0]]
LabeledPoint
Model
Storage
[1.0, 0.0, 3.0]
Vector
Add
clas...
13
Prediction
Extract
Content
Tokenize
Bigrams
Trigrams
Feature
Lookup
[0.0, 1.0, 4.0]
Vector
Model
Lookup
Predict
New twe...
14
How do we provide online predictions with
Spark?
… if you have small data
Task Time in µs
Vector prediction 300
DataFra...
15
How do we fit Spark into our existing system?
Core
functionality
Idibon
custom ML
…
REST API
ML persistence
layer
16
• Real-time operationalization of many, many models
• Embed within different platforms
• Single save/load framework
• R...
17
• Analyzing human language is hard
• We’re using the most exciting parts of Spark to build
performant NLP systems that ...
18
Questions?
Michelle Casbon
michelle@idibon.com
@texasmichelle
Upcoming SlideShare
Loading in …5
×

Advanced Spark Meetup - Jan 12, 2016

1,798 views

Published on

Training & Serving NLP/Spark ML Models in a Distributed Cloud-based Infrastructure

Published in: Engineering
  • Be the first to comment

Advanced Spark Meetup - Jan 12, 2016

  1. 1. Michelle Casbon January 12, 2016 – Advanced Apache Spark Meetup Training & Serving NLP Models in a Distributed Cloud- based Infrastructure
  2. 2. 2 What do we do? • Idibon creates adaptive machine intelligence that can analyze text in any language natural language text social media structured insights
  3. 3. 3 • Background • Platform description • Why we chose Spark • How we’re using Spark ML & MLlib • Challenges of adopting Spark in a distributed NLP system Agenda
  4. 4. 4 What are our use cases? Intent to purchase Global health trends Interactive Voice Response Multilingual news SMS Prioritization Supply Chain Risk Change reception
  5. 5. How do we do it?
  6. 6. • Fewer annotations • Lower costs • Less time spent training • Higher accuracy • Improves over time labeled training set human annotation intelligent queuing & machine learning unlabeled pool Adaptive learning
  7. 7. 7 How do we do it? Dataset Models Identification2 Cleansing3 Training data creation4 Quality Control5 Creation6 Hyperparameter Tuning 7 Intelligent Queueing 8 Rule Creation9 10 Unseen Data Prediction Goal Definition1
  8. 8. What does our platform look like?
  9. 9. 9 • Wide variety of algorithms • Active development • Open source • Industry-standard algorithm implementation • Intended for use in enterprise applications • Scalability Why are we using Spark?
  10. 10. 10 • Feature Extraction • TF-IDF • Word2Vec • Dimensionality reduction • Training • Logistic Regression • SVM • Naïve Bayes • LDA • Prediction • Evaluation metrics How are we using Spark? [1.0, [1.0, 0.0, 3.0]] Feature Extraction Training Prediction
  11. 11. 11 Feature Extraction Extract Content Tokenize Bigrams Trigrams Feature Lookup [1.0, 0.0, 3.0] Vector
  12. 12. 12 Training LogisticRegression WithLBFGS [1.0, [1.0, 0.0, 3.0]] LabeledPoint Model Storage [1.0, 0.0, 3.0] Vector Add classification LogisticRegressionModel
  13. 13. 13 Prediction Extract Content Tokenize Bigrams Trigrams Feature Lookup [0.0, 1.0, 4.0] Vector Model Lookup Predict New tweet [0.0, 1.0, 4.0] Vector Classification Lookup
  14. 14. 14 How do we provide online predictions with Spark? … if you have small data Task Time in µs Vector prediction 300 DataFrame prediction 7800 DataFrames are slow ...
  15. 15. 15 How do we fit Spark into our existing system? Core functionality Idibon custom ML … REST API ML persistence layer
  16. 16. 16 • Real-time operationalization of many, many models • Embed within different platforms • Single save/load framework • Rapidly incorporate new NLP features • Logging/monitoring standardized & abstracted How does a persistence layer enable us to use Spark?
  17. 17. 17 • Analyzing human language is hard • We’re using the most exciting parts of Spark to build performant NLP systems that are faster & better than ever before Summary
  18. 18. 18 Questions? Michelle Casbon michelle@idibon.com @texasmichelle

×