Data Day TX 2016 - Jan 16, 2016

Michelle Casbon
January 16, 2016 – Data Day Texas, Austin
Under the Hood of Idibon’s
Scalable NLP Services

2
• Idibon creates
adaptive machine
intelligence that can
analyze text in any
language
What do we do?
natural language text
social media
structured insights

3
• Background
• Process walk-through
• Platform description
• Why we chose Spark
• How we’re using Spark ML & MLlib
• Challenges of adopting Spark in a distributed NLP system
Agenda

Supply Chain Risk
4
Intent to purchase
What are our use cases?
Global health
trends
Interactive Voice
Response
Multilingual news SMS PrioritizationChange
reception

• Fewer annotations
• Lower costs
• Less time spent training
• Higher accuracy
• Improves over time
labeled training set
human annotation intelligent queuing
&
machine learning
unlabeled pool
Adaptive learning

7
How do we do it?
Dataset
Models
Identification2
Cleansing3
Training data
creation4
Quality Control5
Creation6
Hyperparameter
Tuning
7
Intelligent
Queueing
8
Rule Creation9
10 Unseen Data
Prediction
Goal Definition1

8
• Real-time API support
• Document storage
• 1000’s of individual predictions per second
• Continuous training
• Hyperparameter optimization
Scalability Challenges

What does our platform look like?

10
• Wide variety of algorithms
• Active development
• Open source
• Industry-standard algorithm implementation
• Intended for use in enterprise applications
• Scalability
Why are we using Spark?

11
• Feature Extraction
• TF-IDF
• Word2Vec
• Dimensionality reduction
• Training
• Logistic Regression
• SVM
• Naïve Bayes
• LDA
• Prediction
• Evaluation metrics
How are we using Spark?
[1.0, [1.0, 0.0, 3.0]]
Feature
Extraction
Training
Prediction

12
Feature Extraction
Extract
Content
Tokenize
Bigrams
Trigrams
Feature
Lookup
[1.0, 0.0, 3.0]
Vector

13
Training
LogisticRegression
WithLBFGS
[1.0, [1.0, 0.0, 3.0]]
LabeledPoint
Model
Storage
[1.0, 0.0, 3.0]
Vector
Add
classification
LogisticRegressionModel

14
Prediction
Extract
Content
Tokenize
Bigrams
Trigrams
Feature
Lookup
[0.0, 1.0, 4.0]
Vector
Model
Lookup
Predict
New tweet
[0.0, 1.0, 4.0]
Vector
Classification
Lookup

15
How do we provide online predictions with Spark?
… if you have small data
Task Time in µs
Vector prediction 300
DataFrame prediction 7800
DataFrames are slow
...

16
How do we fit Spark into our existing system?
Core
functionality
Idibon
custom ML
…
REST API
ML persistence
layer

17
• Real-time operationalization of many, many models
• Embed within different platforms
• Single save/load framework
• Rapidly incorporate new NLP features
• Logging/monitoring standardized & abstracted
How does a persistence layer enable us to use Spark?

18
• Analyzing human language is hard
• We’re using exciting tools to build performant NLP systems
that are faster & better than ever before
• Introduce yourself!
Summary

19
Questions?
Michelle Casbon
michelle@idibon.com
@texasmichelle

Data Day TX 2016 - Jan 16, 2016

More Related Content

What's hot

Similar to Data Day TX 2016 - Jan 16, 2016

Recently uploaded

Data Day TX 2016 - Jan 16, 2016

Editor's Notes