Optimizing Terascale Machine Learning Pipelines with Keystone ML

OPTIMIZINGTERASCALE MACHINE
LEARNING PIPELINES WITH
Evan R. Sparks, UC Berkeley AMPLab
with ShivaramVenkataraman,Tomer Kaftan, Michael Franklin, Benjamin Recht
MLKeystone
Apache

WHAT’S A MACHINE
LEARNING PIPELINE?

A STANDARD MACHINE LEARNING PIPELINE
Right?
Data
Train
Classiﬁer
Model

A STANDARD MACHINE LEARNING PIPELINE
That’s more like it!
Data
Train
Linear
Classiﬁer
Model
Feature
Extraction
Test
Data
Predictions

A REAL PIPELINE FOR
IMAGE CLASSIFICATION
Inspired by Coates & Ng, 2012
Data
Image
Parser
Normalizer Convolver
sqrt,mean
Zipper
Linear
Solver
Symmetric
Rectiﬁer
ident,abs
ident,mean
Global Pooling
Pooler
Patch
Extractor
Patch
Whitener
KMeans
Clusterer
Feature Extractor
Label
Extractor
ModelLinear
Mapper
Test
Data
Label
Extractor
Feature
Extractor
Test
Error
Error
Computer

Data
Image
Parser
Normalizer Convolver
sqrt,mean
Zipper
Linear
Solver
Symmetric
Rectiﬁer
ident,abs
ident,mean
Global Pooling
Pooler
Patch
Extractor
Patch
Whitener
KMeans
Clusterer
Feature Extractor
Label
Extractor
Linear
Mapper
Model
Test
Data
Label
Extractor
Feature
Extractor
Test
Error
Error
Computer
Embarrassingly Parallel
Requires Coordination
Tricky to Scale

ABOUT KEYSTONEML
• Software framework for building scalable end-to-end machine
learning pipelines on Apache Spark.
• Helps us understand what it means to build systems for robust,
scalable, end-to-end advanced analytics workloads and the patterns
that emerge.
• Example pipelines that achieve state-of-the-art results on large scale
datasets in computer vision, NLP, and speech - fast.
• Open source software, available at: http://keystone-ml.org/

SIMPLE EXAMPLE:
TEXT CLASSIFICATION
20
Newsgroups
.fit( )
Trim
Tokenize
Bigrams
Top Features
Naive Bayes
Max Classifier
Trim
Tokenize
Bigrams
Max Classifier
Top Features
Transformer
Naive Bayes
Model
Once estimated - apply
these steps to your
production data in an
online or batch fashion.

NOT SO SIMPLE EXAMPLE:
IMAGE CLASSIFICATION
Images
(VOC2007)
.fit( )
Resize
Grayscale
SIFT
PCA
FisherVector
MaxClassifier
Linear Regression
Resize
Grayscale
SIFT
MaxClassifier
PCA Map
Fisher Encoder
Linear Model
Achieves performance
of Chatfield et. al., 2011
Pleasantly parallel
featurization and evaluation.
7 minutes on a modest cluster.
5,000 examples, 40,000
features, 20 classes

EVEN LESS SIMPLE: IMAGENET
Color Edges
Resize
Grayscale
SIFT
PCA
FisherVector
Top 5 Classiﬁer
LCS
PCA
FisherVector
Block Linear
Solver
<100 SLOC
Upgrading the solver
for higher precision
means changing 1 LOC.
Weighted Block
Linear Solver
Adding 100,000 more
texture features is easy.
Texture
Gabor
Wavelets
PCA
FisherVector
1000 class classiﬁcation.
1,200,000 examples
64,000 features.
90 minutes on 100 nodes.

OPTIMIZING KEYSTONEML PIPELINES
High-level API enables rich space of optimizations
Automated ML operator selection. Linear
Solver
L-BFGS
Iterative
SGD
Direct
Solver
Training
Data
Grayscaler
SIFT
Extractor
Reduce
Dimensions
Fisher
Vector
Normalize
Column
Sampler
Linear
Map
Distributed
PCA
Column
Sampler
Local
GMM
Least Sq.
L-BFGS
Predictions
Training
Labels
Auto-caching for iterative workloads.

KEYSTONEML OPTIMIZER
• Sampling-based cost model
projects resource usage
• CPU, Memory, Network
• Utilization tracked through
pipeline.
• Decisions made to minimize
total cost of execution.
• Catalyst-based optimizer does
the heavy lifting.
Stage n d size (GB)
Input 5000 1m pixel
JPEG
0.4
Resize 5000 260k pixels 3.6
Grayscale 5000 260k pixels 1.2
SIFT 5000 65000x128 309
PCA 5000 65000x80 154
FV 5000 256x64x2 1.2
Linear
Regression
5000 20 0.0007
Max
Classiﬁer
5000 1 0.00009

CHOOSING A SOLVER
• Datasets have a number of
interesting degrees of freedom.
• Problem size (n, d, k)
• sparsity (nnz)
• condition number
• Platform has degrees of freedom:
• Memory, CPU, Network, Nodes
• Solvers are predictable!
13
Where:
A 2 Rn⇥d
X 2 Rd⇥k
B 2 Rn⇥k
Objective:
min
X
|AX B|
2
2 + |X|2
2

CHOOSING A SOLVER
• Three Solvers
• Exact, Block, LBFGS
• Two datasets
• Amazon - >99% sparse, n=65m
• TIMIT - dense, n=2m
• Exact solve works well for small # features.
• Use LBFGS for sparse problems.
• Block solver scales well to big dense
problems.
• Hundreds of thousands of features.
●
●
●
●
●
●
Amazon TIMIT
100
1000
10000
10
100
1000
1024 2048 4096 8192 16384 1024 2048 4096 8192 16384
Number of Features
Time(s)
Solver ● Exact Block Solver LBFGS
14

SOLVER PERFORMANCE
• Compared KeystoneML with:
• VowpalWabbit - specialized system for
large, sparse problems.
• SystemML - general purpose, optimizing
ML system.
• Two problems:
• Amazon - Sparse text features.
• BinaryTIMIT - Dense phoneme data.
• High Order Bit:
• KeystoneML pipelines featurization and
adapts to workload changes.
Amazon
0
200
400
600
800
1024 2048 4096 8192 16384
Features
Time(s)
System KeystoneML SystemML
Binary TIMIT
0
100
200
300
400
1024 2048 4096 8192 16384
Features
Time(s)
System KeystoneML SystemML
Amazon
0
50
100
150
1024 2048 4096 8192 16384
Features
Time(s)
System KeystoneML Vowpal Wabbit
Binary TIMIT
0
500
1000
1500
1024 2048 4096 8192 16384
Features
Time(s)
System KeystoneML Vowpal Wabbit

DECIDING WHATTO SAVE
• Pipelines Generate Lots of
intermediate state.
• E.g. SIFT features blow up a
0.42GBVOC dataset to 300GB.
• Iterative algorithms —> state
needed many times.
• How do we determine what to save
for later and what to reuse, given
ﬁxed resource budget?
• Can we adapt to workload changes?
16
Resize
Grayscale
SIFT
PCA
FisherVector
MaxClassiﬁer
Linear Regression

CACHING PROBLEM
• Output is computed via depth-
ﬁrst execution of DAG.
• Caching “truncates” a path
after ﬁrst visit.
• Want to minimize execution
time.
• Subject to memory
constraints.
• Picking optimal set is hard!
17
A B
C
D
E
60s
50g
40s
200g
20s
40g
40g
15s
5s
10g
Output
Cache set Time Memory
ABCDE 140s 340g
B 140s 200g
A 180s 50g
{} 240s 0g

END-TO-END PERFORMANCE
Dataset
Training
Examples
Features Raw Size (GB)
Feature Size
(GB)
Amazon 65 million 100k (sparse) 14 89
TIMIT 2.25 million 528k 7.5 8800
ImageNet 1.28 million 262k 74 2500
VOC 5000 40k 0.43 1.5

Dataset
KeystoneML
Accuracy
Reported
Accuracy
KeystoneML
Time (m)
Reported
Time (m)
Speedup
over
Reported
Amazon 91.6% N/A 3.3 N/A N/A
TIMIT 66.1% 66.3% 138 120 0.87x
ImageNet 67.4% 66.6% 270 5760 21x
VOC 57.2% 59.2% 7 87 12x

Amazon TIMIT ImageNet
0
5
10
15
0
20
40
60
0
100
200
300
400
500
8 16 32 64 128 8 16 32 64 128 8 16 32 64 128
Cluster Size (# of nodes)
Time(minutes)
Stage
Loading Train Data Featurization Model Solve
Loading Test Data Model Eval
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Amazon TIMIT ImageNet
1
2
4
8
16
8 16 32 64 128 8 16 32 64 128 8 16 32 64 128
Cluster Size (# of nodes)
Speedupover8nodes(x)

• Tested three levels of
optimization
• None
• Auto-caching only
• Auto-caching and
operator-selection.
• 7x to 15x speedup
0
5
10
15
Amazon TIMIT VOC
Workload
Speedup
Optimization Level None Whole−Pipeline All

QUESTIONS?
http://keystone-ml.org/
Project Page
Code
http://github.com/amplab/keystone
Training
http://goo.gl/axbkkc

SOFTWARE FEATURES
• Data Loaders
• CSV, CIFAR, ImageNet,VOC,TIMIT, 20 Newsgroups
• Transformers
• NLP -Tokenization, n-grams, term frequency, NER*,
parsing*
• Images - Convolution, Grayscaling, LCS, SIFT*,
FisherVector*, Pooling,Windowing, HOG, Daisy
• Speech - MFCCs*
• Stats - Random Features, Normalization, Scaling*,
Signed Hellinger Mapping, FFT
• Utility/misc - Caching,Top-K classifier, indicator label
mapping, sparse/dense encoding transformers.
• Estimators
• Learning - Block linear models, Linear Discriminant
Analysis, PCA, ZCA Whitening, Naive Bayes*, GMM*
• Example Pipelines
• NLP - Amazon Product Review
Classification, 20 Newsgroups,Wikipedia
Language model
• Images - MNIST, CIFAR,VOC, ImageNet
• Speech -TIMIT
• Evaluation Metrics
• Binary Classification
• Multiclass Classification
• Multilabel Classification
* - Links to external library
Just 11k Lines of Code,
5k of which areTests or JavaDoc.

TRANSFORMERS
TransformerInput Output
abstract classTransformer[In, Out] {
def apply(in: In): Out
def apply(in: RDD[In]): RDD[Out] = in.map(apply)
…
}
TYPE SAFETY HELPS ENSURE ROBUSTNESS

ESTIMATORS
EstimatorRDD[Input]
abstract class Estimator[In, Out] {
def ﬁt(in: RDD[In]):Transformer[In,Out]
…
}
Transformer
.ﬁt()

CHAINING
NGrams(2)String Vectorizer VectorBigrams
val featurizer:Transformer[String,Vector] = NGrams(2) thenVectorizer
featurizerString Vector
=

COMPLEX PIPELINES
.ﬁt(data, labels)
pipelineString Prediction
=
val pipeline = (featurizer thenLabelEstimator LinearModel).ﬁt(data, labels)
Linear
Model
Prediction
Linear
Map
Prediction

Optimizing Terascale Machine Learning Pipelines with Keystone ML

More Related Content

What's hot

Viewers also liked

Similar to Optimizing Terascale Machine Learning Pipelines with Keystone ML

More from Spark Summit

Recently uploaded

Optimizing Terascale Machine Learning Pipelines with Keystone ML