OPTIMIZINGTERASCALE MACHINE
LEARNING PIPELINES WITH
Evan R. Sparks, UC Berkeley AMPLab
with ShivaramVenkataraman,Tomer Kaftan, Michael Franklin, Benjamin Recht
MLKeystone
Apache
WHAT’S A MACHINE
LEARNING PIPELINE?
A STANDARD MACHINE LEARNING PIPELINE
Right?
Data
Train
Classifier
Model
A STANDARD MACHINE LEARNING PIPELINE
That’s more like it!
Data
Train
Linear
Classifier
Model
Feature
Extraction
Test
Data
Predictions
A REAL PIPELINE FOR
IMAGE CLASSIFICATION
Inspired by Coates & Ng, 2012
Data
Image
Parser
Normalizer Convolver
sqrt,mean
Zipper
Linear
Solver
Symmetric
Rectifier
ident,abs
ident,mean
Global Pooling
Pooler
Patch
Extractor
Patch
Whitener
KMeans
Clusterer
Feature Extractor
Label
Extractor
ModelLinear
Mapper
Test
Data
Label
Extractor
Feature
Extractor
Test
Error
Error
Computer
Data
Image
Parser
Normalizer Convolver
sqrt,mean
Zipper
Linear
Solver
Symmetric
Rectifier
ident,abs
ident,mean
Global Pooling
Pooler
Patch
Extractor
Patch
Whitener
KMeans
Clusterer
Feature Extractor
Label
Extractor
Linear
Mapper
Model
Test
Data
Label
Extractor
Feature
Extractor
Test
Error
Error
Computer
Embarrassingly Parallel
Requires Coordination
Tricky to Scale
ABOUT KEYSTONEML
• Software framework for building scalable end-to-end machine
learning pipelines on Apache Spark.
• Helps us understand what it means to build systems for robust,
scalable, end-to-end advanced analytics workloads and the patterns
that emerge.
• Example pipelines that achieve state-of-the-art results on large scale
datasets in computer vision, NLP, and speech - fast.
• Open source software, available at: http://keystone-ml.org/
SIMPLE EXAMPLE:
TEXT CLASSIFICATION
20
Newsgroups
.fit( )
Trim
Tokenize
Bigrams
Top Features
Naive Bayes
Max Classifier
Trim
Tokenize
Bigrams
Max Classifier
Top Features
Transformer
Naive Bayes
Model
Once estimated - apply
these steps to your
production data in an
online or batch fashion.
NOT SO SIMPLE EXAMPLE:
IMAGE CLASSIFICATION
Images
(VOC2007)
.fit( )
Resize
Grayscale
SIFT
PCA
FisherVector
MaxClassifier
Linear Regression
Resize
Grayscale
SIFT
MaxClassifier
PCA Map
Fisher Encoder
Linear Model
Achieves performance
of Chatfield et. al., 2011
Pleasantly parallel
featurization and evaluation.
7 minutes on a modest cluster.
5,000 examples, 40,000
features, 20 classes
EVEN LESS SIMPLE: IMAGENET
Color Edges
Resize
Grayscale
SIFT
PCA
FisherVector
Top 5 Classifier
LCS
PCA
FisherVector
Block Linear
Solver
<100 SLOC
Upgrading the solver
for higher precision
means changing 1 LOC.
Weighted Block
Linear Solver
Adding 100,000 more
texture features is easy.
Texture
Gabor
Wavelets
PCA
FisherVector
1000 class classification.
1,200,000 examples
64,000 features.
90 minutes on 100 nodes.
OPTIMIZING KEYSTONEML PIPELINES
High-level API enables rich space of optimizations
Automated ML operator selection. Linear
Solver
L-BFGS
Iterative
SGD
Direct
Solver
Training
Data
Grayscaler
SIFT
Extractor
Reduce
Dimensions
Fisher
Vector
Normalize
Column
Sampler
Linear
Map
Distributed
PCA
Column
Sampler
Local
GMM
Least Sq.
L-BFGS
Predictions
Training
Labels
Auto-caching for iterative workloads.
KEYSTONEML OPTIMIZER
• Sampling-based cost model
projects resource usage
• CPU, Memory, Network
• Utilization tracked through
pipeline.
• Decisions made to minimize
total cost of execution.
• Catalyst-based optimizer does
the heavy lifting.
Stage n d size (GB)
Input 5000 1m pixel
JPEG
0.4
Resize 5000 260k pixels 3.6
Grayscale 5000 260k pixels 1.2
SIFT 5000 65000x128 309
PCA 5000 65000x80 154
FV 5000 256x64x2 1.2
Linear
Regression
5000 20 0.0007
Max
Classifier
5000 1 0.00009
CHOOSING A SOLVER
• Datasets have a number of
interesting degrees of freedom.
• Problem size (n, d, k)
• sparsity (nnz)
• condition number
• Platform has degrees of freedom:
• Memory, CPU, Network, Nodes
• Solvers are predictable!
13
Where:
A 2 Rn⇥d
X 2 Rd⇥k
B 2 Rn⇥k
Objective:
min
X
|AX B|
2
2 + |X|2
2
CHOOSING A SOLVER
• Three Solvers
• Exact, Block, LBFGS
• Two datasets
• Amazon - >99% sparse, n=65m
• TIMIT - dense, n=2m
• Exact solve works well for small # features.
• Use LBFGS for sparse problems.
• Block solver scales well to big dense
problems.
• Hundreds of thousands of features.
●
●
●
●
●
●
Amazon TIMIT
100
1000
10000
10
100
1000
1024 2048 4096 8192 16384 1024 2048 4096 8192 16384
Number of Features
Time(s)
Solver ● Exact Block Solver LBFGS
14
SOLVER PERFORMANCE
• Compared KeystoneML with:
• VowpalWabbit - specialized system for
large, sparse problems.
• SystemML - general purpose, optimizing
ML system.
• Two problems:
• Amazon - Sparse text features.
• BinaryTIMIT - Dense phoneme data.
• High Order Bit:
• KeystoneML pipelines featurization and
adapts to workload changes.
Amazon
0
200
400
600
800
1024 2048 4096 8192 16384
Features
Time(s)
System KeystoneML SystemML
Binary TIMIT
0
100
200
300
400
1024 2048 4096 8192 16384
Features
Time(s)
System KeystoneML SystemML
Amazon
0
50
100
150
1024 2048 4096 8192 16384
Features
Time(s)
System KeystoneML Vowpal Wabbit
Binary TIMIT
0
500
1000
1500
1024 2048 4096 8192 16384
Features
Time(s)
System KeystoneML Vowpal Wabbit
DECIDING WHATTO SAVE
• Pipelines Generate Lots of
intermediate state.
• E.g. SIFT features blow up a
0.42GBVOC dataset to 300GB.
• Iterative algorithms —> state
needed many times.
• How do we determine what to save
for later and what to reuse, given
fixed resource budget?
• Can we adapt to workload changes?
16
Resize
Grayscale
SIFT
PCA
FisherVector
MaxClassifier
Linear Regression
CACHING PROBLEM
• Output is computed via depth-
first execution of DAG.
• Caching “truncates” a path
after first visit.
• Want to minimize execution
time.
• Subject to memory
constraints.
• Picking optimal set is hard!
17
A B
C
D
E
60s
50g
40s
200g
20s
40g
40g
15s
5s
10g
Output
Cache set Time Memory
ABCDE 140s 340g
B 140s 200g
A 180s 50g
{} 240s 0g
END-TO-END PERFORMANCE
Dataset
Training
Examples
Features Raw Size (GB)
Feature Size
(GB)
Amazon 65 million 100k (sparse) 14 89
TIMIT 2.25 million 528k 7.5 8800
ImageNet 1.28 million 262k 74 2500
VOC 5000 40k 0.43 1.5
END-TO-END PERFORMANCE
Dataset
KeystoneML
Accuracy
Reported
Accuracy
KeystoneML
Time (m)
Reported
Time (m)
Speedup
over
Reported
Amazon 91.6% N/A 3.3 N/A N/A
TIMIT 66.1% 66.3% 138 120 0.87x
ImageNet 67.4% 66.6% 270 5760 21x
VOC 57.2% 59.2% 7 87 12x
END-TO-END PERFORMANCE
Amazon TIMIT ImageNet
0
5
10
15
0
20
40
60
0
100
200
300
400
500
8 16 32 64 128 8 16 32 64 128 8 16 32 64 128
Cluster Size (# of nodes)
Time(minutes)
Stage
Loading Train Data Featurization Model Solve
Loading Test Data Model Eval
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Amazon TIMIT ImageNet
1
2
4
8
16
8 16 32 64 128 8 16 32 64 128 8 16 32 64 128
Cluster Size (# of nodes)
Speedupover8nodes(x)
END-TO-END PERFORMANCE
• Tested three levels of
optimization
• None
• Auto-caching only
• Auto-caching and
operator-selection.
• 7x to 15x speedup
0
5
10
15
Amazon TIMIT VOC
Workload
Speedup
Optimization Level None Whole−Pipeline All
QUESTIONS?
http://keystone-ml.org/
Project Page
Code
http://github.com/amplab/keystone
Training
http://goo.gl/axbkkc
BACKUP SLIDES
SOFTWARE FEATURES
• Data Loaders
• CSV, CIFAR, ImageNet,VOC,TIMIT, 20 Newsgroups
• Transformers
• NLP -Tokenization, n-grams, term frequency, NER*,
parsing*
• Images - Convolution, Grayscaling, LCS, SIFT*,
FisherVector*, Pooling,Windowing, HOG, Daisy
• Speech - MFCCs*
• Stats - Random Features, Normalization, Scaling*,
Signed Hellinger Mapping, FFT
• Utility/misc - Caching,Top-K classifier, indicator label
mapping, sparse/dense encoding transformers.
• Estimators
• Learning - Block linear models, Linear Discriminant
Analysis, PCA, ZCA Whitening, Naive Bayes*, GMM*
• Example Pipelines
• NLP - Amazon Product Review
Classification, 20 Newsgroups,Wikipedia
Language model
• Images - MNIST, CIFAR,VOC, ImageNet
• Speech -TIMIT
• Evaluation Metrics
• Binary Classification
• Multiclass Classification
• Multilabel Classification
* - Links to external library
Just 11k Lines of Code,
5k of which areTests or JavaDoc.
KEY API CONCEPTS
TRANSFORMERS
TransformerInput Output
abstract classTransformer[In, Out] {
def apply(in: In): Out
def apply(in: RDD[In]): RDD[Out] = in.map(apply)
…
}
TYPE SAFETY HELPS ENSURE ROBUSTNESS
ESTIMATORS
EstimatorRDD[Input]
abstract class Estimator[In, Out] {
def fit(in: RDD[In]):Transformer[In,Out]
…
}
Transformer
.fit()
CHAINING
NGrams(2)String Vectorizer VectorBigrams
val featurizer:Transformer[String,Vector] = NGrams(2) thenVectorizer
featurizerString Vector
=
COMPLEX PIPELINES
.fit(data, labels)
pipelineString Prediction
=
val pipeline = (featurizer thenLabelEstimator LinearModel).fit(data, labels)
featurizerString Vector
Linear
Model
Prediction
featurizerString Vector
Linear
Map
Prediction

Optimizing Terascale Machine Learning Pipelines with Keystone ML

  • 1.
    OPTIMIZINGTERASCALE MACHINE LEARNING PIPELINESWITH Evan R. Sparks, UC Berkeley AMPLab with ShivaramVenkataraman,Tomer Kaftan, Michael Franklin, Benjamin Recht MLKeystone Apache
  • 2.
  • 3.
    A STANDARD MACHINELEARNING PIPELINE Right? Data Train Classifier Model
  • 4.
    A STANDARD MACHINELEARNING PIPELINE That’s more like it! Data Train Linear Classifier Model Feature Extraction Test Data Predictions
  • 5.
    A REAL PIPELINEFOR IMAGE CLASSIFICATION Inspired by Coates & Ng, 2012 Data Image Parser Normalizer Convolver sqrt,mean Zipper Linear Solver Symmetric Rectifier ident,abs ident,mean Global Pooling Pooler Patch Extractor Patch Whitener KMeans Clusterer Feature Extractor Label Extractor ModelLinear Mapper Test Data Label Extractor Feature Extractor Test Error Error Computer
  • 6.
    Data Image Parser Normalizer Convolver sqrt,mean Zipper Linear Solver Symmetric Rectifier ident,abs ident,mean Global Pooling Pooler Patch Extractor Patch Whitener KMeans Clusterer FeatureExtractor Label Extractor Linear Mapper Model Test Data Label Extractor Feature Extractor Test Error Error Computer Embarrassingly Parallel Requires Coordination Tricky to Scale
  • 7.
    ABOUT KEYSTONEML • Softwareframework for building scalable end-to-end machine learning pipelines on Apache Spark. • Helps us understand what it means to build systems for robust, scalable, end-to-end advanced analytics workloads and the patterns that emerge. • Example pipelines that achieve state-of-the-art results on large scale datasets in computer vision, NLP, and speech - fast. • Open source software, available at: http://keystone-ml.org/
  • 8.
    SIMPLE EXAMPLE: TEXT CLASSIFICATION 20 Newsgroups .fit() Trim Tokenize Bigrams Top Features Naive Bayes Max Classifier Trim Tokenize Bigrams Max Classifier Top Features Transformer Naive Bayes Model Once estimated - apply these steps to your production data in an online or batch fashion.
  • 9.
    NOT SO SIMPLEEXAMPLE: IMAGE CLASSIFICATION Images (VOC2007) .fit( ) Resize Grayscale SIFT PCA FisherVector MaxClassifier Linear Regression Resize Grayscale SIFT MaxClassifier PCA Map Fisher Encoder Linear Model Achieves performance of Chatfield et. al., 2011 Pleasantly parallel featurization and evaluation. 7 minutes on a modest cluster. 5,000 examples, 40,000 features, 20 classes
  • 10.
    EVEN LESS SIMPLE:IMAGENET Color Edges Resize Grayscale SIFT PCA FisherVector Top 5 Classifier LCS PCA FisherVector Block Linear Solver <100 SLOC Upgrading the solver for higher precision means changing 1 LOC. Weighted Block Linear Solver Adding 100,000 more texture features is easy. Texture Gabor Wavelets PCA FisherVector 1000 class classification. 1,200,000 examples 64,000 features. 90 minutes on 100 nodes.
  • 11.
    OPTIMIZING KEYSTONEML PIPELINES High-levelAPI enables rich space of optimizations Automated ML operator selection. Linear Solver L-BFGS Iterative SGD Direct Solver Training Data Grayscaler SIFT Extractor Reduce Dimensions Fisher Vector Normalize Column Sampler Linear Map Distributed PCA Column Sampler Local GMM Least Sq. L-BFGS Predictions Training Labels Auto-caching for iterative workloads.
  • 12.
    KEYSTONEML OPTIMIZER • Sampling-basedcost model projects resource usage • CPU, Memory, Network • Utilization tracked through pipeline. • Decisions made to minimize total cost of execution. • Catalyst-based optimizer does the heavy lifting. Stage n d size (GB) Input 5000 1m pixel JPEG 0.4 Resize 5000 260k pixels 3.6 Grayscale 5000 260k pixels 1.2 SIFT 5000 65000x128 309 PCA 5000 65000x80 154 FV 5000 256x64x2 1.2 Linear Regression 5000 20 0.0007 Max Classifier 5000 1 0.00009
  • 13.
    CHOOSING A SOLVER •Datasets have a number of interesting degrees of freedom. • Problem size (n, d, k) • sparsity (nnz) • condition number • Platform has degrees of freedom: • Memory, CPU, Network, Nodes • Solvers are predictable! 13 Where: A 2 Rn⇥d X 2 Rd⇥k B 2 Rn⇥k Objective: min X |AX B| 2 2 + |X|2 2
  • 14.
    CHOOSING A SOLVER •Three Solvers • Exact, Block, LBFGS • Two datasets • Amazon - >99% sparse, n=65m • TIMIT - dense, n=2m • Exact solve works well for small # features. • Use LBFGS for sparse problems. • Block solver scales well to big dense problems. • Hundreds of thousands of features. ● ● ● ● ● ● Amazon TIMIT 100 1000 10000 10 100 1000 1024 2048 4096 8192 16384 1024 2048 4096 8192 16384 Number of Features Time(s) Solver ● Exact Block Solver LBFGS 14
  • 15.
    SOLVER PERFORMANCE • ComparedKeystoneML with: • VowpalWabbit - specialized system for large, sparse problems. • SystemML - general purpose, optimizing ML system. • Two problems: • Amazon - Sparse text features. • BinaryTIMIT - Dense phoneme data. • High Order Bit: • KeystoneML pipelines featurization and adapts to workload changes. Amazon 0 200 400 600 800 1024 2048 4096 8192 16384 Features Time(s) System KeystoneML SystemML Binary TIMIT 0 100 200 300 400 1024 2048 4096 8192 16384 Features Time(s) System KeystoneML SystemML Amazon 0 50 100 150 1024 2048 4096 8192 16384 Features Time(s) System KeystoneML Vowpal Wabbit Binary TIMIT 0 500 1000 1500 1024 2048 4096 8192 16384 Features Time(s) System KeystoneML Vowpal Wabbit
  • 16.
    DECIDING WHATTO SAVE •Pipelines Generate Lots of intermediate state. • E.g. SIFT features blow up a 0.42GBVOC dataset to 300GB. • Iterative algorithms —> state needed many times. • How do we determine what to save for later and what to reuse, given fixed resource budget? • Can we adapt to workload changes? 16 Resize Grayscale SIFT PCA FisherVector MaxClassifier Linear Regression
  • 17.
    CACHING PROBLEM • Outputis computed via depth- first execution of DAG. • Caching “truncates” a path after first visit. • Want to minimize execution time. • Subject to memory constraints. • Picking optimal set is hard! 17 A B C D E 60s 50g 40s 200g 20s 40g 40g 15s 5s 10g Output Cache set Time Memory ABCDE 140s 340g B 140s 200g A 180s 50g {} 240s 0g
  • 18.
    END-TO-END PERFORMANCE Dataset Training Examples Features RawSize (GB) Feature Size (GB) Amazon 65 million 100k (sparse) 14 89 TIMIT 2.25 million 528k 7.5 8800 ImageNet 1.28 million 262k 74 2500 VOC 5000 40k 0.43 1.5
  • 19.
    END-TO-END PERFORMANCE Dataset KeystoneML Accuracy Reported Accuracy KeystoneML Time (m) Reported Time(m) Speedup over Reported Amazon 91.6% N/A 3.3 N/A N/A TIMIT 66.1% 66.3% 138 120 0.87x ImageNet 67.4% 66.6% 270 5760 21x VOC 57.2% 59.2% 7 87 12x
  • 20.
    END-TO-END PERFORMANCE Amazon TIMITImageNet 0 5 10 15 0 20 40 60 0 100 200 300 400 500 8 16 32 64 128 8 16 32 64 128 8 16 32 64 128 Cluster Size (# of nodes) Time(minutes) Stage Loading Train Data Featurization Model Solve Loading Test Data Model Eval ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Amazon TIMIT ImageNet 1 2 4 8 16 8 16 32 64 128 8 16 32 64 128 8 16 32 64 128 Cluster Size (# of nodes) Speedupover8nodes(x)
  • 21.
    END-TO-END PERFORMANCE • Testedthree levels of optimization • None • Auto-caching only • Auto-caching and operator-selection. • 7x to 15x speedup 0 5 10 15 Amazon TIMIT VOC Workload Speedup Optimization Level None Whole−Pipeline All
  • 22.
  • 23.
  • 24.
    SOFTWARE FEATURES • DataLoaders • CSV, CIFAR, ImageNet,VOC,TIMIT, 20 Newsgroups • Transformers • NLP -Tokenization, n-grams, term frequency, NER*, parsing* • Images - Convolution, Grayscaling, LCS, SIFT*, FisherVector*, Pooling,Windowing, HOG, Daisy • Speech - MFCCs* • Stats - Random Features, Normalization, Scaling*, Signed Hellinger Mapping, FFT • Utility/misc - Caching,Top-K classifier, indicator label mapping, sparse/dense encoding transformers. • Estimators • Learning - Block linear models, Linear Discriminant Analysis, PCA, ZCA Whitening, Naive Bayes*, GMM* • Example Pipelines • NLP - Amazon Product Review Classification, 20 Newsgroups,Wikipedia Language model • Images - MNIST, CIFAR,VOC, ImageNet • Speech -TIMIT • Evaluation Metrics • Binary Classification • Multiclass Classification • Multilabel Classification * - Links to external library Just 11k Lines of Code, 5k of which areTests or JavaDoc.
  • 25.
  • 26.
    TRANSFORMERS TransformerInput Output abstract classTransformer[In,Out] { def apply(in: In): Out def apply(in: RDD[In]): RDD[Out] = in.map(apply) … } TYPE SAFETY HELPS ENSURE ROBUSTNESS
  • 27.
    ESTIMATORS EstimatorRDD[Input] abstract class Estimator[In,Out] { def fit(in: RDD[In]):Transformer[In,Out] … } Transformer .fit()
  • 28.
    CHAINING NGrams(2)String Vectorizer VectorBigrams valfeaturizer:Transformer[String,Vector] = NGrams(2) thenVectorizer featurizerString Vector =
  • 29.
    COMPLEX PIPELINES .fit(data, labels) pipelineStringPrediction = val pipeline = (featurizer thenLabelEstimator LinearModel).fit(data, labels) featurizerString Vector Linear Model Prediction featurizerString Vector Linear Map Prediction