MACHINE LEARNING 
PIPELINES 
Evan R. Sparks 
Graduate Student, AMPLab 
With: Shivaram Venkataraman, Tomer Kaftan, Gylfi Gudmundsson, 
Michael Franklin, Benjamin Recht, and others!
WHAT IS MACHINE 
LEARNING?
Model 
“Machine learning is a scientific discipline that deals 
with the construction and study of algorithms that can 
learn from data. Such algorithms operate by building 
a model based on inputs and using that to make 
predictions or decisions, rather than following only 
explicitly programmed instructions.” 
–Wikipedia 
Data
ML PROBLEMS 
• Real data often not ∈ Rd 
• Real data not well-behaved 
according to my algorithm. 
• Features need to be 
engineered. 
• Transformations need to be 
applied. 
• Hyperparameters need to be 
tuned. 
SVM Input: 
Real Data:
SYSTEMS PROBLEMS 
• Datasets are huge. 
• Distributed computing is 
hard. 
• Mapping common ML 
techniques to distributed 
setting may be untenable.
WHAT IS MLBASE? 
• Distributed Machine 
Learning - Made Easy! 
• Spark-based platform to 
simplify the development 
and usage of large scale 
machine learning.
Data Train 
Classifier Model 
A STANDARD MACHINE LEARNING PIPELINE 
Right?
Test 
Data 
A STANDARD MACHINE LEARNING PIPELINE 
That’s more like it! 
Data 
Train 
Linear 
Classifier 
Feature Model 
Extraction 
Predictions
Data Image 
Parser Normalizer Convolver 
A REAL PIPELINE FOR 
IMAGE CLASSIFICATION 
Inspired by Coates & Ng, 2012 
Linear 
Solver 
Feature Extractor 
Symmetric 
Rectifier 
Patch 
Extractor 
Patch 
Whitener 
Patch 
Selector 
Label 
Extractor 
Test 
Feature 
Data 
Model Extractor 
Label 
Extractor 
Test 
Error 
Error 
Computer 
Pooler
A SIMPLE EXAMPLE 
• Load up some images. 
• Featurize. 
• Apply a transformation. 
• Fit a linear model. 
• Evaluate on test data. Replicates Fast Food Features Pipeline - Le et. al., 2012
PIPELINES API 
• A pipeline is made of nodes 
which have an expected 
input and output type. 
• Nodes fit together in a 
sensible way. 
• Pipelines are just nodes. 
• Nodes should be things that 
we know how to scale.
WHAT’S IN THE TOOLBOX? 
Nodes 
Images - Patches, Gabor Filters, HoG, Contrast 
Normalization 
Text - n-grams, lemmatization, TF-IDF, POS, NER 
General Purpose - ZCA Whitening, FFT, Scaling, 
Random Signs, Linear Rectifier, Windowing, Pooling, 
Sampling, QR Decomopsition 
Statistics - Borda Voting, Linear Mapping, Matrix 
Multiply 
ML - Linear Solvers, TSQR, Cholesky Solver, MLlib 
Speech and more - coming soon! 
Pipelines 
Example pipelines across domains CIFAR, MNIST, 
ImageNet, ACL Argument Extraction, TIMIT. 
Stay Tuned! 
Hyper Parameter Tuning Libraries 
GraphX MLlib ml-matrix Featurizers Stats 
Spark 
Utils 
Pipelines 
MLI
Data Image 
Parser Normalizer Convolver 
A REAL PIPELINE FOR 
IMAGE CLASSIFICATION 
Inspired by Coates & Ng, 2012 
Linear 
Solver 
Feature Extractor 
Symmetric 
Rectifier 
Patch 
Extractor 
Patch 
Whitener 
Patch 
Selector 
Label 
Extractor 
Test 
Feature 
Data 
Model Extractor 
Label 
Extractor 
Test 
Error 
Error 
Computer 
Pooler 
YOU’RE GOING TO BUILD THIS!!
BEAR WITH 
ME 
Photo: Andy Rouse, (c) Smithsonian Institute
COMPUTER VISION CRASH 
COURSE
SVM Model
FEATURE EXTRACTION 
Data Image 
Parser Normalizer Convolver 
Linear 
Solver 
Feature Extractor 
Symmetric 
Rectifier 
Patch 
Extractor 
Patch 
Whitener 
Patch 
Selector 
Label 
Extractor 
Model 
Pooler
FEATURE EXTRACTION 
Data Image 
Parser Normalizer Convolver 
Linear 
Solver 
Feature Extractor 
Symmetric 
Rectifier 
Patch 
Extractor 
Patch 
Whitener 
Patch 
Selector 
Label 
Extractor 
Model 
Pooler
NORMALIZATION 
• Moves pixels from [0, 255] to 
[-1.0,1.0]. 
• Why? Math! 
• -1*-1 = 1, 1*1 =1 
• If I overlay two pixels on each 
other and they’re similar values, 
their product will be close to 1 
- otherwise, it will be close to 0 
or -1. 
• Necessary for whitening. 
0 
255 
-1 
+1
FEATURE EXTRACTION 
Data Image 
Parser Normalizer Convolver 
Linear 
Solver 
Feature Extractor 
Symmetric 
Rectifier 
Patch 
Extractor 
Patch 
Whitener 
Patch 
Selector 
Label 
Extractor 
Model 
Pooler
PATCH EXTRACTION 
• Image patches become our 
“visual vocabulary” 
• Intuition from text classification. 
• If I’m trying to classify a 
document as “sports” - I’d look 
for words like “football”, 
“batter”, etc. 
• For images - classifying pictures as 
“face” - I’m looking for things that 
look like eyes, ears, noses, etc. 
Visual Vocabulary
FEATURE EXTRACTION 
Data Image 
Parser Normalizer Convolver 
Linear 
Solver 
Feature Extractor 
Symmetric 
Rectifier 
Patch 
Extractor 
Patch 
Whitener 
Patch 
Selector 
Label 
Extractor 
Model 
Pooler
CONVOLUTION 
• A convolution filter applies a weighted 
average to sliding patches of data. 
• Can be used for lots of things - finding 
edges, blurring, etc. 
• Normalized Input: 
• Image, Ear Filter 
• Output: 
• New image - close to 1 for areas 
that look like the ear filter. 
• Apply many of these simultaneously.
FEATURE EXTRACTION 
Data Image 
Parser Normalizer Convolver 
Linear 
Solver 
Feature Extractor 
Symmetric 
Rectifier 
Patch 
Extractor 
Patch 
Whitener 
Patch 
Selector 
Label 
Extractor 
Model 
Pooler
LINEAR RECTIFICATION 
• For each feature, x, given 
some a (=0.25): 
• xnew=max(x-a, 0) 
• What does it do? 
• Removes a bunch of 
noise.
FEATURE EXTRACTION 
Data Image 
Parser Normalizer Convolver 
Linear 
Solver 
Feature Extractor 
Symmetric 
Rectifier 
Patch 
Extractor 
Patch 
Whitener 
Patch 
Selector 
Label 
Extractor 
Model 
Pooler
POOLING 
• convolve(image, k filters) => k 
filtered images. 
• Lots of info - super granular. 
• Pooling lets us break the (filtered) 
images into regions and sum. 
• Think of the “sum” a how much 
an image quadrant is activated. 
• Image summarized into 4*k 
numbers. 
0.5 8 
0 2
LINEAR CLASSIFICATION
Data: A Labels: b Model: x 
Hypothesis: 
Ax = b + error 
Find the x, which minimizes the error = |Ax - b| 
WHY LINEAR CLASSIFIERS? 
They’re simple. They’re fast. They’re well studied. They scale. 
With the right features, they do a good job!
BACK TO OUR PROBLEM 
• What is A in our problem? 
• #images x #features (4f) 
• What about x? 
• #features x #classes 
• For f < 10000, pretty easy to 
solve! 
• Bigger - we have to get 
creative. 
100k 
1k 
10m x 100k = 
10m 
1k
TODAY’S EXERCISE 
• Build 3 image classification pipelines - simple, 
intermediate, advanced. 
• Qualitatively (with your eyes) and quantitatively 
(with statistics) compare their effectiveness.
ML PIPELINES 
• Reusable, general purpose components. 
• Built with distributed data in mind from day 1. 
• Used together: give a complex system comprised 
of well-understood parts. 
GO BEARS

Machine Learning Pipelines

  • 1.
    MACHINE LEARNING PIPELINES Evan R. Sparks Graduate Student, AMPLab With: Shivaram Venkataraman, Tomer Kaftan, Gylfi Gudmundsson, Michael Franklin, Benjamin Recht, and others!
  • 2.
    WHAT IS MACHINE LEARNING?
  • 3.
    Model “Machine learningis a scientific discipline that deals with the construction and study of algorithms that can learn from data. Such algorithms operate by building a model based on inputs and using that to make predictions or decisions, rather than following only explicitly programmed instructions.” –Wikipedia Data
  • 4.
    ML PROBLEMS •Real data often not ∈ Rd • Real data not well-behaved according to my algorithm. • Features need to be engineered. • Transformations need to be applied. • Hyperparameters need to be tuned. SVM Input: Real Data:
  • 5.
    SYSTEMS PROBLEMS •Datasets are huge. • Distributed computing is hard. • Mapping common ML techniques to distributed setting may be untenable.
  • 6.
    WHAT IS MLBASE? • Distributed Machine Learning - Made Easy! • Spark-based platform to simplify the development and usage of large scale machine learning.
  • 7.
    Data Train ClassifierModel A STANDARD MACHINE LEARNING PIPELINE Right?
  • 8.
    Test Data ASTANDARD MACHINE LEARNING PIPELINE That’s more like it! Data Train Linear Classifier Feature Model Extraction Predictions
  • 9.
    Data Image ParserNormalizer Convolver A REAL PIPELINE FOR IMAGE CLASSIFICATION Inspired by Coates & Ng, 2012 Linear Solver Feature Extractor Symmetric Rectifier Patch Extractor Patch Whitener Patch Selector Label Extractor Test Feature Data Model Extractor Label Extractor Test Error Error Computer Pooler
  • 10.
    A SIMPLE EXAMPLE • Load up some images. • Featurize. • Apply a transformation. • Fit a linear model. • Evaluate on test data. Replicates Fast Food Features Pipeline - Le et. al., 2012
  • 11.
    PIPELINES API •A pipeline is made of nodes which have an expected input and output type. • Nodes fit together in a sensible way. • Pipelines are just nodes. • Nodes should be things that we know how to scale.
  • 12.
    WHAT’S IN THETOOLBOX? Nodes Images - Patches, Gabor Filters, HoG, Contrast Normalization Text - n-grams, lemmatization, TF-IDF, POS, NER General Purpose - ZCA Whitening, FFT, Scaling, Random Signs, Linear Rectifier, Windowing, Pooling, Sampling, QR Decomopsition Statistics - Borda Voting, Linear Mapping, Matrix Multiply ML - Linear Solvers, TSQR, Cholesky Solver, MLlib Speech and more - coming soon! Pipelines Example pipelines across domains CIFAR, MNIST, ImageNet, ACL Argument Extraction, TIMIT. Stay Tuned! Hyper Parameter Tuning Libraries GraphX MLlib ml-matrix Featurizers Stats Spark Utils Pipelines MLI
  • 13.
    Data Image ParserNormalizer Convolver A REAL PIPELINE FOR IMAGE CLASSIFICATION Inspired by Coates & Ng, 2012 Linear Solver Feature Extractor Symmetric Rectifier Patch Extractor Patch Whitener Patch Selector Label Extractor Test Feature Data Model Extractor Label Extractor Test Error Error Computer Pooler YOU’RE GOING TO BUILD THIS!!
  • 14.
    BEAR WITH ME Photo: Andy Rouse, (c) Smithsonian Institute
  • 15.
  • 16.
  • 17.
    FEATURE EXTRACTION DataImage Parser Normalizer Convolver Linear Solver Feature Extractor Symmetric Rectifier Patch Extractor Patch Whitener Patch Selector Label Extractor Model Pooler
  • 18.
    FEATURE EXTRACTION DataImage Parser Normalizer Convolver Linear Solver Feature Extractor Symmetric Rectifier Patch Extractor Patch Whitener Patch Selector Label Extractor Model Pooler
  • 19.
    NORMALIZATION • Movespixels from [0, 255] to [-1.0,1.0]. • Why? Math! • -1*-1 = 1, 1*1 =1 • If I overlay two pixels on each other and they’re similar values, their product will be close to 1 - otherwise, it will be close to 0 or -1. • Necessary for whitening. 0 255 -1 +1
  • 20.
    FEATURE EXTRACTION DataImage Parser Normalizer Convolver Linear Solver Feature Extractor Symmetric Rectifier Patch Extractor Patch Whitener Patch Selector Label Extractor Model Pooler
  • 21.
    PATCH EXTRACTION •Image patches become our “visual vocabulary” • Intuition from text classification. • If I’m trying to classify a document as “sports” - I’d look for words like “football”, “batter”, etc. • For images - classifying pictures as “face” - I’m looking for things that look like eyes, ears, noses, etc. Visual Vocabulary
  • 22.
    FEATURE EXTRACTION DataImage Parser Normalizer Convolver Linear Solver Feature Extractor Symmetric Rectifier Patch Extractor Patch Whitener Patch Selector Label Extractor Model Pooler
  • 23.
    CONVOLUTION • Aconvolution filter applies a weighted average to sliding patches of data. • Can be used for lots of things - finding edges, blurring, etc. • Normalized Input: • Image, Ear Filter • Output: • New image - close to 1 for areas that look like the ear filter. • Apply many of these simultaneously.
  • 24.
    FEATURE EXTRACTION DataImage Parser Normalizer Convolver Linear Solver Feature Extractor Symmetric Rectifier Patch Extractor Patch Whitener Patch Selector Label Extractor Model Pooler
  • 25.
    LINEAR RECTIFICATION •For each feature, x, given some a (=0.25): • xnew=max(x-a, 0) • What does it do? • Removes a bunch of noise.
  • 26.
    FEATURE EXTRACTION DataImage Parser Normalizer Convolver Linear Solver Feature Extractor Symmetric Rectifier Patch Extractor Patch Whitener Patch Selector Label Extractor Model Pooler
  • 27.
    POOLING • convolve(image,k filters) => k filtered images. • Lots of info - super granular. • Pooling lets us break the (filtered) images into regions and sum. • Think of the “sum” a how much an image quadrant is activated. • Image summarized into 4*k numbers. 0.5 8 0 2
  • 28.
  • 29.
    Data: A Labels:b Model: x Hypothesis: Ax = b + error Find the x, which minimizes the error = |Ax - b| WHY LINEAR CLASSIFIERS? They’re simple. They’re fast. They’re well studied. They scale. With the right features, they do a good job!
  • 30.
    BACK TO OURPROBLEM • What is A in our problem? • #images x #features (4f) • What about x? • #features x #classes • For f < 10000, pretty easy to solve! • Bigger - we have to get creative. 100k 1k 10m x 100k = 10m 1k
  • 31.
    TODAY’S EXERCISE •Build 3 image classification pipelines - simple, intermediate, advanced. • Qualitatively (with your eyes) and quantitatively (with statistics) compare their effectiveness.
  • 32.
    ML PIPELINES •Reusable, general purpose components. • Built with distributed data in mind from day 1. • Used together: give a complex system comprised of well-understood parts. GO BEARS