Machine Learning Pipelines

MACHINE LEARNING
PIPELINES
Evan R. Sparks
Graduate Student, AMPLab
With: Shivaram Venkataraman, Tomer Kaftan, Gylfi Gudmundsson,
Michael Franklin, Benjamin Recht, and others!

Model
“Machine learning is a scientific discipline that deals
with the construction and study of algorithms that can
learn from data. Such algorithms operate by building
a model based on inputs and using that to make
predictions or decisions, rather than following only
explicitly programmed instructions.”
–Wikipedia
Data

ML PROBLEMS
• Real data often not ∈ Rd
• Real data not well-behaved
according to my algorithm.
• Features need to be
engineered.
• Transformations need to be
applied.
• Hyperparameters need to be
tuned.
SVM Input:
Real Data:

SYSTEMS PROBLEMS
• Datasets are huge.
• Distributed computing is
hard.
• Mapping common ML
techniques to distributed
setting may be untenable.

WHAT IS MLBASE?
• Distributed Machine
Learning - Made Easy!
• Spark-based platform to
simplify the development
and usage of large scale
machine learning.

Data Train
Classifier Model
A STANDARD MACHINE LEARNING PIPELINE
Right?

Test
Data
A STANDARD MACHINE LEARNING PIPELINE
That’s more like it!
Data
Train
Linear
Classifier
Feature Model
Extraction
Predictions

Data Image
Parser Normalizer Convolver
A REAL PIPELINE FOR
IMAGE CLASSIFICATION
Inspired by Coates & Ng, 2012
Linear
Solver
Feature Extractor
Symmetric
Rectifier
Patch
Extractor
Patch
Whitener
Patch
Selector
Label
Extractor
Test
Feature
Data
Model Extractor
Label
Extractor
Test
Error
Error
Computer
Pooler

A SIMPLE EXAMPLE
• Load up some images.
• Featurize.
• Apply a transformation.
• Fit a linear model.
• Evaluate on test data. Replicates Fast Food Features Pipeline - Le et. al., 2012

PIPELINES API
• A pipeline is made of nodes
which have an expected
input and output type.
• Nodes fit together in a
sensible way.
• Pipelines are just nodes.
• Nodes should be things that
we know how to scale.

WHAT’S IN THE TOOLBOX?
Nodes
Images - Patches, Gabor Filters, HoG, Contrast
Normalization
Text - n-grams, lemmatization, TF-IDF, POS, NER
General Purpose - ZCA Whitening, FFT, Scaling,
Random Signs, Linear Rectifier, Windowing, Pooling,
Sampling, QR Decomopsition
Statistics - Borda Voting, Linear Mapping, Matrix
Multiply
ML - Linear Solvers, TSQR, Cholesky Solver, MLlib
Speech and more - coming soon!
Pipelines
Example pipelines across domains CIFAR, MNIST,
ImageNet, ACL Argument Extraction, TIMIT.
Stay Tuned!
Hyper Parameter Tuning Libraries
GraphX MLlib ml-matrix Featurizers Stats
Spark
Utils
Pipelines
MLI

Data Image
A REAL PIPELINE FOR
IMAGE CLASSIFICATION
Inspired by Coates & Ng, 2012
Linear
Solver
Feature Extractor
Symmetric
Rectifier
Patch
Extractor
Patch
Whitener
Patch
Selector
Label
Extractor
Test
Feature
Data
Model Extractor
Label
Extractor
Test
Error
Error
Computer
Pooler
YOU’RE GOING TO BUILD THIS!!

BEAR WITH
ME
Photo: Andy Rouse, (c) Smithsonian Institute

FEATURE EXTRACTION
Data Image
Linear
Solver
Feature Extractor
Symmetric
Rectifier
Patch
Extractor
Patch
Whitener
Patch
Selector
Label
Extractor
Model
Pooler

NORMALIZATION
• Moves pixels from [0, 255] to
[-1.0,1.0].
• Why? Math!
• -1*-1 = 1, 1*1 =1
• If I overlay two pixels on each
other and they’re similar values,
their product will be close to 1
- otherwise, it will be close to 0
or -1.
• Necessary for whitening.
0
255
-1
+1

PATCH EXTRACTION
• Image patches become our
“visual vocabulary”
• Intuition from text classification.
• If I’m trying to classify a
document as “sports” - I’d look
for words like “football”,
“batter”, etc.
• For images - classifying pictures as
“face” - I’m looking for things that
look like eyes, ears, noses, etc.
Visual Vocabulary

CONVOLUTION
• A convolution filter applies a weighted
average to sliding patches of data.
• Can be used for lots of things - finding
edges, blurring, etc.
• Normalized Input:
• Image, Ear Filter
• Output:
• New image - close to 1 for areas
that look like the ear filter.
• Apply many of these simultaneously.

LINEAR RECTIFICATION
• For each feature, x, given
some a (=0.25):
• xnew=max(x-a, 0)
• What does it do?
• Removes a bunch of
noise.

POOLING
• convolve(image, k filters) => k
filtered images.
• Lots of info - super granular.
• Pooling lets us break the (filtered)
images into regions and sum.
• Think of the “sum” a how much
an image quadrant is activated.
• Image summarized into 4*k
numbers.
0.5 8
0 2

Data: A Labels: b Model: x
Hypothesis:
Ax = b + error
Find the x, which minimizes the error = |Ax - b|
WHY LINEAR CLASSIFIERS?
They’re simple. They’re fast. They’re well studied. They scale.
With the right features, they do a good job!

BACK TO OUR PROBLEM
• What is A in our problem?
• #images x #features (4f)
• What about x?
• #features x #classes
• For f < 10000, pretty easy to
solve!
• Bigger - we have to get
creative.
100k
1k
10m x 100k =
10m
1k

TODAY’S EXERCISE
• Build 3 image classification pipelines - simple,
intermediate, advanced.
• Qualitatively (with your eyes) and quantitatively
(with statistics) compare their effectiveness.

ML PIPELINES
• Reusable, general purpose components.
• Built with distributed data in mind from day 1.
• Used together: give a complex system comprised
of well-understood parts.
GO BEARS

Machine Learning Pipelines

More Related Content

What's hot

Viewers also liked

Similar to Machine Learning Pipelines

More from jeykottalam

Recently uploaded

Machine Learning Pipelines