SlideShare a Scribd company logo
1 of 25
Download to read offline
KEYSTONEML
Evan R. Sparks, ShivaramVenkataraman
With:Tomer Kaftan, ZonghengYang, Mike Franklin, Ben Recht
WHAT IS KEYSTONEML
• Software framework for building scalable end-to-end machine learning
pipelines.
• Helps us understand what it means to build systems for robust, scalable,
end-to-end advanced analytics workloads and the patterns that emerge.
• Example pipelines that achieve state-of-the-art results on large scale
datasets in computer vision, NLP, and speech - fast.
• Previewed at AMP Camp 5 and on AMPLab Blog as “ML Pipelines”
• Public release last month! http://keystone-ml.org/
HOW DOES IT FIT WITH
BDAS?
Spark
MLlibGraphX ml-matrix
KeystoneML
Batch ModelTraining
Velox
Model Server
RealTime Serving
http://amplab.github.io/velox-modelserver
WHAT’S A MACHINE
LEARNING PIPELINE?
A STANDARD MACHINE LEARNING PIPELINE
Right?
Data
Train
Classifier
Model
A STANDARD MACHINE LEARNING PIPELINE
That’s more like it!
Data
Train
Linear
Classifier
Model
Feature
Extraction
Test
Data
Predictions
SIMPLE EXAMPLE:
TEXT CLASSIFICATION
20
Newsgroups
.fit( )
Trim
Tokenize
Bigrams
Top Features
Naive Bayes
Max Classifier
Trim
Tokenize
Bigrams
Max Classifier
Top Features
Transformer
Naive Bayes
Model
Then evaluate and ship
toVelox when you’re
ready to apply to real
time data.
NOT SO SIMPLE EXAMPLE:
IMAGE CLASSIFICATION
Images
(VOC2007)
.fit( )
Resize
Grayscale
SIFT
PCA
FisherVector
MaxClassifier
Linear Regression
Resize
Grayscale
SIFT
MaxClassifier
PCA Map
Fisher Encoder
Linear Model
Achieves performance
of Chatfield et. al., 2011
Embarassingly parallel
featurization and evaluation.
15 minutes on a modest
cluster.
5,000 examples, 40,000
features, 20 classes
EVEN LESS SIMPLE: IMAGENET
Color Edges
Resize
Grayscale
SIFT
PCA
FisherVector
Top 5 Classifier
LCS
PCA
FisherVector
Block Linear
Solver
<200 SLOC
And Shivaram doesn’t have a
heart attack when Prof. Recht
tells us he wants to try a
new solver.Weighted Block
Linear Solver
Or add 100,000 more
texture features.
Texture
More
Stuff
I’d Never
Heard Of
1000 class classification.
1,200,000 examples
64,000 features.
90 minutes on 100 nodes.
SOFTWARE FEATURES
• Data Loaders
• CSV, CIFAR, ImageNet,VOC,TIMIT, 20 Newsgroups
• Transformers
• NLP -Tokenization, n-grams, term frequency, NER*,
parsing*
• Images - Convolution, Grayscaling, LCS, SIFT*,
FisherVector*, Pooling,Windowing, HOG, Daisy
• Speech - MFCCs*
• Stats - Random Features, Normalization, Scaling*,
Signed Hellinger Mapping, FFT
• Utility/misc - Caching,Top-K classifier, indicator label
mapping, sparse/dense encoding transformers.
• Estimators
• Learning - Block linear models, Linear Discriminant
Analysis, PCA, ZCA Whitening, Naive Bayes*, GMM*
• Example Pipelines
• NLP - 20 Newsgroups,Wikipedia
Language model
• Images - MNIST, CIFAR,VOC, ImageNet
• Speech -TIMIT
• Evaluation Metrics
• Binary Classification
• Multiclass Classification
• Multilabel Classification
Just 5k Lines of Code,
1.5k of which areTESTS
+ 1.5k lines of JavaDoc
* - Links to external library: MLlib, ml-matrix,VLFeat, EncEval
KEY API CONCEPTS
TRANSFORMERS
TransformerInput Output
abstract classTransformer[In, Out] {
def apply(in: In): Out
def apply(in: RDD[In]): RDD[Out] = in.map(apply)
…
}
TYPE SAFETY HELPS ENSURE ROBUSTNESS
ESTIMATORS
EstimatorRDD[Input]
abstract class Estimator[In, Out] {
def fit(in: RDD[In]):Transformer[In,Out]
…
}
Transformer
.fit()
CHAINING
NGrams(2)String Vectorizer VectorBigrams
val featurizer:Transformer[String,Vector] = NGrams(2) thenVectorizer
featurizerString Vector
=
COMPLEX PIPELINES
.fit(data, labels)
pipelineString Prediction
=
val pipeline = (featurizer thenLabelEstimator LinearModel).fit(data, labels)
featurizerString Vector
Linear
Model
Prediction
featurizerString Vector
Linear
Map
Prediction
USINGTHE API
CREATE ATRANSFORMER
• Can be defined as a
class.
• Or as a Unary
function.
“I want a node that takes a vector and divides by it’s two-norm.”
Chain with other nodes:
val pipeline =Vectorizer then Normalizer
or
val pipeline =Vectorizer thenFunction normalize _
INTEGRATING WITH C CODE?
Don’t reinvent the wheel - use JNI!
Loads a
shared library
javah generates
a header for the
shared library.
Native code is
shared across
the cluster.
RESULTS
• TIMIT Speech Dataset:
• Matches state-of-the-art statistical
performance.
• 90 minutes on 64 EC2 nodes.
• Compare to 120 minutes on 256 IBM
Blue gene machines.
• ImageNet:
• 67% accuracy with weighted block
coordinate decent. Matches best
accuracy with 64k features in the 2012
ImageNet contest.
• 90 minutes end-to-end on 100 nodes.
RESEARCH COMPLEMENTTO
MLLIB PIPELINE API
• Basic set of operators for text,
numbers.
• All spark.ml operations are
transformations on DataFrames.
• Scala, Java, Python
• Part of Apache Spark
MLlib Pipeline API
• Enriched set of operators for complex
domains: vision, NLP, speech, plus
advanced math.
• Type safety.
• Scala-only (for now)
• Separate project (for now)
• External library integration.
• Integrated with new BDAS
technologies: Velox, ml-matrix, soon
Planck,TuPAQ and SampleClean
KeystoneML
RESEARCH DIRECTIONS
AUTOMATIC RESOURCE
ESTIMATION
• Long-complicated pipelines.
• Just a composition of dataflows!
• When how long will this thing
take to run?
• When do I cache?
• Pose as a constrained
optimization problem.
• Enables Efficient Hyperparameter
Tuning
Resize
Grayscale
SIFT
PCA
FisherVector
Top 5 Classifier
LCS
PCA
FisherVector
Block Linear
Solver
Weighted
Block Linear
BETTER ALGORITHMS
• Linear models are pretty great.
• Can be slow to solve exactly.
• Need tons of RAM to materialize full, dense,
feature spaces.
• In classification - bad at class imbalance.
• We’ve developed a new algorithm to address all
of these issues.
• Block Weighted Linear Solve
• Distributed, lazily materialized, weighted, and
approximate.
• Can be orders of magnitude faster than
direct solve methods, and much more
accurate for highly imbalanced problems.
=
Labels Features Model 1
solve( , )
=
Residual 1 Features Model 2
solve( , )
=
Residual N Features Final Model
solve( , )
…
MORE APPLICATIONS
• Ongoing collaborations involving:
• Astronomical image processing.
• Count the stars in the sky!
• Find sunspots with image
classification.
• High resolution structure recognition for
materials science.
• Images so big they are RDDs!
• Advanced Language Models.
• Scalable Kneser-Ney smoothing for
high-quality machine translation.
D. Ushizima, et. al, 2014
Z Zhang, et. al, 2014
E. Jonas, et. al, 2015
QUESTIONS?
Contributors:
Daniel Bruckner, Mike Franklin, Glyfi Gudmundsson,Tomer Kaftan,
Daniel Langkilde, Henry Milner, Ben Recht,Vaishaal Shankar, Evan
Sparks, ShivaramVenkataraman, ZonghengYang
http://keystone-ml.org/
KeystoneML
Code
http://github.com/amplab/keystone

More Related Content

What's hot

Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Databricks
 
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
MLconf
 

What's hot (20)

Machine Learning With Spark
Machine Learning With SparkMachine Learning With Spark
Machine Learning With Spark
 
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
 
Deploying Machine Learning Models to Production
Deploying Machine Learning Models to ProductionDeploying Machine Learning Models to Production
Deploying Machine Learning Models to Production
 
Best Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowBest Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflow
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
 
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlib
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott Clark
 
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
 
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
 
Inside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick ReissInside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick Reiss
 
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
 

Viewers also liked

Google Tech Talk with Dr. Eric Brewer in Korea Apr.27.2015
Google Tech Talk with Dr. Eric Brewer in Korea Apr.27.2015Google Tech Talk with Dr. Eric Brewer in Korea Apr.27.2015
Google Tech Talk with Dr. Eric Brewer in Korea Apr.27.2015
Chris Jang
 
Salter Harris Type 3 Fracture
Salter Harris Type 3 FractureSalter Harris Type 3 Fracture
Salter Harris Type 3 Fracture
Todd Peterson
 

Viewers also liked (20)

Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
 
Square's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanSquare's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong Yan
 
Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In Production
 
10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems
 
Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...
Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...
Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...
 
Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16
Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16
Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16
 
A few questions about large scale machine learning
A few questions about large scale machine learningA few questions about large scale machine learning
A few questions about large scale machine learning
 
Managing and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in PythonManaging and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in Python
 
Google Tech Talk with Dr. Eric Brewer in Korea Apr.27.2015
Google Tech Talk with Dr. Eric Brewer in Korea Apr.27.2015Google Tech Talk with Dr. Eric Brewer in Korea Apr.27.2015
Google Tech Talk with Dr. Eric Brewer in Korea Apr.27.2015
 
Machine Learning on the Microsoft Stack
Machine Learning on the Microsoft StackMachine Learning on the Microsoft Stack
Machine Learning on the Microsoft Stack
 
Salter Harris Type 3 Fracture
Salter Harris Type 3 FractureSalter Harris Type 3 Fracture
Salter Harris Type 3 Fracture
 
Google Cloud Platform 2014Q1 - Starter Guide
Google Cloud Platform   2014Q1 - Starter GuideGoogle Cloud Platform   2014Q1 - Starter Guide
Google Cloud Platform 2014Q1 - Starter Guide
 
L03 intertrochanteric fx
L03 intertrochanteric fxL03 intertrochanteric fx
L03 intertrochanteric fx
 
Modern Machine Learning Infrastructure and Practices
Modern Machine Learning Infrastructure and PracticesModern Machine Learning Infrastructure and Practices
Modern Machine Learning Infrastructure and Practices
 
Building Enterprise Applications on Google Cloud Platform Cloud Computing Exp...
Building Enterprise Applications on Google Cloud Platform Cloud Computing Exp...Building Enterprise Applications on Google Cloud Platform Cloud Computing Exp...
Building Enterprise Applications on Google Cloud Platform Cloud Computing Exp...
 
Intertrochanteric Fractures of Femur
Intertrochanteric Fractures of FemurIntertrochanteric Fractures of Femur
Intertrochanteric Fractures of Femur
 
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay ListingsScalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
 
Intertrochanteric & subtrochanteric fracture classification
Intertrochanteric & subtrochanteric fracture classificationIntertrochanteric & subtrochanteric fracture classification
Intertrochanteric & subtrochanteric fracture classification
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 

Similar to Building Large Scale Machine Learning Applications with Pipelines-(Evan Sparks and Shivaram Venkataraman, UC Berkeley AMPLAB)

Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 
Java 8 selected updates
Java 8 selected updatesJava 8 selected updates
Java 8 selected updates
Vinay H G
 

Similar to Building Large Scale Machine Learning Applications with Pipelines-(Evan Sparks and Shivaram Venkataraman, UC Berkeley AMPLAB) (20)

What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics Stack
 
Scalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache SparkScalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache Spark
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
 
Apache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksApache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected Talks
 
Apache Spark MLlib
Apache Spark MLlib Apache Spark MLlib
Apache Spark MLlib
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Felix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINE
Felix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINEFelix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINE
Felix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINE
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Using MXNet to Train and Deploy your Deep Learning Model
Using MXNet to Train and Deploy your Deep Learning ModelUsing MXNet to Train and Deploy your Deep Learning Model
Using MXNet to Train and Deploy your Deep Learning Model
 
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
 
Low Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexLow Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache Apex
 
Using Deep Learning in Production Pipelines to Predict Consumers’ Interest wi...
Using Deep Learning in Production Pipelines to Predict Consumers’ Interest wi...Using Deep Learning in Production Pipelines to Predict Consumers’ Interest wi...
Using Deep Learning in Production Pipelines to Predict Consumers’ Interest wi...
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech TalksA Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
 
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech TalksA Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
 
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit EU talk by Mikhail Semeniuk Hollin WilkinsSpark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
 
Java 8 selected updates
Java 8 selected updatesJava 8 selected updates
Java 8 selected updates
 
AI and Spark - IBM Community AI Day
AI and Spark - IBM Community AI DayAI and Spark - IBM Community AI Day
AI and Spark - IBM Community AI Day
 
Scala Days NYC 2016
Scala Days NYC 2016Scala Days NYC 2016
Scala Days NYC 2016
 
Static analysis of java enterprise applications
Static analysis of java enterprise applicationsStatic analysis of java enterprise applications
Static analysis of java enterprise applications
 

More from Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
zifhagzkk
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
pwgnohujw
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
yulianti213969
 
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
LuisMiguelPaz5
 
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
mikehavy0
 
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
jk0tkvfv
 

Recently uploaded (20)

SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchers
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
 
ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...
ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...
ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...
 
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
 
DS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .pptDS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .ppt
 
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 

Building Large Scale Machine Learning Applications with Pipelines-(Evan Sparks and Shivaram Venkataraman, UC Berkeley AMPLAB)

  • 1. KEYSTONEML Evan R. Sparks, ShivaramVenkataraman With:Tomer Kaftan, ZonghengYang, Mike Franklin, Ben Recht
  • 2. WHAT IS KEYSTONEML • Software framework for building scalable end-to-end machine learning pipelines. • Helps us understand what it means to build systems for robust, scalable, end-to-end advanced analytics workloads and the patterns that emerge. • Example pipelines that achieve state-of-the-art results on large scale datasets in computer vision, NLP, and speech - fast. • Previewed at AMP Camp 5 and on AMPLab Blog as “ML Pipelines” • Public release last month! http://keystone-ml.org/
  • 3. HOW DOES IT FIT WITH BDAS? Spark MLlibGraphX ml-matrix KeystoneML Batch ModelTraining Velox Model Server RealTime Serving http://amplab.github.io/velox-modelserver
  • 5. A STANDARD MACHINE LEARNING PIPELINE Right? Data Train Classifier Model
  • 6. A STANDARD MACHINE LEARNING PIPELINE That’s more like it! Data Train Linear Classifier Model Feature Extraction Test Data Predictions
  • 7. SIMPLE EXAMPLE: TEXT CLASSIFICATION 20 Newsgroups .fit( ) Trim Tokenize Bigrams Top Features Naive Bayes Max Classifier Trim Tokenize Bigrams Max Classifier Top Features Transformer Naive Bayes Model Then evaluate and ship toVelox when you’re ready to apply to real time data.
  • 8. NOT SO SIMPLE EXAMPLE: IMAGE CLASSIFICATION Images (VOC2007) .fit( ) Resize Grayscale SIFT PCA FisherVector MaxClassifier Linear Regression Resize Grayscale SIFT MaxClassifier PCA Map Fisher Encoder Linear Model Achieves performance of Chatfield et. al., 2011 Embarassingly parallel featurization and evaluation. 15 minutes on a modest cluster. 5,000 examples, 40,000 features, 20 classes
  • 9. EVEN LESS SIMPLE: IMAGENET Color Edges Resize Grayscale SIFT PCA FisherVector Top 5 Classifier LCS PCA FisherVector Block Linear Solver <200 SLOC And Shivaram doesn’t have a heart attack when Prof. Recht tells us he wants to try a new solver.Weighted Block Linear Solver Or add 100,000 more texture features. Texture More Stuff I’d Never Heard Of 1000 class classification. 1,200,000 examples 64,000 features. 90 minutes on 100 nodes.
  • 10. SOFTWARE FEATURES • Data Loaders • CSV, CIFAR, ImageNet,VOC,TIMIT, 20 Newsgroups • Transformers • NLP -Tokenization, n-grams, term frequency, NER*, parsing* • Images - Convolution, Grayscaling, LCS, SIFT*, FisherVector*, Pooling,Windowing, HOG, Daisy • Speech - MFCCs* • Stats - Random Features, Normalization, Scaling*, Signed Hellinger Mapping, FFT • Utility/misc - Caching,Top-K classifier, indicator label mapping, sparse/dense encoding transformers. • Estimators • Learning - Block linear models, Linear Discriminant Analysis, PCA, ZCA Whitening, Naive Bayes*, GMM* • Example Pipelines • NLP - 20 Newsgroups,Wikipedia Language model • Images - MNIST, CIFAR,VOC, ImageNet • Speech -TIMIT • Evaluation Metrics • Binary Classification • Multiclass Classification • Multilabel Classification Just 5k Lines of Code, 1.5k of which areTESTS + 1.5k lines of JavaDoc * - Links to external library: MLlib, ml-matrix,VLFeat, EncEval
  • 12. TRANSFORMERS TransformerInput Output abstract classTransformer[In, Out] { def apply(in: In): Out def apply(in: RDD[In]): RDD[Out] = in.map(apply) … } TYPE SAFETY HELPS ENSURE ROBUSTNESS
  • 13. ESTIMATORS EstimatorRDD[Input] abstract class Estimator[In, Out] { def fit(in: RDD[In]):Transformer[In,Out] … } Transformer .fit()
  • 14. CHAINING NGrams(2)String Vectorizer VectorBigrams val featurizer:Transformer[String,Vector] = NGrams(2) thenVectorizer featurizerString Vector =
  • 15. COMPLEX PIPELINES .fit(data, labels) pipelineString Prediction = val pipeline = (featurizer thenLabelEstimator LinearModel).fit(data, labels) featurizerString Vector Linear Model Prediction featurizerString Vector Linear Map Prediction
  • 17. CREATE ATRANSFORMER • Can be defined as a class. • Or as a Unary function. “I want a node that takes a vector and divides by it’s two-norm.” Chain with other nodes: val pipeline =Vectorizer then Normalizer or val pipeline =Vectorizer thenFunction normalize _
  • 18. INTEGRATING WITH C CODE? Don’t reinvent the wheel - use JNI! Loads a shared library javah generates a header for the shared library. Native code is shared across the cluster.
  • 19. RESULTS • TIMIT Speech Dataset: • Matches state-of-the-art statistical performance. • 90 minutes on 64 EC2 nodes. • Compare to 120 minutes on 256 IBM Blue gene machines. • ImageNet: • 67% accuracy with weighted block coordinate decent. Matches best accuracy with 64k features in the 2012 ImageNet contest. • 90 minutes end-to-end on 100 nodes.
  • 20. RESEARCH COMPLEMENTTO MLLIB PIPELINE API • Basic set of operators for text, numbers. • All spark.ml operations are transformations on DataFrames. • Scala, Java, Python • Part of Apache Spark MLlib Pipeline API • Enriched set of operators for complex domains: vision, NLP, speech, plus advanced math. • Type safety. • Scala-only (for now) • Separate project (for now) • External library integration. • Integrated with new BDAS technologies: Velox, ml-matrix, soon Planck,TuPAQ and SampleClean KeystoneML
  • 22. AUTOMATIC RESOURCE ESTIMATION • Long-complicated pipelines. • Just a composition of dataflows! • When how long will this thing take to run? • When do I cache? • Pose as a constrained optimization problem. • Enables Efficient Hyperparameter Tuning Resize Grayscale SIFT PCA FisherVector Top 5 Classifier LCS PCA FisherVector Block Linear Solver Weighted Block Linear
  • 23. BETTER ALGORITHMS • Linear models are pretty great. • Can be slow to solve exactly. • Need tons of RAM to materialize full, dense, feature spaces. • In classification - bad at class imbalance. • We’ve developed a new algorithm to address all of these issues. • Block Weighted Linear Solve • Distributed, lazily materialized, weighted, and approximate. • Can be orders of magnitude faster than direct solve methods, and much more accurate for highly imbalanced problems. = Labels Features Model 1 solve( , ) = Residual 1 Features Model 2 solve( , ) = Residual N Features Final Model solve( , ) …
  • 24. MORE APPLICATIONS • Ongoing collaborations involving: • Astronomical image processing. • Count the stars in the sky! • Find sunspots with image classification. • High resolution structure recognition for materials science. • Images so big they are RDDs! • Advanced Language Models. • Scalable Kneser-Ney smoothing for high-quality machine translation. D. Ushizima, et. al, 2014 Z Zhang, et. al, 2014 E. Jonas, et. al, 2015
  • 25. QUESTIONS? Contributors: Daniel Bruckner, Mike Franklin, Glyfi Gudmundsson,Tomer Kaftan, Daniel Langkilde, Henry Milner, Ben Recht,Vaishaal Shankar, Evan Sparks, ShivaramVenkataraman, ZonghengYang http://keystone-ml.org/ KeystoneML Code http://github.com/amplab/keystone