SlideShare a Scribd company logo
1 of 30
Spark DataFrames
and ML Pipelines
Joseph K. Bradley
May 1, 2015
MLconf Seattle
Who am I?
Joseph K. Bradley
Ph.D. in ML from CMU, postdoc at Berkeley
Apache Spark committer
Software Engineer @ Databricks Inc.
2
Databricks Inc.
3
Founded by the creators of Spark
& driving its development
Databricks Cloud: the best place to run Spark
Guess what…we’re hiring!
databricks.com/company/careers
4
Concise APIs in Python, Java, Scala
… and R in Spark 1.4!
500+ enterprises using or planning
to use Spark in production (blog)
Spark
SparkSQL Streaming MLlib GraphX
Distributed computing engine
• Built for speed, ease of use,
and sophisticated analytics
• Apache open source
Beyond Hadoop
5
Early adopters (Data) Engineers
MapReduce &
functional API
Data Scientists
& Statisticians
Spark for Data Science
DataFrames
Intuitive manipulation of distributed structured data
6
Machine Learning Pipelines
Simple construction and tuning of ML workflows
Google Trends for “dataframe”
7
DataFrames
8
dept age name
Bio 48 H Smith
CS 54 A Turing
Bio 43 B Jones
Chem 61 M Kennedy
RDD API
DataFrame API
Data grouped into
named columns
DataFrames
9
dept age name
Bio 48 H Smith
CS 54 A Turing
Bio 43 B Jones
Chem 61 M Kennedy
Data grouped into
named columns
DSL for common tasks
• Project, filter, aggregate, join, …
• Metadata
• UDFs
Spark DataFrames
10
API inspired by R and Python Pandas
• Python, Scala, Java (+ R in dev)
• Pandas integration
Distributed DataFrame
Highly optimized
11
0 2 4 6 8 10
RDD Scala
RDD Python
Spark Scala DF
Spark Python DF
Runtime of aggregating 10 million int pairs (secs)
Spark DataFrames are fast
better
Uses SparkSQL
Catalyst optimizer
12
Demo: DataFrames
Spark for Data Science
DataFrames
• Structured data
• Familiar API based on R & Python Pandas
• Distributed, optimized implementation
13
Machine Learning Pipelines
Simple construction and tuning of ML workflows
About Spark MLlib
Started @ Berkeley
• Spark 0.8
Now (Spark 1.3)
• Contributions from 50+ orgs, 100+ individuals
• Growing coverage of distributed algorithms
Spark
SparkSQL Streaming MLlib GraphX
14
About Spark MLlib
Classification
• Logistic regression
• Naive Bayes
• Streaming logistic regression
• Linear SVMs
• Decision trees
• Random forests
• Gradient-boosted trees
Regression
• Ordinary least squares
• Ridge regression
• Lasso
• Isotonic regression
• Decision trees
• Random forests
• Gradient-boosted trees
• Streaming linear methods
15
Statistics
• Pearson correlation
• Spearman correlation
• Online summarization
• Chi-squared test
• Kernel density estimation
Linear algebra
• Local dense & sparse vectors &
matrices
• Distributed matrices
• Block-partitioned matrix
• Row matrix
• Indexed row matrix
• Coordinate matrix
• Matrix decompositions
Frequent itemsets
• FP-growth
Model import/export
Clustering
• Gaussian mixture models
• K-Means
• Streaming K-Means
• Latent Dirichlet Allocation
• Power Iteration Clustering
Recommendation
• Alternating Least Squares
Feature extraction & selection
• Word2Vec
• Chi-Squared selection
• Hashing term frequency
• Inverse document frequency
• Normalizer
• Standard scaler
• Tokenizer
ML Workflows are complex
16
Image
classification
pipeline*
* Evan Sparks. “ML Pipelines.”
amplab.cs.berkeley.edu/ml-pipelines
 Specify pipeline
 Inspect & debug
 Re-run on new data
 Tune parameters
Example: Text Classification
17
Goal: Given a text document, predict its
topic.
Subject: Re: Lexan Polish?
Suggest McQuires #1 plastic
polish. It will help somewhat
but nothing will remove deep
scratches without making it
worse than it already is.
McQuires will do something...
1: about science
0: not about science
LabelFeatures
Dataset: “20 Newsgroups”
From UCI KDD Archive
ML Workflow
18
Train model
Evaluate
Load data
Extract features
Load Data
19
Train model
Evaluate
Load data
Extract features
built-in external
{ JSON }
JDBC
and more …
Data sources for DataFrames
Load Data
20
Train model
Evaluate
Load data
Extract features
label: Int
text: String
Current data schema
Extract Features
21
Train model
Evaluate
Load data
Extract features
label: Int
text: String
Current data schema
Extract Features
22
Train model
Evaluate
Load data
label: Int
text: String
Current data schema
Tokenizer
Hashed Term Freq.
features: Vector
words: Seq[String]
Transformer
Train a Model
23
Logistic Regression
Evaluate
label: Int
text: String
Current data schema
Tokenizer
Hashed Term Freq.
features: Vector
words: Seq[String]
prediction: Int
Estimator
Load data
Transformer
Evaluate the Model
24
Logistic Regression
Evaluate
label: Int
text: String
Current data schema
Tokenizer
Hashed Term Freq.
features: Vector
words: Seq[String]
prediction: Int
Load data
Transformer
Evaluator
Estimator
By default, always
append new columns
 Can go back & inspect
intermediate results
 Made efficient by
DataFrame
optimizations
ML Pipelines
25
Logistic Regression
Evaluate
Tokenizer
Hashed Term Freq.
Load data
Pipeline
Test data
Logistic Regression
Tokenizer
Hashed Term Freq.
Evaluate
Re-run exactly
the same way
Parameter Tuning
26
Logistic Regression
Evaluate
Tokenizer
Hashed Term Freq.
lr.regParam
{0.01, 0.1, 0.5}
hashingTF.numFeatures
{100, 1000, 10000} Given:
• Estimator
• Parameter grid
• Evaluator
Find best parameters
CrossValidator
27
Demo: ML Pipelines
Recap
DataFrames
• Structured data
• Familiar API based on R & Python
Pandas
• Distributed, optimized
implementation
Machine Learning Pipelines
• Integration with DataFrames
• Familiar API based on scikit-learn
• Simple parameter tuning 28
Composable & DAG Pipelines
Schema validation
User-defined Transformers
& Estimators
Looking Ahead
Collaborations with UC Berkeley & others
• Auto-tuning models
29
DataFrames
• Further
optimization
• API for R
ML Pipelines
• More algorithms &
pluggability
• API for R
Thank you!
Spark documentation
spark.apache.org
Pipelines blog post
databricks.com/blog/2015/01/07
DataFrames blog post
databricks.com/blog/2015/02/17
Databricks Cloud Platform
databricks.com/product
Spark MOOCs on edX
Intro to Spark & ML with Spark
Spark Packages
spark-packages.org

More Related Content

What's hot

Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Databricks
 
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
Databricks
 

What's hot (20)

Productionizing H2O Models with Apache Spark with Jakub Hava and Michal Maloh...
Productionizing H2O Models with Apache Spark with Jakub Hava and Michal Maloh...Productionizing H2O Models with Apache Spark with Jakub Hava and Michal Maloh...
Productionizing H2O Models with Apache Spark with Jakub Hava and Michal Maloh...
 
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and Architecture
 
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
 
Keynote at spark summit east anjul
Keynote at spark summit east anjulKeynote at spark summit east anjul
Keynote at spark summit east anjul
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
MLeap: Release Spark ML Pipelines
MLeap: Release Spark ML PipelinesMLeap: Release Spark ML Pipelines
MLeap: Release Spark ML Pipelines
 
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4jTransforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
 
Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conference
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Hundreds of queries in the time of one - Gianmario Spacagna
Hundreds of queries in the time of one - Gianmario SpacagnaHundreds of queries in the time of one - Gianmario Spacagna
Hundreds of queries in the time of one - Gianmario Spacagna
 
From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting data
 
How To Visualize Graphs
How To Visualize GraphsHow To Visualize Graphs
How To Visualize Graphs
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
Improve ML Predictions using Connected Feature Extraction
Improve ML Predictions using Connected Feature ExtractionImprove ML Predictions using Connected Feature Extraction
Improve ML Predictions using Connected Feature Extraction
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflow
 
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionData Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
 
Semantika Introduction
Semantika IntroductionSemantika Introduction
Semantika Introduction
 
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
 
Zipline - A Declarative Feature Engineering Framework
Zipline - A Declarative Feature Engineering FrameworkZipline - A Declarative Feature Engineering Framework
Zipline - A Declarative Feature Engineering Framework
 
Sydney Apache Spark Meetup - Spark Natural Language Processing
Sydney Apache Spark Meetup - Spark Natural Language ProcessingSydney Apache Spark Meetup - Spark Natural Language Processing
Sydney Apache Spark Meetup - Spark Natural Language Processing
 

Viewers also liked

Leveraging Community Engagement for Brand Engagement, 2012, presentation
Leveraging Community Engagement for Brand Engagement, 2012, presentationLeveraging Community Engagement for Brand Engagement, 2012, presentation
Leveraging Community Engagement for Brand Engagement, 2012, presentation
Florent Renucci
 
Presentation9juin2010
Presentation9juin2010Presentation9juin2010
Presentation9juin2010
phothisane
 
Inverse matrix pptx
Inverse matrix pptxInverse matrix pptx
Inverse matrix pptx
Kimguan Tan
 
Chapter 4: Linear Algebraic Equations
Chapter 4: Linear Algebraic EquationsChapter 4: Linear Algebraic Equations
Chapter 4: Linear Algebraic Equations
Maria Fernanda
 
RcppEigen and SVD
RcppEigen and SVDRcppEigen and SVD
RcppEigen and SVD
Xiangze
 

Viewers also liked (20)

Tutorial matlab
Tutorial matlabTutorial matlab
Tutorial matlab
 
Leveraging Community Engagement for Brand Engagement, 2012, presentation
Leveraging Community Engagement for Brand Engagement, 2012, presentationLeveraging Community Engagement for Brand Engagement, 2012, presentation
Leveraging Community Engagement for Brand Engagement, 2012, presentation
 
Presentation9juin2010
Presentation9juin2010Presentation9juin2010
Presentation9juin2010
 
la reconnaissance des EFE
la reconnaissance des EFEla reconnaissance des EFE
la reconnaissance des EFE
 
space vector
space vectorspace vector
space vector
 
Generalization of Principal Component Analysis, presentation, 2012
Generalization of Principal Component Analysis, presentation, 2012Generalization of Principal Component Analysis, presentation, 2012
Generalization of Principal Component Analysis, presentation, 2012
 
Projet IPv6 Matrix / Version française intégrale
Projet IPv6 Matrix / Version française intégraleProjet IPv6 Matrix / Version française intégrale
Projet IPv6 Matrix / Version française intégrale
 
Cours add-r1-part1
Cours add-r1-part1Cours add-r1-part1
Cours add-r1-part1
 
Cours add-r1-part0
Cours add-r1-part0Cours add-r1-part0
Cours add-r1-part0
 
Inverse matrix pptx
Inverse matrix pptxInverse matrix pptx
Inverse matrix pptx
 
Vector space classification
Vector space classificationVector space classification
Vector space classification
 
Matrix algebra
Matrix algebraMatrix algebra
Matrix algebra
 
Cours add-r1-part5
Cours add-r1-part5Cours add-r1-part5
Cours add-r1-part5
 
Chapter 4: Linear Algebraic Equations
Chapter 4: Linear Algebraic EquationsChapter 4: Linear Algebraic Equations
Chapter 4: Linear Algebraic Equations
 
Systems of linear equations; matrices
Systems of linear equations; matricesSystems of linear equations; matrices
Systems of linear equations; matrices
 
RcppEigen and SVD
RcppEigen and SVDRcppEigen and SVD
RcppEigen and SVD
 
Pourquoi les SEO regrettent (parfois) d'avoir dormi en cours de maths
Pourquoi les SEO regrettent (parfois) d'avoir dormi en cours de mathsPourquoi les SEO regrettent (parfois) d'avoir dormi en cours de maths
Pourquoi les SEO regrettent (parfois) d'avoir dormi en cours de maths
 
Linear vector space
Linear vector spaceLinear vector space
Linear vector space
 
마켓컬리 마케팅채널분석 - 201603
마켓컬리 마케팅채널분석 - 201603마켓컬리 마케팅채널분석 - 201603
마켓컬리 마케팅채널분석 - 201603
 
Inverse matrix
Inverse matrixInverse matrix
Inverse matrix
 

Similar to Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyApache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Databricks
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
Peyman Mohajerian
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsApache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new Directions
Databricks
 
Machine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - DatabricksMachine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - Databricks
Spark Summit
 

Similar to Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15 (20)

Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics Stack
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyApache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsApache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new Directions
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Leveraging Graphs for Better AI
Leveraging Graphs for Better AILeveraging Graphs for Better AI
Leveraging Graphs for Better AI
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL Performance
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
 
Machine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - DatabricksMachine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - Databricks
 

More from MLconf

Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
MLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
MLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
MLconf
 

More from MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Recently uploaded (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 

Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

  • 1. Spark DataFrames and ML Pipelines Joseph K. Bradley May 1, 2015 MLconf Seattle
  • 2. Who am I? Joseph K. Bradley Ph.D. in ML from CMU, postdoc at Berkeley Apache Spark committer Software Engineer @ Databricks Inc. 2
  • 3. Databricks Inc. 3 Founded by the creators of Spark & driving its development Databricks Cloud: the best place to run Spark Guess what…we’re hiring! databricks.com/company/careers
  • 4. 4 Concise APIs in Python, Java, Scala … and R in Spark 1.4! 500+ enterprises using or planning to use Spark in production (blog) Spark SparkSQL Streaming MLlib GraphX Distributed computing engine • Built for speed, ease of use, and sophisticated analytics • Apache open source
  • 5. Beyond Hadoop 5 Early adopters (Data) Engineers MapReduce & functional API Data Scientists & Statisticians
  • 6. Spark for Data Science DataFrames Intuitive manipulation of distributed structured data 6 Machine Learning Pipelines Simple construction and tuning of ML workflows
  • 7. Google Trends for “dataframe” 7
  • 8. DataFrames 8 dept age name Bio 48 H Smith CS 54 A Turing Bio 43 B Jones Chem 61 M Kennedy RDD API DataFrame API Data grouped into named columns
  • 9. DataFrames 9 dept age name Bio 48 H Smith CS 54 A Turing Bio 43 B Jones Chem 61 M Kennedy Data grouped into named columns DSL for common tasks • Project, filter, aggregate, join, … • Metadata • UDFs
  • 10. Spark DataFrames 10 API inspired by R and Python Pandas • Python, Scala, Java (+ R in dev) • Pandas integration Distributed DataFrame Highly optimized
  • 11. 11 0 2 4 6 8 10 RDD Scala RDD Python Spark Scala DF Spark Python DF Runtime of aggregating 10 million int pairs (secs) Spark DataFrames are fast better Uses SparkSQL Catalyst optimizer
  • 13. Spark for Data Science DataFrames • Structured data • Familiar API based on R & Python Pandas • Distributed, optimized implementation 13 Machine Learning Pipelines Simple construction and tuning of ML workflows
  • 14. About Spark MLlib Started @ Berkeley • Spark 0.8 Now (Spark 1.3) • Contributions from 50+ orgs, 100+ individuals • Growing coverage of distributed algorithms Spark SparkSQL Streaming MLlib GraphX 14
  • 15. About Spark MLlib Classification • Logistic regression • Naive Bayes • Streaming logistic regression • Linear SVMs • Decision trees • Random forests • Gradient-boosted trees Regression • Ordinary least squares • Ridge regression • Lasso • Isotonic regression • Decision trees • Random forests • Gradient-boosted trees • Streaming linear methods 15 Statistics • Pearson correlation • Spearman correlation • Online summarization • Chi-squared test • Kernel density estimation Linear algebra • Local dense & sparse vectors & matrices • Distributed matrices • Block-partitioned matrix • Row matrix • Indexed row matrix • Coordinate matrix • Matrix decompositions Frequent itemsets • FP-growth Model import/export Clustering • Gaussian mixture models • K-Means • Streaming K-Means • Latent Dirichlet Allocation • Power Iteration Clustering Recommendation • Alternating Least Squares Feature extraction & selection • Word2Vec • Chi-Squared selection • Hashing term frequency • Inverse document frequency • Normalizer • Standard scaler • Tokenizer
  • 16. ML Workflows are complex 16 Image classification pipeline* * Evan Sparks. “ML Pipelines.” amplab.cs.berkeley.edu/ml-pipelines  Specify pipeline  Inspect & debug  Re-run on new data  Tune parameters
  • 17. Example: Text Classification 17 Goal: Given a text document, predict its topic. Subject: Re: Lexan Polish? Suggest McQuires #1 plastic polish. It will help somewhat but nothing will remove deep scratches without making it worse than it already is. McQuires will do something... 1: about science 0: not about science LabelFeatures Dataset: “20 Newsgroups” From UCI KDD Archive
  • 19. Load Data 19 Train model Evaluate Load data Extract features built-in external { JSON } JDBC and more … Data sources for DataFrames
  • 20. Load Data 20 Train model Evaluate Load data Extract features label: Int text: String Current data schema
  • 21. Extract Features 21 Train model Evaluate Load data Extract features label: Int text: String Current data schema
  • 22. Extract Features 22 Train model Evaluate Load data label: Int text: String Current data schema Tokenizer Hashed Term Freq. features: Vector words: Seq[String] Transformer
  • 23. Train a Model 23 Logistic Regression Evaluate label: Int text: String Current data schema Tokenizer Hashed Term Freq. features: Vector words: Seq[String] prediction: Int Estimator Load data Transformer
  • 24. Evaluate the Model 24 Logistic Regression Evaluate label: Int text: String Current data schema Tokenizer Hashed Term Freq. features: Vector words: Seq[String] prediction: Int Load data Transformer Evaluator Estimator By default, always append new columns  Can go back & inspect intermediate results  Made efficient by DataFrame optimizations
  • 25. ML Pipelines 25 Logistic Regression Evaluate Tokenizer Hashed Term Freq. Load data Pipeline Test data Logistic Regression Tokenizer Hashed Term Freq. Evaluate Re-run exactly the same way
  • 26. Parameter Tuning 26 Logistic Regression Evaluate Tokenizer Hashed Term Freq. lr.regParam {0.01, 0.1, 0.5} hashingTF.numFeatures {100, 1000, 10000} Given: • Estimator • Parameter grid • Evaluator Find best parameters CrossValidator
  • 28. Recap DataFrames • Structured data • Familiar API based on R & Python Pandas • Distributed, optimized implementation Machine Learning Pipelines • Integration with DataFrames • Familiar API based on scikit-learn • Simple parameter tuning 28 Composable & DAG Pipelines Schema validation User-defined Transformers & Estimators
  • 29. Looking Ahead Collaborations with UC Berkeley & others • Auto-tuning models 29 DataFrames • Further optimization • API for R ML Pipelines • More algorithms & pluggability • API for R
  • 30. Thank you! Spark documentation spark.apache.org Pipelines blog post databricks.com/blog/2015/01/07 DataFrames blog post databricks.com/blog/2015/02/17 Databricks Cloud Platform databricks.com/product Spark MOOCs on edX Intro to Spark & ML with Spark Spark Packages spark-packages.org

Editor's Notes

  1. Contributions plot from: https://databricks.com/blog/2015/03/31/spark-turns-five-years-old.html Daytona GraySort contest (100TB sort) (blog)
  2. For those coming from Hadoop, this is a huge improvement: simpler code, runs on a laptop and on a huge cluster, very efficient. Can you spot the bug in the code using the RDD API?
  3. Contributions estimated from github commit logs, with some effort to de-duplicate entities.
  4. Dataset source: http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html *Data from UCI KDD Archive, originally donated to archive by Tom Mitchell (CMU).
  5. No time to mention: User-defined functions (UDFs) Optimizations: code gen, predicate pushdown