Spark DataFrames
and ML Pipelines
Joseph K. Bradley
May 1, 2015
MLconf Seattle
Who am I?
Joseph K. Bradley
Ph.D. in ML from CMU, postdoc at Berkeley
Apache Spark committer
Software Engineer @ Databricks Inc.
2
Databricks Inc.
3
Founded by the creators of Spark
& driving its development
Databricks Cloud: the best place to run Spark
Guess what…we’re hiring!
databricks.com/company/careers
4
Concise APIs in Python, Java, Scala
… and R in Spark 1.4!
500+ enterprises using or planning
to use Spark in production (blog)
Spark
SparkSQL Streaming MLlib GraphX
Distributed computing engine
• Built for speed, ease of use,
and sophisticated analytics
• Apache open source
Beyond Hadoop
5
Early adopters (Data) Engineers
MapReduce &
functional API
Data Scientists
& Statisticians
Spark for Data Science
DataFrames
Intuitive manipulation of distributed structured data
6
Machine Learning Pipelines
Simple construction and tuning of ML workflows
Google Trends for “dataframe”
7
DataFrames
8
dept age name
Bio 48 H Smith
CS 54 A Turing
Bio 43 B Jones
Chem 61 M Kennedy
RDD API
DataFrame API
Data grouped into
named columns
DataFrames
9
dept age name
Bio 48 H Smith
CS 54 A Turing
Bio 43 B Jones
Chem 61 M Kennedy
Data grouped into
named columns
DSL for common tasks
• Project, filter, aggregate, join, …
• Metadata
• UDFs
Spark DataFrames
10
API inspired by R and Python Pandas
• Python, Scala, Java (+ R in dev)
• Pandas integration
Distributed DataFrame
Highly optimized
11
0 2 4 6 8 10
RDD Scala
RDD Python
Spark Scala DF
Spark Python DF
Runtime of aggregating 10 million int pairs (secs)
Spark DataFrames are fast
better
Uses SparkSQL
Catalyst optimizer
12
Demo: DataFrames
Spark for Data Science
DataFrames
• Structured data
• Familiar API based on R & Python Pandas
• Distributed, optimized implementation
13
Machine Learning Pipelines
Simple construction and tuning of ML workflows
About Spark MLlib
Started @ Berkeley
• Spark 0.8
Now (Spark 1.3)
• Contributions from 50+ orgs, 100+ individuals
• Growing coverage of distributed algorithms
Spark
SparkSQL Streaming MLlib GraphX
14
About Spark MLlib
Classification
• Logistic regression
• Naive Bayes
• Streaming logistic regression
• Linear SVMs
• Decision trees
• Random forests
• Gradient-boosted trees
Regression
• Ordinary least squares
• Ridge regression
• Lasso
• Isotonic regression
• Decision trees
• Random forests
• Gradient-boosted trees
• Streaming linear methods
15
Statistics
• Pearson correlation
• Spearman correlation
• Online summarization
• Chi-squared test
• Kernel density estimation
Linear algebra
• Local dense & sparse vectors &
matrices
• Distributed matrices
• Block-partitioned matrix
• Row matrix
• Indexed row matrix
• Coordinate matrix
• Matrix decompositions
Frequent itemsets
• FP-growth
Model import/export
Clustering
• Gaussian mixture models
• K-Means
• Streaming K-Means
• Latent Dirichlet Allocation
• Power Iteration Clustering
Recommendation
• Alternating Least Squares
Feature extraction & selection
• Word2Vec
• Chi-Squared selection
• Hashing term frequency
• Inverse document frequency
• Normalizer
• Standard scaler
• Tokenizer
ML Workflows are complex
16
Image
classification
pipeline*
* Evan Sparks. “ML Pipelines.”
amplab.cs.berkeley.edu/ml-pipelines
 Specify pipeline
 Inspect & debug
 Re-run on new data
 Tune parameters
Example: Text Classification
17
Goal: Given a text document, predict its
topic.
Subject: Re: Lexan Polish?
Suggest McQuires #1 plastic
polish. It will help somewhat
but nothing will remove deep
scratches without making it
worse than it already is.
McQuires will do something...
1: about science
0: not about science
LabelFeatures
Dataset: “20 Newsgroups”
From UCI KDD Archive
ML Workflow
18
Train model
Evaluate
Load data
Extract features
Load Data
19
Train model
Evaluate
Load data
Extract features
built-in external
{ JSON }
JDBC
and more …
Data sources for DataFrames
Load Data
20
Train model
Evaluate
Load data
Extract features
label: Int
text: String
Current data schema
Extract Features
21
Train model
Evaluate
Load data
Extract features
label: Int
text: String
Current data schema
Extract Features
22
Train model
Evaluate
Load data
label: Int
text: String
Current data schema
Tokenizer
Hashed Term Freq.
features: Vector
words: Seq[String]
Transformer
Train a Model
23
Logistic Regression
Evaluate
label: Int
text: String
Current data schema
Tokenizer
Hashed Term Freq.
features: Vector
words: Seq[String]
prediction: Int
Estimator
Load data
Transformer
Evaluate the Model
24
Logistic Regression
Evaluate
label: Int
text: String
Current data schema
Tokenizer
Hashed Term Freq.
features: Vector
words: Seq[String]
prediction: Int
Load data
Transformer
Evaluator
Estimator
By default, always
append new columns
 Can go back & inspect
intermediate results
 Made efficient by
DataFrame
optimizations
ML Pipelines
25
Logistic Regression
Evaluate
Tokenizer
Hashed Term Freq.
Load data
Pipeline
Test data
Logistic Regression
Tokenizer
Hashed Term Freq.
Evaluate
Re-run exactly
the same way
Parameter Tuning
26
Logistic Regression
Evaluate
Tokenizer
Hashed Term Freq.
lr.regParam
{0.01, 0.1, 0.5}
hashingTF.numFeatures
{100, 1000, 10000} Given:
• Estimator
• Parameter grid
• Evaluator
Find best parameters
CrossValidator
27
Demo: ML Pipelines
Recap
DataFrames
• Structured data
• Familiar API based on R & Python
Pandas
• Distributed, optimized
implementation
Machine Learning Pipelines
• Integration with DataFrames
• Familiar API based on scikit-learn
• Simple parameter tuning 28
Composable & DAG Pipelines
Schema validation
User-defined Transformers
& Estimators
Looking Ahead
Collaborations with UC Berkeley & others
• Auto-tuning models
29
DataFrames
• Further
optimization
• API for R
ML Pipelines
• More algorithms &
pluggability
• API for R
Thank you!
Spark documentation
spark.apache.org
Pipelines blog post
databricks.com/blog/2015/01/07
DataFrames blog post
databricks.com/blog/2015/02/17
Databricks Cloud Platform
databricks.com/product
Spark MOOCs on edX
Intro to Spark & ML with Spark
Spark Packages
spark-packages.org

Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

  • 1.
    Spark DataFrames and MLPipelines Joseph K. Bradley May 1, 2015 MLconf Seattle
  • 2.
    Who am I? JosephK. Bradley Ph.D. in ML from CMU, postdoc at Berkeley Apache Spark committer Software Engineer @ Databricks Inc. 2
  • 3.
    Databricks Inc. 3 Founded bythe creators of Spark & driving its development Databricks Cloud: the best place to run Spark Guess what…we’re hiring! databricks.com/company/careers
  • 4.
    4 Concise APIs inPython, Java, Scala … and R in Spark 1.4! 500+ enterprises using or planning to use Spark in production (blog) Spark SparkSQL Streaming MLlib GraphX Distributed computing engine • Built for speed, ease of use, and sophisticated analytics • Apache open source
  • 5.
    Beyond Hadoop 5 Early adopters(Data) Engineers MapReduce & functional API Data Scientists & Statisticians
  • 6.
    Spark for DataScience DataFrames Intuitive manipulation of distributed structured data 6 Machine Learning Pipelines Simple construction and tuning of ML workflows
  • 7.
    Google Trends for“dataframe” 7
  • 8.
    DataFrames 8 dept age name Bio48 H Smith CS 54 A Turing Bio 43 B Jones Chem 61 M Kennedy RDD API DataFrame API Data grouped into named columns
  • 9.
    DataFrames 9 dept age name Bio48 H Smith CS 54 A Turing Bio 43 B Jones Chem 61 M Kennedy Data grouped into named columns DSL for common tasks • Project, filter, aggregate, join, … • Metadata • UDFs
  • 10.
    Spark DataFrames 10 API inspiredby R and Python Pandas • Python, Scala, Java (+ R in dev) • Pandas integration Distributed DataFrame Highly optimized
  • 11.
    11 0 2 46 8 10 RDD Scala RDD Python Spark Scala DF Spark Python DF Runtime of aggregating 10 million int pairs (secs) Spark DataFrames are fast better Uses SparkSQL Catalyst optimizer
  • 12.
  • 13.
    Spark for DataScience DataFrames • Structured data • Familiar API based on R & Python Pandas • Distributed, optimized implementation 13 Machine Learning Pipelines Simple construction and tuning of ML workflows
  • 14.
    About Spark MLlib Started@ Berkeley • Spark 0.8 Now (Spark 1.3) • Contributions from 50+ orgs, 100+ individuals • Growing coverage of distributed algorithms Spark SparkSQL Streaming MLlib GraphX 14
  • 15.
    About Spark MLlib Classification •Logistic regression • Naive Bayes • Streaming logistic regression • Linear SVMs • Decision trees • Random forests • Gradient-boosted trees Regression • Ordinary least squares • Ridge regression • Lasso • Isotonic regression • Decision trees • Random forests • Gradient-boosted trees • Streaming linear methods 15 Statistics • Pearson correlation • Spearman correlation • Online summarization • Chi-squared test • Kernel density estimation Linear algebra • Local dense & sparse vectors & matrices • Distributed matrices • Block-partitioned matrix • Row matrix • Indexed row matrix • Coordinate matrix • Matrix decompositions Frequent itemsets • FP-growth Model import/export Clustering • Gaussian mixture models • K-Means • Streaming K-Means • Latent Dirichlet Allocation • Power Iteration Clustering Recommendation • Alternating Least Squares Feature extraction & selection • Word2Vec • Chi-Squared selection • Hashing term frequency • Inverse document frequency • Normalizer • Standard scaler • Tokenizer
  • 16.
    ML Workflows arecomplex 16 Image classification pipeline* * Evan Sparks. “ML Pipelines.” amplab.cs.berkeley.edu/ml-pipelines  Specify pipeline  Inspect & debug  Re-run on new data  Tune parameters
  • 17.
    Example: Text Classification 17 Goal:Given a text document, predict its topic. Subject: Re: Lexan Polish? Suggest McQuires #1 plastic polish. It will help somewhat but nothing will remove deep scratches without making it worse than it already is. McQuires will do something... 1: about science 0: not about science LabelFeatures Dataset: “20 Newsgroups” From UCI KDD Archive
  • 18.
  • 19.
    Load Data 19 Train model Evaluate Loaddata Extract features built-in external { JSON } JDBC and more … Data sources for DataFrames
  • 20.
    Load Data 20 Train model Evaluate Loaddata Extract features label: Int text: String Current data schema
  • 21.
    Extract Features 21 Train model Evaluate Loaddata Extract features label: Int text: String Current data schema
  • 22.
    Extract Features 22 Train model Evaluate Loaddata label: Int text: String Current data schema Tokenizer Hashed Term Freq. features: Vector words: Seq[String] Transformer
  • 23.
    Train a Model 23 LogisticRegression Evaluate label: Int text: String Current data schema Tokenizer Hashed Term Freq. features: Vector words: Seq[String] prediction: Int Estimator Load data Transformer
  • 24.
    Evaluate the Model 24 LogisticRegression Evaluate label: Int text: String Current data schema Tokenizer Hashed Term Freq. features: Vector words: Seq[String] prediction: Int Load data Transformer Evaluator Estimator By default, always append new columns  Can go back & inspect intermediate results  Made efficient by DataFrame optimizations
  • 25.
    ML Pipelines 25 Logistic Regression Evaluate Tokenizer HashedTerm Freq. Load data Pipeline Test data Logistic Regression Tokenizer Hashed Term Freq. Evaluate Re-run exactly the same way
  • 26.
    Parameter Tuning 26 Logistic Regression Evaluate Tokenizer HashedTerm Freq. lr.regParam {0.01, 0.1, 0.5} hashingTF.numFeatures {100, 1000, 10000} Given: • Estimator • Parameter grid • Evaluator Find best parameters CrossValidator
  • 27.
  • 28.
    Recap DataFrames • Structured data •Familiar API based on R & Python Pandas • Distributed, optimized implementation Machine Learning Pipelines • Integration with DataFrames • Familiar API based on scikit-learn • Simple parameter tuning 28 Composable & DAG Pipelines Schema validation User-defined Transformers & Estimators
  • 29.
    Looking Ahead Collaborations withUC Berkeley & others • Auto-tuning models 29 DataFrames • Further optimization • API for R ML Pipelines • More algorithms & pluggability • API for R
  • 30.
    Thank you! Spark documentation spark.apache.org Pipelinesblog post databricks.com/blog/2015/01/07 DataFrames blog post databricks.com/blog/2015/02/17 Databricks Cloud Platform databricks.com/product Spark MOOCs on edX Intro to Spark & ML with Spark Spark Packages spark-packages.org

Editor's Notes

  • #5 Contributions plot from: https://databricks.com/blog/2015/03/31/spark-turns-five-years-old.html Daytona GraySort contest (100TB sort) (blog)
  • #9 For those coming from Hadoop, this is a huge improvement: simpler code, runs on a laptop and on a huge cluster, very efficient. Can you spot the bug in the code using the RDD API?
  • #15 Contributions estimated from github commit logs, with some effort to de-duplicate entities.
  • #18 Dataset source: http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html *Data from UCI KDD Archive, originally donated to archive by Tom Mitchell (CMU).
  • #29 No time to mention: User-defined functions (UDFs) Optimizations: code gen, predicate pushdown