Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

Spark DataFrames
and ML Pipelines
Joseph K. Bradley
May 1, 2015
MLconf Seattle

Who am I?
Joseph K. Bradley
Ph.D. in ML from CMU, postdoc at Berkeley
Apache Spark committer
Software Engineer @ Databricks Inc.
2

Databricks Inc.
3
Founded by the creators of Spark
& driving its development
Databricks Cloud: the best place to run Spark
Guess what…we’re hiring!
databricks.com/company/careers

4
Concise APIs in Python, Java, Scala
… and R in Spark 1.4!
500+ enterprises using or planning
to use Spark in production (blog)
Spark
SparkSQL Streaming MLlib GraphX
Distributed computing engine
• Built for speed, ease of use,
and sophisticated analytics
• Apache open source

Beyond Hadoop
5
Early adopters (Data) Engineers
MapReduce &
functional API
Data Scientists
& Statisticians

Spark for Data Science
DataFrames
Intuitive manipulation of distributed structured data
6
Machine Learning Pipelines
Simple construction and tuning of ML workflows

Google Trends for “dataframe”
7

DataFrames
8
dept age name
Bio 48 H Smith
CS 54 A Turing
Bio 43 B Jones
Chem 61 M Kennedy
RDD API
DataFrame API
Data grouped into
named columns

DataFrames
9
dept age name
Bio 48 H Smith
CS 54 A Turing
Bio 43 B Jones
Chem 61 M Kennedy
Data grouped into
named columns
DSL for common tasks
• Project, filter, aggregate, join, …
• Metadata
• UDFs

Spark DataFrames
10
API inspired by R and Python Pandas
• Python, Scala, Java (+ R in dev)
• Pandas integration
Distributed DataFrame
Highly optimized

11
0 2 4 6 8 10
RDD Scala
RDD Python
Spark Scala DF
Spark Python DF
Runtime of aggregating 10 million int pairs (secs)
Spark DataFrames are fast
better
Uses SparkSQL
Catalyst optimizer

Spark for Data Science
DataFrames
• Structured data
• Familiar API based on R & Python Pandas
• Distributed, optimized implementation
13
Simple construction and tuning of ML workflows

About Spark MLlib
Started @ Berkeley
• Spark 0.8
Now (Spark 1.3)
• Contributions from 50+ orgs, 100+ individuals
• Growing coverage of distributed algorithms
Spark
SparkSQL Streaming MLlib GraphX
14

About Spark MLlib
Classification
• Logistic regression
• Naive Bayes
• Streaming logistic regression
• Linear SVMs
• Decision trees
• Random forests
• Gradient-boosted trees
Regression
• Ordinary least squares
• Ridge regression
• Lasso
• Isotonic regression
• Decision trees
• Random forests
• Gradient-boosted trees
• Streaming linear methods
15
Statistics
• Pearson correlation
• Spearman correlation
• Online summarization
• Chi-squared test
• Kernel density estimation
Linear algebra
• Local dense & sparse vectors &
matrices
• Distributed matrices
• Block-partitioned matrix
• Row matrix
• Indexed row matrix
• Coordinate matrix
• Matrix decompositions
Frequent itemsets
• FP-growth
Model import/export
Clustering
• Gaussian mixture models
• K-Means
• Streaming K-Means
• Latent Dirichlet Allocation
• Power Iteration Clustering
Recommendation
• Alternating Least Squares
Feature extraction & selection
• Word2Vec
• Chi-Squared selection
• Hashing term frequency
• Inverse document frequency
• Normalizer
• Standard scaler
• Tokenizer

ML Workflows are complex
16
Image
classification
pipeline*
* Evan Sparks. “ML Pipelines.”
amplab.cs.berkeley.edu/ml-pipelines
 Specify pipeline
 Inspect & debug
 Re-run on new data
 Tune parameters

Example: Text Classification
17
Goal: Given a text document, predict its
topic.
Subject: Re: Lexan Polish?
Suggest McQuires #1 plastic
polish. It will help somewhat
but nothing will remove deep
scratches without making it
worse than it already is.
McQuires will do something...
1: about science
0: not about science
LabelFeatures
Dataset: “20 Newsgroups”
From UCI KDD Archive

ML Workflow
18
Train model
Evaluate
Load data
Extract features

Load Data
19
Train model
Evaluate
Load data
Extract features
built-in external
{ JSON }
JDBC
and more …
Data sources for DataFrames

Load Data
20
Train model
Evaluate
Load data
Extract features
label: Int
text: String
Current data schema

Extract Features
21
Train model
Evaluate
Load data
Extract features
label: Int
text: String
Current data schema

Extract Features
22
Train model
Evaluate
Load data
label: Int
text: String
Current data schema
Tokenizer
Hashed Term Freq.
features: Vector
words: Seq[String]
Transformer

Train a Model
23
Logistic Regression
Evaluate
label: Int
text: String
Current data schema
Tokenizer
Hashed Term Freq.
features: Vector
words: Seq[String]
prediction: Int
Estimator
Load data
Transformer

Evaluate the Model
24
Logistic Regression
Evaluate
label: Int
text: String
Current data schema
Tokenizer
Hashed Term Freq.
features: Vector
words: Seq[String]
prediction: Int
Load data
Transformer
Evaluator
Estimator
By default, always
append new columns
 Can go back & inspect
intermediate results
 Made efficient by
DataFrame
optimizations

ML Pipelines
25
Logistic Regression
Evaluate
Tokenizer
Hashed Term Freq.
Load data
Pipeline
Test data
Logistic Regression
Tokenizer
Hashed Term Freq.
Evaluate
Re-run exactly
the same way

Parameter Tuning
26
Logistic Regression
Evaluate
Tokenizer
Hashed Term Freq.
lr.regParam
{0.01, 0.1, 0.5}
hashingTF.numFeatures
{100, 1000, 10000} Given:
• Estimator
• Parameter grid
• Evaluator
Find best parameters
CrossValidator

Recap
DataFrames
• Structured data
• Familiar API based on R & Python
Pandas
• Distributed, optimized
implementation
• Integration with DataFrames
• Familiar API based on scikit-learn
• Simple parameter tuning 28
Composable & DAG Pipelines
Schema validation
User-defined Transformers
& Estimators

Looking Ahead
Collaborations with UC Berkeley & others
• Auto-tuning models
29
DataFrames
• Further
optimization
• API for R
ML Pipelines
• More algorithms &
pluggability
• API for R

Thank you!
Spark documentation
spark.apache.org
Pipelines blog post
databricks.com/blog/2015/01/07
DataFrames blog post
databricks.com/blog/2015/02/17
Databricks Cloud Platform
databricks.com/product
Spark MOOCs on edX
Intro to Spark & ML with Spark
Spark Packages
spark-packages.org

Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

More Related Content

What's hot

Viewers also liked

Similar to Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

More from MLconf

Recently uploaded

Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

Editor's Notes