Spark DataFrames and ML Pipelines: In this talk, we will discuss two recent efforts in Spark to scale up data science: distributed DataFrames and Machine Learning Pipelines. These components allow users to manipulate distributed datasets and handle complex ML workflows, using intuitive APIs in Python, Java, and Scala (and R in development).
Data frames in R and Python have become standards for data science, yet they do not work well with Big Data. Inspired by R and Pandas, Spark DataFrames provide concise, powerful interfaces for structured data manipulation. DataFrames support rich data types, a variety of data sources and storage systems, and state-of-the-art optimization via the Spark SQL Catalyst optimizer.
On top of DataFrames, we have built a new ML Pipeline API. ML workflows often involve a complex sequence of processing and learning stages, including data cleaning, feature extraction and transformation, training, and hyperparameter tuning. With most current tools for ML, it is difficult to set up practical pipelines. Inspired by scikit-learn, we built simple APIs to help users quickly assemble and tune practical ML pipelines.
2. Who am I?
Joseph K. Bradley
Ph.D. in ML from CMU, postdoc at Berkeley
Apache Spark committer
Software Engineer @ Databricks Inc.
2
3. Databricks Inc.
3
Founded by the creators of Spark
& driving its development
Databricks Cloud: the best place to run Spark
Guess what…we’re hiring!
databricks.com/company/careers
4. 4
Concise APIs in Python, Java, Scala
… and R in Spark 1.4!
500+ enterprises using or planning
to use Spark in production (blog)
Spark
SparkSQL Streaming MLlib GraphX
Distributed computing engine
• Built for speed, ease of use,
and sophisticated analytics
• Apache open source
6. Spark for Data Science
DataFrames
Intuitive manipulation of distributed structured data
6
Machine Learning Pipelines
Simple construction and tuning of ML workflows
8. DataFrames
8
dept age name
Bio 48 H Smith
CS 54 A Turing
Bio 43 B Jones
Chem 61 M Kennedy
RDD API
DataFrame API
Data grouped into
named columns
9. DataFrames
9
dept age name
Bio 48 H Smith
CS 54 A Turing
Bio 43 B Jones
Chem 61 M Kennedy
Data grouped into
named columns
DSL for common tasks
• Project, filter, aggregate, join, …
• Metadata
• UDFs
10. Spark DataFrames
10
API inspired by R and Python Pandas
• Python, Scala, Java (+ R in dev)
• Pandas integration
Distributed DataFrame
Highly optimized
11. 11
0 2 4 6 8 10
RDD Scala
RDD Python
Spark Scala DF
Spark Python DF
Runtime of aggregating 10 million int pairs (secs)
Spark DataFrames are fast
better
Uses SparkSQL
Catalyst optimizer
13. Spark for Data Science
DataFrames
• Structured data
• Familiar API based on R & Python Pandas
• Distributed, optimized implementation
13
Machine Learning Pipelines
Simple construction and tuning of ML workflows
14. About Spark MLlib
Started @ Berkeley
• Spark 0.8
Now (Spark 1.3)
• Contributions from 50+ orgs, 100+ individuals
• Growing coverage of distributed algorithms
Spark
SparkSQL Streaming MLlib GraphX
14
15. About Spark MLlib
Classification
• Logistic regression
• Naive Bayes
• Streaming logistic regression
• Linear SVMs
• Decision trees
• Random forests
• Gradient-boosted trees
Regression
• Ordinary least squares
• Ridge regression
• Lasso
• Isotonic regression
• Decision trees
• Random forests
• Gradient-boosted trees
• Streaming linear methods
15
Statistics
• Pearson correlation
• Spearman correlation
• Online summarization
• Chi-squared test
• Kernel density estimation
Linear algebra
• Local dense & sparse vectors &
matrices
• Distributed matrices
• Block-partitioned matrix
• Row matrix
• Indexed row matrix
• Coordinate matrix
• Matrix decompositions
Frequent itemsets
• FP-growth
Model import/export
Clustering
• Gaussian mixture models
• K-Means
• Streaming K-Means
• Latent Dirichlet Allocation
• Power Iteration Clustering
Recommendation
• Alternating Least Squares
Feature extraction & selection
• Word2Vec
• Chi-Squared selection
• Hashing term frequency
• Inverse document frequency
• Normalizer
• Standard scaler
• Tokenizer
16. ML Workflows are complex
16
Image
classification
pipeline*
* Evan Sparks. “ML Pipelines.”
amplab.cs.berkeley.edu/ml-pipelines
Specify pipeline
Inspect & debug
Re-run on new data
Tune parameters
17. Example: Text Classification
17
Goal: Given a text document, predict its
topic.
Subject: Re: Lexan Polish?
Suggest McQuires #1 plastic
polish. It will help somewhat
but nothing will remove deep
scratches without making it
worse than it already is.
McQuires will do something...
1: about science
0: not about science
LabelFeatures
Dataset: “20 Newsgroups”
From UCI KDD Archive
23. Train a Model
23
Logistic Regression
Evaluate
label: Int
text: String
Current data schema
Tokenizer
Hashed Term Freq.
features: Vector
words: Seq[String]
prediction: Int
Estimator
Load data
Transformer
24. Evaluate the Model
24
Logistic Regression
Evaluate
label: Int
text: String
Current data schema
Tokenizer
Hashed Term Freq.
features: Vector
words: Seq[String]
prediction: Int
Load data
Transformer
Evaluator
Estimator
By default, always
append new columns
Can go back & inspect
intermediate results
Made efficient by
DataFrame
optimizations
28. Recap
DataFrames
• Structured data
• Familiar API based on R & Python
Pandas
• Distributed, optimized
implementation
Machine Learning Pipelines
• Integration with DataFrames
• Familiar API based on scikit-learn
• Simple parameter tuning 28
Composable & DAG Pipelines
Schema validation
User-defined Transformers
& Estimators
29. Looking Ahead
Collaborations with UC Berkeley & others
• Auto-tuning models
29
DataFrames
• Further
optimization
• API for R
ML Pipelines
• More algorithms &
pluggability
• API for R
30. Thank you!
Spark documentation
spark.apache.org
Pipelines blog post
databricks.com/blog/2015/01/07
DataFrames blog post
databricks.com/blog/2015/02/17
Databricks Cloud Platform
databricks.com/product
Spark MOOCs on edX
Intro to Spark & ML with Spark
Spark Packages
spark-packages.org
For those coming from Hadoop, this is a huge improvement: simpler code, runs on a laptop and on a huge cluster, very efficient.
Can you spot the bug in the code using the RDD API?
Contributions estimated from github commit logs, with some effort to de-duplicate entities.
Dataset source: http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html
*Data from UCI KDD Archive, originally donated to archive by Tom Mitchell (CMU).
No time to mention:
User-defined functions (UDFs)
Optimizations: code gen, predicate pushdown