Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Spark DataFrames
and ML Pipelines
Joseph K. Bradley
May 1, 2015
MLconf Seattle
Who am I?
Joseph K. Bradley
Ph.D. in ML from CMU, postdoc at Berkeley
Apache Spark committer
Software Engineer @ Databrick...
Databricks Inc.
3
Founded by the creators of Spark
& driving its development
Databricks Cloud: the best place to run Spark...
4
Concise APIs in Python, Java, Scala
… and R in Spark 1.4!
500+ enterprises using or planning
to use Spark in production ...
Beyond Hadoop
5
Early adopters (Data) Engineers
MapReduce &
functional API
Data Scientists
& Statisticians
Spark for Data Science
DataFrames
Intuitive manipulation of distributed structured data
6
Machine Learning Pipelines
Simpl...
Google Trends for “dataframe”
7
DataFrames
8
dept age name
Bio 48 H Smith
CS 54 A Turing
Bio 43 B Jones
Chem 61 M Kennedy
RDD API
DataFrame API
Data group...
DataFrames
9
dept age name
Bio 48 H Smith
CS 54 A Turing
Bio 43 B Jones
Chem 61 M Kennedy
Data grouped into
named columns
...
Spark DataFrames
10
API inspired by R and Python Pandas
• Python, Scala, Java (+ R in dev)
• Pandas integration
Distribute...
11
0 2 4 6 8 10
RDD Scala
RDD Python
Spark Scala DF
Spark Python DF
Runtime of aggregating 10 million int pairs (secs)
Spa...
12
Demo: DataFrames
Spark for Data Science
DataFrames
• Structured data
• Familiar API based on R & Python Pandas
• Distributed, optimized imp...
About Spark MLlib
Started @ Berkeley
• Spark 0.8
Now (Spark 1.3)
• Contributions from 50+ orgs, 100+ individuals
• Growing...
About Spark MLlib
Classification
• Logistic regression
• Naive Bayes
• Streaming logistic regression
• Linear SVMs
• Decis...
ML Workflows are complex
16
Image
classification
pipeline*
* Evan Sparks. “ML Pipelines.”
amplab.cs.berkeley.edu/ml-pipeli...
Example: Text Classification
17
Goal: Given a text document, predict its
topic.
Subject: Re: Lexan Polish?
Suggest McQuire...
ML Workflow
18
Train model
Evaluate
Load data
Extract features
Load Data
19
Train model
Evaluate
Load data
Extract features
built-in external
{ JSON }
JDBC
and more …
Data sources for D...
Load Data
20
Train model
Evaluate
Load data
Extract features
label: Int
text: String
Current data schema
Extract Features
21
Train model
Evaluate
Load data
Extract features
label: Int
text: String
Current data schema
Extract Features
22
Train model
Evaluate
Load data
label: Int
text: String
Current data schema
Tokenizer
Hashed Term Freq....
Train a Model
23
Logistic Regression
Evaluate
label: Int
text: String
Current data schema
Tokenizer
Hashed Term Freq.
feat...
Evaluate the Model
24
Logistic Regression
Evaluate
label: Int
text: String
Current data schema
Tokenizer
Hashed Term Freq....
ML Pipelines
25
Logistic Regression
Evaluate
Tokenizer
Hashed Term Freq.
Load data
Pipeline
Test data
Logistic Regression
...
Parameter Tuning
26
Logistic Regression
Evaluate
Tokenizer
Hashed Term Freq.
lr.regParam
{0.01, 0.1, 0.5}
hashingTF.numFea...
27
Demo: ML Pipelines
Recap
DataFrames
• Structured data
• Familiar API based on R & Python
Pandas
• Distributed, optimized
implementation
Machi...
Looking Ahead
Collaborations with UC Berkeley & others
• Auto-tuning models
29
DataFrames
• Further
optimization
• API for...
Thank you!
Spark documentation
spark.apache.org
Pipelines blog post
databricks.com/blog/2015/01/07
DataFrames blog post
da...
Upcoming SlideShare
Loading in …5
×

Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

1,145 views

Published on

Spark DataFrames and ML Pipelines: In this talk, we will discuss two recent efforts in Spark to scale up data science: distributed DataFrames and Machine Learning Pipelines. These components allow users to manipulate distributed datasets and handle complex ML workflows, using intuitive APIs in Python, Java, and Scala (and R in development).

Data frames in R and Python have become standards for data science, yet they do not work well with Big Data. Inspired by R and Pandas, Spark DataFrames provide concise, powerful interfaces for structured data manipulation. DataFrames support rich data types, a variety of data sources and storage systems, and state-of-the-art optimization via the Spark SQL Catalyst optimizer.

On top of DataFrames, we have built a new ML Pipeline API. ML workflows often involve a complex sequence of processing and learning stages, including data cleaning, feature extraction and transformation, training, and hyperparameter tuning. With most current tools for ML, it is difficult to set up practical pipelines. Inspired by scikit-learn, we built simple APIs to help users quickly assemble and tune practical ML pipelines.

Published in: Technology
  • Be the first to comment

Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

  1. 1. Spark DataFrames and ML Pipelines Joseph K. Bradley May 1, 2015 MLconf Seattle
  2. 2. Who am I? Joseph K. Bradley Ph.D. in ML from CMU, postdoc at Berkeley Apache Spark committer Software Engineer @ Databricks Inc. 2
  3. 3. Databricks Inc. 3 Founded by the creators of Spark & driving its development Databricks Cloud: the best place to run Spark Guess what…we’re hiring! databricks.com/company/careers
  4. 4. 4 Concise APIs in Python, Java, Scala … and R in Spark 1.4! 500+ enterprises using or planning to use Spark in production (blog) Spark SparkSQL Streaming MLlib GraphX Distributed computing engine • Built for speed, ease of use, and sophisticated analytics • Apache open source
  5. 5. Beyond Hadoop 5 Early adopters (Data) Engineers MapReduce & functional API Data Scientists & Statisticians
  6. 6. Spark for Data Science DataFrames Intuitive manipulation of distributed structured data 6 Machine Learning Pipelines Simple construction and tuning of ML workflows
  7. 7. Google Trends for “dataframe” 7
  8. 8. DataFrames 8 dept age name Bio 48 H Smith CS 54 A Turing Bio 43 B Jones Chem 61 M Kennedy RDD API DataFrame API Data grouped into named columns
  9. 9. DataFrames 9 dept age name Bio 48 H Smith CS 54 A Turing Bio 43 B Jones Chem 61 M Kennedy Data grouped into named columns DSL for common tasks • Project, filter, aggregate, join, … • Metadata • UDFs
  10. 10. Spark DataFrames 10 API inspired by R and Python Pandas • Python, Scala, Java (+ R in dev) • Pandas integration Distributed DataFrame Highly optimized
  11. 11. 11 0 2 4 6 8 10 RDD Scala RDD Python Spark Scala DF Spark Python DF Runtime of aggregating 10 million int pairs (secs) Spark DataFrames are fast better Uses SparkSQL Catalyst optimizer
  12. 12. 12 Demo: DataFrames
  13. 13. Spark for Data Science DataFrames • Structured data • Familiar API based on R & Python Pandas • Distributed, optimized implementation 13 Machine Learning Pipelines Simple construction and tuning of ML workflows
  14. 14. About Spark MLlib Started @ Berkeley • Spark 0.8 Now (Spark 1.3) • Contributions from 50+ orgs, 100+ individuals • Growing coverage of distributed algorithms Spark SparkSQL Streaming MLlib GraphX 14
  15. 15. About Spark MLlib Classification • Logistic regression • Naive Bayes • Streaming logistic regression • Linear SVMs • Decision trees • Random forests • Gradient-boosted trees Regression • Ordinary least squares • Ridge regression • Lasso • Isotonic regression • Decision trees • Random forests • Gradient-boosted trees • Streaming linear methods 15 Statistics • Pearson correlation • Spearman correlation • Online summarization • Chi-squared test • Kernel density estimation Linear algebra • Local dense & sparse vectors & matrices • Distributed matrices • Block-partitioned matrix • Row matrix • Indexed row matrix • Coordinate matrix • Matrix decompositions Frequent itemsets • FP-growth Model import/export Clustering • Gaussian mixture models • K-Means • Streaming K-Means • Latent Dirichlet Allocation • Power Iteration Clustering Recommendation • Alternating Least Squares Feature extraction & selection • Word2Vec • Chi-Squared selection • Hashing term frequency • Inverse document frequency • Normalizer • Standard scaler • Tokenizer
  16. 16. ML Workflows are complex 16 Image classification pipeline* * Evan Sparks. “ML Pipelines.” amplab.cs.berkeley.edu/ml-pipelines  Specify pipeline  Inspect & debug  Re-run on new data  Tune parameters
  17. 17. Example: Text Classification 17 Goal: Given a text document, predict its topic. Subject: Re: Lexan Polish? Suggest McQuires #1 plastic polish. It will help somewhat but nothing will remove deep scratches without making it worse than it already is. McQuires will do something... 1: about science 0: not about science LabelFeatures Dataset: “20 Newsgroups” From UCI KDD Archive
  18. 18. ML Workflow 18 Train model Evaluate Load data Extract features
  19. 19. Load Data 19 Train model Evaluate Load data Extract features built-in external { JSON } JDBC and more … Data sources for DataFrames
  20. 20. Load Data 20 Train model Evaluate Load data Extract features label: Int text: String Current data schema
  21. 21. Extract Features 21 Train model Evaluate Load data Extract features label: Int text: String Current data schema
  22. 22. Extract Features 22 Train model Evaluate Load data label: Int text: String Current data schema Tokenizer Hashed Term Freq. features: Vector words: Seq[String] Transformer
  23. 23. Train a Model 23 Logistic Regression Evaluate label: Int text: String Current data schema Tokenizer Hashed Term Freq. features: Vector words: Seq[String] prediction: Int Estimator Load data Transformer
  24. 24. Evaluate the Model 24 Logistic Regression Evaluate label: Int text: String Current data schema Tokenizer Hashed Term Freq. features: Vector words: Seq[String] prediction: Int Load data Transformer Evaluator Estimator By default, always append new columns  Can go back & inspect intermediate results  Made efficient by DataFrame optimizations
  25. 25. ML Pipelines 25 Logistic Regression Evaluate Tokenizer Hashed Term Freq. Load data Pipeline Test data Logistic Regression Tokenizer Hashed Term Freq. Evaluate Re-run exactly the same way
  26. 26. Parameter Tuning 26 Logistic Regression Evaluate Tokenizer Hashed Term Freq. lr.regParam {0.01, 0.1, 0.5} hashingTF.numFeatures {100, 1000, 10000} Given: • Estimator • Parameter grid • Evaluator Find best parameters CrossValidator
  27. 27. 27 Demo: ML Pipelines
  28. 28. Recap DataFrames • Structured data • Familiar API based on R & Python Pandas • Distributed, optimized implementation Machine Learning Pipelines • Integration with DataFrames • Familiar API based on scikit-learn • Simple parameter tuning 28 Composable & DAG Pipelines Schema validation User-defined Transformers & Estimators
  29. 29. Looking Ahead Collaborations with UC Berkeley & others • Auto-tuning models 29 DataFrames • Further optimization • API for R ML Pipelines • More algorithms & pluggability • API for R
  30. 30. Thank you! Spark documentation spark.apache.org Pipelines blog post databricks.com/blog/2015/01/07 DataFrames blog post databricks.com/blog/2015/02/17 Databricks Cloud Platform databricks.com/product Spark MOOCs on edX Intro to Spark & ML with Spark Spark Packages spark-packages.org

×