Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

1,119 views

Published on

Data frames in R and Python have become standards for data science, yet they do not work well with Big Data. Inspired by R and Pandas, Spark DataFrames provide concise, powerful interfaces for structured data manipulation. DataFrames support rich data types, a variety of data sources and storage systems, and state-of-the-art optimization via the Spark SQL Catalyst optimizer.

On top of DataFrames, we have built a new ML Pipeline API. ML workflows often involve a complex sequence of processing and learning stages, including data cleaning, feature extraction and transformation, training, and hyperparameter tuning. With most current tools for ML, it is difficult to set up practical pipelines. Inspired by scikit-learn, we built simple APIs to help users quickly assemble and tune practical ML pipelines.

Published in:
Technology

No Downloads

Total views

1,119

On SlideShare

0

From Embeds

0

Number of Embeds

9

Shares

0

Downloads

35

Comments

0

Likes

3

No embeds

No notes for slide

Daytona GraySort contest

(100TB sort) (blog)

Can you spot the bug in the code using the RDD API?

*Data from UCI KDD Archive, originally donated to archive by Tom Mitchell (CMU).

User-defined functions (UDFs)

Optimizations: code gen, predicate pushdown

- 1. Spark DataFrames and ML Pipelines Joseph K. Bradley May 1, 2015 MLconf Seattle
- 2. Who am I? Joseph K. Bradley Ph.D. in ML from CMU, postdoc at Berkeley Apache Spark committer Software Engineer @ Databricks Inc. 2
- 3. Databricks Inc. 3 Founded by the creators of Spark & driving its development Databricks Cloud: the best place to run Spark Guess what…we’re hiring! databricks.com/company/careers
- 4. 4 Concise APIs in Python, Java, Scala … and R in Spark 1.4! 500+ enterprises using or planning to use Spark in production (blog) Spark SparkSQL Streaming MLlib GraphX Distributed computing engine • Built for speed, ease of use, and sophisticated analytics • Apache open source
- 5. Beyond Hadoop 5 Early adopters (Data) Engineers MapReduce & functional API Data Scientists & Statisticians
- 6. Spark for Data Science DataFrames Intuitive manipulation of distributed structured data 6 Machine Learning Pipelines Simple construction and tuning of ML workflows
- 7. Google Trends for “dataframe” 7
- 8. DataFrames 8 dept age name Bio 48 H Smith CS 54 A Turing Bio 43 B Jones Chem 61 M Kennedy RDD API DataFrame API Data grouped into named columns
- 9. DataFrames 9 dept age name Bio 48 H Smith CS 54 A Turing Bio 43 B Jones Chem 61 M Kennedy Data grouped into named columns DSL for common tasks • Project, filter, aggregate, join, … • Metadata • UDFs
- 10. Spark DataFrames 10 API inspired by R and Python Pandas • Python, Scala, Java (+ R in dev) • Pandas integration Distributed DataFrame Highly optimized
- 11. 11 0 2 4 6 8 10 RDD Scala RDD Python Spark Scala DF Spark Python DF Runtime of aggregating 10 million int pairs (secs) Spark DataFrames are fast better Uses SparkSQL Catalyst optimizer
- 12. 12 Demo: DataFrames
- 13. Spark for Data Science DataFrames • Structured data • Familiar API based on R & Python Pandas • Distributed, optimized implementation 13 Machine Learning Pipelines Simple construction and tuning of ML workflows
- 14. About Spark MLlib Started @ Berkeley • Spark 0.8 Now (Spark 1.3) • Contributions from 50+ orgs, 100+ individuals • Growing coverage of distributed algorithms Spark SparkSQL Streaming MLlib GraphX 14
- 15. About Spark MLlib Classification • Logistic regression • Naive Bayes • Streaming logistic regression • Linear SVMs • Decision trees • Random forests • Gradient-boosted trees Regression • Ordinary least squares • Ridge regression • Lasso • Isotonic regression • Decision trees • Random forests • Gradient-boosted trees • Streaming linear methods 15 Statistics • Pearson correlation • Spearman correlation • Online summarization • Chi-squared test • Kernel density estimation Linear algebra • Local dense & sparse vectors & matrices • Distributed matrices • Block-partitioned matrix • Row matrix • Indexed row matrix • Coordinate matrix • Matrix decompositions Frequent itemsets • FP-growth Model import/export Clustering • Gaussian mixture models • K-Means • Streaming K-Means • Latent Dirichlet Allocation • Power Iteration Clustering Recommendation • Alternating Least Squares Feature extraction & selection • Word2Vec • Chi-Squared selection • Hashing term frequency • Inverse document frequency • Normalizer • Standard scaler • Tokenizer
- 16. ML Workflows are complex 16 Image classification pipeline* * Evan Sparks. “ML Pipelines.” amplab.cs.berkeley.edu/ml-pipelines Specify pipeline Inspect & debug Re-run on new data Tune parameters
- 17. Example: Text Classification 17 Goal: Given a text document, predict its topic. Subject: Re: Lexan Polish? Suggest McQuires #1 plastic polish. It will help somewhat but nothing will remove deep scratches without making it worse than it already is. McQuires will do something... 1: about science 0: not about science LabelFeatures Dataset: “20 Newsgroups” From UCI KDD Archive
- 18. ML Workflow 18 Train model Evaluate Load data Extract features
- 19. Load Data 19 Train model Evaluate Load data Extract features built-in external { JSON } JDBC and more … Data sources for DataFrames
- 20. Load Data 20 Train model Evaluate Load data Extract features label: Int text: String Current data schema
- 21. Extract Features 21 Train model Evaluate Load data Extract features label: Int text: String Current data schema
- 22. Extract Features 22 Train model Evaluate Load data label: Int text: String Current data schema Tokenizer Hashed Term Freq. features: Vector words: Seq[String] Transformer
- 23. Train a Model 23 Logistic Regression Evaluate label: Int text: String Current data schema Tokenizer Hashed Term Freq. features: Vector words: Seq[String] prediction: Int Estimator Load data Transformer
- 24. Evaluate the Model 24 Logistic Regression Evaluate label: Int text: String Current data schema Tokenizer Hashed Term Freq. features: Vector words: Seq[String] prediction: Int Load data Transformer Evaluator Estimator By default, always append new columns Can go back & inspect intermediate results Made efficient by DataFrame optimizations
- 25. ML Pipelines 25 Logistic Regression Evaluate Tokenizer Hashed Term Freq. Load data Pipeline Test data Logistic Regression Tokenizer Hashed Term Freq. Evaluate Re-run exactly the same way
- 26. Parameter Tuning 26 Logistic Regression Evaluate Tokenizer Hashed Term Freq. lr.regParam {0.01, 0.1, 0.5} hashingTF.numFeatures {100, 1000, 10000} Given: • Estimator • Parameter grid • Evaluator Find best parameters CrossValidator
- 27. 27 Demo: ML Pipelines
- 28. Recap DataFrames • Structured data • Familiar API based on R & Python Pandas • Distributed, optimized implementation Machine Learning Pipelines • Integration with DataFrames • Familiar API based on scikit-learn • Simple parameter tuning 28 Composable & DAG Pipelines Schema validation User-defined Transformers & Estimators
- 29. Looking Ahead Collaborations with UC Berkeley & others • Auto-tuning models 29 DataFrames • Further optimization • API for R ML Pipelines • More algorithms & pluggability • API for R
- 30. Thank you! Spark documentation spark.apache.org Pipelines blog post databricks.com/blog/2015/01/07 DataFrames blog post databricks.com/blog/2015/02/17 Databricks Cloud Platform databricks.com/product Spark MOOCs on edX Intro to Spark & ML with Spark Spark Packages spark-packages.org

No public clipboards found for this slide

Be the first to comment