Scaling and Unifying SciKit Learn and Apache Spark Pipelines

•

0 likes•669 views

Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark. Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations. Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.

Data & Analytics

Scaling and Unifying
Scikit Learn and Spark
Pipelines using Ray
Raghu Ganti
Principal Research Staff Member
IBM T J Watson Research Center
Team (IBM & Red Hat):
Michael Behrendt, Linsong Chu, Carlos
Costa, Erik Erlandson, Mudhakar Srivatsa

Ray.IO
§ Can we do pipelines on
Ray?
§ Can we scale popular
AI/ML pipelines on Ray?
§ Can we unify scikit learn
and Spark pipelines?

Current pipeline API
• Focus on scikit learn and Spark pipelines
• Scikit learn missing scaling; Spark focus on data parallel
scaling
Transform
Fit
X
X
y
X’
Fitted model

Scaling Pipelines: I/O as List of Objects
Transform
Fit
[X1, X2, … XN]
[X1, X2, … XN]
[y1, y2, … yN]
[X1’, X2’, …, XN’]
[FM1, FM2, … FMN]

Scaling Pipelines: AND/OR Graphs
And node
X1
X2
XN
X1’
X2’
XM’
Or node
X
Step1
Step2
StepN
X’
X’
X’

Key Features
▪ Python function as
unit of compute
▪ Intuitive for data
scientist
▪ Follows transformer
APIs
▪ MPI-style scaling
▪ Object references
as I/O for unit of
compute
▪ Sharing of objects
using Plasma store
▪ Enables zero-copy
object sharing
• List of objects as I/O
• Function as unit of
compute
▪ Scikit learn typically
in Python
▪ Ray.IO with RayDP
enables efficient
data exchange
• Cross environment
▪ Enriched DAGs from
plain pipelines
▪ OR nodes for fan-
out expressions
▪ AND nodes for
arbitrary lambdas
• AND/OR Graphs

Illustrative Example
8
Preprocess
Random
Forest
Gradient
Boost
Decision
Tree
Sample Pipeline
Scikit learn Pipeline
Our Pipeline

Pipelines Galore…
Airflow Kubeflow Scikit learn
Spark
Pipeline
Our
pipeline
Task
parallelism
✓ ✓ ✗ ✓ ✓
Data
parallelism
✗ ✗ ✗ ✓ ✓
And/Or Graphs ✓ ✓ ✗ ✗ ✓
Computational
unit
Container Container
Python
function
Python/Java
function
Python/Java
function
Mutability of
DAG
✗ ✗ ✓ ✓ ✓

What to expect?
• Execution strategies based on graph traversals
• Early stopping criteria
• Mutability of execution pipelines
• Current status: Proposal discussion with Ray and OSS
community

Q&A
Contacts:
Raghu Ganti (rganti@us.ibm.com)
Michael Behrendt (michaelbehrendt@de.ibm.com)
Linsong Chu (lchu@us.ibm.com)
Carlos Costa (chcost@us.ibm.com)
Erik Erlandson (eerlands@redhat.com)
Mudhakar Srivatsa (msrivats@us.ibm.com)

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

What's hot

Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks

Apache Spark Core – Practical OptimizationDatabricks

Distributed computing with Ray. Find your hyper-parameters, speed up your Pan...Jan Margeta

Build Real-Time Applications with Databricks StreamingDatabricks

Intro to Delta LakeDatabricks

Databricks Platform.pptxAlex Ivy

Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann

Physical Plans in Spark SQLDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz

PySpark dataframeJaemun Jung

Apache Spark ArchitectureAlexey Grishchenko

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Getting Started with Databricks SQL AnalyticsDatabricks

Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureKai Wähner

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks

Introduction to Apache SparkRahul Jain

Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit

Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit

What's hot (20)

Designing Structured Streaming Pipelines—How to Architect Things Right

Apache Spark Core – Practical Optimization

Distributed computing with Ray. Find your hyper-parameters, speed up your Pan...

Build Real-Time Applications with Databricks Streaming

Intro to Delta Lake

Databricks Platform.pptx

Introduction to Apache Flink - Fast and reliable big data processing

Physical Plans in Spark SQL

Learn to Use Databricks for Data Science

Spark (Structured) Streaming vs. Kafka Streams

PySpark dataframe

Apache Spark Architecture

Scaling your Data Pipelines with Apache Spark on Kubernetes

Getting Started with Databricks SQL Analytics

Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Introduction to Apache Spark

Compression Options in Hadoop - A Tale of Tradeoffs

Optimizing Delta/Parquet Data Lakes for Apache Spark

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...

Similar to Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Python business intelligence (PyData 2012 talk)Stefan Urbanek

Balancing Infrastructure with Optimization and Problem FormulationAlex D. Gaudio

Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu

Scilab Challenge@NTU 2014/2015 Project BriefingTBSS Group

Graph Analytics in SparkPaco Nathan

GraphX: Graph analytics for insights about developer communitiesPaco Nathan

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)

YARN webinar series: Using Scalding to write applications to Hadoop and YARNHortonworks

IBM Strategy for SparkMark Kerzner

MathWorks Interview LectureJohn Yates

Dev Ops TrainingSpark Summit

Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...Facultad de Informática UCM

Introduction to elasticsearchhypto

An R primer for SQL folksThomas Hütter

What’s New in the Berkeley Data Analytics StackTuri, Inc.

Practicing at the Cutting EdgeC4Media

Big data distributed processing: Spark introductionHektor Jacynycz García

Data Science with SparkKrishna Sankar

Apache Arrow (Strata-Hadoop World San Jose 2016)Wes McKinney

Spark meetup TCHUGRyan Bosshart

Similar to Scaling and Unifying SciKit Learn and Apache Spark Pipelines (20)

Python business intelligence (PyData 2012 talk)

Balancing Infrastructure with Optimization and Problem Formulation

Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

Scilab Challenge@NTU 2014/2015 Project Briefing

Graph Analytics in Spark

GraphX: Graph analytics for insights about developer communities

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...

YARN webinar series: Using Scalding to write applications to Hadoop and YARN

IBM Strategy for Spark

MathWorks Interview Lecture

Dev Ops Training

Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...

Introduction to elasticsearch

An R primer for SQL folks

What’s New in the Berkeley Data Analytics Stack

Practicing at the Cutting Edge

Big data distributed processing: Spark introduction

Data Science with Spark

Apache Arrow (Strata-Hadoop World San Jose 2016)

Spark meetup TCHUG

Recently uploaded

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort

9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha

Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03

B2 Creative Industry Response Evaluation.docxStephen266013

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss

How we prevented account sharing with MFAAndrei Kaleshka

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Universitat Politècnica de Catalunya

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort

E-Commerce Order PredictionShraddha Kamble.pptxBoston Institute of Analytics

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408

ASML's Taxonomy Adventure by Daniel Cantervoginip

Recently uploaded (20)

Customer Service Analytics - Make Sense of All Your Data.pptx

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)

9654467111 Call Girls In Munirka Hotel And Home Service

Top 5 Best Data Analytics Courses In Queens

B2 Creative Industry Response Evaluation.docx

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改

How we prevented account sharing with MFA

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi

E-Commerce Order PredictionShraddha Kamble.pptx

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps

ASML's Taxonomy Adventure by Daniel Canter

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

1. Scaling and Unifying Scikit Learn and Spark Pipelines using Ray Raghu Ganti Principal Research Staff Member IBM T J Watson Research Center Team (IBM & Red Hat): Michael Behrendt, Linsong Chu, Carlos Costa, Erik Erlandson, Mudhakar Srivatsa

2. So many pipelines… And many more…

3. Ray.IO § Can we do pipelines on Ray? § Can we scale popular AI/ML pipelines on Ray? § Can we unify scikit learn and Spark pipelines?

4. Current pipeline API • Focus on scikit learn and Spark pipelines • Scikit learn missing scaling; Spark focus on data parallel scaling Transform Fit X X y X’ Fitted model

5. Scaling Pipelines: I/O as List of Objects Transform Fit [X1, X2, … XN] [X1, X2, … XN] [y1, y2, … yN] [X1’, X2’, …, XN’] [FM1, FM2, … FMN]

6. Scaling Pipelines: AND/OR Graphs And node X1 X2 XN X1’ X2’ XM’ Or node X Step1 Step2 StepN X’ X’ X’

7. Key Features ▪ Python function as unit of compute ▪ Intuitive for data scientist ▪ Follows transformer APIs ▪ MPI-style scaling ▪ Object references as I/O for unit of compute ▪ Sharing of objects using Plasma store ▪ Enables zero-copy object sharing • List of objects as I/O • Function as unit of compute ▪ Scikit learn typically in Python ▪ Ray.IO with RayDP enables efficient data exchange • Cross environment ▪ Enriched DAGs from plain pipelines ▪ OR nodes for fan- out expressions ▪ AND nodes for arbitrary lambdas • AND/OR Graphs

8. Illustrative Example 8 Preprocess Random Forest Gradient Boost Decision Tree Sample Pipeline Scikit learn Pipeline Our Pipeline

9. Pipelines Galore… Airflow Kubeflow Scikit learn Spark Pipeline Our pipeline Task parallelism ✓ ✓ ✗ ✓ ✓ Data parallelism ✗ ✗ ✗ ✓ ✓ And/Or Graphs ✓ ✓ ✗ ✗ ✓ Computational unit Container Container Python function Python/Java function Python/Java function Mutability of DAG ✗ ✗ ✓ ✓ ✓

10. What to expect? • Execution strategies based on graph traversals • Early stopping criteria • Mutability of execution pipelines • Current status: Proposal discussion with Ray and OSS community

11. Q&A Contacts: Raghu Ganti (rganti@us.ibm.com) Michael Behrendt (michaelbehrendt@de.ibm.com) Linsong Chu (lchu@us.ibm.com) Carlos Costa (chcost@us.ibm.com) Erik Erlandson (eerlands@redhat.com) Mudhakar Srivatsa (msrivats@us.ibm.com)

12. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Similar to Scaling and Unifying SciKit Learn and Apache Spark Pipelines (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Scaling and Unifying SciKit Learn and Apache Spark Pipelines