SlideShare a Scribd company logo
1 of 31
Download to read offline
Foundations for Scaling ML
in Apache Spark
Joseph K. Bradley
August 14, 2016
® ™
Who am I?
Apache Spark committer & PMC member
Software Engineer @ Databricks
Machine Learning Department @ Carnegie Mellon
2
•  General engine for big data computing
•  Fast
•  Easy to use
•  APIs in Python, Scala, Java & R
3
Apache Spark
Spark	SQL	 Streaming	 MLlib	 GraphX	
Largest cluster:
8000 Nodes (Tencent)
Open source
•  Apache Software Foundation
•  1000+ contributors
•  200+ companies & universities
NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN
FRANCISCO
Source: Slide 5 of Spark Community Update
MLlib: Spark’s ML library
5
0
500
1000
v0.8 v0.9 v1.0 v1.1 v1.2 v1.3 v1.4 v1.5 v1.6 v2.0
commits/release
Learning tasks
Classification
Regression
Recommendation
Clustering
Frequent itemsets
Data utilities
Featurization
Statistics
Linear algebra
Workflow utilities
Model import/export
Pipelines
DataFrames
Cross validation
Goals
Scale-out ML
Standard library
Extensible API
MLlib: original design
RDDs
Challenges for scalability
6
Resilient Distributed Datasets (RDDs)
7
Map Reduce
master
Resilient Distributed Datasets (RDDs)
8
Resilient Distributed Datasets (RDDs)
9
Resiliency
•  Lineage
•  Caching &
checkpointing
ML on RDDs: the good
Flexible: GLMs, trees, matrix factorization, etc.
Scalable: E.g., Alternating Least Squares on Spotify data
•  50+ million users x 30+ million songs
•  50 billion ratings
Cost ~ $10
•  32 r3.8xlarge nodes (spot instances)
•  For rank 10 with 10 iterations, ~1 hour running time.
10
ML on RDDs: the challenges
Partitioning
•  Data partitioning impacts performance.
•  E.g., for Alternating Least Squares
Lineage
•  Iterative algorithms à long RDD lineage
•  Solvable via careful caching and checkpointing
JVM
•  Garbage collection (GC)
•  Boxed types
11
MLlib: current status
DataFrame & Dataset integration
Pipelines API
12
Spark DataFrames & Datasets
13
dept	 age	 name	
Bio	 48	 H	Smith	
CS	 34	 A	Turing	
Bio	 43	 B	Jones	
Chem	 61	 M	Kennedy	
Data grouped into
named columns
DSL for common tasks
•  Project, filter, aggregate, join, …
•  100+ functions available
•  User-Defined Functions (UDFs)
data.groupBy(“dept”).avg(“age”)
Datasets: Strongly typed DataFrames
DataFrame optimizations
Catalyst query optimizer
Project Tungsten
• Memory management
• Code generation
14
Predicate pushdown
Join selection
…
Off-heap
Avoid JVM GC
Compressed format
Combine operations into single,
efficient code blocks
ML Pipelines
•  DataFrames: unified ML dataset API
•  Flexible types
•  Add & remove columns during Pipeline execution
15
Load data
Feature	
extracIon	
Original	
dataset	
16
PredicIve	
model	
EvaluaIon	
Text Label
I bought the game... 4
Do NOT bother try... 1
this shirt is aweso... 5
never got it. Seller... 1
I ordered this to... 3
Extract features
Feature	
extracIon	
Original	
dataset	
17
PredicIve	
model	
EvaluaIon	
Text Label Words Features
I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...]
Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...]
this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...]
never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...]
I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...]
Fit a model
Feature	
extracIon	
Original	
dataset	
18
PredicIve	
model	
EvaluaIon	
Text Label Words Features Prediction Probability
I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] 4 0.8
Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6
this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9
never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7
I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...] 4 0.7
Evaluate
Feature	
extracIon	
Original	
dataset	
19
PredicIve	
model	
EvaluaIon	
Text Label Words Features Prediction Probability
I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] 4 0.8
Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6
this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9
never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7
I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...] 4 0.7
ML Pipelines
DataFrames: unified ML dataset API
•  Flexible types
•  Add & remove columns during Pipeline execution
•  Materialize columns lazily
•  Inspect intermediate results
20
Under the hood: optimizations
Current use of DataFrames
•  API
•  Transformations & predictions
21
Feature transformation & model
prediction are phrased as User-
Defined Functions (UDFs)
à Catalyst query optimizer
à Tungsten memory management
+ code generation
MLlib: future scaling
DataFrames for training
Potential benefits
•  Spilling to disk
•  Catalyst
•  Tungsten
Challenges remaining
22
Implementing ML on DataFrames
23
Map Reduce
master
Scalability
DataFrames automatically spill to disk
à Classic pain point of RDDs
24
java.lang.OutOfMemoryError
Goal: Smoothly scale, without custom per-algorithm optimizations
Catalyst in ML
Key idea: automatic query (ML algorithm) optimization
•  DataFrame operations are lazy.
•  Express entire algorithm as DataFrame operations.
•  Let Catalyst reorganize the algorithm, data, etc.
à Fewer manual optimizations
25
Tungsten in ML
Tungsten: off-heap memory management
•  Avoids JVM GC
•  Uses efficient storage formats
•  Code generation
26
Issue in ML:
object creation during each iteration
Issue in ML:
Array[(Int,Double,Double)]
Issue in ML:
Volcano iterator model in MR/RDDs
Prototyping ML on DataFrames
Currently:
•  Belief propagation
•  Connected components
Current challenges:
•  DataFrame query plans do not have iteration as a top-level concept
•  ML/Graph-specific optimizations for Catalyst query planner
Eventual goal: Port all ML algorithms to run on top of DataFrames
à speed & scalability
27
To summarize...
MLlib on RDDs
•  Required custom optimizations
MLlib with a DataFrame-based API
•  Friendly API
•  Improvements for prediction
MLlib on DataFrames
•  Potential for even greater scaling for training
•  Simpler for non-experts to write new algorithms
28
Get started
Get involved
•  JIRA http://issues.apache.org
•  mailing lists http://spark.apache.org
•  Github http://github.com/apache/spark
•  Spark Packages http://spark-packages.org
Learn more
•  New in Apache Spark 2.0
http://databricks.com/blog/2016/06/01
•  MOOCs on EdX http://databricks.com/spark/training
29
Try out Apache Spark 2.0 in
Databricks Community Edition
http://databricks.com/ce
Many thanks to the community
for contributions & support!
Databricks
Founded by the creators of Apache Spark
Offers hosted service
•  Spark on EC2
•  Notebooks
•  Visualizations
•  Cluster management
•  Scheduled jobs
30
We’re hiring!
Thank you!
Twitter: @jkbatcmu

More Related Content

Similar to Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16

Distributed ML in Apache Spark
Distributed ML in Apache SparkDistributed ML in Apache Spark
Distributed ML in Apache SparkDatabricks
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDatabricks
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkDatabricks
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkDataWorks Summit/Hadoop Summit
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...PAPIs.io
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionDatabricks
 
Machine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - DatabricksMachine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - DatabricksSpark Summit
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
 
Scaling Up Machine Learning Experimentation at Tubi 5x and Beyond
Scaling Up Machine Learning Experimentation at Tubi 5x and BeyondScaling Up Machine Learning Experimentation at Tubi 5x and Beyond
Scaling Up Machine Learning Experimentation at Tubi 5x and BeyondScyllaDB
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
 
Low Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexLow Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexApache Apex
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupJim Dowling
 
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureDatabricks
 
R4ML: An R Based Scalable Machine Learning Framework
R4ML: An R Based Scalable Machine Learning FrameworkR4ML: An R Based Scalable Machine Learning Framework
R4ML: An R Based Scalable Machine Learning FrameworkAlok Singh
 
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdfPyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdfJim Dowling
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibDatabricks
 
Sista: Improving Cog’s JIT performance
Sista: Improving Cog’s JIT performanceSista: Improving Cog’s JIT performance
Sista: Improving Cog’s JIT performanceESUG
 
Maintainable Machine Learning Products
Maintainable Machine Learning ProductsMaintainable Machine Learning Products
Maintainable Machine Learning ProductsAndrew Musselman
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Libraryjeykottalam
 

Similar to Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16 (20)

Distributed ML in Apache Spark
Distributed ML in Apache SparkDistributed ML in Apache Spark
Distributed ML in Apache Spark
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
 
Machine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - DatabricksMachine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - Databricks
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
Scaling Up Machine Learning Experimentation at Tubi 5x and Beyond
Scaling Up Machine Learning Experimentation at Tubi 5x and BeyondScaling Up Machine Learning Experimentation at Tubi 5x and Beyond
Scaling Up Machine Learning Experimentation at Tubi 5x and Beyond
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
Low Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexLow Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache Apex
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
 
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and Architecture
 
R4ML: An R Based Scalable Machine Learning Framework
R4ML: An R Based Scalable Machine Learning FrameworkR4ML: An R Based Scalable Machine Learning Framework
R4ML: An R Based Scalable Machine Learning Framework
 
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdfPyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlib
 
Sista: Improving Cog’s JIT performance
Sista: Improving Cog’s JIT performanceSista: Improving Cog’s JIT performance
Sista: Improving Cog’s JIT performance
 
Maintainable Machine Learning Products
Maintainable Machine Learning ProductsMaintainable Machine Learning Products
Maintainable Machine Learning Products
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 

More from BigMine

Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...
Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...
Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...BigMine
 
From Practice to Theory in Learning from Massive Data by Charles Elkan at Big...
From Practice to Theory in Learning from Massive Data by Charles Elkan at Big...From Practice to Theory in Learning from Massive Data by Charles Elkan at Big...
From Practice to Theory in Learning from Massive Data by Charles Elkan at Big...BigMine
 
Big Data and Small Devices by Katharina Morik
Big Data and Small Devices by Katharina MorikBig Data and Small Devices by Katharina Morik
Big Data and Small Devices by Katharina MorikBigMine
 
Exact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping YeExact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping YeBigMine
 
Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...BigMine
 
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...BigMine
 
Big & Personal: the data and the models behind Netflix recommendations by Xa...
 Big & Personal: the data and the models behind Netflix recommendations by Xa... Big & Personal: the data and the models behind Netflix recommendations by Xa...
Big & Personal: the data and the models behind Netflix recommendations by Xa...BigMine
 
Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos FaloutsosLarge Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos FaloutsosBigMine
 
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 Unexpected Challenges in Large Scale Machine Learning by Charles Parker Unexpected Challenges in Large Scale Machine Learning by Charles Parker
Unexpected Challenges in Large Scale Machine Learning by Charles ParkerBigMine
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...BigMine
 

More from BigMine (10)

Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...
Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...
Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...
 
From Practice to Theory in Learning from Massive Data by Charles Elkan at Big...
From Practice to Theory in Learning from Massive Data by Charles Elkan at Big...From Practice to Theory in Learning from Massive Data by Charles Elkan at Big...
From Practice to Theory in Learning from Massive Data by Charles Elkan at Big...
 
Big Data and Small Devices by Katharina Morik
Big Data and Small Devices by Katharina MorikBig Data and Small Devices by Katharina Morik
Big Data and Small Devices by Katharina Morik
 
Exact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping YeExact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping Ye
 
Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...
 
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
 
Big & Personal: the data and the models behind Netflix recommendations by Xa...
 Big & Personal: the data and the models behind Netflix recommendations by Xa... Big & Personal: the data and the models behind Netflix recommendations by Xa...
Big & Personal: the data and the models behind Netflix recommendations by Xa...
 
Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos FaloutsosLarge Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
 
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 Unexpected Challenges in Large Scale Machine Learning by Charles Parker Unexpected Challenges in Large Scale Machine Learning by Charles Parker
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
 

Recently uploaded

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 

Recently uploaded (20)

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 

Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16

  • 1. Foundations for Scaling ML in Apache Spark Joseph K. Bradley August 14, 2016 ® ™
  • 2. Who am I? Apache Spark committer & PMC member Software Engineer @ Databricks Machine Learning Department @ Carnegie Mellon 2
  • 3. •  General engine for big data computing •  Fast •  Easy to use •  APIs in Python, Scala, Java & R 3 Apache Spark Spark SQL Streaming MLlib GraphX Largest cluster: 8000 Nodes (Tencent) Open source •  Apache Software Foundation •  1000+ contributors •  200+ companies & universities
  • 4. NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN FRANCISCO Source: Slide 5 of Spark Community Update
  • 5. MLlib: Spark’s ML library 5 0 500 1000 v0.8 v0.9 v1.0 v1.1 v1.2 v1.3 v1.4 v1.5 v1.6 v2.0 commits/release Learning tasks Classification Regression Recommendation Clustering Frequent itemsets Data utilities Featurization Statistics Linear algebra Workflow utilities Model import/export Pipelines DataFrames Cross validation Goals Scale-out ML Standard library Extensible API
  • 7. Resilient Distributed Datasets (RDDs) 7 Map Reduce master
  • 9. Resilient Distributed Datasets (RDDs) 9 Resiliency •  Lineage •  Caching & checkpointing
  • 10. ML on RDDs: the good Flexible: GLMs, trees, matrix factorization, etc. Scalable: E.g., Alternating Least Squares on Spotify data •  50+ million users x 30+ million songs •  50 billion ratings Cost ~ $10 •  32 r3.8xlarge nodes (spot instances) •  For rank 10 with 10 iterations, ~1 hour running time. 10
  • 11. ML on RDDs: the challenges Partitioning •  Data partitioning impacts performance. •  E.g., for Alternating Least Squares Lineage •  Iterative algorithms à long RDD lineage •  Solvable via careful caching and checkpointing JVM •  Garbage collection (GC) •  Boxed types 11
  • 12. MLlib: current status DataFrame & Dataset integration Pipelines API 12
  • 13. Spark DataFrames & Datasets 13 dept age name Bio 48 H Smith CS 34 A Turing Bio 43 B Jones Chem 61 M Kennedy Data grouped into named columns DSL for common tasks •  Project, filter, aggregate, join, … •  100+ functions available •  User-Defined Functions (UDFs) data.groupBy(“dept”).avg(“age”) Datasets: Strongly typed DataFrames
  • 14. DataFrame optimizations Catalyst query optimizer Project Tungsten • Memory management • Code generation 14 Predicate pushdown Join selection … Off-heap Avoid JVM GC Compressed format Combine operations into single, efficient code blocks
  • 15. ML Pipelines •  DataFrames: unified ML dataset API •  Flexible types •  Add & remove columns during Pipeline execution 15
  • 16. Load data Feature extracIon Original dataset 16 PredicIve model EvaluaIon Text Label I bought the game... 4 Do NOT bother try... 1 this shirt is aweso... 5 never got it. Seller... 1 I ordered this to... 3
  • 17. Extract features Feature extracIon Original dataset 17 PredicIve model EvaluaIon Text Label Words Features I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...]
  • 18. Fit a model Feature extracIon Original dataset 18 PredicIve model EvaluaIon Text Label Words Features Prediction Probability I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] 4 0.8 Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6 this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9 never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7 I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...] 4 0.7
  • 19. Evaluate Feature extracIon Original dataset 19 PredicIve model EvaluaIon Text Label Words Features Prediction Probability I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] 4 0.8 Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6 this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9 never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7 I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...] 4 0.7
  • 20. ML Pipelines DataFrames: unified ML dataset API •  Flexible types •  Add & remove columns during Pipeline execution •  Materialize columns lazily •  Inspect intermediate results 20
  • 21. Under the hood: optimizations Current use of DataFrames •  API •  Transformations & predictions 21 Feature transformation & model prediction are phrased as User- Defined Functions (UDFs) à Catalyst query optimizer à Tungsten memory management + code generation
  • 22. MLlib: future scaling DataFrames for training Potential benefits •  Spilling to disk •  Catalyst •  Tungsten Challenges remaining 22
  • 23. Implementing ML on DataFrames 23 Map Reduce master
  • 24. Scalability DataFrames automatically spill to disk à Classic pain point of RDDs 24 java.lang.OutOfMemoryError Goal: Smoothly scale, without custom per-algorithm optimizations
  • 25. Catalyst in ML Key idea: automatic query (ML algorithm) optimization •  DataFrame operations are lazy. •  Express entire algorithm as DataFrame operations. •  Let Catalyst reorganize the algorithm, data, etc. à Fewer manual optimizations 25
  • 26. Tungsten in ML Tungsten: off-heap memory management •  Avoids JVM GC •  Uses efficient storage formats •  Code generation 26 Issue in ML: object creation during each iteration Issue in ML: Array[(Int,Double,Double)] Issue in ML: Volcano iterator model in MR/RDDs
  • 27. Prototyping ML on DataFrames Currently: •  Belief propagation •  Connected components Current challenges: •  DataFrame query plans do not have iteration as a top-level concept •  ML/Graph-specific optimizations for Catalyst query planner Eventual goal: Port all ML algorithms to run on top of DataFrames à speed & scalability 27
  • 28. To summarize... MLlib on RDDs •  Required custom optimizations MLlib with a DataFrame-based API •  Friendly API •  Improvements for prediction MLlib on DataFrames •  Potential for even greater scaling for training •  Simpler for non-experts to write new algorithms 28
  • 29. Get started Get involved •  JIRA http://issues.apache.org •  mailing lists http://spark.apache.org •  Github http://github.com/apache/spark •  Spark Packages http://spark-packages.org Learn more •  New in Apache Spark 2.0 http://databricks.com/blog/2016/06/01 •  MOOCs on EdX http://databricks.com/spark/training 29 Try out Apache Spark 2.0 in Databricks Community Edition http://databricks.com/ce Many thanks to the community for contributions & support!
  • 30. Databricks Founded by the creators of Apache Spark Offers hosted service •  Spark on EC2 •  Notebooks •  Visualizations •  Cluster management •  Scheduled jobs 30 We’re hiring!