Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Foundations for Scaling ML
in Apache Spark
Joseph K. Bradley
August 14, 2016
® ™
Who am I?
Apache Spark committer & PMC member
Software Engineer @ Databricks
Machine Learning Department @ Carnegie Mellon...
•  General engine for big data computing
•  Fast
•  Easy to use
•  APIs in Python, Scala, Java & R
3
Apache Spark
Spark	SQ...
NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN
FRANCISCO
Source: Slide 5 of Spark Community Update
MLlib: Spark’s ML library
5
0
500
1000
v0.8 v0.9 v1.0 v1.1 v1.2 v1.3 v1.4 v1.5 v1.6 v2.0
commits/release
Learning tasks
Cl...
MLlib: original design
RDDs
Challenges for scalability
6
Resilient Distributed Datasets (RDDs)
7
Map Reduce
master
Resilient Distributed Datasets (RDDs)
8
Resilient Distributed Datasets (RDDs)
9
Resiliency
•  Lineage
•  Caching &
checkpointing
ML on RDDs: the good
Flexible: GLMs, trees, matrix factorization, etc.
Scalable: E.g., Alternating Least Squares on Spotif...
ML on RDDs: the challenges
Partitioning
•  Data partitioning impacts performance.
•  E.g., for Alternating Least Squares
L...
MLlib: current status
DataFrame & Dataset integration
Pipelines API
12
Spark DataFrames & Datasets
13
dept	 age	 name	
Bio	 48	 H	Smith	
CS	 34	 A	Turing	
Bio	 43	 B	Jones	
Chem	 61	 M	Kennedy	...
DataFrame optimizations
Catalyst query optimizer
Project Tungsten
• Memory management
• Code generation
14
Predicate pushd...
ML Pipelines
•  DataFrames: unified ML dataset API
•  Flexible types
•  Add & remove columns during Pipeline execution
15
Load data
Feature	
extracIon	
Original	
dataset	
16
PredicIve	
model	
EvaluaIon	
Text Label
I bought the game... 4
Do NOT ...
Extract features
Feature	
extracIon	
Original	
dataset	
17
PredicIve	
model	
EvaluaIon	
Text Label Words Features
I bought...
Fit a model
Feature	
extracIon	
Original	
dataset	
18
PredicIve	
model	
EvaluaIon	
Text Label Words Features Prediction Pr...
Evaluate
Feature	
extracIon	
Original	
dataset	
19
PredicIve	
model	
EvaluaIon	
Text Label Words Features Prediction Proba...
ML Pipelines
DataFrames: unified ML dataset API
•  Flexible types
•  Add & remove columns during Pipeline execution
•  Mat...
Under the hood: optimizations
Current use of DataFrames
•  API
•  Transformations & predictions
21
Feature transformation ...
MLlib: future scaling
DataFrames for training
Potential benefits
•  Spilling to disk
•  Catalyst
•  Tungsten
Challenges re...
Implementing ML on DataFrames
23
Map Reduce
master
Scalability
DataFrames automatically spill to disk
à Classic pain point of RDDs
24
java.lang.OutOfMemoryError
Goal: Smooth...
Catalyst in ML
Key idea: automatic query (ML algorithm) optimization
•  DataFrame operations are lazy.
•  Express entire a...
Tungsten in ML
Tungsten: off-heap memory management
•  Avoids JVM GC
•  Uses efficient storage formats
•  Code generation
26...
Prototyping ML on DataFrames
Currently:
•  Belief propagation
•  Connected components
Current challenges:
•  DataFrame que...
To summarize...
MLlib on RDDs
•  Required custom optimizations
MLlib with a DataFrame-based API
•  Friendly API
•  Improve...
Get started
Get involved
•  JIRA http://issues.apache.org
•  mailing lists http://spark.apache.org
•  Github http://github...
Databricks
Founded by the creators of Apache Spark
Offers hosted service
•  Spark on EC2
•  Notebooks
•  Visualizations
•  ...
Thank you!
Twitter: @jkbatcmu
Upcoming SlideShare
Loading in …5
×

Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16

1,192 views

Published on

Apache Spark has become the most active open source Big Data project, and its Machine Learning library MLlib has seen rapid growth in usage. A critical aspect of MLlib and Spark is the ability to scale: the same code used on a laptop can scale to 100’s or 1000’s of machines. This talk will describe ongoing and future efforts to make MLlib even faster and more scalable by integrating with two key initiatives in Spark. The first is Catalyst, the query optimizer underlying DataFrames and Datasets. The second is Tungsten, the project for approaching bare-metal speeds in Spark via memory management, cache-awareness, and code generation. This talk will discuss the goals, the challenges, and the benefits for MLlib users and developers. More generally, we will reflect on the importance of integrating ML with the many other aspects of big data analysis.

About MLlib: MLlib is a general Machine Learning library providing many ML algorithms, feature transformers, and tools for model tuning and building workflows. The library benefits from integration with the rest of Apache Spark (SQL, streaming, Graph, core), which facilitates ETL, streaming, and deployment. It is used in both ad hoc analysis and production deployments throughout academia and industry.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16

  1. 1. Foundations for Scaling ML in Apache Spark Joseph K. Bradley August 14, 2016 ® ™
  2. 2. Who am I? Apache Spark committer & PMC member Software Engineer @ Databricks Machine Learning Department @ Carnegie Mellon 2
  3. 3. •  General engine for big data computing •  Fast •  Easy to use •  APIs in Python, Scala, Java & R 3 Apache Spark Spark SQL Streaming MLlib GraphX Largest cluster: 8000 Nodes (Tencent) Open source •  Apache Software Foundation •  1000+ contributors •  200+ companies & universities
  4. 4. NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN FRANCISCO Source: Slide 5 of Spark Community Update
  5. 5. MLlib: Spark’s ML library 5 0 500 1000 v0.8 v0.9 v1.0 v1.1 v1.2 v1.3 v1.4 v1.5 v1.6 v2.0 commits/release Learning tasks Classification Regression Recommendation Clustering Frequent itemsets Data utilities Featurization Statistics Linear algebra Workflow utilities Model import/export Pipelines DataFrames Cross validation Goals Scale-out ML Standard library Extensible API
  6. 6. MLlib: original design RDDs Challenges for scalability 6
  7. 7. Resilient Distributed Datasets (RDDs) 7 Map Reduce master
  8. 8. Resilient Distributed Datasets (RDDs) 8
  9. 9. Resilient Distributed Datasets (RDDs) 9 Resiliency •  Lineage •  Caching & checkpointing
  10. 10. ML on RDDs: the good Flexible: GLMs, trees, matrix factorization, etc. Scalable: E.g., Alternating Least Squares on Spotify data •  50+ million users x 30+ million songs •  50 billion ratings Cost ~ $10 •  32 r3.8xlarge nodes (spot instances) •  For rank 10 with 10 iterations, ~1 hour running time. 10
  11. 11. ML on RDDs: the challenges Partitioning •  Data partitioning impacts performance. •  E.g., for Alternating Least Squares Lineage •  Iterative algorithms à long RDD lineage •  Solvable via careful caching and checkpointing JVM •  Garbage collection (GC) •  Boxed types 11
  12. 12. MLlib: current status DataFrame & Dataset integration Pipelines API 12
  13. 13. Spark DataFrames & Datasets 13 dept age name Bio 48 H Smith CS 34 A Turing Bio 43 B Jones Chem 61 M Kennedy Data grouped into named columns DSL for common tasks •  Project, filter, aggregate, join, … •  100+ functions available •  User-Defined Functions (UDFs) data.groupBy(“dept”).avg(“age”) Datasets: Strongly typed DataFrames
  14. 14. DataFrame optimizations Catalyst query optimizer Project Tungsten • Memory management • Code generation 14 Predicate pushdown Join selection … Off-heap Avoid JVM GC Compressed format Combine operations into single, efficient code blocks
  15. 15. ML Pipelines •  DataFrames: unified ML dataset API •  Flexible types •  Add & remove columns during Pipeline execution 15
  16. 16. Load data Feature extracIon Original dataset 16 PredicIve model EvaluaIon Text Label I bought the game... 4 Do NOT bother try... 1 this shirt is aweso... 5 never got it. Seller... 1 I ordered this to... 3
  17. 17. Extract features Feature extracIon Original dataset 17 PredicIve model EvaluaIon Text Label Words Features I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...]
  18. 18. Fit a model Feature extracIon Original dataset 18 PredicIve model EvaluaIon Text Label Words Features Prediction Probability I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] 4 0.8 Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6 this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9 never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7 I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...] 4 0.7
  19. 19. Evaluate Feature extracIon Original dataset 19 PredicIve model EvaluaIon Text Label Words Features Prediction Probability I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] 4 0.8 Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6 this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9 never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7 I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...] 4 0.7
  20. 20. ML Pipelines DataFrames: unified ML dataset API •  Flexible types •  Add & remove columns during Pipeline execution •  Materialize columns lazily •  Inspect intermediate results 20
  21. 21. Under the hood: optimizations Current use of DataFrames •  API •  Transformations & predictions 21 Feature transformation & model prediction are phrased as User- Defined Functions (UDFs) à Catalyst query optimizer à Tungsten memory management + code generation
  22. 22. MLlib: future scaling DataFrames for training Potential benefits •  Spilling to disk •  Catalyst •  Tungsten Challenges remaining 22
  23. 23. Implementing ML on DataFrames 23 Map Reduce master
  24. 24. Scalability DataFrames automatically spill to disk à Classic pain point of RDDs 24 java.lang.OutOfMemoryError Goal: Smoothly scale, without custom per-algorithm optimizations
  25. 25. Catalyst in ML Key idea: automatic query (ML algorithm) optimization •  DataFrame operations are lazy. •  Express entire algorithm as DataFrame operations. •  Let Catalyst reorganize the algorithm, data, etc. à Fewer manual optimizations 25
  26. 26. Tungsten in ML Tungsten: off-heap memory management •  Avoids JVM GC •  Uses efficient storage formats •  Code generation 26 Issue in ML: object creation during each iteration Issue in ML: Array[(Int,Double,Double)] Issue in ML: Volcano iterator model in MR/RDDs
  27. 27. Prototyping ML on DataFrames Currently: •  Belief propagation •  Connected components Current challenges: •  DataFrame query plans do not have iteration as a top-level concept •  ML/Graph-specific optimizations for Catalyst query planner Eventual goal: Port all ML algorithms to run on top of DataFrames à speed & scalability 27
  28. 28. To summarize... MLlib on RDDs •  Required custom optimizations MLlib with a DataFrame-based API •  Friendly API •  Improvements for prediction MLlib on DataFrames •  Potential for even greater scaling for training •  Simpler for non-experts to write new algorithms 28
  29. 29. Get started Get involved •  JIRA http://issues.apache.org •  mailing lists http://spark.apache.org •  Github http://github.com/apache/spark •  Spark Packages http://spark-packages.org Learn more •  New in Apache Spark 2.0 http://databricks.com/blog/2016/06/01 •  MOOCs on EdX http://databricks.com/spark/training 29 Try out Apache Spark 2.0 in Databricks Community Edition http://databricks.com/ce Many thanks to the community for contributions & support!
  30. 30. Databricks Founded by the creators of Apache Spark Offers hosted service •  Spark on EC2 •  Notebooks •  Visualizations •  Cluster management •  Scheduled jobs 30 We’re hiring!
  31. 31. Thank you! Twitter: @jkbatcmu

×