Asynchronous Parameter Server for Distributed Machine Learning with Spark

•

2 likes•937 views

Spark Summit

Glint: An Asynchronous Parameter Server for Spark

Data & Analytics

An Asynchronous Parameter Server for Spark
Rolf Jagerman
University of Amsterdam

Motivation
Data
Machine
Learning
Algorithm
Model

Motivation
Data
ModelMachine
Learning
Algorithm

Motivation
Factorization Machines
https://github.com/MLnick/glint-fm

Tourism Video games Javascript Biology
hotel allianc write tag
local hord docum protein
car euro var hypothet
holidai warcraft return gene
area wow prop cytoplasm
golf gold function prk
wed warhamm subset locu
Motivation
Topic Modeling (LDA)
Tourism Video games Javascript Biology
hotel allianc write tag
local hord docum protein
car euro var hypothet
holidai warcraft return gene
area wow prop cytoplasm
golf gold function prk
wed warhamm subset locu
Tourism Video games Javascript Biology
hotel allianc write tag
local hord docum protein
car euro var hypothet
holidai warcraft return gene
area wow prop cytoplasm
golf gold function prk
wed warhamm subset locu

Problem
Model size exceeds memory of a single machine!
Data
ModelMachine
Learning
Algorithm

9
Spark Worker
Spark Driver
Spark Worker Spark Worker Spark Worker

Spark Worker
Spark Driver
Spark Worker Spark Worker Spark Worker
10

Spark Worker
Spark Driver
Spark Worker Spark Worker Spark Worker
11

Spark Worker
Spark Driver
Spark Worker Spark Worker Spark Worker
12

Spark Worker
Spark Driver
Spark Worker Spark Worker Spark Worker
13

Spark Worker
Spark Driver
Spark Worker Spark Worker Spark Worker
14

Spark Worker
Spark Driver
Spark Worker Spark Worker Spark Worker
15

Parameter Server
• A machine learning framework
• Distributes a model over multiple machines
• Offers two operations:
• Pull: query parts of the model
• Push: update parts of the model
16

Parameter Server
• Machine learning update equation:
wi ← wi + Δ 
• (Stochastic) gradient descent
• Collapsed Gibbs sampling for topic modeling
17

Parameter Server
• Machine learning update equation:
wi ← wi + Δ 
• (Stochastic) gradient descent
• Collapsed Gibbs sampling for topic modeling
• Aggregate push updates via addition (+)
18

19
Spark Worker
Spark Driver
Spark Worker Spark Worker Spark Worker
Parameter Server Parameter Server Parameter Server

20
Spark Worker
Spark Driver
Spark Worker Spark Worker Spark Worker
Parameter Server Parameter Server Parameter Server

21
Spark Worker
Spark Driver
Spark Worker Spark Worker Spark Worker
Parameter Server Parameter Server Parameter Server

22
Spark Worker
Spark Driver
Spark Worker Spark Worker Spark Worker
Parameter Server Parameter Server Parameter Server

23
Spark Worker
Spark Driver
Spark Worker Spark Worker Spark Worker
Parameter Server Parameter Server Parameter Server

Spark Worker
Spark Driver
Spark Worker Spark Worker Spark Worker
Parameter Server Parameter Server Parameter Server
24

Spark Worker
Spark Driver
Spark Worker Spark Worker Spark Worker
Parameter Server Parameter Server Parameter Server
25

Experiments
Task
• LDA topic modeling
• 1,000 topics
• 27TB data set (ClueWeb12)
27

Experiments
Setup
• 30 Spark workers (16 CPU cores each)
• 3.7TB RAM total
• Interconnected over 10Gb/s ethernet
28

Experiments
Approach
• Glint for model storage (billions of parameters)
• LDA approximation algorithm for runtime
improvements
• A small loss in model quality is acceptable
29

Experiments
Comparing three methods:
1. Glint LDA
2. MLLib EM LDA
3. MLLib Online LDA
30

Experiments
31
Glint MLLib EM MLLib Online
6,108 -1.3% +0.4%
5,731 -5.4% -4.6%
5,427 -10.7% -8.2%
6,021 -5.1% +4.0%
5,813 +1.2% -1.3%
5,520 +4.8% -5.5%
5,861 +6.8% -0.5%
Data (GB) # Topics
50 20
100 20
150 20
200 20
200 40
200 60
200 80
Perplexity (topic model quality)

Experiments
32
Glint MLLib EM MLLib Online
6.3 9.7 16.3
7.1 14.2 17.8
8.9 14.1 19.6
10.8 22.3 21.5
11.9 23.7 57.5
13.4 32.4 131.0
14.7 34.4 233.2
Data (GB) # Topics
50 20
100 20
150 20
200 20
200 40
200 60
200 80
Runtime (minutes)

Experiments
33
Glint MLLib EM MLLib Online
3.3
4.6
5.5
6.2
12.1
18.0
23.9
Data (GB) # Topics
50 20
100 20
150 20
200 20
200 40
200 60
200 80
Shuffle Write (GB)
No
Shuffle
Write
No
Shuffle
Write

• MLLib could not scale beyond 200GB or 
100 topics due to task and job failures 
• Glint can compute a topic model on the full 
27TB with 1,000 topics
Experiments
34

Experiments
0 20 40 60
Time (hours)
Perplexity
104204304404504
35

Conclusion
• Glint is a parameter server for Spark
• Machine learning for very large models
• Asynchrony enables highly flexible threading
• Extremely easy to use
• Outperforms MLLib on LDA topic modeling
36

Future work
• Better fault tolerance (using Chord/DHT)
• User defined functions for aggregation
• Support for sparse models
• Implementing other algorithms
• Deep learning
• Linear models
37

What's hot

Analytics at Scale with Apache Spark on AWS with Jonathan FritzDatabricks

Spark Summit EU talk by Michael NitschingerSpark Summit

Spark Summit EU talk by Simon WhitearSpark Summit

Spark Summit EU talk by Mikhail Semeniuk Hollin WilkinsSpark Summit

Spark Summit EU talk by Luca CanaliSpark Summit

A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks

Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit

Extending Spark With Java Agent (handout)Jaroslav Bachorik

Spark Summit EU talk by Heiko KorndorfSpark Summit

Spark Summit EU talk by Bas GeerdinkSpark Summit

Simplifying Big Data Applications with Apache Spark 2.0Spark Summit

Degrading Performance? You Might be Suffering From the Small Files SyndromeDatabricks

Spark Summit EU talk by Christos ErotocritouSpark Summit

Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Databricks

Operational Tips For Deploying Apache SparkDatabricks

Spark Summit EU talk by Tim HunterSpark Summit

Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Anya Bida

Spark Summit EU talk by Patrick Baier and Stanimir DragievSpark Summit

Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit

Random Walks on Large Scale Graphs with Apache Spark with Min ShenDatabricks

What's hot (20)

Analytics at Scale with Apache Spark on AWS with Jonathan Fritz

Spark Summit EU talk by Michael Nitschinger

Spark Summit EU talk by Simon Whitear

Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins

Spark Summit EU talk by Luca Canali

A Journey into Databricks' Pipelines: Journey and Lessons Learned

Spark Summit EU talk by Shay Nativ and Dvir Volk

Extending Spark With Java Agent (handout)

Spark Summit EU talk by Heiko Korndorf

Spark Summit EU talk by Bas Geerdink

Simplifying Big Data Applications with Apache Spark 2.0

Degrading Performance? You Might be Suffering From the Small Files Syndrome

Spark Summit EU talk by Christos Erotocritou

Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...

Operational Tips For Deploying Apache Spark

Spark Summit EU talk by Tim Hunter

Spark Tuning For Enterprise System Administrators, Spark Summit East 2016

Spark Summit EU talk by Patrick Baier and Stanimir Dragiev

Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma

Random Walks on Large Scale Graphs with Apache Spark with Min Shen

Viewers also liked

Machine Learning Methods for Parameter Acquisition in a Human ...butest

MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...asimkadav

Scalable Deep Learning Platform On Spark In BaiduJen Aman

Challenges on Distributed Machine Learningjie cao

GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...Fabio Petroni, PhD

Large Scale Distributed Deep NetworksHiroyuki Vincent Yamazaki

Distributed machine learningStanley Wang

Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf

Spark after Dark by Chris Fregly of DatabricksData Con LA

Distributed Deep Learning on SparkMathieu Dumoulin

FlinkML: Large Scale Machine Learning with Apache FlinkTheodoros Vasiloudis

Deep Learning at ScaleMateusz Dymczyk

Inspection of CloudML Hyper Parameter Tuningnagachika t

A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Usin...Kato Mivule

Scaling Machine Learning To Billions Of ParametersJen Aman

Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016MLconf

Machine Learning on Big DataMax Lin

Introduction to Machine Learning and Deep LearningTerry Taewoong Um

Viewers also liked (18)

Machine Learning Methods for Parameter Acquisition in a Human ...

MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...

Scalable Deep Learning Platform On Spark In Baidu

Challenges on Distributed Machine Learning

GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...

Large Scale Distributed Deep Networks

Distributed machine learning

Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...

Spark after Dark by Chris Fregly of Databricks

Distributed Deep Learning on Spark

FlinkML: Large Scale Machine Learning with Apache Flink

Deep Learning at Scale

Inspection of CloudML Hyper Parameter Tuning

A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Usin...

Scaling Machine Learning To Billions Of Parameters

Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016

Machine Learning on Big Data

Introduction to Machine Learning and Deep Learning

Similar to Asynchronous Parameter Server for Distributed Machine Learning with Spark

Asynchronous Hyperparameter Search with Spark on Hopsworks and MaggyJim Dowling

Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Chris Fregly

Spark ML Pipeline servingStepan Pushkarev

Profiling & Testing with SparkRoger Rafanell Mas

Build, train, and deploy Machine Learning models at scale (May 2018)Julien SIMON

An Insider’s Guide to Maximizing Spark SQL PerformanceTakuya UESHIN

Javantura v4 - Java and lambdas and streams - are they better than for loops ...HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association

Build, train, and deploy Machine Learning models at scale (May 2018)Julien SIMON

Asynchronous Hyperparameter Optimization with Apache SparkDatabricks

(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services

Running Presto and Spark on the Netflix Big Data PlatformEva Tse

Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks

Architetture serverless e pattern avanzati per AWS LambdaAmazon Web Services

Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks

AMD It's Time to ROCinside-BigData.com

Apache spark-melbourne-april-2015-meetupNed Shawa

Apache Spark 2.0: Faster, Easier, and SmarterDatabricks

Spark streaming state of the unionDatabricks

Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS...Amazon Web Services

Deep_dive_on_Amazon_Neptune_DAT361.pdfShaikAsif83

Similar to Asynchronous Parameter Server for Distributed Machine Learning with Spark (20)

Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy

Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...

Spark ML Pipeline serving

Profiling & Testing with Spark

Build, train, and deploy Machine Learning models at scale (May 2018)

An Insider’s Guide to Maximizing Spark SQL Performance

Javantura v4 - Java and lambdas and streams - are they better than for loops ...

Build, train, and deploy Machine Learning models at scale (May 2018)

Asynchronous Hyperparameter Optimization with Apache Spark

(BDT303) Running Spark and Presto on the Netflix Big Data Platform

Running Presto and Spark on the Netflix Big Data Platform

Build Large-Scale Data Analytics and AI Pipeline Using RayDP

Architetture serverless e pattern avanzati per AWS Lambda

Building a SIMD Supported Vectorized Native Engine for Spark SQL

AMD It's Time to ROC

Apache spark-melbourne-april-2015-meetup

Apache Spark 2.0: Faster, Easier, and Smarter

Spark streaming state of the union

Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS...

Deep_dive_on_Amazon_Neptune_DAT361.pdf

Recently uploaded

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat

Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71

Smarteg dropshipping via API with DroFx.pptxolyaivanovalion

BabyOno dropshipping via API with DroFx.pptxolyaivanovalion

BigBuy dropshipping via API with DroFx.pptxolyaivanovalion

CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion

Week-01-2.ppt BBB human Computer interactionfulawalesam

Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann

Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor

April 2024 - Crypto Market Report's Analysismanisha194592

Ravak dropshipping via API with DroFx.pptxolyaivanovalion

Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863

VidaXL dropshipping via API with DroFx.pptxolyaivanovalion

Ukraine War presentation: KNOW THE BASICSAishani27

Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

Introduction-to-Machine-Learning (1).pptxfirstjob4

Recently uploaded (20)

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service

Call Girls In Mahipalpur O9654467111 Escorts Service

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha

Smarteg dropshipping via API with DroFx.pptx

BabyOno dropshipping via API with DroFx.pptx

BigBuy dropshipping via API with DroFx.pptx

CebaBaby dropshipping via API with DroFX.pptx

Week-01-2.ppt BBB human Computer interaction

Generative AI on Enterprise Cloud with NiFi and Milvus

Log Analysis using OSSEC sasoasasasas.pptx

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...

April 2024 - Crypto Market Report's Analysis

Ravak dropshipping via API with DroFx.pptx

Dubai Call Girls Wifey O52&786472 Call Girls Dubai

VidaXL dropshipping via API with DroFx.pptx

Ukraine War presentation: KNOW THE BASICS

Schema on read is obsolete. Welcome metaprogramming..pdf

Customer Service Analytics - Make Sense of All Your Data.pptx

Introduction-to-Machine-Learning (1).pptx

Asynchronous Parameter Server for Distributed Machine Learning with Spark

1.   An Asynchronous Parameter Server for Spark Rolf Jagerman University of Amsterdam

2. Motivation Data Machine Learning Algorithm Model

3. Motivation Data Machine Learning Algorithm Model

4. Motivation Data ModelMachine Learning Algorithm

5. Motivation Deep Learning

6. Motivation Factorization Machines https://github.com/MLnick/glint-fm

7. Tourism Video games Javascript Biology hotel allianc write tag local hord docum protein car euro var hypothet holidai warcraft return gene area wow prop cytoplasm golf gold function prk wed warhamm subset locu Motivation Topic Modeling (LDA) Tourism Video games Javascript Biology hotel allianc write tag local hord docum protein car euro var hypothet holidai warcraft return gene area wow prop cytoplasm golf gold function prk wed warhamm subset locu Tourism Video games Javascript Biology hotel allianc write tag local hord docum protein car euro var hypothet holidai warcraft return gene area wow prop cytoplasm golf gold function prk wed warhamm subset locu

8. Problem Model size exceeds memory of a single machine! Data ModelMachine Learning Algorithm

9. 9 Spark Worker Spark Driver Spark Worker Spark Worker Spark Worker

10. Spark Worker Spark Driver Spark Worker Spark Worker Spark Worker 10

11. Spark Worker Spark Driver Spark Worker Spark Worker Spark Worker 11

12. Spark Worker Spark Driver Spark Worker Spark Worker Spark Worker 12

13. Spark Worker Spark Driver Spark Worker Spark Worker Spark Worker 13

14. Spark Worker Spark Driver Spark Worker Spark Worker Spark Worker 14

15. Spark Worker Spark Driver Spark Worker Spark Worker Spark Worker 15

16. Parameter Server • A machine learning framework • Distributes a model over multiple machines • Offers two operations: • Pull: query parts of the model • Push: update parts of the model 16

17. Parameter Server • Machine learning update equation: wi ← wi + Δ  • (Stochastic) gradient descent • Collapsed Gibbs sampling for topic modeling 17

18. Parameter Server • Machine learning update equation: wi ← wi + Δ  • (Stochastic) gradient descent • Collapsed Gibbs sampling for topic modeling • Aggregate push updates via addition (+) 18

19. 19 Spark Worker Spark Driver Spark Worker Spark Worker Spark Worker Parameter Server Parameter Server Parameter Server

20. 20 Spark Worker Spark Driver Spark Worker Spark Worker Spark Worker Parameter Server Parameter Server Parameter Server

21. 21 Spark Worker Spark Driver Spark Worker Spark Worker Spark Worker Parameter Server Parameter Server Parameter Server

22. 22 Spark Worker Spark Driver Spark Worker Spark Worker Spark Worker Parameter Server Parameter Server Parameter Server

23. 23 Spark Worker Spark Driver Spark Worker Spark Worker Spark Worker Parameter Server Parameter Server Parameter Server

24. Spark Worker Spark Driver Spark Worker Spark Worker Spark Worker Parameter Server Parameter Server Parameter Server 24

25. Spark Worker Spark Driver Spark Worker Spark Worker Spark Worker Parameter Server Parameter Server Parameter Server 25

26. Demo 26

27. Experiments Task • LDA topic modeling • 1,000 topics • 27TB data set (ClueWeb12) 27

28. Experiments Setup • 30 Spark workers (16 CPU cores each) • 3.7TB RAM total • Interconnected over 10Gb/s ethernet 28

29. Experiments Approach • Glint for model storage (billions of parameters) • LDA approximation algorithm for runtime improvements • A small loss in model quality is acceptable 29

30. Experiments Comparing three methods: 1. Glint LDA 2. MLLib EM LDA 3. MLLib Online LDA 30

31. Experiments 31 Glint MLLib EM MLLib Online 6,108 -1.3% +0.4% 5,731 -5.4% -4.6% 5,427 -10.7% -8.2% 6,021 -5.1% +4.0% 5,813 +1.2% -1.3% 5,520 +4.8% -5.5% 5,861 +6.8% -0.5% Data (GB) # Topics 50 20 100 20 150 20 200 20 200 40 200 60 200 80 Perplexity (topic model quality)

32. Experiments 32 Glint MLLib EM MLLib Online 6.3 9.7 16.3 7.1 14.2 17.8 8.9 14.1 19.6 10.8 22.3 21.5 11.9 23.7 57.5 13.4 32.4 131.0 14.7 34.4 233.2 Data (GB) # Topics 50 20 100 20 150 20 200 20 200 40 200 60 200 80 Runtime (minutes)

33. Experiments 33 Glint MLLib EM MLLib Online 3.3 4.6 5.5 6.2 12.1 18.0 23.9 Data (GB) # Topics 50 20 100 20 150 20 200 20 200 40 200 60 200 80 Shuffle Write (GB) No Shuffle Write No Shuffle Write

34. • MLLib could not scale beyond 200GB or  100 topics due to task and job failures  • Glint can compute a topic model on the full  27TB with 1,000 topics Experiments 34

35. Experiments 0 20 40 60 Time (hours) Perplexity 104204304404504 35

36. Conclusion • Glint is a parameter server for Spark • Machine learning for very large models • Asynchrony enables highly flexible threading • Extremely easy to use • Outperforms MLLib on LDA topic modeling 36

37. Future work • Better fault tolerance (using Chord/DHT) • User defined functions for aggregation • Support for sparse models • Implementing other algorithms • Deep learning • Linear models 37

38. THANK YOU github.com/rjagerman/glint

Asynchronous Parameter Server for Distributed Machine Learning with Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Asynchronous Parameter Server for Distributed Machine Learning with Spark

Similar to Asynchronous Parameter Server for Distributed Machine Learning with Spark (20)

More from Spark Summit

More from Spark Summit (20)

Recently uploaded

Recently uploaded (20)

Asynchronous Parameter Server for Distributed Machine Learning with Spark