AI4Media WP3 workshop - Distributed training introduction

•Download as PPTX, PDF•

0 likes•34 views

Hannes Fassold

Introduction into distributed training of AI/ML algorithms Hannes Fassold, JOANNEUM RESEARCH

Engineering

1
Distributed training
and frameworks
AI4Media – WP3 Workshop, 2021-05-04
Hannes Fassold, JOANNEUM RESEARCH - DIGITAL

Introduction & Motivation
• Motivation
• State-of-the-art DL models get bigger, as well as the datasets
on which they are trained on
• GPT-3 model (SoA text processing / NLP model)
• model size is 175 billion parameters
• trained on 500 billion tokens
• So training time gets up and up …
• Doing a single training is usally not sufficient
• Hyperparameter tuning is usually done
• Adapt learning rate, momentum, ..
• Might want to experiment with the network architecture
• Network architecture search
2

Definition
• In distributed training the workload to train a model is split up
and shared among multiple processors (workers / nodes)
• Can be a “cluster” of few workers or up to several hundreds
• Usually, each worker is equipped with 2-8 GPUs
• Optimal case is linear scaling
• Training time is inversely proportionally to number of workers
• Usually not achieved due to adverse affects for larger clusters
• Serial parts (which can not be paralllized) get more prominent
• Communication cost (between workers) may rise dis-proportionally
• Distributed training allows training “from scratch” on a huge dataset in minutes
• E.g. Image classification model can be trained in 1.5 minutes
on ImageNet dataset, employing 512 GPUs
3

Data parallelism versus Model parallelism
• Data parallelism
• Training data is split into chunks
• Each worker processes a chunk
and updates model
• Advantages
• Can be applied to any model
• Disadvantages
• Each worker must have enough (GPU)
memory to hold the whole model
• Updated model must be communicated
regularly to all workers
4

Data parallelism versus Model parallelism
• Model parallelism
• Model is split into several parts
• Each worker processes
its respective model part
• Advantages
• Support for large models which
do not fit in GPU memory (e.g. NLP models)
• Disadvantages
• One has to find an efficient split of the model,
depends on model structure and number of workers
5

System architecture
• System architecture describes how the model parameter
updates of the different workers are performed
• Centralized system architecture
• Workers periodically report their model
updates to one (or more) parameter servers
• Decentralized system architecture
• Workers exchange the model updates
directly via an allreduce operation
• Topology of the allreduce operation is critical
• Fully connected => Communication cost O(n^2) !
• Usually using high-performance topologies like
ring, tree, butterfly etc.
6

Synchronization strategies
• Different strategies to synchronize the model parameters between all workers
• Synchronous
• Sync of model parameters is done after each iteration (mini-batch)
• Prone to straggler problem (slowest worker delays all workers)
• Bounded asynchronous
• Workers may train on model parameters which are ‘a few iterations’ old
• Asynchronous (e.g. Hogwild algorithm)
• Workers update their model completely independent from others
• Difficult to reason about model convergence
• Lost update problem: new parameters written by
worker A could be overwritten by worker B
7

Distributed training frameworks
• Main DL frameworks (PyTorch, TensorFlow, MXNet)
• Provide mainly support for a single node (but using multiple GPUs)
• Horovod (Uber)
• PyTorch, TensorFlow, Keras, MXNet
• Data parallelism and limited model parallelism
• Fairscale (Facebook)
• PyTorch
• Data parallelism and limited model/pipeline parallelism
• Deepspeed (Microsoft)
• PyTorch
• Data parallelism and model/pipeline parallelism
• Gradient compression (1-bit Adam/LAMB), …
8

AI4Media WP3 workshop - Distributed training introduction

What's hot

Machine learning IntroductionDong Guo

Back propagationBangalore

Keras: Deep Learning Library for PythonRafi Khan

PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsJinwon Lee

Handwriting recognitionMaeda Hanafi

World Artificial Intelligence Conference Shanghai 2018Adam Gibson

Introduction to machine learningSanghamitra Deb

DyCode Engineering - Machine Learning with TensorFlowAlwin Arrasyid

Developing Recommendation System to provide a PersonalizedLearning experienc...Sanghamitra Deb

ML_in_QM_JC_02-10-18Suzanne Wallace

Deep learning for real life applicationsAnas Arram, Ph.D

Dlآمال أسعد

Automated Machine LearningYuriy Guts

Productionizing dl from the ground upAdam Gibson

Intro to ml_2021Sanghamitra Deb

PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorJinwon Lee

201907 AutoML and Neural Architecture SearchDaeJin Kim

A Neural Network that Understands HandwritingShivam Sawhney

Machine Learning on Distributed Systems by Josh PoduskaData Con LA

Wits presentation 6_28072015Beatrice van Eden

What's hot (20)

Machine learning Introduction

Back propagation

Keras: Deep Learning Library for Python

PR-231: A Simple Framework for Contrastive Learning of Visual Representations

Handwriting recognition

World Artificial Intelligence Conference Shanghai 2018

Introduction to machine learning

DyCode Engineering - Machine Learning with TensorFlow

Developing Recommendation System to provide a PersonalizedLearning experienc...

ML_in_QM_JC_02-10-18

Deep learning for real life applications

Automated Machine Learning

Productionizing dl from the ground up

Intro to ml_2021

PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector

201907 AutoML and Neural Architecture Search

A Neural Network that Understands Handwriting

Machine Learning on Distributed Systems by Josh Poduska

Wits presentation 6_28072015

Similar to AI4Media WP3 workshop - Distributed training introduction

Coding For Cores - C# WayBishnu Rawal

Pdc lecture1SyedSafeer1

PAC 2019 virtual Alexander Podelko Neotys

Distributed Model Validation with EpsilonSina Madani

Data warehouse 26 exploiting parallel technologiesVaibhav Khanna

Distributed DNN training: Infrastructure, challenges, and lessons learnedWee Hyong Tok

Week # 1.pdfgiddy5

Deep Learning at ScaleMateusz Dymczyk

Assessing quick update methods of statistical translation modelstransLectures

PthreadGopi Saiteja

Ideas spracklen-finalsupportlogic

Regularization in deep learningKien Le

C3 w3Ajay Taneja

Simulation of Heterogeneous Cloud InfrastructuresCloudLightning

unit 4.pptxSUBHAMSHARANRA211100

Presentation 7.pptxShivam327815

Multithreaded Programming Part- I.pdfHarika Pudugosula

Parallel architecture &programmingIsmail El Gayar

Dependable Systems -Fault Tolerance Patterns (4/16)Peter Tröger

Similar to AI4Media WP3 workshop - Distributed training introduction (20)

Coding For Cores - C# Way

Pdc lecture1

PAC 2019 virtual Alexander Podelko

Distributed Model Validation with Epsilon

Data warehouse 26 exploiting parallel technologies

Distributed DNN training: Infrastructure, challenges, and lessons learned

Week # 1.pdf

Deep Learning at Scale

Assessing quick update methods of statistical translation models

Pthread

Ideas spracklen-final

Regularization in deep learning

C3 w3

Simulation of Heterogeneous Cloud Infrastructures

unit 4.pptx

Presentation 7.pptx

Multithreaded Programming Part- I.pdf

Parallel architecture &programming

Dependable Systems -Fault Tolerance Patterns (4/16)

Recently uploaded

Vivazz, Mieres Social Housing Design Spaintimesproduction05

Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control

Unit 1 - Soil Classification and Compaction.pdfRagavanV2

Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi

Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi

UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSrknatarajan

VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY

ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya

Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...9953056974 Low Rate Call Girls In Saket, Delhi NCR

NFPA 5000 2024 standard .DerechoLaboralIndivi

The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat

chapter 5.pptx: drainage and irrigation engineeringmulugeta48

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat

Roadmap to Membership of RICS - Pathways and RoutesM Maged Hegazy, LLM, MBA, CCP, P3O

PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLManishPatel169454

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis

result management system report for college projectTonystark477637

FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066

Thermal Engineering -unit - III & IV.pptDineshKumar4165

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth

Recently uploaded (20)

Vivazz, Mieres Social Housing Design Spain

Water Industry Process Automation & Control Monthly - April 2024

Unit 1 - Soil Classification and Compaction.pdf

Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...

Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking

UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS

VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...

ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf

Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...

NFPA 5000 2024 standard .

The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...

chapter 5.pptx: drainage and irrigation engineering

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...

Roadmap to Membership of RICS - Pathways and Routes

PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...

result management system report for college project

FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756

Thermal Engineering -unit - III & IV.ppt

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...

AI4Media WP3 workshop - Distributed training introduction

1. 1 Distributed training and frameworks AI4Media – WP3 Workshop, 2021-05-04 Hannes Fassold, JOANNEUM RESEARCH - DIGITAL

2. Introduction & Motivation • Motivation • State-of-the-art DL models get bigger, as well as the datasets on which they are trained on • GPT-3 model (SoA text processing / NLP model) • model size is 175 billion parameters • trained on 500 billion tokens • So training time gets up and up … • Doing a single training is usally not sufficient • Hyperparameter tuning is usually done • Adapt learning rate, momentum, .. • Might want to experiment with the network architecture • Network architecture search 2

3. Definition • In distributed training the workload to train a model is split up and shared among multiple processors (workers / nodes) • Can be a “cluster” of few workers or up to several hundreds • Usually, each worker is equipped with 2-8 GPUs • Optimal case is linear scaling • Training time is inversely proportionally to number of workers • Usually not achieved due to adverse affects for larger clusters • Serial parts (which can not be paralllized) get more prominent • Communication cost (between workers) may rise dis-proportionally • Distributed training allows training “from scratch” on a huge dataset in minutes • E.g. Image classification model can be trained in 1.5 minutes on ImageNet dataset, employing 512 GPUs 3

4. Data parallelism versus Model parallelism • Data parallelism • Training data is split into chunks • Each worker processes a chunk and updates model • Advantages • Can be applied to any model • Disadvantages • Each worker must have enough (GPU) memory to hold the whole model • Updated model must be communicated regularly to all workers 4

5. Data parallelism versus Model parallelism • Model parallelism • Model is split into several parts • Each worker processes its respective model part • Advantages • Support for large models which do not fit in GPU memory (e.g. NLP models) • Disadvantages • One has to find an efficient split of the model, depends on model structure and number of workers 5

6. System architecture • System architecture describes how the model parameter updates of the different workers are performed • Centralized system architecture • Workers periodically report their model updates to one (or more) parameter servers • Decentralized system architecture • Workers exchange the model updates directly via an allreduce operation • Topology of the allreduce operation is critical • Fully connected => Communication cost O(n^2) ! • Usually using high-performance topologies like ring, tree, butterfly etc. 6

7. Synchronization strategies • Different strategies to synchronize the model parameters between all workers • Synchronous • Sync of model parameters is done after each iteration (mini-batch) • Prone to straggler problem (slowest worker delays all workers) • Bounded asynchronous • Workers may train on model parameters which are ‘a few iterations’ old • Asynchronous (e.g. Hogwild algorithm) • Workers update their model completely independent from others • Difficult to reason about model convergence • Lost update problem: new parameters written by worker A could be overwritten by worker B 7

8. Distributed training frameworks • Main DL frameworks (PyTorch, TensorFlow, MXNet) • Provide mainly support for a single node (but using multiple GPUs) • Horovod (Uber) • PyTorch, TensorFlow, Keras, MXNet • Data parallelism and limited model parallelism • Fairscale (Facebook) • PyTorch • Data parallelism and limited model/pipeline parallelism • Deepspeed (Microsoft) • PyTorch • Data parallelism and model/pipeline parallelism • Gradient compression (1-bit Adam/LAMB), … 8

9. 9

10. 10

11. 11

12. 12

13. 13

14. 14

Editor's Notes

GPT-3 info => https://lambdalabs.com/blog/demystifying-gpt-3/ und https://scilogs.spektrum.de/hlf/an-ai-walks-into-a-bar-and-it-writes-an-awesome-story/
Alexnet in 1.5 minuten – siehe https://arxiv.org/pdf/1902.06855.pdf
Info und figures aus https://arxiv.org/pdf/1903.11314.pdf
Info und figures aus https://arxiv.org/pdf/1903.11314.pdf

AI4Media WP3 workshop - Distributed training introduction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to AI4Media WP3 workshop - Distributed training introduction

Similar to AI4Media WP3 workshop - Distributed training introduction (20)

Recently uploaded

Recently uploaded (20)

AI4Media WP3 workshop - Distributed training introduction

Editor's Notes