SlideShare a Scribd company logo
1 of 18
Download to read offline
Toward Distributed, Global, Deep Learning
Using IoT Devices
Bharath Sudharsan, Pankesh Patel, John Breslin,
Muhammad Intizar Ali, Karan Mitra, Schahram Dustdar,
Omer Rana, Prem Prakash Jayaraman, Rajiv Ranjan
Distributed Training
▪ In Deep Learning (DL), more is better
➢ More data, layers, compute power,
leads to higher accuracy, and better
robustness of trained models
▪ Distributed training can improve model
convergence speed
➢ Every GPU runs exact same copy of the
training script. Each training process is
uniquely identified by its rank
➢ As the number of training processes
increases, inter-process
communications increases, and
communication overhead starts
affecting scaling efficiency
Distribute training on multiple GPUs using Amazon SageMaker
Osmotic Computing: Federated View
▪ Cloud-centric IoT programming model needs to be revised
into something that’s more adaptable and decentralized
to meet the needs of emerging IoT applications
▪ Osmotic Computing: Maintain optimal balance of
microservices across the edge and the cloud with a
tunable configuration
➢ Data Analysis: Distribution of data analysis tasks
across cloud and edge computing environments
➢ Deep Learning: DL can be orchestrated and take
advantage of the cloud-edge for privacy preserving
DL model training
Osmotic Computing
Cloud Datacenter
Edge Node (Storage)
Edge Node (Data Cache)
Edge Node (Local Compute)
Edge Node (Agent)
Edge Layer(s)
Cloud Layer
IoT Layer
body monitor
humidity
motion
light
aggregator …
…
…
hospital alarm
mobile
…
GPS
Massive data processing
Massive machine learning
Real time data collection, processing
Data caching, agenting, load-balancing
Osmotic Computing paradigm
Restricted Storage
Extremely low latency
Limited storage
High latency
Limited compute capacity
IoT Devices - Hardware View
ARM Cortex-M0 MCU
based BLE beacon
Edge gateway with
GPUs and SSDs
Powerful CPU + basic GPU based
SBCs (single board computers)
Edge computing hardware: highly resource constrained -> high resource (left to right)
▪ MCUs and small CPUs: BLE beacons, smart
bulbs, smart plugs, TV remotes, fitness bands
▪ SBCs: Raspberry Pis, BeagleBones, NVIDIA
Jetsons, Latte Pandas, Intel NUCs, Google Coral
▪ GPU accelerated: AWS snowball, Digi gateways,
Dell Edge Gateways for IoT, HPE Edgeline
▪ Roughly 50 billion MCU chips were shipped in 2020
(market estimates), which far exceeds other chips like
GPUs & CPUs (only 100 million units sold)
IoT Devices - Hardware View
▪ SBCs for DL model training – established area
✓ Numerous papers, libraries, algorithms, tools
exists to enable ML self-learning and re-training
✓ ML framework support i.e., TF Lite can run on
SBCs and not on MCUs
▪ MCUs for DL model training – emerging area
✓ Edge2Train: Train SVMs on MCUs
✓ Train++: Ultra fast incremental learning
✓ ML-MCU: Train 50 class classifier on 3$ IoT chip
Powerful CPU + basic GPU based
SBCs (Single Board Computers)
Billions of IoT devices are designed
using such MCUs and small CPUs
Distributed Training on IoT Devices
▪ When idle IoT devices are efficiently connected, it can collectively train mid-sized models
✓ Efficiently connecting 20 Alexas can collectively pool 40 GB of RAM
✓ Possible to train in similar speeds as GPU, but at a 0 $ investment since millions of IoT devices
already exist globally, and most of them are idle
=
2500$ GEFORCE RTX 2080 Ti GPU setup with 11 GB RAM (left). Common Alexa smart speaker with approx. 2 GB RAM each
Distributed Training on IoT Devices
▪ With strict privacy regulations, the historic datasets building process is rapidly transforming into historic
intelligence building - achieved by distributed learning on the IoT devices, then central model combining
▪ ML model aggregation rather than data aggregation - using locally generated data to collectively train a
model without storing or transmitting data to server
▪ Use case and model combining methods: https://github.com/bharathsudharsan/ML-Model-Combining
Distributed learning on IoT devices in Osmotic Computing
Distributed, Global Training on IoT Devices
▪ Distributed training of one DL model on the
hardware of thousands of IoT devices is the
future of ML and IoT
➢ Improved convergence speed
➢ Improved data privacy
➢ Effective utilization of idle devices
➢ Avoid investing on GPU clusters or Cloud
▪ Global training scenarios/setups can be
impacted by real-world network uncertainties
and staleness. Challenges are presented in
upcoming slides
Servers coordinating with geographically
separated IoT devices to produce a trained model
High FLOPs Consumption
▪ FLOPS = Floating point operations per second. FLOPs = Floating point operations
➢ FLOPS is a unit of speed. FLOPs is a unit of amount
▪ The input/activation of video analytics DL networks has [N, T, C, H, W] as its five dimensions
➢ N is batch size, T is temporal timestamps, C is channel number, H & W are spatial resolution
▪ Computational overhead and network congestion: Can apply 2-D CNN to each video image frames -
temporal relationship between the frames cannot be modeled/learned
▪ More parameters can cause stalling: For distributed learning of spatio-temporal data, the models
with 3-D convolutions, in addition to large model size, also suffers from large number of parameters,
➢ Main reason to slow down the training and communication process even within a GPU cluster
➢ Training will stall when unexpected network issues are encountered
Slow Exchange of Model Gradients
Network conditions across different continents
▪ Shanghai to Boston: Even at speed of light, direct air distance - still takes 78 ms to send and receive a packet
➢ 11, 725 km × 2/(3 × 108 m/s) = 78.16ms. Information collected from Google Maps
▪ Network conditions across different continents. Different from training inside a data center, long-distance
distributed training suffers from high latency, which proposes a severe challenge to scale across the world
Staleness Effects
▪ Network communication bottleneck produce stale parameters
➢ Model parameters arrive late, not reflecting the latest updates
➢ Staleness during training can lead to model instability
➢ Staleness not only slows down convergence but also degrades model performance
➢ Popular distributed model training techniques (e.g., SSGD, ASGD, D2, AD-PSGD) adopt a
nonsynchronous execution approach to handle staleness
➢ Not feasible to monitor and control staleness in the current complex IoT environments containing
heterogeneous devices using different network protocols
▪ Staleness challenges can be addressed by designing accuracy guaranteeing dynamic error compensation and
network coding techniques
Dataset I/O Efficiency
▪ Datasets are usually stored in a high-performance storage system (HPSS), shared across all worker nodes
➢ HPSS systems have good sequential I/O performance, their random-access performance is inferior,
causing bottlenecks for large data traffic
▪ Research needs to consider novel data approximation, sampling and filtering methods
➢ Develop a method to identify videos that have multiple similar frames (i.e., we say that nearby frames
contain similar information), then load and share only nonredundant frames during distributed training
➢ Similarly, for other datasets associated with images and sensor readings, we recommend filtering or
downsampling the data without losing information, then distributing it during training
Design Considerations
▪ Latency and Bandwidth: Dynamic and depend on the network condition, which we cannot control
▪ Scalability: Essential when connecting many devices. To improve, we need to significantly reduce
communication cost (Tc), which is determined by network bandwidth, latency
➢ Tc = latency + (model size/bandwidth)
➢ If we can achieve X times training speedup on Y machines, the overall distributed training scalability
(defined as X/Y) increases
▪ IoT Hardware Friendliness: When following SSDS and ASGD, the IoT devices need to use techniques to
tolerate extreme network conditions by reducing the data to be transferred
➢ Gradient sparsification, temporally sparse updates, gradient quantization/compression
➢ Accommodating such techniques add computation strain while consuming the limited memory that
is sufficient only for training models and executing the device's routine functionalities
Two-step Deep Compression Method
▪ Step One: Sets the weight threshold Wt high for the training involved devices that have a poor internet
➢ Reduces frequent transmission of weights, reducing the network bandwidth. i.e., weights are
locally accumulated till it reaches Wt
➢ Less usage of congested network by not allowing to transmit weights frequently
▪ Step Two: Shrinks all the accumulated weights using the encode() function
➢ Similar to how Vanilla, Nesterov momentum SGD encodes the parameters/gradients
➢ Shrink and efficiently transmit trained weights without straining IoT hardware
▪ Both steps jointly improve training scalability and speed by tolerating the real-world network uncertainties
and by reducing the communication-to-computation ratio
Two-step Deep Compression Method
Comparing distributed training within a GPU cluster versus training using geographically distributed IoT devices
▪ The proposed two-step deep compression method can
(a) Tolerate latency and increase training speed
(b) Reduce the communication-to-computation ratio to improve scalability and reduce communication costs
Summary
▪ Presented the concept of training DL models on idle IoT devices, millions of which exist across the world
➢ Reduces Cost: Avoiding investing in GPU clusters or Cloud by effective utilization of idle IoT devices
➢ Improves Privacy: Historic datasets building process transforming into historic intelligence building
➢ Improves Training Speed: Can pool massive hardware resources if devices are efficiently connected
▪ Identified and studied challenges that impact the distributed, global training on IoT devices
▪ Presented a two-step deep compression method to improve distributed training speed and scalability
➢ Can significantly compress gradients during the training of a wide range of NN architectures
➢ Can also be utilized alongside TF-Lite and Federated Learning approaches thereby providing the basis
for a broad-spectrum of decentralized and collaborative learning applications
Contact: Bharath Sudharsan
Email: bharath.sudharsan@insight-centre.org
www.confirm.ie

More Related Content

What's hot

A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...butest
 
Introduction to Parallel Distributed Computer Systems
Introduction to Parallel Distributed Computer SystemsIntroduction to Parallel Distributed Computer Systems
Introduction to Parallel Distributed Computer SystemsMrMaKKaWi
 
Introduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed ComputingIntroduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed ComputingSayed Chhattan Shah
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel ComputingRoshan Karunarathna
 
AI On the Edge: Model Compression
AI On the Edge: Model CompressionAI On the Edge: Model Compression
AI On the Edge: Model CompressionApache MXNet
 
Research Scope in Parallel Computing And Parallel Programming
Research Scope in Parallel Computing And Parallel ProgrammingResearch Scope in Parallel Computing And Parallel Programming
Research Scope in Parallel Computing And Parallel ProgrammingShitalkumar Sukhdeve
 
Lecture 1 introduction to parallel and distributed computing
Lecture 1   introduction to parallel and distributed computingLecture 1   introduction to parallel and distributed computing
Lecture 1 introduction to parallel and distributed computingVajira Thambawita
 
Lecture 1
Lecture 1Lecture 1
Lecture 1Mr SMAK
 
Parallel computing and its applications
Parallel computing and its applicationsParallel computing and its applications
Parallel computing and its applicationsBurhan Ahmed
 
Parallel computing in india
Parallel computing in indiaParallel computing in india
Parallel computing in indiaPreeti Chauhan
 
Applications of paralleL processing
Applications of paralleL processingApplications of paralleL processing
Applications of paralleL processingPage Maker
 
network ram parallel computing
network ram parallel computingnetwork ram parallel computing
network ram parallel computingNiranjana Ambadi
 
Parallel Computing 2007: Overview
Parallel Computing 2007: OverviewParallel Computing 2007: Overview
Parallel Computing 2007: OverviewGeoffrey Fox
 
Parallel & Distributed processing
Parallel & Distributed processingParallel & Distributed processing
Parallel & Distributed processingSyed Zaid Irshad
 

What's hot (20)

A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...
 
Introduction to Parallel Distributed Computer Systems
Introduction to Parallel Distributed Computer SystemsIntroduction to Parallel Distributed Computer Systems
Introduction to Parallel Distributed Computer Systems
 
Parallel Computing
Parallel ComputingParallel Computing
Parallel Computing
 
Introduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed ComputingIntroduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed Computing
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel Computing
 
AI On the Edge: Model Compression
AI On the Edge: Model CompressionAI On the Edge: Model Compression
AI On the Edge: Model Compression
 
Research Scope in Parallel Computing And Parallel Programming
Research Scope in Parallel Computing And Parallel ProgrammingResearch Scope in Parallel Computing And Parallel Programming
Research Scope in Parallel Computing And Parallel Programming
 
Introduction to parallel computing
Introduction to parallel computingIntroduction to parallel computing
Introduction to parallel computing
 
Lecture02 types
Lecture02 typesLecture02 types
Lecture02 types
 
Lecture 1 introduction to parallel and distributed computing
Lecture 1   introduction to parallel and distributed computingLecture 1   introduction to parallel and distributed computing
Lecture 1 introduction to parallel and distributed computing
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
Parallel computing and its applications
Parallel computing and its applicationsParallel computing and its applications
Parallel computing and its applications
 
Parallel computing in india
Parallel computing in indiaParallel computing in india
Parallel computing in india
 
Applications of paralleL processing
Applications of paralleL processingApplications of paralleL processing
Applications of paralleL processing
 
network ram parallel computing
network ram parallel computingnetwork ram parallel computing
network ram parallel computing
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
Parallel Computing 2007: Overview
Parallel Computing 2007: OverviewParallel Computing 2007: Overview
Parallel Computing 2007: Overview
 
Parallel & Distributed processing
Parallel & Distributed processingParallel & Distributed processing
Parallel & Distributed processing
 
Deep learning-practical
Deep learning-practicalDeep learning-practical
Deep learning-practical
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
 

Similar to Distributed Deep Learning Using IoT Devices

Seed block algorithm
Seed block algorithmSeed block algorithm
Seed block algorithmDipak Badhe
 
Multi Layer Federated Learning.pptx
Multi Layer Federated Learning.pptxMulti Layer Federated Learning.pptx
Multi Layer Federated Learning.pptxTimePass43152
 
Advanced Design and Optimization of Data Center Interconnection Networks.pptx
Advanced Design and Optimization of Data Center Interconnection Networks.pptxAdvanced Design and Optimization of Data Center Interconnection Networks.pptx
Advanced Design and Optimization of Data Center Interconnection Networks.pptxService Solutions Pvt. Ltd. (SSL)
 
PIMRC-2012, Sydney, Australia, 28 July, 2012
PIMRC-2012, Sydney, Australia, 28 July, 2012PIMRC-2012, Sydney, Australia, 28 July, 2012
PIMRC-2012, Sydney, Australia, 28 July, 2012Charith Perera
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learningAmer Ather
 
Data processing in Cyber-Physical Systems
Data processing in Cyber-Physical SystemsData processing in Cyber-Physical Systems
Data processing in Cyber-Physical SystemsBob Marcus
 
Machine Learning on mobile devices
Machine Learning on mobile devicesMachine Learning on mobile devices
Machine Learning on mobile devicesSergey Burkov
 
Lecture_IIITD.pptx
Lecture_IIITD.pptxLecture_IIITD.pptx
Lecture_IIITD.pptxachakracu
 
Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedWee Hyong Tok
 
System models for distributed and cloud computing
System models for distributed and cloud computingSystem models for distributed and cloud computing
System models for distributed and cloud computingpurplesea
 
UNIT 4 computer networking powerpoint presentation .pdf
UNIT 4 computer networking powerpoint presentation .pdfUNIT 4 computer networking powerpoint presentation .pdf
UNIT 4 computer networking powerpoint presentation .pdfshubhangisonawane6
 
CC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdfCC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdfHasanAfwaaz1
 
Simulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresSimulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresCloudLightning
 
Inroduction to grid computing by gargi shankar verma
Inroduction to grid computing by gargi shankar vermaInroduction to grid computing by gargi shankar verma
Inroduction to grid computing by gargi shankar vermagargishankar1981
 
Survey_Report_Deep Learning Algorithm
Survey_Report_Deep Learning AlgorithmSurvey_Report_Deep Learning Algorithm
Survey_Report_Deep Learning AlgorithmSahil Kaw
 
Cloud Computing System models for Distributed and cloud computing & Performan...
Cloud Computing System models for Distributed and cloud computing & Performan...Cloud Computing System models for Distributed and cloud computing & Performan...
Cloud Computing System models for Distributed and cloud computing & Performan...hrmalik20
 
Cloud computing system models for distributed and cloud computing
Cloud computing system models for distributed and cloud computingCloud computing system models for distributed and cloud computing
Cloud computing system models for distributed and cloud computinghrmalik20
 

Similar to Distributed Deep Learning Using IoT Devices (20)

Seed block algorithm
Seed block algorithmSeed block algorithm
Seed block algorithm
 
Cluster computing
Cluster computingCluster computing
Cluster computing
 
Deep Learning at Scale
Deep Learning at ScaleDeep Learning at Scale
Deep Learning at Scale
 
C3 w3
C3 w3C3 w3
C3 w3
 
Multi Layer Federated Learning.pptx
Multi Layer Federated Learning.pptxMulti Layer Federated Learning.pptx
Multi Layer Federated Learning.pptx
 
Advanced Design and Optimization of Data Center Interconnection Networks.pptx
Advanced Design and Optimization of Data Center Interconnection Networks.pptxAdvanced Design and Optimization of Data Center Interconnection Networks.pptx
Advanced Design and Optimization of Data Center Interconnection Networks.pptx
 
PIMRC-2012, Sydney, Australia, 28 July, 2012
PIMRC-2012, Sydney, Australia, 28 July, 2012PIMRC-2012, Sydney, Australia, 28 July, 2012
PIMRC-2012, Sydney, Australia, 28 July, 2012
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
 
Data processing in Cyber-Physical Systems
Data processing in Cyber-Physical SystemsData processing in Cyber-Physical Systems
Data processing in Cyber-Physical Systems
 
Machine Learning on mobile devices
Machine Learning on mobile devicesMachine Learning on mobile devices
Machine Learning on mobile devices
 
Lecture_IIITD.pptx
Lecture_IIITD.pptxLecture_IIITD.pptx
Lecture_IIITD.pptx
 
Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learned
 
System models for distributed and cloud computing
System models for distributed and cloud computingSystem models for distributed and cloud computing
System models for distributed and cloud computing
 
UNIT 4 computer networking powerpoint presentation .pdf
UNIT 4 computer networking powerpoint presentation .pdfUNIT 4 computer networking powerpoint presentation .pdf
UNIT 4 computer networking powerpoint presentation .pdf
 
CC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdfCC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdf
 
Simulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresSimulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud Infrastructures
 
Inroduction to grid computing by gargi shankar verma
Inroduction to grid computing by gargi shankar vermaInroduction to grid computing by gargi shankar verma
Inroduction to grid computing by gargi shankar verma
 
Survey_Report_Deep Learning Algorithm
Survey_Report_Deep Learning AlgorithmSurvey_Report_Deep Learning Algorithm
Survey_Report_Deep Learning Algorithm
 
Cloud Computing System models for Distributed and cloud computing & Performan...
Cloud Computing System models for Distributed and cloud computing & Performan...Cloud Computing System models for Distributed and cloud computing & Performan...
Cloud Computing System models for Distributed and cloud computing & Performan...
 
Cloud computing system models for distributed and cloud computing
Cloud computing system models for distributed and cloud computingCloud computing system models for distributed and cloud computing
Cloud computing system models for distributed and cloud computing
 

Recently uploaded

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 

Recently uploaded (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 

Distributed Deep Learning Using IoT Devices

  • 1. Toward Distributed, Global, Deep Learning Using IoT Devices Bharath Sudharsan, Pankesh Patel, John Breslin, Muhammad Intizar Ali, Karan Mitra, Schahram Dustdar, Omer Rana, Prem Prakash Jayaraman, Rajiv Ranjan
  • 2. Distributed Training ▪ In Deep Learning (DL), more is better ➢ More data, layers, compute power, leads to higher accuracy, and better robustness of trained models ▪ Distributed training can improve model convergence speed ➢ Every GPU runs exact same copy of the training script. Each training process is uniquely identified by its rank ➢ As the number of training processes increases, inter-process communications increases, and communication overhead starts affecting scaling efficiency Distribute training on multiple GPUs using Amazon SageMaker
  • 3. Osmotic Computing: Federated View ▪ Cloud-centric IoT programming model needs to be revised into something that’s more adaptable and decentralized to meet the needs of emerging IoT applications ▪ Osmotic Computing: Maintain optimal balance of microservices across the edge and the cloud with a tunable configuration ➢ Data Analysis: Distribution of data analysis tasks across cloud and edge computing environments ➢ Deep Learning: DL can be orchestrated and take advantage of the cloud-edge for privacy preserving DL model training
  • 4. Osmotic Computing Cloud Datacenter Edge Node (Storage) Edge Node (Data Cache) Edge Node (Local Compute) Edge Node (Agent) Edge Layer(s) Cloud Layer IoT Layer body monitor humidity motion light aggregator … … … hospital alarm mobile … GPS Massive data processing Massive machine learning Real time data collection, processing Data caching, agenting, load-balancing Osmotic Computing paradigm Restricted Storage Extremely low latency Limited storage High latency Limited compute capacity
  • 5. IoT Devices - Hardware View ARM Cortex-M0 MCU based BLE beacon Edge gateway with GPUs and SSDs Powerful CPU + basic GPU based SBCs (single board computers) Edge computing hardware: highly resource constrained -> high resource (left to right) ▪ MCUs and small CPUs: BLE beacons, smart bulbs, smart plugs, TV remotes, fitness bands ▪ SBCs: Raspberry Pis, BeagleBones, NVIDIA Jetsons, Latte Pandas, Intel NUCs, Google Coral ▪ GPU accelerated: AWS snowball, Digi gateways, Dell Edge Gateways for IoT, HPE Edgeline ▪ Roughly 50 billion MCU chips were shipped in 2020 (market estimates), which far exceeds other chips like GPUs & CPUs (only 100 million units sold)
  • 6. IoT Devices - Hardware View ▪ SBCs for DL model training – established area ✓ Numerous papers, libraries, algorithms, tools exists to enable ML self-learning and re-training ✓ ML framework support i.e., TF Lite can run on SBCs and not on MCUs ▪ MCUs for DL model training – emerging area ✓ Edge2Train: Train SVMs on MCUs ✓ Train++: Ultra fast incremental learning ✓ ML-MCU: Train 50 class classifier on 3$ IoT chip Powerful CPU + basic GPU based SBCs (Single Board Computers) Billions of IoT devices are designed using such MCUs and small CPUs
  • 7. Distributed Training on IoT Devices ▪ When idle IoT devices are efficiently connected, it can collectively train mid-sized models ✓ Efficiently connecting 20 Alexas can collectively pool 40 GB of RAM ✓ Possible to train in similar speeds as GPU, but at a 0 $ investment since millions of IoT devices already exist globally, and most of them are idle = 2500$ GEFORCE RTX 2080 Ti GPU setup with 11 GB RAM (left). Common Alexa smart speaker with approx. 2 GB RAM each
  • 8. Distributed Training on IoT Devices ▪ With strict privacy regulations, the historic datasets building process is rapidly transforming into historic intelligence building - achieved by distributed learning on the IoT devices, then central model combining ▪ ML model aggregation rather than data aggregation - using locally generated data to collectively train a model without storing or transmitting data to server ▪ Use case and model combining methods: https://github.com/bharathsudharsan/ML-Model-Combining Distributed learning on IoT devices in Osmotic Computing
  • 9. Distributed, Global Training on IoT Devices ▪ Distributed training of one DL model on the hardware of thousands of IoT devices is the future of ML and IoT ➢ Improved convergence speed ➢ Improved data privacy ➢ Effective utilization of idle devices ➢ Avoid investing on GPU clusters or Cloud ▪ Global training scenarios/setups can be impacted by real-world network uncertainties and staleness. Challenges are presented in upcoming slides Servers coordinating with geographically separated IoT devices to produce a trained model
  • 10. High FLOPs Consumption ▪ FLOPS = Floating point operations per second. FLOPs = Floating point operations ➢ FLOPS is a unit of speed. FLOPs is a unit of amount ▪ The input/activation of video analytics DL networks has [N, T, C, H, W] as its five dimensions ➢ N is batch size, T is temporal timestamps, C is channel number, H & W are spatial resolution ▪ Computational overhead and network congestion: Can apply 2-D CNN to each video image frames - temporal relationship between the frames cannot be modeled/learned ▪ More parameters can cause stalling: For distributed learning of spatio-temporal data, the models with 3-D convolutions, in addition to large model size, also suffers from large number of parameters, ➢ Main reason to slow down the training and communication process even within a GPU cluster ➢ Training will stall when unexpected network issues are encountered
  • 11. Slow Exchange of Model Gradients Network conditions across different continents ▪ Shanghai to Boston: Even at speed of light, direct air distance - still takes 78 ms to send and receive a packet ➢ 11, 725 km × 2/(3 × 108 m/s) = 78.16ms. Information collected from Google Maps ▪ Network conditions across different continents. Different from training inside a data center, long-distance distributed training suffers from high latency, which proposes a severe challenge to scale across the world
  • 12. Staleness Effects ▪ Network communication bottleneck produce stale parameters ➢ Model parameters arrive late, not reflecting the latest updates ➢ Staleness during training can lead to model instability ➢ Staleness not only slows down convergence but also degrades model performance ➢ Popular distributed model training techniques (e.g., SSGD, ASGD, D2, AD-PSGD) adopt a nonsynchronous execution approach to handle staleness ➢ Not feasible to monitor and control staleness in the current complex IoT environments containing heterogeneous devices using different network protocols ▪ Staleness challenges can be addressed by designing accuracy guaranteeing dynamic error compensation and network coding techniques
  • 13. Dataset I/O Efficiency ▪ Datasets are usually stored in a high-performance storage system (HPSS), shared across all worker nodes ➢ HPSS systems have good sequential I/O performance, their random-access performance is inferior, causing bottlenecks for large data traffic ▪ Research needs to consider novel data approximation, sampling and filtering methods ➢ Develop a method to identify videos that have multiple similar frames (i.e., we say that nearby frames contain similar information), then load and share only nonredundant frames during distributed training ➢ Similarly, for other datasets associated with images and sensor readings, we recommend filtering or downsampling the data without losing information, then distributing it during training
  • 14. Design Considerations ▪ Latency and Bandwidth: Dynamic and depend on the network condition, which we cannot control ▪ Scalability: Essential when connecting many devices. To improve, we need to significantly reduce communication cost (Tc), which is determined by network bandwidth, latency ➢ Tc = latency + (model size/bandwidth) ➢ If we can achieve X times training speedup on Y machines, the overall distributed training scalability (defined as X/Y) increases ▪ IoT Hardware Friendliness: When following SSDS and ASGD, the IoT devices need to use techniques to tolerate extreme network conditions by reducing the data to be transferred ➢ Gradient sparsification, temporally sparse updates, gradient quantization/compression ➢ Accommodating such techniques add computation strain while consuming the limited memory that is sufficient only for training models and executing the device's routine functionalities
  • 15. Two-step Deep Compression Method ▪ Step One: Sets the weight threshold Wt high for the training involved devices that have a poor internet ➢ Reduces frequent transmission of weights, reducing the network bandwidth. i.e., weights are locally accumulated till it reaches Wt ➢ Less usage of congested network by not allowing to transmit weights frequently ▪ Step Two: Shrinks all the accumulated weights using the encode() function ➢ Similar to how Vanilla, Nesterov momentum SGD encodes the parameters/gradients ➢ Shrink and efficiently transmit trained weights without straining IoT hardware ▪ Both steps jointly improve training scalability and speed by tolerating the real-world network uncertainties and by reducing the communication-to-computation ratio
  • 16. Two-step Deep Compression Method Comparing distributed training within a GPU cluster versus training using geographically distributed IoT devices ▪ The proposed two-step deep compression method can (a) Tolerate latency and increase training speed (b) Reduce the communication-to-computation ratio to improve scalability and reduce communication costs
  • 17. Summary ▪ Presented the concept of training DL models on idle IoT devices, millions of which exist across the world ➢ Reduces Cost: Avoiding investing in GPU clusters or Cloud by effective utilization of idle IoT devices ➢ Improves Privacy: Historic datasets building process transforming into historic intelligence building ➢ Improves Training Speed: Can pool massive hardware resources if devices are efficiently connected ▪ Identified and studied challenges that impact the distributed, global training on IoT devices ▪ Presented a two-step deep compression method to improve distributed training speed and scalability ➢ Can significantly compress gradients during the training of a wide range of NN architectures ➢ Can also be utilized alongside TF-Lite and Federated Learning approaches thereby providing the basis for a broad-spectrum of decentralized and collaborative learning applications
  • 18. Contact: Bharath Sudharsan Email: bharath.sudharsan@insight-centre.org www.confirm.ie