Memory efficient implementation of dense nets

•

2 likes•758 views

Hyungjoo Cho

This is the review of the paper "Memory-Efficient Implementation of DenseNets"

Software

Memory-Efficient
Implementation of
DenseNets
Hyungjoo Cho

DenseNets
- ResNets
- Dense connectivity
- Composite function
- Pooling layers
- Growth rate
- Bottleneck layers
- Compression

Memory optimization
with Computation Graph

“Training Deep Nets with Sublinear Memory Cost” - Tianqi Chen et. al.

Pre-activation batch normalization
- Apply batch-norm and non-linearities before
the conv-operation
- Given m layers, pre-activations generate up
to m copies of each layer
- Typically ( m - 1 ) ( m - 2 ) / 2
“Identity Mapping in Deep Residual Networks” - Kaiming He et. al.

Contiguous Concatenation
- Convolution is most efficient when input
lies in a contiguous block of memory
- To make a contiguous input, each layer
must copy all previous features
- Given m layers, each feature may be
copied up to m times
“Memory-Efficient Implementation of DenseNets” - Geoff Pleiss et. al.

Concatenation and normalization operation are
computationally extremely cheap

Copying to pre-allocated memory is significantly faster than
allocating new memory

“Memory-Efficient Implementation of DenseNets” - Geoff Pleiss et. al.

Shared storage for concatenation
- Rather than allocating memory for each concatenation operation, assign the
outputs to a memory allocation shared across all layers
- Shared Memory Storage 1 is used by all network layers, its data is not
permanent
- Need to be recomputed during back-propagation

Shared storage for batch normalization
- Assign the outputs of batch normalization to a shared memory allocation
- The data in Shared Memory Storage 2 is not permanent and will be
overwritten by the next layer
- Should recompute the batch normalization outputs during back-propagation

Implementation Details ( Pytorch )
- Pre allocate shared memories when DenseBlock is initialized

Implementation Details ( Pytorch )
- Define bottleneck layer function as an Autograd Function

Implementation Details ( Pytorch )
- Define custom functions ( BN, Cat, ReLU, Conv2d ) with backward as well

Appendix
- Pytorch C libraries ( i.e, ReLU )

What's hot

Computer architecture multi processorMazin Alwaaly

Chapter 08 Google

MultiprocessorKamal Acharya

MultiprocessorNeel Patel

Linux NUMA & Databases: Perils and OpportunitiesRaghavendra Prabhu

Aca2 07 newSumit Mittu

Lecture24 Multiprocessorallankliu

13. multiprocessingkarishmamubeen

Multiple processor systemsjeetesh036

Morph : a novel acceleratorBaharJV

Михаил Щербаков - Нестандартное использование Puppet в деплойментеYandex

Linux memoryericrain911

NUMAPallab Ray

31 address binding, dynamic loadingmyrajendra

Spectrum Scale Memory UsageTomer Perry

Multiprocessing operating systemsChathurangi Shyalika

Multiprocessor architectureArpan Baishya

Lecture 7 cuda execution modelVajira Thambawita

Dynamic memory allocationGaurav Mandal

Coherence and consistency models in multiprocessor architectureUniversity of Pisa

What's hot (20)

Computer architecture multi processor

Chapter 08

Multiprocessor

Linux NUMA & Databases: Perils and Opportunities

Aca2 07 new

Lecture24 Multiprocessor

13. multiprocessing

Multiple processor systems

Morph : a novel accelerator

Михаил Щербаков - Нестандартное использование Puppet в деплойменте

Linux memory

NUMA

31 address binding, dynamic loading

Spectrum Scale Memory Usage

Multiprocessing operating systems

Multiprocessor architecture

Lecture 7 cuda execution model

Dynamic memory allocation

Coherence and consistency models in multiprocessor architecture

Similar to Memory efficient implementation of dense nets

Operating SystemSubhasis Dash

shashank_hpca1995_00386533Shashank Nemawarkar

mTCP使ってみたHajime Tazaki

PFQ@ PAM12Nicola Bonelli

A Platform for Accelerating Machine Learning ApplicationsNVIDIA Taiwan

Deep Learning - CNN and RNNAshray Bhandare

Introduction to Parallel ComputingAkhila Prabhakaran

Distributed Deep Learning on AWS with Apache MXNetAmazon Web Services

Operating system.pptxKrishnenduMisra2

Ch4 memory managementBullz Musetsho

Ospankajoshi15

Massively Parallel ArchitecturesJason Hearne-McGuiness

shashank_spdp1993_00395543Shashank Nemawarkar

CS8603_Notes_003-1_edubuzz360.pdfKishaKiddo

deeplearninghuda2018

Chapter 6 osAbDul ThaYyal

High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...EUDAT

Advanced computer architectureAjithaSomasundaram

Linux Internals - Kernel/CoreShay Cohen

A Buffering Approach to Manage I/O in a Normalized Cross-Correlation Earthqua...Dawei Mu

Similar to Memory efficient implementation of dense nets (20)

Operating System

shashank_hpca1995_00386533

mTCP使ってみた

PFQ@ PAM12

A Platform for Accelerating Machine Learning Applications

Deep Learning - CNN and RNN

Introduction to Parallel Computing

Distributed Deep Learning on AWS with Apache MXNet

Operating system.pptx

Ch4 memory management

Massively Parallel Architectures

shashank_spdp1993_00395543

CS8603_Notes_003-1_edubuzz360.pdf

deeplearning

Chapter 6 os

High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...

Advanced computer architecture

Linux Internals - Kernel/Core

A Buffering Approach to Manage I/O in a Normalized Cross-Correlation Earthqua...

Recently uploaded

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.

buds n tech IT solutionsmonugehlot87

What is Fashion PLM and Why Do You Need ItWave PLM

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin

Project Based Learning (A.I).pptx detail explanationkaushalgiri8080

Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden

Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin

Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110

Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea

why an Opensea Clone Script might be your perfect match.pdfjoe51371421

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp

Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh

Professional Resume Template for Software DevelopersVinodh Ram

Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH

Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08

Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.

Recently uploaded (20)

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...

buds n tech IT solutions

What is Fashion PLM and Why Do You Need It

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...

Project Based Learning (A.I).pptx detail explanation

Engage Usergroup 2024 - The Good The Bad_The Ugly

Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide

Salesforce Certified Field Service Consultant

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...

Automate your Kamailio Test Calls - Kamailio World 2024

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样

why an Opensea Clone Script might be your perfect match.pdf

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE

Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝

Professional Resume Template for Software Developers

Der Spagat zwischen BIAS und FAIRNESS (2024)

Unit 1.1 Excite Part 1, class 9, cbse...

Cloud Management Software Platforms: OpenStack

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...

Memory efficient implementation of dense nets

1. Memory-Efficient Implementation of DenseNets Hyungjoo Cho

3. DenseNets - ResNets - Dense connectivity - Composite function - Pooling layers - Growth rate - Bottleneck layers - Compression

6. Memory optimization with Computation Graph

7. “Training Deep Nets with Sublinear Memory Cost” - Tianqi Chen et. al.

8. “Training Deep Nets with Sublinear Memory Cost” - Tianqi Chen et. al.

9. “Training Deep Nets with Sublinear Memory Cost” - Tianqi Chen et. al.

10. Memory Efficient DenseNets

11. Pre-activation batch normalization - Apply batch-norm and non-linearities before the conv-operation - Given m layers, pre-activations generate up to m copies of each layer - Typically ( m - 1 ) ( m - 2 ) / 2 “Identity Mapping in Deep Residual Networks” - Kaiming He et. al.

12. Contiguous Concatenation - Convolution is most efficient when input lies in a contiguous block of memory - To make a contiguous input, each layer must copy all previous features - Given m layers, each feature may be copied up to m times “Memory-Efficient Implementation of DenseNets” - Geoff Pleiss et. al.

13. Concatenation and normalization operation are computationally extremely cheap

14. Copying to pre-allocated memory is significantly faster than allocating new memory

15. “Memory-Efficient Implementation of DenseNets” - Geoff Pleiss et. al.

16. Shared storage for concatenation - Rather than allocating memory for each concatenation operation, assign the outputs to a memory allocation shared across all layers - Shared Memory Storage 1 is used by all network layers, its data is not permanent - Need to be recomputed during back-propagation

17. Shared storage for batch normalization - Assign the outputs of batch normalization to a shared memory allocation - The data in Shared Memory Storage 2 is not permanent and will be overwritten by the next layer - Should recompute the batch normalization outputs during back-propagation

18. Implementation Details ( Pytorch ) - Pre allocate shared memories when DenseBlock is initialized

19. Implementation Details ( Pytorch ) - Define bottleneck layer function as an Autograd Function

20. Implementation Details ( Pytorch ) - Define custom functions ( BN, Cat, ReLU, Conv2d ) with backward as well

21. Appendix - Pytorch C libraries ( i.e, ReLU )

22. Result

23.

24.

25. Thanks !

Memory efficient implementation of dense nets

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Memory efficient implementation of dense nets

Similar to Memory efficient implementation of dense nets (20)

Recently uploaded

Recently uploaded (20)

Memory efficient implementation of dense nets