An Analysis of Convolution for Inference

•

6 likes•6,793 views

Scott Gray presents at the 2016 ICML conference. Scott Gray went over various ways of computing convolution in the workshop on "On-device Intelligence".

Technology

An Analysis of Convolution for Inference
24 June 2016
Scott Gray
Nervana Systems
MAKING MACHINES SMARTER.™

Proprietary and conﬁdential. Do not distribute.ner va na
Direct Convolution
2
• Compute with in-place slicing + gemm
• Data layout considerations: C, H, W, N
• Minimize slicing logic
• Maximize contiguous access
• Leverage filter overlap

Proprietary and conﬁdential. Do not distribute.ner va na
Small N direct convolution: Without Superblocking
3
fprop
Q = (W-S+1 + 2 * pad) / stride
wi = sk + qj * stride - pad
Fig from V. Dumoulin,
https://github.com/vdumoulin/conv_arithmetic

Proprietary and conﬁdential. Do not distribute.ner va na
Small N direct convolution: With Superblocking
4
fprop
Q = (W-S+1 + 2 * pad) / stride
wi = sk + qj * stride - pad

Proprietary and conﬁdential. Do not distribute.ner va na
Small N direct convolution: Bprop for deconv
5
bprop
pad’ = S - pad - 1
wi = (qj - pad’ + sk) / stride

Proprietary and conﬁdential. Do not distribute.ner va na
Small N direct convolution: Dilated Filters
6
Dilated
S’ = (S-1) * rate + 1
Q = (W-S’+1 + 2*pad) / stride
wi = sk * rate + qj * stride - pad
Fig from F. Yu, V. Koltun
http://arxiv.org/abs/1511.07122v3

Proprietary and conﬁdential. Do not distribute.ner va na
Convolution with Algorithmic Speedups
7
• FFT and Winograd have same basic computational flow
• FFT tiles typically need to be much bigger
• Winograd history: Toom and Cook, then Lavin

Proprietary and conﬁdential. Do not distribute.ner va na
Winograd: input transform
8
Input Feature Map
4x4 stride 2
• Input transform
• 2D Winograd is a nested
product of 1D transforms
• Transforms can be
simplified to remove zeros

Proprietary and conﬁdential. Do not distribute.ner va na
Winograd: filter transform
9
• Filter transform
• Same as input but with
different coefficients
• Transform each feature map
independently

Proprietary and conﬁdential. Do not distribute.ner va na
Winograd: batched GEMM
10
• Point-wise Multiplication
• Posed as batched GEMM
operation

Proprietary and conﬁdential. Do not distribute.ner va na
Winograd: output transform
11
Output Feature Map
• Output transform
• Same as input and filter
• Transform back to pixel
space to obtain 2x2 output
tile

Proprietary and conﬁdential. Do not distribute.ner va na
Transforms for Increased Accuracy
12
Integer roots
4 0 -5 0 1 0
0 -4 -4 1 1 0
0 4 -4 -1 1 0
0 -2 -1 2 1 0
0 2 -1 -2 1 0
0 4 0 -5 0 1
0.87 0 -2.64 0 1 0
0 -1.4 -2.25 0.62 1 0
0 1.4 -2.25 -0.62 1 0
0 -0.58 -0.39 1.5 1 0
0 0.58 -0.39 -1.5 1 0
0 0.87 0 -2.64 0 1
Fractional roots
Input transforms for 4x4

$Proprietary and conﬁdential. Do not distribute.ner va na Precision 13 Percentage error from Convolution 0 5 10 15 20 25 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Direct 2x2 Winograd 4x4 winograd (Fractional Roots) 4x4 Winograd (Integer Roots) PercentageError Bit width Bits Direct 2x2 Winograd 4x4 frac 4x4 int 3 56.461 112.174 351.196 314.62 4 23.533 46.222 274.28 432.959 5 10.879 21.394 142.649 459.723 6 5.245 10.34 68.062 446.271 7 2.585 5.074 33.73 250.057 8 1.286 2.516 16.667 123.585 9 0.639 1.253 8.246 62.001 10 0.319 0.626 4.154 31.006 11 0.159 0.312 2.064 15.439 12 0.08 0.156 1.029 7.669 13 0.04 0.078 0.515 3.857 14 0.02 0.039 0.259 1.923 15 0.01 0.019 0.129 0.966 16 0.005 0.01 0.064 0.483$

Proprietary and conﬁdential. Do not distribute.ner va na
Multiplier Transistor Efficiency
14
Algo bits speedup transistors
performance
/ transistor
Direct 8 1.0 3000 1
2x2 9 2.25 3750 1.8
4x4 12 4.0 6000 2.0
Transistor Counts from Wikipedia:

Proprietary and conﬁdential. Do not distribute.ner va na
Logarithmic quantization
15
D. Miyashita, EH. Lee, B. Murmann
Convolutional Neural Networks using Logarithmic Data Representation
http://arxiv.org/abs/1603.01025v2

Proprietary and conﬁdential. Do not distribute.ner va na 16
Performance: VGG fp32 on GTX1080effectiveTFLOPS
Batch Size
VGG - Totals:
0
5
10
15
20
25
64 32 16 8 4 2 1
Neon Direct
Neon F(2x2,3x3)
Neon F(4x4,3x3)
cuDNN FFT

Proprietary and conﬁdential. Do not distribute.ner va na 17
Peak Performance: VGG fp32 on GTX1080effectiveTFLOPS
Batch Size
VGG - Layer 4.2:
0
5
10
15
20
25
64 32 16 8 4 2 1
Neon Direct
Neon F(2x2,3x3)
Neon F(4x4,3x3)
cuDNN FFT

What's hot

Dds 2Nhân Lê

GDC16: Arbitrary amount of 3D data running on Gear VR by Vinh TruongUmbra Software

Image Segmentation Using Hardware Forest ClassifiersNeil Pittman

Chaotic substitution box design for block ciphersHammad Haleem

Math cad fourier analysis (jcb-edited)Julio Banks

[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering WorkflowTakahiro Harada

Fragging Rights: A Tale of a Pathological Storage WorkloadEric Sproul

Unit 5 vspsushant7dare

Multi core k meansb0rAAs

MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...Kenichiro Tanaka

The InternetDavid Evans

Parallel Implementation of K Means Clustering on CUDAprithan

Neighbourhood Preserving Quantisation for LSH SIGIR PosterSean Moran

Scaling the #2ndhalfSalo Shp

PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosAMD Developer Central

[2018 GDC] Real-Time Ray-Tracing Techniques for Integration into Existing Ren...Takahiro Harada

Parallel K means clustering using CUDAprithan

Unite2019 HLOD를 활용한 대규모 씬 제작 방법장규 서

TressFX The Fast and The Furry by Nicolas ThibierozAMD Developer Central

[BGOUG] Java GC - Friend or FoeSAP HANA Cloud Platform

What's hot (20)

Dds 2

GDC16: Arbitrary amount of 3D data running on Gear VR by Vinh Truong

Image Segmentation Using Hardware Forest Classifiers

Chaotic substitution box design for block ciphers

Math cad fourier analysis (jcb-edited)

[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow

Fragging Rights: A Tale of a Pathological Storage Workload

Unit 5 vsp

Multi core k means

MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...

The Internet

Parallel Implementation of K Means Clustering on CUDA

Neighbourhood Preserving Quantisation for LSH SIGIR Poster

Scaling the #2ndhalf

PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

[2018 GDC] Real-Time Ray-Tracing Techniques for Integration into Existing Ren...

Parallel K means clustering using CUDA

Unite2019 HLOD를 활용한 대규모 씬 제작 방법

TressFX The Fast and The Furry by Nicolas Thibieroz

[BGOUG] Java GC - Friend or Foe

Viewers also liked

Deep Learning at ScaleIntel Nervana

Intel Nervana Artificial Intelligence Meetup 1/31/17Intel Nervana

Nervana and the Future of ComputingIntel Nervana

Introduction to deep learning @ Startup.ML by Andres RodriguezIntel Nervana

Urs Köster - Convolutional and Recurrent Neural NetworksIntel Nervana

Intel Nervana Artificial Intelligence Meetup 11/30/16Intel Nervana

RE-Work Deep Learning Summit - September 2016Intel Nervana

懇親会の余興スライドAkira Tamamori

clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning AccelerationIntel® Software

Video Activity Recognition and NLP Q&A Model ExampleIntel Nervana

A Method of Speech Waveform Synthesis based on WaveNet considering Speech Gen...Akira Tamamori

Startup.Ml: Using neon for NLP and Localization Applications Intel Nervana

Using neon for pattern recognition in audio dataIntel Nervana

Urs Köster Presenting at RE-Work DL Summit in BostonIntel Nervana

Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...Intel Nervana

Rethinking computation: A processor architecture for machine intelligenceIntel Nervana

Introduction to Deep Learning with Will ConstableIntel Nervana

Intel's Machine Learning Strategyinside-BigData.com

ODSC WestIntel Nervana

Anil Thomas - Object recognitionIntel Nervana

Viewers also liked (20)

Deep Learning at Scale

Intel Nervana Artificial Intelligence Meetup 1/31/17

Nervana and the Future of Computing

Introduction to deep learning @ Startup.ML by Andres Rodriguez

Urs Köster - Convolutional and Recurrent Neural Networks

Intel Nervana Artificial Intelligence Meetup 11/30/16

RE-Work Deep Learning Summit - September 2016

懇親会の余興スライド

clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration

Video Activity Recognition and NLP Q&A Model Example

A Method of Speech Waveform Synthesis based on WaveNet considering Speech Gen...

Startup.Ml: Using neon for NLP and Localization Applications

Using neon for pattern recognition in audio data

Urs Köster Presenting at RE-Work DL Summit in Boston

Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...

Rethinking computation: A processor architecture for machine intelligence

Introduction to Deep Learning with Will Constable

Intel's Machine Learning Strategy

ODSC West

Anil Thomas - Object recognition

Similar to An Analysis of Convolution for Inference

Visual thinking colin_ware_lectures_2013_3_findabilityElsa von Licy

Rainbow Over the Windows: More Colors Than You Could ExpectPeter Hlavaty

Video Compression, Part 2-Section 2, Video Coding Concepts Dr. Mohieddin Moradi

畳み込みについてHironoriKanazawa

#6 PyData Warsaw: Deep learning for image segmentationMatthew Opala

02 DSD-NL 2016 - Simona Gebruikersmiddag - Floating point onnauwkeurigheid en...Deltares

“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...Edge AI and Vision Alliance

December 4, ProjectUniversity of Colorado at Boulder

7nm "Navi" GPU - A GPU Built For Performance AMD

DL (v2).pptxFKKBWITTAINAN

Optimizing the Graphics Pipeline with Compute, GDC 2016Graham Wihlidal

Panoramic Video in Environmental Monitoring Software Development and Applica...pycontw

Verifiably RandomDavid Evans

Code vectorization for mobile devicesSt1X

A Deep Dive Into Understanding Apache CassandraDataStax Academy

HBaseCon 2013: Scalable Network Designs for Apache HBaseCloudera, Inc.

Genome Browser based on Google Maps APIHong ChangBum

Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernelAnne Nicolas

Clipping & RasterizationAhmed Daoud

[251] implementing deep learning using cu dnnNAVER D2

Similar to An Analysis of Convolution for Inference (20)

Visual thinking colin_ware_lectures_2013_3_findability

Rainbow Over the Windows: More Colors Than You Could Expect

Video Compression, Part 2-Section 2, Video Coding Concepts

畳み込みについて

#6 PyData Warsaw: Deep learning for image segmentation

02 DSD-NL 2016 - Simona Gebruikersmiddag - Floating point onnauwkeurigheid en...

“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...

December 4, Project

7nm "Navi" GPU - A GPU Built For Performance

DL (v2).pptx

Optimizing the Graphics Pipeline with Compute, GDC 2016

Panoramic Video in Environmental Monitoring Software Development and Applica...

Verifiably Random

Code vectorization for mobile devices

A Deep Dive Into Understanding Apache Cassandra

HBaseCon 2013: Scalable Network Designs for Apache HBase

Genome Browser based on Google Maps API

Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel

Clipping & Rasterization

[251] implementing deep learning using cu dnn

Recently uploaded

GenAI Risks & Security Meetup 01052024.pdflior mazor

Partners Life - Insurer Innovation Award 2024The Digital Insurer

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Why Teams call analytics are critical to your entire businesspanagenda

Real Time Object Detection Using Open CVKhem

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Artificial Intelligence: Facts and MythsJoaquim Jorge

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

MINDCTI Revenue Release Quarter One 2024MIND CTI

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Recently uploaded (20)

GenAI Risks & Security Meetup 01052024.pdf

Partners Life - Insurer Innovation Award 2024

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Tata AIG General Insurance Company - Insurer Innovation Award 2024

🐬 The future of MySQL is Postgres 🐘

Why Teams call analytics are critical to your entire business

Real Time Object Detection Using Open CV

Axa Assurance Maroc - Insurer Innovation Award 2024

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Artificial Intelligence: Facts and Myths

Artificial Intelligence Chap.5 : Uncertainty

MINDCTI Revenue Release Quarter One 2024

Powerful Google developer tools for immediate impact! (2023-24 C)

A Domino Admins Adventures (Engage 2024)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Exploring the Future Potential of AI-Enabled Smartphone Processors

An Analysis of Convolution for Inference

1. An Analysis of Convolution for Inference 24 June 2016 Scott Gray Nervana Systems MAKING MACHINES SMARTER.™

2. Proprietary and conﬁdential. Do not distribute.ner va na Direct Convolution 2 • Compute with in-place slicing + gemm • Data layout considerations: C, H, W, N • Minimize slicing logic • Maximize contiguous access • Leverage filter overlap

3. Proprietary and conﬁdential. Do not distribute.ner va na Small N direct convolution: Without Superblocking 3 fprop Q = (W-S+1 + 2 * pad) / stride wi = sk + qj * stride - pad Fig from V. Dumoulin, https://github.com/vdumoulin/conv_arithmetic

4. Proprietary and conﬁdential. Do not distribute.ner va na Small N direct convolution: With Superblocking 4 fprop Q = (W-S+1 + 2 * pad) / stride wi = sk + qj * stride - pad

5. Proprietary and conﬁdential. Do not distribute.ner va na Small N direct convolution: Bprop for deconv 5 bprop pad’ = S - pad - 1 wi = (qj - pad’ + sk) / stride

6. Proprietary and conﬁdential. Do not distribute.ner va na Small N direct convolution: Dilated Filters 6 Dilated S’ = (S-1) * rate + 1 Q = (W-S’+1 + 2*pad) / stride wi = sk * rate + qj * stride - pad Fig from F. Yu, V. Koltun http://arxiv.org/abs/1511.07122v3

7. Proprietary and conﬁdential. Do not distribute.ner va na Convolution with Algorithmic Speedups 7 • FFT and Winograd have same basic computational flow • FFT tiles typically need to be much bigger • Winograd history: Toom and Cook, then Lavin

8. Proprietary and conﬁdential. Do not distribute.ner va na Winograd: input transform 8 Input Feature Map 4x4 stride 2 • Input transform • 2D Winograd is a nested product of 1D transforms • Transforms can be simplified to remove zeros

9. Proprietary and conﬁdential. Do not distribute.ner va na Winograd: filter transform 9 • Filter transform • Same as input but with different coefficients • Transform each feature map independently

10. Proprietary and conﬁdential. Do not distribute.ner va na Winograd: batched GEMM 10 • Point-wise Multiplication • Posed as batched GEMM operation

11. Proprietary and conﬁdential. Do not distribute.ner va na Winograd: output transform 11 Output Feature Map • Output transform • Same as input and filter • Transform back to pixel space to obtain 2x2 output tile

12. Proprietary and conﬁdential. Do not distribute.ner va na Transforms for Increased Accuracy 12 Integer roots 4 0 -5 0 1 0 0 -4 -4 1 1 0 0 4 -4 -1 1 0 0 -2 -1 2 1 0 0 2 -1 -2 1 0 0 4 0 -5 0 1 0.87 0 -2.64 0 1 0 0 -1.4 -2.25 0.62 1 0 0 1.4 -2.25 -0.62 1 0 0 -0.58 -0.39 1.5 1 0 0 0.58 -0.39 -1.5 1 0 0 0.87 0 -2.64 0 1 Fractional roots Input transforms for 4x4

13. Proprietary and conﬁdential. Do not distribute.ner va na Precision 13 Percentage error from Convolution 0 5 10 15 20 25 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Direct 2x2 Winograd 4x4 winograd (Fractional Roots) 4x4 Winograd (Integer Roots) PercentageError Bit width Bits Direct 2x2 Winograd 4x4 frac 4x4 int 3 56.461 112.174 351.196 314.62 4 23.533 46.222 274.28 432.959 5 10.879 21.394 142.649 459.723 6 5.245 10.34 68.062 446.271 7 2.585 5.074 33.73 250.057 8 1.286 2.516 16.667 123.585 9 0.639 1.253 8.246 62.001 10 0.319 0.626 4.154 31.006 11 0.159 0.312 2.064 15.439 12 0.08 0.156 1.029 7.669 13 0.04 0.078 0.515 3.857 14 0.02 0.039 0.259 1.923 15 0.01 0.019 0.129 0.966 16 0.005 0.01 0.064 0.483

14. Proprietary and conﬁdential. Do not distribute.ner va na Multiplier Transistor Efficiency 14 Algo bits speedup transistors performance / transistor Direct 8 1.0 3000 1 2x2 9 2.25 3750 1.8 4x4 12 4.0 6000 2.0 Transistor Counts from Wikipedia:

15. Proprietary and conﬁdential. Do not distribute.ner va na Logarithmic quantization 15 D. Miyashita, EH. Lee, B. Murmann Convolutional Neural Networks using Logarithmic Data Representation http://arxiv.org/abs/1603.01025v2

16. Proprietary and conﬁdential. Do not distribute.ner va na 16 Performance: VGG fp32 on GTX1080effectiveTFLOPS Batch Size VGG - Totals: 0 5 10 15 20 25 64 32 16 8 4 2 1 Neon Direct Neon F(2x2,3x3) Neon F(4x4,3x3) cuDNN FFT

17. Proprietary and conﬁdential. Do not distribute.ner va na 17 Peak Performance: VGG fp32 on GTX1080effectiveTFLOPS Batch Size VGG - Layer 4.2: 0 5 10 15 20 25 64 32 16 8 4 2 1 Neon Direct Neon F(2x2,3x3) Neon F(4x4,3x3) cuDNN FFT

An Analysis of Convolution for Inference

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to An Analysis of Convolution for Inference

Similar to An Analysis of Convolution for Inference (20)

Recently uploaded

Recently uploaded (20)

An Analysis of Convolution for Inference