SlideShare a Scribd company logo
"Even Faster CNNs: Exploring the New Class of Winograd Algorithms," a Presentation from Arm
▪
▪
▪
▪
▪
▪
▪
Agenda
Arm: Extraordinary Growth From Sensors to Server
50 billion
chips shipped
50 billion
chips shipped
"Even Faster CNNs: Exploring the New Class of Winograd Algorithms," a Presentation from Arm
conv1
17%
conv2
22%
conv3
18%
conv4
18%
conv5
18%
Layer breakdown for AlexNet
Embedded Vision Summit 2016
Even Smaller Convolution Kernels…
CNN Layer Breakdown
Fully Connected Layer Issue (1)
•
–
•
•
Fully Connected Layer Issue (2)
"Even Faster CNNs: Exploring the New Class of Winograd Algorithms," a Presentation from Arm
GEMM-based Convolution
•
•
•
•
•
•
•
•
What Did We Do to Improve the Performance?
device
OpenCL Concepts: Platform Model
•
•
•
•
OpenCL Concepts: Compute Unit
OpenCL concepts: work-items/work-groups
•
•
•
•
OpenCL Concepts: Work-items/work-group
Improving L1 Cache Utilization: Memory Coalescing
Improving L2 Cache Utilization Tuning LWS (1)
•
•
•
•
•
•
•
•
Improving L2 Cache Utilization Tuning LWS (2)
Improving L2 Cache Utilization Tuning LWS (3)
"Even Faster CNNs: Exploring the New Class of Winograd Algorithms," a Presentation from Arm
Goal
𝑟0 = 𝑑00 ∙ 𝑤0 + 𝑑01 ∙ 𝑤1 + 𝑑02 ∙ 𝑤2
𝑟1 = 𝑑10 ∙ 𝑤0 + 𝑑11 ∙ 𝑤1 + 𝑑12 ∙ 𝑤2
Introduction
Winograd’s Minimal Filtering Algorithm (1)
𝑚1 = 𝑘1 + 𝑘2 ∙
𝑤0 + 𝑤1 + 𝑤2
2
𝑚2 = 𝑘2 − 𝑘1 ∙
𝑤0 − 𝑤1 + 𝑤2
2
𝑚3 = 𝑘1 − 𝑘3 ∙ w2
𝑚0 = 𝑘0 − 𝑘2 ∙ w0
Winograd’s Minimal Filtering Algorithm (2)
2D Case: Nest Minimal 1D Algorithms (1)
2D Case: Nest Minimal 1D Algorithms (2)
2D Case: Nest Minimal 1D Algorithms (3)
𝑴1 = 𝑲1 + 𝑲2 ∙
𝑾0 + 𝑾1 + 𝑾2
2
𝑴2 = 𝑲2 − 𝑲1 ∙
𝑾0 − 𝑾1 + 𝑾2
2
𝑴3 = 𝑲1 − 𝑲3 ∙ 𝐖2
𝑴0 = 𝑲0 − 𝑲2 ∙ 𝐖0
Complexity Reduction
𝑌 = 𝐴𝑇 𝐺𝑤𝐺 𝑇 ⊙ 𝐵 𝑇 𝑘𝐵 𝐴
•
•
•
•
Algorithm Design (1)
Algorithm Design (2)
Input Transform
Filter Transform
Element-wise Multiplication as Batched GEMM
Output Transform
•
•
Memory Footprint: GEMM-based vs Winograd-based
•
•
•
•
Optimizing Input/Output Transform
•
•
•
•
Optimizing batched-GEMM
VGG16 Convolution Layers Breakdown (CPU)
VGG16 convolution layer breakdown (GPU)
VGG16 Convolution Layers Breakdown (GPU)
GEMM-based vs Winograd-based Convolution (1)
GEMM-based vs Winograd-based Convolution (2)
Extending Winograd-based Convolution: F(4x4,3x3)
VGG16 Convolution Layers Breakdown (GPU)
•
•
GEMM-based vs Winograd-based Convolution (1)
•
•
GEMM-based vs Winograd-based Convolution (2)
•
•
Accuracy: Absolute Error
•
•
Accuracy: ILSVRC2012
•
•
•
•
•
Current Investigations
•
•
•
•
•
Conclusion
•
•
•
•
•
•
References
"Even Faster CNNs: Exploring the New Class of Winograd Algorithms," a Presentation from Arm
5656 © 2018 Arm Limited
The Arm trademarks featured in this presentation are registered trademarks or
trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All
rights reserved. All other marks featured may be trademarks of their respective
owners.
www.arm.com/company/policies/trademarks

More Related Content

"Even Faster CNNs: Exploring the New Class of Winograd Algorithms," a Presentation from Arm