SlideShare a Scribd company logo
© 2021 Arm
David Cochard & Kazuki Kyakuno
21st September 2021
Optimizing NN inference
performance on Arm NEON
and Vulkan using ailia SDK
ax Inc
© 2020 Arm Limited (or its affiliates)
Welcome!
Tweet us: @ArmSoftwareDev -> #AIVTT
Check out our Arm Software Developers YouTubechannel
Signup now for our next AI Virtual Tech Talk: developer.arm.com/techtalks
Attendees: don’t forget to fill out the survey to be in with a chance of winning
an Arduino Nano 33 BLE board
Our upcoming Arm AI Tech Talks
Date Title Host
September
21st Optimizing NN inference performance on Arm NEON and Vulkan using the Ailia SDK Ax Inc
October 5th EON Tuner: AutoML for real-world embedded devices Edge Impulse
October 28th ARM架构端侧AI视觉算法的开发到部署 (Development to Deployment of Endpoint AI vision
Algorithms Based on Arm Architecture)
ICE TECH
November
2nd Getting started with running Machine Learning on Arm Ethos-U55 Arm
Visit: developer.arm.com/techtalks
Presenters
David Cochard,
Engineering Manager,
ax Inc.
Kazuki Kyakuno,
CTO,
ax Inc.
Agenda
・Presentation of our solution “ ailia ”
・Optimize computation on CPU
・Optimize computation on GPU
Company Profile
Name ax Inc.
Location Tokyo, Japan
CEO TERADA Takehiko
Business
・Development and provide “ailia SDK”
・AI Consulting, Training AI models
and more AI businesses
“AI everywhere ”
We believe that AI will be used on every device in the near future.
We are developing fast SDKs and constantly researching the latest AI models
to help accelerating the evolution of AI era.
ax Inc. aims to provide tools to implement the latest AI capabilities to solve
real-world problems.
There are many challenges
in implementing AI for various devices.
Those challenges include:
• Wide variety of AI models
• Multi platform
• Various programing languages
• Long-term API consistency
• Performance optimization for each devices
and more.
ailia SDK
ailia SDK is an AI framework leveraging CPU and GPU to achieve high-performance AI inference
It supports ONNX (opset 10 & 11) and enables high-performance inference using NEON and Vulkan
It offers over 140 pre-trained models, ready to be integrated into our client’s application
https://ailia.jp/en
Benefits when implementing AI to Edge devices
install
ailia SDK
download
ailia MODELS
execute
sample code
result
verification
model
conversion
install
dependent
libraries
search
optimal
model
clone model
repository
setup edge
environment
setup
development
environment
NG NG
NG
result
verification
Implement
pre and post
processing
Conventional
ailia SDK
ailia SDK Architecture
Vulkan Metal NEON
Accelerator (Convolution, Pooling, Resize, Add, and more.)
Runtime Graph Optimization
API (C++, Python, C#, JNI)
ONNX (opset=10, 11) (supporting over 100 layers)
ailia SDK
ailia MODELS
ailia MODELS
Over 140 models compatible with ailia SDK are publicly available on github
You can easily try out the latest models such as YOLOv4, MIDAS and PaddleOCR
https://github.com/axinc-ai/ailia-models
ailia MODELS
ailia MODELS has samples for Python, C ++ and Unity.
You can use accelerated inference using Vulkan in any environment.
Hand Detection Depth Estimation
Optimize computation on CPU
NEON implementation in AI
Since the amount of processing is large for AI compared to general applications, significant speedup is
required.
Thanks to the high degree of parallelism in layer operations, there is a lot of room for optimization using
SIMD instructions such as NEON.
There are many types of layers to implement and possibilities to use various optimization techniques.
NEON overview
NEON provides scalar/vector instructions and registers for Arm CPUs
It is possible to perform parallel calculation in 128-bit units, as well as FP32, up to 4 elements
simultaneously
Float 4 Parallel SIMD
instructions
128 bit Data
128 bit Data
Basic techniques
Replace code branching with a bit select instruction
Process number sign using by bit manipulation
Reorder data structure (load / store)
// save sign bit from input value.
mask = vdupq_n_u32(1<<31);
sign = vandq_u32(cast_u32(x), mask);
v = vabsq_f32(x);
// .. any operation with abs value ..
// restore sign bit with xor
res = cast_f32(veorq_u32(cast_u32(v), sign));
// remove branch with bit select
// dst[i]=(src[i]<0.0f)?src[i]*alpha:src[i];
val = vld1q_f32(src);
sel = vcltq_f32(val, zero);
mod = vmulq_n_f32(val, alpha);
vst1q_f32(dst, vbslq_f32(sel, mod, val));
// reorder data with structure load
x[] = {
11, 12, 13, 14,
21, 22, 23, 24,
31, 32, 33, 34,
41, 42, 43, 44,
};
v = vld4q_f32(x);
// v.val[0] : { 11, 21, 31, 41 }
// v.val[1] : { 12, 22, 32, 42 }
// v.val[2] : { 13, 23, 33, 43 }
// v.val[3] : { 14, 24, 34, 44 }
Approximation
NEON implementation using approximate expressions for high-level functions such as exp, log, and erf
// log(x) = log((2^n) * z) = n*log(2) + log(z)
// log(z) ≒ 2 * (w + (w^3)/3 + (w^5)/5 + ..) : w = (z-1)/(z+1)
log2 = 0.6931471805599453f;
n = pick_exponent(x);
z = pick_fractional(x);
w = (z-1.0) / (z+1.0);
ww = w*w;
r = (1.0/7.0) + (1.0/9.0)*ww;
r = (1.0/5.0) + r*ww;
r = (1.0/3.0) + r*ww;
r = (1.0 + r*ww);
r = r*w; // w + (w^3)/3 + (w^5)/5 + (w^7)/7 + (w^9)/9
return (n*log2 + r*2);
Threading
Taking advantage of Arm big.little technology, performance gain can be achieved by efficiently assigning
jobs to different cores.
Adjust the processing unit according to the cache size
Diving a task in units of equal size
does not fit big.little SoC. design
Assign more processing-intensive
tasks to big cores, and smaller ones
to little cores.
Benchmark
Comparison of inference time with NEON enabled and disabled
SoC Model Improvement (w/wo NEON)
Snapdragon 888
ResNet50 2.15 times faster
YOLOv3 3.90 times faster
Exynos 9820
ResNet50 3.08 times faster
YOLOv3 2.86 times faster
Future work
The vector length of NEON is 128 bit, but 512 bit vector operation can be used in SVE2 added in Armv9-A
Going forward, ailia SDK will continue to support the latest instruction sets
Optimize computation on GPU
Benefits of Vulkan
Support for all major GPUs
Support for all major OSs
Windows, Android, Linux
Easy installation for GPU inference
Being widely used for gaming, it only requires standard drivers to run
Little additional disk space usage
ailia_vulkan.dll is only 2.8MB
AI Inference Acceleration with Vulkan
Fast AI inference achieved using Runtime Graph Optimization and Layer Fusion
Implementation of the Optimized Logic for GEMM using Vulkan’s Compute Shader
Implementation of layers such as Convolution or Pooling using Vulkan’s Compute Shader
Implementation of the Winograd Algorithm with shaders to accelerate the heavy Convolution layer
Roles between CPU and GPU
Allocate GPU memory
Send data to GPU memory
Create compute kernel on GPU
Dispatch compute kernel on GPU
Fetch Compute result from GPU
memory
Compute using compute kernel
Computing API | GPU Driver
CPU GPU
Code required for Vulkan
Create VkInstance
Select VkPhysicalDevice
Create VkDevice
Get VkQueue
Allocate VkDeviceMemory
Bind vkDeviceMemory to VkBuffer
Send data to device from host
Configure VkShaderModule
Create VkDescriptorSetLayout
Create VkPipelineLayout
Create VkPipeline
Create VkDescriptorPool
Create VkDescriptorSet
Create VkCommandPool
Create VkCommandBuffer
Regist VkPipeline to VkCommandBuffer
Regist VkDescriptorSet to VkCommandBuffer
Call vkCmdDispatch
Executing and waiting for VkCommandBuffer to complete
Data transfer from device to host
Example of GPU Computing
Add the corresponding elements of A and B of the same size and write to the corresponding element of
the output of the same size
Input A [ z=4, y=360, x=640 ]
Input B [ z=4, y=360, x=640 ]
Output [ z=4, y=360, x=640 ]
+
Example of GPU Computing : Device code (GPU code)
#version 450
layout(std430, binding = 0) writeonly buffer Dst {
float data[];
} dst;
layout(std430, binding = 1) readonly buffer Src_A {
float data[];
} src_a;
layout(std430, binding = 2) readonly buffer Src_B {
float data[];
} src_b;
layout(local_size_x = 64) in;
void main()
{
// Exception handling of fractional blocks is omitted
const uint id = gl_GlobalInvocationID.x;
dst.data[id] = src_a.data[id] + src_b.data[id];
}
Example of GPU Computing : Host code (CPU code)
// Instance initialization, device selection, buffer allocation, shader module construction, etc. are
omitted.
vkBeginCommandBuffer(cmdBuf, &beginInfo); // Start recording command buffer
// Registered pipeline in command buffer --Shader module is registered in pipeline
vkCmdBindPipeline(cmdBuf, VK_PIPELINE_BIND_POINT_COMPUTE, pipeline);
// Registered descriptor set in command buffer --I / O buffer allocation is registered in descSet
vkCmdBindDescriptorSets(cmdBuf, VK_PIPELINE_BIND_POINT_COMPUTE, descSet);
// Specify how many work loops to start
// Start with the most recently configured pipeline and descriptor set
// Numerical example of one-dimensional processing of z = 4, y = 360, x = 640 with localsize = 64
vkCmdDispatch(cmdBuf, (4*360*640)/64, 1, 1);
vkEndCommandBuffer(cmdBuf); // Set the end of command buffer construction
// Pass the command buffer to the device and run the shader --cmdBuf is registered in submitInfo
vkQueueSubmit(queue, 1, &submitInfo, NULL);
// Waiting for processing completion and data collection are omitted
Workgroups and Threads (Invocation)
Write how many kernels to boot in the host (CPU) code (groupCount)
Describe what each thread does in the device (GPU) code (shader)
Use localsize to specify the number of threads (invocations)
Thread
Workgroup
vkCmdDispatch
Workgroup
Thread Thread Thread
CPU
GPU
groupCount
localsize
Arm GPU HW configuration example
https://developer.arm.com/ip-products/graphics-and-multimedia/mali-gpus/mali-g76-gpu
Configurable from 4 to 20 shader cores
delivering largest capability for a Mali GPU
3 engines per shader core
8 execution lanes per engine
Correspondence with API
vkCmdDispatch(.., groupCountX, groupCountY, groupCountZ);
Kernel boot process with host code
Launch (groupCountX * groupCountY * groupCountZ) workgroups
One workgroup is assigned to one execution engine
If you can fill all the execution engines, make full use of the GPU
layout(localsize_x=64) in;
Device code thread number specification part
Specifies the number of local threads in a workgroup
If the number of threads issued in the execution engine can be filled, the execution engine will be fully utilized.
Relation between localsize and dispatch
If the local size is 64 and the image size is 640 * 360, you need to run (360 * 640) / 64 workgroups to run
the entire image
640
360
localsize=64
1
groupCount=(360
*640)/64
How to choose localsize
The limit of localsize is defined by maxComputeWorkgroupSize
maxComputeWorkgroupSize = {384,384,384} (Mali G76)
maximum size of a local compute workgroup per dimension
maxComputeWorkGroupInvocations = 384 (Mali G76)
maximum total number of compute shader invocations in a single local workgroup
Arm Mali GPU Best Practices Developer Guide said
“Use 64 as a baseline workgroup size. Do not use more than 64 threads per workgroup.”
Localsize adjustment required for complex and heavy kernels
Time spent by localsize
Error if maxComputeWorkgroupSize is
exceeded
Performance saturation at localsize> = 32
Because it uses only one shader module
localsize = 1 is the worst performance
GEMM
General matrix multiply
Multiplication of 2D dense matrix
Often used in scientific computing (computer simulation)
Patterson & Hennessy "Computer Organization and Design"
1426-page textbooks with the story of speeding up GEMM
Matrix A Matrix B
* Matrix C
Basic implementation of GEMM
// I / O definition omitted (when trans_a == false && trans_b == false)
#define BLOCK_SIZE 8
layout(local_size_x=BLOCK_SIZE, local_size_y=BLOCK_SIZE) in;
// shared memory can be commonly referenced from workgroup
shared float sa[ BLOCK_SIZE * BLOCK_SIZE ];
shared float sb[ BLOCK_SIZE * BLOCK_SIZE ];
void main()
{
uint lx = gl_LocalInvocationID.x; uint ly = gl_LocalInvocationID.y;
uint dx = gl_WorkGroupID.x * BLOCK_SIZE; uint dy = gl_WorkGroupID.y * BLOCK_SIZE;
float sum = 0.0;
for (uint k=0; k<TOTAL_K; k+=BLOCK_SIZE) {
sa[ ly * BLOCK_SZIE + lx ] = src_a.data[ (dy + ly) * src_a_width + ( k+lx) ];
sb[ ly * BLOCK_SIZE + lx ] = src_b.data[ ( k + ly) * src_b_width + (dx+lx) ];
barrier();
for (uint i=0; i<BLOCK_SIZE; ++i) {
sum += sa[ ly * BLOCK_SIZE + i ] * sb[ i * BLOCK_SIZE + lx ];
}
barrier();
}
// Exception handling of fractional blocks is omitted
dst.data[ (dy+ly) * dst_width + (dx+lx) ] = sum;
}
Basic implementation of GEMM
// I / O definition omitted (when trans_a == false && trans_b == false)
#define BLOCK_SIZE 8
layout(local_size_x=BLOCK_SIZE, local_size_y=BLOCK_SIZE) in;
// shared memory can be commonly referenced from workgroup
shared float sa[ BLOCK_SIZE * BLOCK_SIZE ];
shared float sb[ BLOCK_SIZE * BLOCK_SIZE ];
void main()
{
uint lx = gl_LocalInvocationID.x; uint ly = gl_LocalInvocationID.y;
uint dx = gl_WorkGroupID.x * BLOCK_SIZE; uint dy = gl_WorkGroupID.y * BLOCK_SIZE;
float sum = 0.0;
for (uint k=0; k<TOTAL_K; k+=BLOCK_SIZE) {
sa[ ly * BLOCK_SZIE + lx ] = src_a.data[ (dy + ly) * src_a_width + ( k+lx) ];
sb[ ly * BLOCK_SIZE + lx ] = src_b.data[ ( k + ly) * src_b_width + (dx+lx) ];
barrier();
for (uint i=0; i<BLOCK_SIZE; ++i) {
sum += sa[ ly * BLOCK_SIZE + i ] * sb[ i * BLOCK_SIZE + lx ];
}
barrier();
}
// Exception handling of fractional blocks is omitted
dst.data[ (dy+ly) * dst_width + (dx+lx) ] = sum;
} 1 thread is responsible for one output element
Basic implementation of GEMM
// I / O definition omitted (when trans_a == false && trans_b == false)
#define BLOCK_SIZE 8
layout(local_size_x=BLOCK_SIZE, local_size_y=BLOCK_SIZE) in;
// shared memory can be commonly referenced from workgroup
shared float sa[ BLOCK_SIZE * BLOCK_SIZE ];
shared float sb[ BLOCK_SIZE * BLOCK_SIZE ];
void main()
{
uint lx = gl_LocalInvocationID.x; uint ly = gl_LocalInvocationID.y;
uint dx = gl_WorkGroupID.x * BLOCK_SIZE; uint dy = gl_WorkGroupID.y * BLOCK_SIZE;
float sum = 0.0;
for (uint k=0; k<TOTAL_K; k+=BLOCK_SIZE) {
sa[ ly * BLOCK_SZIE + lx ] = src_a.data[ (dy + ly) * src_a_width + ( k+lx) ];
sb[ ly * BLOCK_SIZE + lx ] = src_b.data[ ( k + ly) * src_b_width + (dx+lx) ];
barrier();
for (uint i=0; i<BLOCK_SIZE; ++i) {
sum += sa[ ly * BLOCK_SIZE + i ] * sb[ i * BLOCK_SIZE + lx ];
}
barrier();
}
// Exception handling of fractional blocks is omitted
dst.data[ (dy+ly) * dst_width + (dx+lx) ] = sum;
}
localsize is 8 x 8 = 64
Create threads with 2D blocks
Basic implementation of GEMM
// I / O definition omitted (when trans_a == false && trans_b == false)
#define BLOCK_SIZE 8
layout(local_size_x=BLOCK_SIZE, local_size_y=BLOCK_SIZE) in;
// shared memory can be commonly referenced from workgroup
shared float sa[ BLOCK_SIZE * BLOCK_SIZE ];
shared float sb[ BLOCK_SIZE * BLOCK_SIZE ];
void main()
{
uint lx = gl_LocalInvocationID.x; uint ly = gl_LocalInvocationID.y;
uint dx = gl_WorkGroupID.x * BLOCK_SIZE; uint dy = gl_WorkGroupID.y * BLOCK_SIZE;
float sum = 0.0;
for (uint k=0; k<TOTAL_K; k+=BLOCK_SIZE) {
sa[ ly * BLOCK_SZIE + lx ] = src_a.data[ (dy + ly) * src_a_width + ( k+lx) ];
sb[ ly * BLOCK_SIZE + lx ] = src_b.data[ ( k + ly) * src_b_width + (dx+lx) ];
barrier();
for (uint i=0; i<BLOCK_SIZE; ++i) {
sum += sa[ ly * BLOCK_SIZE + i ] * sb[ i * BLOCK_SIZE + lx ];
}
barrier();
}
// Exception handling of fractional blocks is omitted
dst.data[ (dy+ly) * dst_width + (dx+lx) ] = sum;
}
Both A and B read into shared memory
Basic implementation of GEMM
// I / O definition omitted (when trans_a == false && trans_b == false)
#define BLOCK_SIZE 8
layout(local_size_x=BLOCK_SIZE, local_size_y=BLOCK_SIZE) in;
// shared memory can be commonly referenced from workgroup
shared float sa[ BLOCK_SIZE * BLOCK_SIZE ];
shared float sb[ BLOCK_SIZE * BLOCK_SIZE ];
void main()
{
uint lx = gl_LocalInvocationID.x; uint ly = gl_LocalInvocationID.y;
uint dx = gl_WorkGroupID.x * BLOCK_SIZE; uint dy = gl_WorkGroupID.y * BLOCK_SIZE;
float sum = 0.0;
for (uint k=0; k<TOTAL_K; k+=BLOCK_SIZE) {
sa[ ly * BLOCK_SZIE + lx ] = src_a.data[ (dy + ly) * src_a_width + ( k+lx) ];
sb[ ly * BLOCK_SIZE + lx ] = src_b.data[ ( k + ly) * src_b_width + (dx+lx) ];
barrier();
for (uint i=0; i<BLOCK_SIZE; ++i) {
sum += sa[ ly * BLOCK_SIZE + i ] * sb[ i * BLOCK_SIZE + lx ];
}
barrier();
}
// Exception handling of fractional blocks is omitted
dst.data[ (dy+ly) * dst_width + (dx+lx) ] = sum;
}
In the barrier, calculate the output element by
referring to the shared memory prepared by
another thread in the workgroup.
Features of GEMM old implementation
1 thread is responsible for 1 output element
localsize is 8 * 8 = 64
Store A and B together in shared memory
Put a barrier () and refer to the shared memory read by another thread in the workgroup.
Related research about GEMM
V.Volkov and J.Demmel “Benchmarking GPUs to Tune Dense Linear Algebra”
https://www.cs.colostate.edu/~cs675/volkov08-sc08talk.pdf
• Increasing the localsize will make it easier to cause register spills.
• shared memory is slower than registers
• Putting only B in shared memory and putting A in a register made it faster.
Larger block size reduces global memory access
Optimized implementation of GEMM
// I / O definition omitted
#define BLOCK_SIZE 16
layout(local_size_x=BLOCK_SIZE) in;
// shared memory can be commonly referenced from workgroup
shared vec4 sb[ BLOCK_SIZE ];
void main()
{
// Coordinate calculation part omitted
float sum[BLOCK_SIZE];
for (int i=0; i<BLOCK_SIZE; ++i) { sum[i] = 0.0f; }
for (uint k=0; k<TOTAL_K; k+=4) {
sb[ lx ] = load_b(dx+lx, k);
barrier();
vec4 wa = load_a(dy+lx, k);
for (uint i=0; i<BLOCK_SIZE; ++i) {
sum[i] += dot(wa, sb[i]);
}
barrier();
}
// Exception handling of fractional blocks is omitted
for (int i=0; i<BLOCK_SIZE; ++i) {
dst.data[ (dy+lx) * dst_width + (dx+i) ] = sum[i];
}
}
Optimized implementation of GEMM
// I / O definition omitted
#define BLOCK_SIZE 16
layout(local_size_x=BLOCK_SIZE) in;
// shared memory can be commonly referenced from workgroup
shared vec4 sb[ BLOCK_SIZE ];
void main()
{
// Coordinate calculation part omitted
float sum[BLOCK_SIZE];
for (int i=0; i<BLOCK_SIZE; ++i) { sum[i] = 0.0f; }
for (uint k=0; k<TOTAL_K; k+=4) {
sb[ lx ] = load_b(dx+lx, k);
barrier();
vec4 wa = load_a(dy+lx, k);
for (uint i=0; i<BLOCK_SIZE; ++i) {
sum[i] += dot(wa, sb[i]);
}
barrier();
}
// Exception handling of fractional blocks is omitted
for (int i=0; i<BLOCK_SIZE; ++i) {
dst.data[ (dy+lx) * dst_width + (dx+i) ] = sum[i];
}
}
1 thread is responsible for BLOCK_SIZE output elements
Optimized implementation of GEMM
// I / O definition omitted
#define BLOCK_SIZE 16
layout(local_size_x=BLOCK_SIZE) in;
// shared memory can be commonly referenced from workgroup
shared vec4 sb[ BLOCK_SIZE ];
void main()
{
// Coordinate calculation part omitted
float sum[BLOCK_SIZE];
for (int i=0; i<BLOCK_SIZE; ++i) { sum[i] = 0.0f; }
for (uint k=0; k<TOTAL_K; k+=4) {
sb[ lx ] = load_b(dx+lx, k);
barrier();
vec4 wa = load_a(dy+lx, k);
for (uint i=0; i<BLOCK_SIZE; ++i) {
sum[i] += dot(wa, sb[i]);
}
barrier();
}
// Exception handling of fractional blocks is omitted
for (int i=0; i<BLOCK_SIZE; ++i) {
dst.data[ (dy+lx) * dst_width + (dx+i) ] = sum[i];
}
}
localsize is a one-dimensional thread with BLOCK_SIZE
= 16
Optimized implementation of GEMM
// I / O definition omitted
#define BLOCK_SIZE 16
layout(local_size_x=BLOCK_SIZE) in;
// shared memory can be commonly referenced from workgroup
shared vec4 sb[ BLOCK_SIZE ];
void main()
{
// Coordinate calculation part omitted
float sum[BLOCK_SIZE];
for (int i=0; i<BLOCK_SIZE; ++i) { sum[i] = 0.0f; }
for (uint k=0; k<TOTAL_K; k+=4) {
sb[ lx ] = load_b(dx+lx, k);
barrier();
vec4 wa = load_a(dy+lx, k);
for (uint i=0; i<BLOCK_SIZE; ++i) {
sum[i] += dot(wa, sb[i]);
}
barrier();
}
// Exception handling of fractional blocks is omitted
for (int i=0; i<BLOCK_SIZE; ++i) {
dst.data[ (dy+lx) * dst_width + (dx+i) ] = sum[i];
}
}
Only B is stored in shared memory
Optimized implementation of GEMM
// I / O definition omitted
#define BLOCK_SIZE 16
layout(local_size_x=BLOCK_SIZE) in;
// shared memory can be commonly referenced from workgroup
shared vec4 sb[ BLOCK_SIZE ];
void main()
{
// Coordinate calculation part omitted
float sum[BLOCK_SIZE];
for (int i=0; i<BLOCK_SIZE; ++i) { sum[i] = 0.0f; }
for (uint k=0; k<TOTAL_K; k+=4) {
sb[ lx ] = load_b(dx+lx, k);
barrier();
vec4 wa = load_a(dy+lx, k);
for (uint i=0; i<BLOCK_SIZE; ++i) {
sum[i] += dot(wa, sb[i]);
}
barrier();
}
// Exception handling of fractional blocks is omitted
for (int i=0; i<BLOCK_SIZE; ++i) {
dst.data[ (dy+lx) * dst_width + (dx+i) ] = sum[i];
}
}
A is a normal variable and expects register
allocation
Optimized implementation of GEMM
// I / O definition omitted
#define BLOCK_SIZE 16
layout(local_size_x=BLOCK_SIZE) in;
// shared memory can be commonly referenced from workgroup
shared vec4 sb[ BLOCK_SIZE ];
void main()
{
// Coordinate calculation part omitted
float sum[BLOCK_SIZE];
for (int i=0; i<BLOCK_SIZE; ++i) { sum[i] = 0.0f; }
for (uint k=0; k<TOTAL_K; k+=4) {
sb[ lx ] = load_b(dx+lx, k);
barrier();
vec4 wa = load_a(dy+lx, k);
for (uint i=0; i<BLOCK_SIZE; ++i) {
sum[i] += dot(wa, sb[i]);
}
barrier();
}
// Exception handling of fractional blocks is omitted
for (int i=0; i<BLOCK_SIZE; ++i) {
dst.data[ (dy+lx) * dst_width + (dx+i) ] = sum[i];
}
}
Expected to use 4 arithmetic units from 1 thread with vec4
type packed with 4 floats
Features of new GEMM implementation
1 thread is responsible for the BLOCK_SIZE output element
localsize is BLOCK_SIZE = 16
Only B stored in shared memory
A is a normal variable and expects to use register
Use vec4 from 1 thread to 4 arithmetic units
Performance comparison in BERT
BERT is a Transformer-based natural language processing model
The contents are a mass of GEMM = MatMul
Inference time 2.5 to 3.5 times faster with optimized implementation
Based on our benchmark, it is for example 3.53 times faster on Arm Mali G76
The Winograd Algorithm
The arithmetic complexity of the convolution layer can be reduced by applying transforms to the inputs
and the output
Although the transforms are a little demanding, the matrix multiplication which takes up most of the
computation time can be reduced
https://arxiv.org/abs/1509.09308
For a 3x3 convolution, the number of multiply operations can be reduced from 36 to 16
Filter weights
Image
Transform
Transform
Matrix
multiplication
InvTransform
Matrix Multiplication Optimization
A GEMM block size is M=N=K=4
(in some architectures M=8)
Computation of the matrix product for each block in every invocation
No synchronization of workgroup
16-bytes memory alignment and reading as vec4 in a single instruction
Memory access optimization by keeping the vector format as much as possible in later processing
Command Buffer Management
The overhead of vkQueueSubmit being large, sub threading is used to reduce the load
1. Start a new thread to process the command buffer
2. When a request is made through ailia’s API, the command buffer is added to the sub thread’s queue
3. If the sub thread is idle, the command buffer queue is sent all at once to vkQueueSubmit
ready_command_buffer_vector
vkQueueWaitIdle
vkQueueSubmit
Graph Parse
Convolution + Activation
Main Thread Sub Thread
Pooling
Summary
Summary
High-performance AI inference can be achieved using NEON and Vulkan regardless of which device or OS
and with only standard drivers
Easy integration into any application through ailia SDK
Fast inference using Vulkan has already been implemented for more than 140 AI models
“ailia” resolves all problems
ailia SDK ailia MODELS
ailia TRAINER
ailia VISION ailia APPS
ailia MEDIA
Reference
ax Inc. provides total solutions related to AI, from consulting to model, SDK, AI applications / systems
development and support. Should you have any inquiry, feel free to get in touch with us.
ax Inc. home page https://axinc.jp/en/(Inquiry)
ailia MEDIA https://medium.com/axinc-ai
ailia SDK https://ailia.jp/en/(Free trial version available)
ailia MODELS https://github.com/axinc-ai/ailia-models
ailia AI Showcase (Video) https://www.youtube.com/watch?v=lRnWX1rDRQU
ailia AI Showcase (Android) https://play.google.com/store/apps/details?id=jp.axinc.ailia_ai_showcase
© 2020 Arm Limited (or its affiliates)
Thank you!
Tweet us: @ArmSoftwareDev -> #AIVTT
Check out our Arm Software Developers YouTubechannel
Signup now for our next AI Virtual Tech Talk: developer.arm.com/techtalks
Attendees: don’t forget to fill out the survey to be in with a chance of winning
an Arduino Nano 33 BLE board
© 2020 Arm Limited (or its affiliates)
Join Arm DevSummit
Live on October 19-21
Where Hardware and Software
Join Forces
Sign Up Today at Devsummit.arm.com
Thank You
Danke
Merci
谢谢
ありがとう
Gracias
Kiitos
감사합니다
ध"यवाद
‫ﺷ‬
‫ﻛ‬
‫ر‬
ً
‫ا‬
‫ת‬
‫ו‬
‫ד‬
‫ה‬
AI Virtual Tech Talks Series

More Related Content

What's hot

Webinar: Building Embedded Applications from QtCreator with Docker
Webinar: Building Embedded Applications from QtCreator with DockerWebinar: Building Embedded Applications from QtCreator with Docker
Webinar: Building Embedded Applications from QtCreator with Docker
Burkhard Stubert
 
Perceptual Computing Workshop à Paris
Perceptual Computing Workshop à ParisPerceptual Computing Workshop à Paris
Perceptual Computing Workshop à Paris
BeMyApp
 
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
AMD Developer Central
 
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningPL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
AMD Developer Central
 
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
AMD Developer Central
 
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...
AMD Developer Central
 
Hexagonal Architecture: The Standard for Qt Embedded Applications
Hexagonal Architecture: The Standard for Qt Embedded ApplicationsHexagonal Architecture: The Standard for Qt Embedded Applications
Hexagonal Architecture: The Standard for Qt Embedded Applications
Burkhard Stubert
 
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
AMD Developer Central
 
Perceptual Computing Workshop in Munich
Perceptual Computing Workshop in MunichPerceptual Computing Workshop in Munich
Perceptual Computing Workshop in Munich
BeMyApp
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
Rob Gillen
 
HSA HSAIL Introduction Hot Chips 2013
HSA HSAIL Introduction  Hot Chips 2013 HSA HSAIL Introduction  Hot Chips 2013
HSA HSAIL Introduction Hot Chips 2013
HSA Foundation
 
vkFX: Effect(ive) approach for Vulkan API
vkFX: Effect(ive) approach for Vulkan APIvkFX: Effect(ive) approach for Vulkan API
vkFX: Effect(ive) approach for Vulkan API
Tristan Lorach
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
Angela Mendoza M.
 
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
AMD Developer Central
 
Cuda introduction
Cuda introductionCuda introduction
Cuda introduction
Hanibei
 
Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesIntroduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : Notes
Subhajit Sahu
 
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
AMD Developer Central
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
Piyush Mittal
 
Siggraph 2016 - Vulkan and nvidia : the essentials
Siggraph 2016 - Vulkan and nvidia : the essentialsSiggraph 2016 - Vulkan and nvidia : the essentials
Siggraph 2016 - Vulkan and nvidia : the essentials
Tristan Lorach
 
Using Qt under LGPLv3
Using Qt under LGPLv3Using Qt under LGPLv3
Using Qt under LGPLv3
Burkhard Stubert
 

What's hot (20)

Webinar: Building Embedded Applications from QtCreator with Docker
Webinar: Building Embedded Applications from QtCreator with DockerWebinar: Building Embedded Applications from QtCreator with Docker
Webinar: Building Embedded Applications from QtCreator with Docker
 
Perceptual Computing Workshop à Paris
Perceptual Computing Workshop à ParisPerceptual Computing Workshop à Paris
Perceptual Computing Workshop à Paris
 
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
 
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningPL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
 
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
 
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...
 
Hexagonal Architecture: The Standard for Qt Embedded Applications
Hexagonal Architecture: The Standard for Qt Embedded ApplicationsHexagonal Architecture: The Standard for Qt Embedded Applications
Hexagonal Architecture: The Standard for Qt Embedded Applications
 
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
 
Perceptual Computing Workshop in Munich
Perceptual Computing Workshop in MunichPerceptual Computing Workshop in Munich
Perceptual Computing Workshop in Munich
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
HSA HSAIL Introduction Hot Chips 2013
HSA HSAIL Introduction  Hot Chips 2013 HSA HSAIL Introduction  Hot Chips 2013
HSA HSAIL Introduction Hot Chips 2013
 
vkFX: Effect(ive) approach for Vulkan API
vkFX: Effect(ive) approach for Vulkan APIvkFX: Effect(ive) approach for Vulkan API
vkFX: Effect(ive) approach for Vulkan API
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
 
Cuda introduction
Cuda introductionCuda introduction
Cuda introduction
 
Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesIntroduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : Notes
 
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
 
Siggraph 2016 - Vulkan and nvidia : the essentials
Siggraph 2016 - Vulkan and nvidia : the essentialsSiggraph 2016 - Vulkan and nvidia : the essentials
Siggraph 2016 - Vulkan and nvidia : the essentials
 
Using Qt under LGPLv3
Using Qt under LGPLv3Using Qt under LGPLv3
Using Qt under LGPLv3
 

Similar to Optimizing NN inference performance on Arm NEON and Vulkan

DevFest 2022 - Skaffold 2 Deep Dive Taipei.pdf
DevFest 2022 - Skaffold 2 Deep Dive Taipei.pdfDevFest 2022 - Skaffold 2 Deep Dive Taipei.pdf
DevFest 2022 - Skaffold 2 Deep Dive Taipei.pdf
KAI CHU CHUNG
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Codemotion
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Codemotion
 
Delivering Docker & K3s worloads to IoT Edge devices
Delivering Docker & K3s worloads to IoT Edge devicesDelivering Docker & K3s worloads to IoT Edge devices
Delivering Docker & K3s worloads to IoT Edge devices
Ajeet Singh Raina
 
“Open Standards: Powering the Future of Embedded Vision,” a Presentation from...
“Open Standards: Powering the Future of Embedded Vision,” a Presentation from...“Open Standards: Powering the Future of Embedded Vision,” a Presentation from...
“Open Standards: Powering the Future of Embedded Vision,” a Presentation from...
Edge AI and Vision Alliance
 
Scaling Docker Containers using Kubernetes and Azure Container Service
Scaling Docker Containers using Kubernetes and Azure Container ServiceScaling Docker Containers using Kubernetes and Azure Container Service
Scaling Docker Containers using Kubernetes and Azure Container Service
Ben Hall
 
PCCC21:日本電気株式会社「一台何役?SX-Aurora TSUBASA最新情報」
PCCC21:日本電気株式会社「一台何役?SX-Aurora TSUBASA最新情報」PCCC21:日本電気株式会社「一台何役?SX-Aurora TSUBASA最新情報」
PCCC21:日本電気株式会社「一台何役?SX-Aurora TSUBASA最新情報」
PC Cluster Consortium
 
UplinQ - qualcomm® hexagon™ sdk optimize your multimedia solutions
UplinQ - qualcomm® hexagon™ sdk optimize your multimedia solutionsUplinQ - qualcomm® hexagon™ sdk optimize your multimedia solutions
UplinQ - qualcomm® hexagon™ sdk optimize your multimedia solutions
Satya Harish
 
Qualcomm Hexagon SDK: Optimize Your Multimedia Solutions
Qualcomm Hexagon SDK: Optimize Your Multimedia SolutionsQualcomm Hexagon SDK: Optimize Your Multimedia Solutions
Qualcomm Hexagon SDK: Optimize Your Multimedia Solutions
Qualcomm Developer Network
 
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them AllScylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
ScyllaDB
 
Track c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eveTrack c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eve
chiportal
 
Deploying Cloud Native Red Team Infrastructure with Kubernetes, Istio and Envoy
Deploying Cloud Native Red Team Infrastructure with Kubernetes, Istio and Envoy Deploying Cloud Native Red Team Infrastructure with Kubernetes, Istio and Envoy
Deploying Cloud Native Red Team Infrastructure with Kubernetes, Istio and Envoy
Jeffrey Holden
 
4 implementation
4 implementation4 implementation
4 implementation
hanmya
 
DRIVE PX 2
DRIVE PX 2DRIVE PX 2
DRIVE PX 2
Shri Sundaram
 
Transparent GPU Exploitation for Java
Transparent GPU Exploitation for JavaTransparent GPU Exploitation for Java
Transparent GPU Exploitation for Java
Kazuaki Ishizaki
 
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
Christopher Diamantopoulos
 
HKG15-300: Art's Quick Compiler: An unofficial overview
HKG15-300: Art's Quick Compiler: An unofficial overviewHKG15-300: Art's Quick Compiler: An unofficial overview
HKG15-300: Art's Quick Compiler: An unofficial overview
Linaro
 
Flutter Vikings 2022 - Full Stack Dart
Flutter Vikings 2022  - Full Stack DartFlutter Vikings 2022  - Full Stack Dart
Flutter Vikings 2022 - Full Stack Dart
Chris Swan
 
Android on IA devices and Intel Tools
Android on IA devices and Intel ToolsAndroid on IA devices and Intel Tools
Android on IA devices and Intel Tools
Xavier Hallade
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 

Similar to Optimizing NN inference performance on Arm NEON and Vulkan (20)

DevFest 2022 - Skaffold 2 Deep Dive Taipei.pdf
DevFest 2022 - Skaffold 2 Deep Dive Taipei.pdfDevFest 2022 - Skaffold 2 Deep Dive Taipei.pdf
DevFest 2022 - Skaffold 2 Deep Dive Taipei.pdf
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
 
Delivering Docker & K3s worloads to IoT Edge devices
Delivering Docker & K3s worloads to IoT Edge devicesDelivering Docker & K3s worloads to IoT Edge devices
Delivering Docker & K3s worloads to IoT Edge devices
 
“Open Standards: Powering the Future of Embedded Vision,” a Presentation from...
“Open Standards: Powering the Future of Embedded Vision,” a Presentation from...“Open Standards: Powering the Future of Embedded Vision,” a Presentation from...
“Open Standards: Powering the Future of Embedded Vision,” a Presentation from...
 
Scaling Docker Containers using Kubernetes and Azure Container Service
Scaling Docker Containers using Kubernetes and Azure Container ServiceScaling Docker Containers using Kubernetes and Azure Container Service
Scaling Docker Containers using Kubernetes and Azure Container Service
 
PCCC21:日本電気株式会社「一台何役?SX-Aurora TSUBASA最新情報」
PCCC21:日本電気株式会社「一台何役?SX-Aurora TSUBASA最新情報」PCCC21:日本電気株式会社「一台何役?SX-Aurora TSUBASA最新情報」
PCCC21:日本電気株式会社「一台何役?SX-Aurora TSUBASA最新情報」
 
UplinQ - qualcomm® hexagon™ sdk optimize your multimedia solutions
UplinQ - qualcomm® hexagon™ sdk optimize your multimedia solutionsUplinQ - qualcomm® hexagon™ sdk optimize your multimedia solutions
UplinQ - qualcomm® hexagon™ sdk optimize your multimedia solutions
 
Qualcomm Hexagon SDK: Optimize Your Multimedia Solutions
Qualcomm Hexagon SDK: Optimize Your Multimedia SolutionsQualcomm Hexagon SDK: Optimize Your Multimedia Solutions
Qualcomm Hexagon SDK: Optimize Your Multimedia Solutions
 
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them AllScylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
 
Track c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eveTrack c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eve
 
Deploying Cloud Native Red Team Infrastructure with Kubernetes, Istio and Envoy
Deploying Cloud Native Red Team Infrastructure with Kubernetes, Istio and Envoy Deploying Cloud Native Red Team Infrastructure with Kubernetes, Istio and Envoy
Deploying Cloud Native Red Team Infrastructure with Kubernetes, Istio and Envoy
 
4 implementation
4 implementation4 implementation
4 implementation
 
DRIVE PX 2
DRIVE PX 2DRIVE PX 2
DRIVE PX 2
 
Transparent GPU Exploitation for Java
Transparent GPU Exploitation for JavaTransparent GPU Exploitation for Java
Transparent GPU Exploitation for Java
 
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
 
HKG15-300: Art's Quick Compiler: An unofficial overview
HKG15-300: Art's Quick Compiler: An unofficial overviewHKG15-300: Art's Quick Compiler: An unofficial overview
HKG15-300: Art's Quick Compiler: An unofficial overview
 
Flutter Vikings 2022 - Full Stack Dart
Flutter Vikings 2022  - Full Stack DartFlutter Vikings 2022  - Full Stack Dart
Flutter Vikings 2022 - Full Stack Dart
 
Android on IA devices and Intel Tools
Android on IA devices and Intel ToolsAndroid on IA devices and Intel Tools
Android on IA devices and Intel Tools
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 

Recently uploaded

Material for memory and display system h
Material for memory and display system hMaterial for memory and display system h
Material for memory and display system h
gowrishankartb2005
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
21UME003TUSHARDEB
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
171ticu
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
abbyasa1014
 
BRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdfBRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdf
LAXMAREDDY22
 
integral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdfintegral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdf
gaafergoudaay7aga
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
PKavitha10
 
Certificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi AhmedCertificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi Ahmed
Mahmoud Morsy
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
Prakhyath Rai
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
Gino153088
 
Seminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptxSeminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptx
Madan Karki
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
Divyanshu
 
Welding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdfWelding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdf
AjmalKhan50578
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
IJECEIAES
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
Atif Razi
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
Yasser Mahgoub
 
Software Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.pptSoftware Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.ppt
TaghreedAltamimi
 
Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...
bijceesjournal
 

Recently uploaded (20)

Material for memory and display system h
Material for memory and display system hMaterial for memory and display system h
Material for memory and display system h
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
 
BRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdfBRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdf
 
integral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdfintegral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdf
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
 
Certificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi AhmedCertificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi Ahmed
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
 
Seminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptxSeminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptx
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
 
Welding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdfWelding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdf
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
 
Software Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.pptSoftware Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.ppt
 
Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...
 

Optimizing NN inference performance on Arm NEON and Vulkan

  • 1. © 2021 Arm David Cochard & Kazuki Kyakuno 21st September 2021 Optimizing NN inference performance on Arm NEON and Vulkan using ailia SDK ax Inc
  • 2. © 2020 Arm Limited (or its affiliates) Welcome! Tweet us: @ArmSoftwareDev -> #AIVTT Check out our Arm Software Developers YouTubechannel Signup now for our next AI Virtual Tech Talk: developer.arm.com/techtalks Attendees: don’t forget to fill out the survey to be in with a chance of winning an Arduino Nano 33 BLE board
  • 3. Our upcoming Arm AI Tech Talks Date Title Host September 21st Optimizing NN inference performance on Arm NEON and Vulkan using the Ailia SDK Ax Inc October 5th EON Tuner: AutoML for real-world embedded devices Edge Impulse October 28th ARM架构端侧AI视觉算法的开发到部署 (Development to Deployment of Endpoint AI vision Algorithms Based on Arm Architecture) ICE TECH November 2nd Getting started with running Machine Learning on Arm Ethos-U55 Arm Visit: developer.arm.com/techtalks
  • 4. Presenters David Cochard, Engineering Manager, ax Inc. Kazuki Kyakuno, CTO, ax Inc.
  • 5. Agenda ・Presentation of our solution “ ailia ” ・Optimize computation on CPU ・Optimize computation on GPU
  • 6. Company Profile Name ax Inc. Location Tokyo, Japan CEO TERADA Takehiko Business ・Development and provide “ailia SDK” ・AI Consulting, Training AI models and more AI businesses
  • 7. “AI everywhere ” We believe that AI will be used on every device in the near future. We are developing fast SDKs and constantly researching the latest AI models to help accelerating the evolution of AI era. ax Inc. aims to provide tools to implement the latest AI capabilities to solve real-world problems.
  • 8. There are many challenges in implementing AI for various devices.
  • 9. Those challenges include: • Wide variety of AI models • Multi platform • Various programing languages • Long-term API consistency • Performance optimization for each devices and more.
  • 10. ailia SDK ailia SDK is an AI framework leveraging CPU and GPU to achieve high-performance AI inference It supports ONNX (opset 10 & 11) and enables high-performance inference using NEON and Vulkan It offers over 140 pre-trained models, ready to be integrated into our client’s application https://ailia.jp/en
  • 11. Benefits when implementing AI to Edge devices install ailia SDK download ailia MODELS execute sample code result verification model conversion install dependent libraries search optimal model clone model repository setup edge environment setup development environment NG NG NG result verification Implement pre and post processing Conventional ailia SDK
  • 12. ailia SDK Architecture Vulkan Metal NEON Accelerator (Convolution, Pooling, Resize, Add, and more.) Runtime Graph Optimization API (C++, Python, C#, JNI) ONNX (opset=10, 11) (supporting over 100 layers) ailia SDK ailia MODELS
  • 13. ailia MODELS Over 140 models compatible with ailia SDK are publicly available on github You can easily try out the latest models such as YOLOv4, MIDAS and PaddleOCR https://github.com/axinc-ai/ailia-models
  • 14. ailia MODELS ailia MODELS has samples for Python, C ++ and Unity. You can use accelerated inference using Vulkan in any environment. Hand Detection Depth Estimation
  • 16. NEON implementation in AI Since the amount of processing is large for AI compared to general applications, significant speedup is required. Thanks to the high degree of parallelism in layer operations, there is a lot of room for optimization using SIMD instructions such as NEON. There are many types of layers to implement and possibilities to use various optimization techniques.
  • 17. NEON overview NEON provides scalar/vector instructions and registers for Arm CPUs It is possible to perform parallel calculation in 128-bit units, as well as FP32, up to 4 elements simultaneously Float 4 Parallel SIMD instructions 128 bit Data 128 bit Data
  • 18. Basic techniques Replace code branching with a bit select instruction Process number sign using by bit manipulation Reorder data structure (load / store) // save sign bit from input value. mask = vdupq_n_u32(1<<31); sign = vandq_u32(cast_u32(x), mask); v = vabsq_f32(x); // .. any operation with abs value .. // restore sign bit with xor res = cast_f32(veorq_u32(cast_u32(v), sign)); // remove branch with bit select // dst[i]=(src[i]<0.0f)?src[i]*alpha:src[i]; val = vld1q_f32(src); sel = vcltq_f32(val, zero); mod = vmulq_n_f32(val, alpha); vst1q_f32(dst, vbslq_f32(sel, mod, val)); // reorder data with structure load x[] = { 11, 12, 13, 14, 21, 22, 23, 24, 31, 32, 33, 34, 41, 42, 43, 44, }; v = vld4q_f32(x); // v.val[0] : { 11, 21, 31, 41 } // v.val[1] : { 12, 22, 32, 42 } // v.val[2] : { 13, 23, 33, 43 } // v.val[3] : { 14, 24, 34, 44 }
  • 19. Approximation NEON implementation using approximate expressions for high-level functions such as exp, log, and erf // log(x) = log((2^n) * z) = n*log(2) + log(z) // log(z) ≒ 2 * (w + (w^3)/3 + (w^5)/5 + ..) : w = (z-1)/(z+1) log2 = 0.6931471805599453f; n = pick_exponent(x); z = pick_fractional(x); w = (z-1.0) / (z+1.0); ww = w*w; r = (1.0/7.0) + (1.0/9.0)*ww; r = (1.0/5.0) + r*ww; r = (1.0/3.0) + r*ww; r = (1.0 + r*ww); r = r*w; // w + (w^3)/3 + (w^5)/5 + (w^7)/7 + (w^9)/9 return (n*log2 + r*2);
  • 20. Threading Taking advantage of Arm big.little technology, performance gain can be achieved by efficiently assigning jobs to different cores. Adjust the processing unit according to the cache size Diving a task in units of equal size does not fit big.little SoC. design Assign more processing-intensive tasks to big cores, and smaller ones to little cores.
  • 21. Benchmark Comparison of inference time with NEON enabled and disabled SoC Model Improvement (w/wo NEON) Snapdragon 888 ResNet50 2.15 times faster YOLOv3 3.90 times faster Exynos 9820 ResNet50 3.08 times faster YOLOv3 2.86 times faster
  • 22. Future work The vector length of NEON is 128 bit, but 512 bit vector operation can be used in SVE2 added in Armv9-A Going forward, ailia SDK will continue to support the latest instruction sets
  • 24. Benefits of Vulkan Support for all major GPUs Support for all major OSs Windows, Android, Linux Easy installation for GPU inference Being widely used for gaming, it only requires standard drivers to run Little additional disk space usage ailia_vulkan.dll is only 2.8MB
  • 25. AI Inference Acceleration with Vulkan Fast AI inference achieved using Runtime Graph Optimization and Layer Fusion Implementation of the Optimized Logic for GEMM using Vulkan’s Compute Shader Implementation of layers such as Convolution or Pooling using Vulkan’s Compute Shader Implementation of the Winograd Algorithm with shaders to accelerate the heavy Convolution layer
  • 26. Roles between CPU and GPU Allocate GPU memory Send data to GPU memory Create compute kernel on GPU Dispatch compute kernel on GPU Fetch Compute result from GPU memory Compute using compute kernel Computing API | GPU Driver CPU GPU
  • 27. Code required for Vulkan Create VkInstance Select VkPhysicalDevice Create VkDevice Get VkQueue Allocate VkDeviceMemory Bind vkDeviceMemory to VkBuffer Send data to device from host Configure VkShaderModule Create VkDescriptorSetLayout Create VkPipelineLayout Create VkPipeline Create VkDescriptorPool Create VkDescriptorSet Create VkCommandPool Create VkCommandBuffer Regist VkPipeline to VkCommandBuffer Regist VkDescriptorSet to VkCommandBuffer Call vkCmdDispatch Executing and waiting for VkCommandBuffer to complete Data transfer from device to host
  • 28. Example of GPU Computing Add the corresponding elements of A and B of the same size and write to the corresponding element of the output of the same size Input A [ z=4, y=360, x=640 ] Input B [ z=4, y=360, x=640 ] Output [ z=4, y=360, x=640 ] +
  • 29. Example of GPU Computing : Device code (GPU code) #version 450 layout(std430, binding = 0) writeonly buffer Dst { float data[]; } dst; layout(std430, binding = 1) readonly buffer Src_A { float data[]; } src_a; layout(std430, binding = 2) readonly buffer Src_B { float data[]; } src_b; layout(local_size_x = 64) in; void main() { // Exception handling of fractional blocks is omitted const uint id = gl_GlobalInvocationID.x; dst.data[id] = src_a.data[id] + src_b.data[id]; }
  • 30. Example of GPU Computing : Host code (CPU code) // Instance initialization, device selection, buffer allocation, shader module construction, etc. are omitted. vkBeginCommandBuffer(cmdBuf, &beginInfo); // Start recording command buffer // Registered pipeline in command buffer --Shader module is registered in pipeline vkCmdBindPipeline(cmdBuf, VK_PIPELINE_BIND_POINT_COMPUTE, pipeline); // Registered descriptor set in command buffer --I / O buffer allocation is registered in descSet vkCmdBindDescriptorSets(cmdBuf, VK_PIPELINE_BIND_POINT_COMPUTE, descSet); // Specify how many work loops to start // Start with the most recently configured pipeline and descriptor set // Numerical example of one-dimensional processing of z = 4, y = 360, x = 640 with localsize = 64 vkCmdDispatch(cmdBuf, (4*360*640)/64, 1, 1); vkEndCommandBuffer(cmdBuf); // Set the end of command buffer construction // Pass the command buffer to the device and run the shader --cmdBuf is registered in submitInfo vkQueueSubmit(queue, 1, &submitInfo, NULL); // Waiting for processing completion and data collection are omitted
  • 31. Workgroups and Threads (Invocation) Write how many kernels to boot in the host (CPU) code (groupCount) Describe what each thread does in the device (GPU) code (shader) Use localsize to specify the number of threads (invocations) Thread Workgroup vkCmdDispatch Workgroup Thread Thread Thread CPU GPU groupCount localsize
  • 32. Arm GPU HW configuration example https://developer.arm.com/ip-products/graphics-and-multimedia/mali-gpus/mali-g76-gpu Configurable from 4 to 20 shader cores delivering largest capability for a Mali GPU 3 engines per shader core 8 execution lanes per engine
  • 33. Correspondence with API vkCmdDispatch(.., groupCountX, groupCountY, groupCountZ); Kernel boot process with host code Launch (groupCountX * groupCountY * groupCountZ) workgroups One workgroup is assigned to one execution engine If you can fill all the execution engines, make full use of the GPU layout(localsize_x=64) in; Device code thread number specification part Specifies the number of local threads in a workgroup If the number of threads issued in the execution engine can be filled, the execution engine will be fully utilized.
  • 34. Relation between localsize and dispatch If the local size is 64 and the image size is 640 * 360, you need to run (360 * 640) / 64 workgroups to run the entire image 640 360 localsize=64 1 groupCount=(360 *640)/64
  • 35. How to choose localsize The limit of localsize is defined by maxComputeWorkgroupSize maxComputeWorkgroupSize = {384,384,384} (Mali G76) maximum size of a local compute workgroup per dimension maxComputeWorkGroupInvocations = 384 (Mali G76) maximum total number of compute shader invocations in a single local workgroup Arm Mali GPU Best Practices Developer Guide said “Use 64 as a baseline workgroup size. Do not use more than 64 threads per workgroup.” Localsize adjustment required for complex and heavy kernels
  • 36. Time spent by localsize Error if maxComputeWorkgroupSize is exceeded Performance saturation at localsize> = 32 Because it uses only one shader module localsize = 1 is the worst performance
  • 37. GEMM General matrix multiply Multiplication of 2D dense matrix Often used in scientific computing (computer simulation) Patterson & Hennessy "Computer Organization and Design" 1426-page textbooks with the story of speeding up GEMM Matrix A Matrix B * Matrix C
  • 38. Basic implementation of GEMM // I / O definition omitted (when trans_a == false && trans_b == false) #define BLOCK_SIZE 8 layout(local_size_x=BLOCK_SIZE, local_size_y=BLOCK_SIZE) in; // shared memory can be commonly referenced from workgroup shared float sa[ BLOCK_SIZE * BLOCK_SIZE ]; shared float sb[ BLOCK_SIZE * BLOCK_SIZE ]; void main() { uint lx = gl_LocalInvocationID.x; uint ly = gl_LocalInvocationID.y; uint dx = gl_WorkGroupID.x * BLOCK_SIZE; uint dy = gl_WorkGroupID.y * BLOCK_SIZE; float sum = 0.0; for (uint k=0; k<TOTAL_K; k+=BLOCK_SIZE) { sa[ ly * BLOCK_SZIE + lx ] = src_a.data[ (dy + ly) * src_a_width + ( k+lx) ]; sb[ ly * BLOCK_SIZE + lx ] = src_b.data[ ( k + ly) * src_b_width + (dx+lx) ]; barrier(); for (uint i=0; i<BLOCK_SIZE; ++i) { sum += sa[ ly * BLOCK_SIZE + i ] * sb[ i * BLOCK_SIZE + lx ]; } barrier(); } // Exception handling of fractional blocks is omitted dst.data[ (dy+ly) * dst_width + (dx+lx) ] = sum; }
  • 39. Basic implementation of GEMM // I / O definition omitted (when trans_a == false && trans_b == false) #define BLOCK_SIZE 8 layout(local_size_x=BLOCK_SIZE, local_size_y=BLOCK_SIZE) in; // shared memory can be commonly referenced from workgroup shared float sa[ BLOCK_SIZE * BLOCK_SIZE ]; shared float sb[ BLOCK_SIZE * BLOCK_SIZE ]; void main() { uint lx = gl_LocalInvocationID.x; uint ly = gl_LocalInvocationID.y; uint dx = gl_WorkGroupID.x * BLOCK_SIZE; uint dy = gl_WorkGroupID.y * BLOCK_SIZE; float sum = 0.0; for (uint k=0; k<TOTAL_K; k+=BLOCK_SIZE) { sa[ ly * BLOCK_SZIE + lx ] = src_a.data[ (dy + ly) * src_a_width + ( k+lx) ]; sb[ ly * BLOCK_SIZE + lx ] = src_b.data[ ( k + ly) * src_b_width + (dx+lx) ]; barrier(); for (uint i=0; i<BLOCK_SIZE; ++i) { sum += sa[ ly * BLOCK_SIZE + i ] * sb[ i * BLOCK_SIZE + lx ]; } barrier(); } // Exception handling of fractional blocks is omitted dst.data[ (dy+ly) * dst_width + (dx+lx) ] = sum; } 1 thread is responsible for one output element
  • 40. Basic implementation of GEMM // I / O definition omitted (when trans_a == false && trans_b == false) #define BLOCK_SIZE 8 layout(local_size_x=BLOCK_SIZE, local_size_y=BLOCK_SIZE) in; // shared memory can be commonly referenced from workgroup shared float sa[ BLOCK_SIZE * BLOCK_SIZE ]; shared float sb[ BLOCK_SIZE * BLOCK_SIZE ]; void main() { uint lx = gl_LocalInvocationID.x; uint ly = gl_LocalInvocationID.y; uint dx = gl_WorkGroupID.x * BLOCK_SIZE; uint dy = gl_WorkGroupID.y * BLOCK_SIZE; float sum = 0.0; for (uint k=0; k<TOTAL_K; k+=BLOCK_SIZE) { sa[ ly * BLOCK_SZIE + lx ] = src_a.data[ (dy + ly) * src_a_width + ( k+lx) ]; sb[ ly * BLOCK_SIZE + lx ] = src_b.data[ ( k + ly) * src_b_width + (dx+lx) ]; barrier(); for (uint i=0; i<BLOCK_SIZE; ++i) { sum += sa[ ly * BLOCK_SIZE + i ] * sb[ i * BLOCK_SIZE + lx ]; } barrier(); } // Exception handling of fractional blocks is omitted dst.data[ (dy+ly) * dst_width + (dx+lx) ] = sum; } localsize is 8 x 8 = 64 Create threads with 2D blocks
  • 41. Basic implementation of GEMM // I / O definition omitted (when trans_a == false && trans_b == false) #define BLOCK_SIZE 8 layout(local_size_x=BLOCK_SIZE, local_size_y=BLOCK_SIZE) in; // shared memory can be commonly referenced from workgroup shared float sa[ BLOCK_SIZE * BLOCK_SIZE ]; shared float sb[ BLOCK_SIZE * BLOCK_SIZE ]; void main() { uint lx = gl_LocalInvocationID.x; uint ly = gl_LocalInvocationID.y; uint dx = gl_WorkGroupID.x * BLOCK_SIZE; uint dy = gl_WorkGroupID.y * BLOCK_SIZE; float sum = 0.0; for (uint k=0; k<TOTAL_K; k+=BLOCK_SIZE) { sa[ ly * BLOCK_SZIE + lx ] = src_a.data[ (dy + ly) * src_a_width + ( k+lx) ]; sb[ ly * BLOCK_SIZE + lx ] = src_b.data[ ( k + ly) * src_b_width + (dx+lx) ]; barrier(); for (uint i=0; i<BLOCK_SIZE; ++i) { sum += sa[ ly * BLOCK_SIZE + i ] * sb[ i * BLOCK_SIZE + lx ]; } barrier(); } // Exception handling of fractional blocks is omitted dst.data[ (dy+ly) * dst_width + (dx+lx) ] = sum; } Both A and B read into shared memory
  • 42. Basic implementation of GEMM // I / O definition omitted (when trans_a == false && trans_b == false) #define BLOCK_SIZE 8 layout(local_size_x=BLOCK_SIZE, local_size_y=BLOCK_SIZE) in; // shared memory can be commonly referenced from workgroup shared float sa[ BLOCK_SIZE * BLOCK_SIZE ]; shared float sb[ BLOCK_SIZE * BLOCK_SIZE ]; void main() { uint lx = gl_LocalInvocationID.x; uint ly = gl_LocalInvocationID.y; uint dx = gl_WorkGroupID.x * BLOCK_SIZE; uint dy = gl_WorkGroupID.y * BLOCK_SIZE; float sum = 0.0; for (uint k=0; k<TOTAL_K; k+=BLOCK_SIZE) { sa[ ly * BLOCK_SZIE + lx ] = src_a.data[ (dy + ly) * src_a_width + ( k+lx) ]; sb[ ly * BLOCK_SIZE + lx ] = src_b.data[ ( k + ly) * src_b_width + (dx+lx) ]; barrier(); for (uint i=0; i<BLOCK_SIZE; ++i) { sum += sa[ ly * BLOCK_SIZE + i ] * sb[ i * BLOCK_SIZE + lx ]; } barrier(); } // Exception handling of fractional blocks is omitted dst.data[ (dy+ly) * dst_width + (dx+lx) ] = sum; } In the barrier, calculate the output element by referring to the shared memory prepared by another thread in the workgroup.
  • 43. Features of GEMM old implementation 1 thread is responsible for 1 output element localsize is 8 * 8 = 64 Store A and B together in shared memory Put a barrier () and refer to the shared memory read by another thread in the workgroup.
  • 44. Related research about GEMM V.Volkov and J.Demmel “Benchmarking GPUs to Tune Dense Linear Algebra” https://www.cs.colostate.edu/~cs675/volkov08-sc08talk.pdf • Increasing the localsize will make it easier to cause register spills. • shared memory is slower than registers • Putting only B in shared memory and putting A in a register made it faster.
  • 45. Larger block size reduces global memory access
  • 46. Optimized implementation of GEMM // I / O definition omitted #define BLOCK_SIZE 16 layout(local_size_x=BLOCK_SIZE) in; // shared memory can be commonly referenced from workgroup shared vec4 sb[ BLOCK_SIZE ]; void main() { // Coordinate calculation part omitted float sum[BLOCK_SIZE]; for (int i=0; i<BLOCK_SIZE; ++i) { sum[i] = 0.0f; } for (uint k=0; k<TOTAL_K; k+=4) { sb[ lx ] = load_b(dx+lx, k); barrier(); vec4 wa = load_a(dy+lx, k); for (uint i=0; i<BLOCK_SIZE; ++i) { sum[i] += dot(wa, sb[i]); } barrier(); } // Exception handling of fractional blocks is omitted for (int i=0; i<BLOCK_SIZE; ++i) { dst.data[ (dy+lx) * dst_width + (dx+i) ] = sum[i]; } }
  • 47. Optimized implementation of GEMM // I / O definition omitted #define BLOCK_SIZE 16 layout(local_size_x=BLOCK_SIZE) in; // shared memory can be commonly referenced from workgroup shared vec4 sb[ BLOCK_SIZE ]; void main() { // Coordinate calculation part omitted float sum[BLOCK_SIZE]; for (int i=0; i<BLOCK_SIZE; ++i) { sum[i] = 0.0f; } for (uint k=0; k<TOTAL_K; k+=4) { sb[ lx ] = load_b(dx+lx, k); barrier(); vec4 wa = load_a(dy+lx, k); for (uint i=0; i<BLOCK_SIZE; ++i) { sum[i] += dot(wa, sb[i]); } barrier(); } // Exception handling of fractional blocks is omitted for (int i=0; i<BLOCK_SIZE; ++i) { dst.data[ (dy+lx) * dst_width + (dx+i) ] = sum[i]; } } 1 thread is responsible for BLOCK_SIZE output elements
  • 48. Optimized implementation of GEMM // I / O definition omitted #define BLOCK_SIZE 16 layout(local_size_x=BLOCK_SIZE) in; // shared memory can be commonly referenced from workgroup shared vec4 sb[ BLOCK_SIZE ]; void main() { // Coordinate calculation part omitted float sum[BLOCK_SIZE]; for (int i=0; i<BLOCK_SIZE; ++i) { sum[i] = 0.0f; } for (uint k=0; k<TOTAL_K; k+=4) { sb[ lx ] = load_b(dx+lx, k); barrier(); vec4 wa = load_a(dy+lx, k); for (uint i=0; i<BLOCK_SIZE; ++i) { sum[i] += dot(wa, sb[i]); } barrier(); } // Exception handling of fractional blocks is omitted for (int i=0; i<BLOCK_SIZE; ++i) { dst.data[ (dy+lx) * dst_width + (dx+i) ] = sum[i]; } } localsize is a one-dimensional thread with BLOCK_SIZE = 16
  • 49. Optimized implementation of GEMM // I / O definition omitted #define BLOCK_SIZE 16 layout(local_size_x=BLOCK_SIZE) in; // shared memory can be commonly referenced from workgroup shared vec4 sb[ BLOCK_SIZE ]; void main() { // Coordinate calculation part omitted float sum[BLOCK_SIZE]; for (int i=0; i<BLOCK_SIZE; ++i) { sum[i] = 0.0f; } for (uint k=0; k<TOTAL_K; k+=4) { sb[ lx ] = load_b(dx+lx, k); barrier(); vec4 wa = load_a(dy+lx, k); for (uint i=0; i<BLOCK_SIZE; ++i) { sum[i] += dot(wa, sb[i]); } barrier(); } // Exception handling of fractional blocks is omitted for (int i=0; i<BLOCK_SIZE; ++i) { dst.data[ (dy+lx) * dst_width + (dx+i) ] = sum[i]; } } Only B is stored in shared memory
  • 50. Optimized implementation of GEMM // I / O definition omitted #define BLOCK_SIZE 16 layout(local_size_x=BLOCK_SIZE) in; // shared memory can be commonly referenced from workgroup shared vec4 sb[ BLOCK_SIZE ]; void main() { // Coordinate calculation part omitted float sum[BLOCK_SIZE]; for (int i=0; i<BLOCK_SIZE; ++i) { sum[i] = 0.0f; } for (uint k=0; k<TOTAL_K; k+=4) { sb[ lx ] = load_b(dx+lx, k); barrier(); vec4 wa = load_a(dy+lx, k); for (uint i=0; i<BLOCK_SIZE; ++i) { sum[i] += dot(wa, sb[i]); } barrier(); } // Exception handling of fractional blocks is omitted for (int i=0; i<BLOCK_SIZE; ++i) { dst.data[ (dy+lx) * dst_width + (dx+i) ] = sum[i]; } } A is a normal variable and expects register allocation
  • 51. Optimized implementation of GEMM // I / O definition omitted #define BLOCK_SIZE 16 layout(local_size_x=BLOCK_SIZE) in; // shared memory can be commonly referenced from workgroup shared vec4 sb[ BLOCK_SIZE ]; void main() { // Coordinate calculation part omitted float sum[BLOCK_SIZE]; for (int i=0; i<BLOCK_SIZE; ++i) { sum[i] = 0.0f; } for (uint k=0; k<TOTAL_K; k+=4) { sb[ lx ] = load_b(dx+lx, k); barrier(); vec4 wa = load_a(dy+lx, k); for (uint i=0; i<BLOCK_SIZE; ++i) { sum[i] += dot(wa, sb[i]); } barrier(); } // Exception handling of fractional blocks is omitted for (int i=0; i<BLOCK_SIZE; ++i) { dst.data[ (dy+lx) * dst_width + (dx+i) ] = sum[i]; } } Expected to use 4 arithmetic units from 1 thread with vec4 type packed with 4 floats
  • 52. Features of new GEMM implementation 1 thread is responsible for the BLOCK_SIZE output element localsize is BLOCK_SIZE = 16 Only B stored in shared memory A is a normal variable and expects to use register Use vec4 from 1 thread to 4 arithmetic units
  • 53. Performance comparison in BERT BERT is a Transformer-based natural language processing model The contents are a mass of GEMM = MatMul Inference time 2.5 to 3.5 times faster with optimized implementation Based on our benchmark, it is for example 3.53 times faster on Arm Mali G76
  • 54. The Winograd Algorithm The arithmetic complexity of the convolution layer can be reduced by applying transforms to the inputs and the output Although the transforms are a little demanding, the matrix multiplication which takes up most of the computation time can be reduced https://arxiv.org/abs/1509.09308 For a 3x3 convolution, the number of multiply operations can be reduced from 36 to 16 Filter weights Image Transform Transform Matrix multiplication InvTransform
  • 55. Matrix Multiplication Optimization A GEMM block size is M=N=K=4 (in some architectures M=8) Computation of the matrix product for each block in every invocation No synchronization of workgroup 16-bytes memory alignment and reading as vec4 in a single instruction Memory access optimization by keeping the vector format as much as possible in later processing
  • 56. Command Buffer Management The overhead of vkQueueSubmit being large, sub threading is used to reduce the load 1. Start a new thread to process the command buffer 2. When a request is made through ailia’s API, the command buffer is added to the sub thread’s queue 3. If the sub thread is idle, the command buffer queue is sent all at once to vkQueueSubmit ready_command_buffer_vector vkQueueWaitIdle vkQueueSubmit Graph Parse Convolution + Activation Main Thread Sub Thread Pooling
  • 58. Summary High-performance AI inference can be achieved using NEON and Vulkan regardless of which device or OS and with only standard drivers Easy integration into any application through ailia SDK Fast inference using Vulkan has already been implemented for more than 140 AI models
  • 59. “ailia” resolves all problems ailia SDK ailia MODELS ailia TRAINER ailia VISION ailia APPS ailia MEDIA
  • 60. Reference ax Inc. provides total solutions related to AI, from consulting to model, SDK, AI applications / systems development and support. Should you have any inquiry, feel free to get in touch with us. ax Inc. home page https://axinc.jp/en/(Inquiry) ailia MEDIA https://medium.com/axinc-ai ailia SDK https://ailia.jp/en/(Free trial version available) ailia MODELS https://github.com/axinc-ai/ailia-models ailia AI Showcase (Video) https://www.youtube.com/watch?v=lRnWX1rDRQU ailia AI Showcase (Android) https://play.google.com/store/apps/details?id=jp.axinc.ailia_ai_showcase
  • 61. © 2020 Arm Limited (or its affiliates) Thank you! Tweet us: @ArmSoftwareDev -> #AIVTT Check out our Arm Software Developers YouTubechannel Signup now for our next AI Virtual Tech Talk: developer.arm.com/techtalks Attendees: don’t forget to fill out the survey to be in with a chance of winning an Arduino Nano 33 BLE board
  • 62. © 2020 Arm Limited (or its affiliates) Join Arm DevSummit Live on October 19-21 Where Hardware and Software Join Forces Sign Up Today at Devsummit.arm.com