An Analysis of Convolution for Inference

•

6 likes•6,793 views

Scott Gray presents at the 2016 ICML conference. Scott Gray went over various ways of computing convolution in the workshop on "On-device Intelligence".

Technology

An Analysis of Convolution for Inference
24 June 2016
Scott Gray
Nervana Systems
MAKING MACHINES SMARTER.™

Proprietary and conﬁdential. Do not distribute.ner va na
Direct Convolution
2
• Compute with in-place slicing + gemm
• Data layout considerations: C, H, W, N
• Minimize slicing logic
• Maximize contiguous access
• Leverage filter overlap

Proprietary and conﬁdential. Do not distribute.ner va na
Small N direct convolution: Without Superblocking
3
fprop
Q = (W-S+1 + 2 * pad) / stride
wi = sk + qj * stride - pad
Fig from V. Dumoulin,
https://github.com/vdumoulin/conv_arithmetic

Proprietary and conﬁdential. Do not distribute.ner va na
Small N direct convolution: With Superblocking
4
fprop
Q = (W-S+1 + 2 * pad) / stride
wi = sk + qj * stride - pad

Proprietary and conﬁdential. Do not distribute.ner va na
Small N direct convolution: Bprop for deconv
5
bprop
pad’ = S - pad - 1
wi = (qj - pad’ + sk) / stride

Proprietary and conﬁdential. Do not distribute.ner va na
Small N direct convolution: Dilated Filters
6
Dilated
S’ = (S-1) * rate + 1
Q = (W-S’+1 + 2*pad) / stride
wi = sk * rate + qj * stride - pad
Fig from F. Yu, V. Koltun
http://arxiv.org/abs/1511.07122v3

Proprietary and conﬁdential. Do not distribute.ner va na
Convolution with Algorithmic Speedups
7
• FFT and Winograd have same basic computational flow
• FFT tiles typically need to be much bigger
• Winograd history: Toom and Cook, then Lavin

Proprietary and conﬁdential. Do not distribute.ner va na
Winograd: input transform
8
Input Feature Map
4x4 stride 2
• Input transform
• 2D Winograd is a nested
product of 1D transforms
• Transforms can be
simplified to remove zeros

Proprietary and conﬁdential. Do not distribute.ner va na
Winograd: filter transform
9
• Filter transform
• Same as input but with
different coefficients
• Transform each feature map
independently

Proprietary and conﬁdential. Do not distribute.ner va na
Winograd: batched GEMM
10
• Point-wise Multiplication
• Posed as batched GEMM
operation

Proprietary and conﬁdential. Do not distribute.ner va na
Winograd: output transform
11
Output Feature Map
• Output transform
• Same as input and filter
• Transform back to pixel
space to obtain 2x2 output
tile

Proprietary and conﬁdential. Do not distribute.ner va na
Transforms for Increased Accuracy
12
Integer roots
4 0 -5 0 1 0
0 -4 -4 1 1 0
0 4 -4 -1 1 0
0 -2 -1 2 1 0
0 2 -1 -2 1 0
0 4 0 -5 0 1
0.87 0 -2.64 0 1 0
0 -1.4 -2.25 0.62 1 0
0 1.4 -2.25 -0.62 1 0
0 -0.58 -0.39 1.5 1 0
0 0.58 -0.39 -1.5 1 0
0 0.87 0 -2.64 0 1
Fractional roots
Input transforms for 4x4

$Proprietary and conﬁdential. Do not distribute.ner va na Precision 13 Percentage error from Convolution 0 5 10 15 20 25 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Direct 2x2 Winograd 4x4 winograd (Fractional Roots) 4x4 Winograd (Integer Roots) PercentageError Bit width Bits Direct 2x2 Winograd 4x4 frac 4x4 int 3 56.461 112.174 351.196 314.62 4 23.533 46.222 274.28 432.959 5 10.879 21.394 142.649 459.723 6 5.245 10.34 68.062 446.271 7 2.585 5.074 33.73 250.057 8 1.286 2.516 16.667 123.585 9 0.639 1.253 8.246 62.001 10 0.319 0.626 4.154 31.006 11 0.159 0.312 2.064 15.439 12 0.08 0.156 1.029 7.669 13 0.04 0.078 0.515 3.857 14 0.02 0.039 0.259 1.923 15 0.01 0.019 0.129 0.966 16 0.005 0.01 0.064 0.483$

Proprietary and conﬁdential. Do not distribute.ner va na
Multiplier Transistor Efficiency
14
Algo bits speedup transistors
performance
/ transistor
Direct 8 1.0 3000 1
2x2 9 2.25 3750 1.8
4x4 12 4.0 6000 2.0
Transistor Counts from Wikipedia:

Proprietary and conﬁdential. Do not distribute.ner va na
Logarithmic quantization
15
D. Miyashita, EH. Lee, B. Murmann
Convolutional Neural Networks using Logarithmic Data Representation
http://arxiv.org/abs/1603.01025v2

Proprietary and conﬁdential. Do not distribute.ner va na 16
Performance: VGG fp32 on GTX1080effectiveTFLOPS
Batch Size
VGG - Totals:
0
5
10
15
20
25
64 32 16 8 4 2 1
Neon Direct
Neon F(2x2,3x3)
Neon F(4x4,3x3)
cuDNN FFT

Proprietary and conﬁdential. Do not distribute.ner va na 17
Peak Performance: VGG fp32 on GTX1080effectiveTFLOPS
Batch Size
VGG - Layer 4.2:
0
5
10
15
20
25
64 32 16 8 4 2 1
Neon Direct
Neon F(2x2,3x3)
Neon F(4x4,3x3)
cuDNN FFT

This session goes over many of the techniques we use at Nervana in GPU programming to achieve state-of-the-art performance for deep learning networks. The main focus will be on the customization of dense linear algebra kernels: Winograd 3x3 convolution, direct convolution, and small tile GEMM (matrix multiply). In particular, we'll look at how we achieve high utilization at very small mini batches which is important for multi-gpu scaling and inference. In addition we'll talk about where and how you can effectively leverage lower and mixed precision to further increase performance without loss in accuracy.

Dx11 performancereloaded

mistercteam

Masked Software Occlusion Culling

Intel® Software

Efficient occlusion culling in dynamic scenes is a very important topic to the game and real-time graphics community in order to accelerate rendering. We present a novel algorithm inspired by recent advances in depth culling for graphics hardware, but adapted and optimized for SIMD-capable CPUs. Our algorithm has very low memory overhead and is three times faster than previous work, while culling 98% of all triangles by a full resolution depth buffer approach. It supports interleaving occluder rasterization and occlusion queries without penalty, making it easy

Dissecting the Rendering of The Surge

Philip Hammer

Bindless Deferred Decals in The Surge 2

Philip Hammer

Foveated Ray Tracing for VR on Multiple GPUs

Takahiro Harada

Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson

AMD Developer Central

A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)Takahiro Harada

Deep learning is unlocking tremendous economic value across various market sectors. Individual data scientists can draw from several open source frameworks and basic hardware resources during the very initial investigative phases but quickly require significant hardware and software resources to build and deploy production models. Intel Nervana has built a competitive deep learning platform to make it easy for data scientists to start from the iterative, investigatory phase and take models all the way to deployment. Nervana’s platform is designed for speed and scale, and serves as a catalyst for all types of organizations to benefit from the full potential of deep learning. Example of supported applications include but not limited to automotive speech interfaces, image search, language translation, agricultural robotics and genomics, financial document summarization, and finding anomalies in IoT data. In this talk, we will give an overview of Nervana’s DL platform and get some hands-on experience using this platform to train and execute deep learning models. Speaker: Will Constable Join our Meetup Group: https://www.meetup.com/SV-Deep-Learning/

What's hot

Dds 2

Nhân Lê

GDC16: Arbitrary amount of 3D data running on Gear VR by Vinh Truong

Umbra Software

Image Segmentation Using Hardware Forest ClassifiersNeil Pittman

Chaotic substitution box design for block ciphers

Hammad Haleem

Math cad fourier analysis (jcb-edited)Julio Banks

[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow

Takahiro Harada

Fragging Rights: A Tale of a Pathological Storage Workload

Eric Sproul

Unit 5 vsp

sushant7dare

Multi core k means

b0rAAs

MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...

Kenichiro Tanaka

The Internet

David Evans

Parallel Implementation of K Means Clustering on CUDA

prithan

Neighbourhood Preserving Quantisation for LSH SIGIR Poster

Sean Moran

Scaling the #2ndhalf

Salo Shp

PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

AMD Developer Central

[2018 GDC] Real-Time Ray-Tracing Techniques for Integration into Existing Ren...

Takahiro Harada

Parallel K means clustering using CUDA

prithan

Unite2019 HLOD를 활용한 대규모 씬 제작 방법

장규 서

TressFX The Fast and The Furry by Nicolas Thibieroz

AMD Developer Central

[BGOUG] Java GC - Friend or Foe

SAP HANA Cloud Platform

What's hot (20)

Dds 2

GDC16: Arbitrary amount of 3D data running on Gear VR by Vinh Truong

Image Segmentation Using Hardware Forest Classifiers

Chaotic substitution box design for block ciphers

Math cad fourier analysis (jcb-edited)

[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow

Fragging Rights: A Tale of a Pathological Storage Workload

Unit 5 vsp

Multi core k means

MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...

The Internet

Parallel Implementation of K Means Clustering on CUDA

Neighbourhood Preserving Quantisation for LSH SIGIR Poster

Scaling the #2ndhalf

PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

[2018 GDC] Real-Time Ray-Tracing Techniques for Integration into Existing Ren...

Parallel K means clustering using CUDA

Unite2019 HLOD를 활용한 대규모 씬 제작 방법

TressFX The Fast and The Furry by Nicolas Thibieroz

[BGOUG] Java GC - Friend or Foe

Viewers also liked

Deep Learning at Scale

Intel Nervana

Intel Nervana Artificial Intelligence Meetup 1/31/17

Intel Nervana

Nervana and the Future of Computing

Intel Nervana

Introduction to deep learning @ Startup.ML by Andres Rodriguez

Intel Nervana

Deep learning is unlocking tremendous economic value across various market sectors. Individual data scientists can draw from several open source frameworks and basic hardware resources during the very initial investigative phases but quickly require significant hardware and software resources to build and deploy production models. Intel offers various software and hardware to support a diversity of workloads and user needs. Intel Nervana delivers a competitive deep learning platform to make it easy for data scientists to start from the iterative, investigatory phase and take models all the way to deployment. This platform is designed for speed and scale, and serves as a catalyst for all types of organizations to benefit from the full potential of deep learning. Example of supported applications include but not limited to automotive speech interfaces, image search, language translation, agricultural robotics and genomics, financial document summarization, and finding anomalies in IoT data.

Urs Köster - Convolutional and Recurrent Neural Networks

Intel Nervana

Intel Nervana Artificial Intelligence Meetup 11/30/16

Intel Nervana

End-to-end speech recognition in Neon presented by Anthony Ndirango and Tyler Lee Modern automatic speech recognition systems incorporate tremendous amount of expert knowledge and a wide array of machine learning techniques. The promise of deep learning is to strip away much of this complexity in favor of the flexibility of neural networks. We will describe our efforts in implementing end-to-end speech recognition in neon by combining convolutional and recurrent neural networks to create an acoustic model followed by a graph-based decoding scheme. These types of models are trained to go directly from raw waveforms to transcribed speech without requiring any kind of explicit forced alignment. We will also discuss additional challenges that must be overcome to produce state-of-the-art results.

RE-Work Deep Learning Summit - September 2016

Intel Nervana

懇親会の余興スライド

Akira Tamamori

clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration

Intel® Software

In this presentation, you will hear a story about how Intel graphics can accelerate deep learning applications. The method is simple and reproducible, with impressive results of up to four times over the original CPU performance. We introduce clCaffe*, an extension of the well-known Caffe* framework with OpenCL™ standard. This OpenCL™ standard enables primitives of the convolutional neural networks (CNN) pipeline to operate on GPU (graphics processing unit), FPGA (field programmable gate array) or any device with OpenCL support. Once set up, Caffe users can seamlessly toggle to clCaffe to take advantage of Intel graphics acceleration. Compared with original CPUs, Intel graphics presents 2.5x speedup (AlexNet classification), or 4.0x (GoogleNet classification) on 5th or 6th generation Intel® Core™ processors. Finally, we give a detailed analysis of clCaffe performance, and identify the lacking components in Intel Graphics software stack that impair its performance in the deep learning support.

Video Activity Recognition and NLP Q&A Model Example

Intel Nervana

A Method of Speech Waveform Synthesis based on WaveNet considering Speech Gen...

Akira Tamamori

Startup.Ml: Using neon for NLP and Localization Applications

Intel Nervana

Speaker: Arjun Bansal, co-founder of Nervana Systems Arjun Bansal’s workshop focused on neon, an open-source python based deep learning framework that has been build from the ground up for speed and ease of use. The workshop highlights how to use neon, build Recurrent Recurrent Neural Networks to generate and analyze text, and build Convolutional Autoencoders to generate images and to localize objects. Arjun also demoed the integration of neon with the Nervana cloud (in private beta) for multi-GPU training of deep networks.

Using neon for pattern recognition in audio data

Intel Nervana

Urs Köster Presenting at RE-Work DL Summit in Boston

Intel Nervana

Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...

Intel Nervana

Intel Nervana has built a competitive deep learning platform to make it easy for data scientists to start from the iterative, investigatory phase and take models all the way to deployment. Nervana’s platform is designed for speed and scale, and serves as a catalyst for all types of organizations to benefit from the full potential of deep learning. Example of supported applications include but not limited to automotive speech interfaces, image search, language translation, agricultural robotics and genomics, financial document summarization, and finding anomalies in IoT data.

Rethinking computation: A processor architecture for machine intelligence

Intel Nervana

Introduction to Deep Learning with Will Constable

Intel Nervana

Deep Residual Nets, Activity recognition in videos, and Q&A systems using neon and the Nervana Cloud Will Constable will start with an introduction to the field of Deep Learning, neon and the Nervana Cloud. The presentation will be followed by an interactive workshop using neon. neon is an open-source Python based Deep Learning framework that has been built from the ground up for speed, scalability and ease of use.

Intel's Machine Learning Strategy

inside-BigData.com

ODSC West

Intel Nervana

Urs Köster and Yinyin Liu present at ODSC West. Deep learning has had a major impact in the last three years. Imperfect interactions with machines, such as speech, natural language, or image processing have been made robust by deep learning and deep learning holds promise in finding usable structure in large datasets. The training process is lengthy and has proven to be difficult to scale due to constraints of existing compute architectures and there is a need of standardized tools for building and scaling deep learning solutions. Urs will outline some of these challenges and how fundamental changes to the organization of computation and communication can lead to large advances in capabilities. Urs will dive deep into the field of Deep Learning and focus on Convolutional and Recurrent Neural Networks. The talk will be followed by a workshop highlighting neon™, an open source python based deep learning framework that has been built from the ground up for speed and ease of use. This session is targeted at data scientists and researchers interested in taking deep learning to the next level of speed and scalability. The tutorial covers how to use neon™ to build and train Recurrent Neural Networks to generate text, and Convolutional Networks to perform image classification.

Anil Thomas - Object recognition

Intel Nervana

Anil Thomas dives deep into the field of Deep Learning and focuses on object recognition. This talk will start with a general overview of how to use neon, Convolutional Neural Networks (CNN) and applying neon to an object recognition Kaggle problem. The talk is followed by a workshop highlighting neon, an open source python based deep learning framework that has been built from the ground up for speed and ease of use.

Viewers also liked (20)

Deep Learning at Scale

Intel Nervana Artificial Intelligence Meetup 1/31/17

Nervana and the Future of Computing

Introduction to deep learning @ Startup.ML by Andres Rodriguez

Urs Köster - Convolutional and Recurrent Neural Networks

Intel Nervana Artificial Intelligence Meetup 11/30/16

RE-Work Deep Learning Summit - September 2016

懇親会の余興スライド

clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration

Video Activity Recognition and NLP Q&A Model Example

A Method of Speech Waveform Synthesis based on WaveNet considering Speech Gen...

Startup.Ml: Using neon for NLP and Localization Applications

Using neon for pattern recognition in audio data

Urs Köster Presenting at RE-Work DL Summit in Boston

Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...

Rethinking computation: A processor architecture for machine intelligence

Introduction to Deep Learning with Will Constable

Intel's Machine Learning Strategy

ODSC West

Anil Thomas - Object recognition

Similar to An Analysis of Convolution for Inference

Visual thinking colin_ware_lectures_2013_3_findabilityElsa von Licy

“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...

Edge AI and Vision Alliance

For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2021/02/improving-power-efficiency-for-edge-inferencing-with-memory-management-optimizations-a-presentation-from-samsung/ Nathan Levy, Project Leader at Samsung, presents the “Improving Power Efficiency for Edge Inferencing with Memory Management Optimizations” tutorial at the September 2020 Embedded Vision Summit. In the race to power efficiency for neural network processing, optimizing memory use to reduce data traffic is critical. Many processors have a small local memory (typically SRAM) used as a scratch pad which can be used to reduce the expensive data traffic to and from a big remote memory (e.g., DRAM). The specific structure of neural networks allows for advanced optimization techniques to optimize the use of the local memory. In this presentation, Levy describes the key aspects of memory management optimization for neural networks along with the trade-offs that must be managed in light of the processor architecture and the details of the network. In addition, he shows the importance of tailoring the memory management approach to the specific network, illustrated by analysis of a case study.

Rainbow Over the Windows: More Colors Than You Could Expect

Peter Hlavaty

As time goes on operating systems keep evolving, like Microsoft Windows do, it ships new designs, features and codes from time to time. However sometimes it also ships more than bit of codes for complex subsystems residing in its kernel ... and at some future point it starts implementing new designs to prevent unnecessary access to it. However is it safe enough? As we can see from security bulletins, win32k subsystem attracts lots of attention. It looks that with efforts of many security researchers who has dug into this area, finding bugs here shall becomes pretty tough and almost fruitless. But unfortunately this is not true, as win32k is backed up by very complex logic and large amount of code by nature.. We will present our point of view to Windows graphic subsystem, as well as schema of our fuzzing strategies. We will introduce some unusual areas of win32k, its extensions and how it can breaks even locked environments. Part of our talk will be dedicated to CVE-2016-0176, the bug we used for this year's Pwn2Own Edge sandbox bypass, from its discovery to its exploitation techniques, which could serves as an example for universal DirectX escape which is independent of graphics vendors.

Video Compression, Part 2-Section 2, Video Coding Concepts

Dr. Mohieddin Moradi

#6 PyData Warsaw: Deep learning for image segmentation

Matthew Opala

Deep learning techniques ignited a great progress in many computer vision tasks like image classification, object detection, and segmentation. Almost every month a new method is published that achieves state-of-the-art result on some common benchmark dataset. In addition to that, DL is being applied to new problems in CV. In the talk we’re going to focus on DL application to image segmentation task. We want to show the practical importance of this task for the fashion industry by presenting our case study with results achieved with various attempts and methods.

02 DSD-NL 2016 - Simona Gebruikersmiddag - Floating point onnauwkeurigheid en...

Deltares

“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...

Edge AI and Vision Alliance

For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2023/07/efficiently-map-ai-and-vision-applications-onto-multi-core-ai-processors-using-cevas-parallel-processing-framework-a-presentation-from-ceva/ Rami Drucker, Machine Learning Software Architect at CEVA, presents the “Efficiently Map AI and Vision Applications onto Multi-core AI Processors Using CEVA’s Parallel Processing Framework” tutorial at the May 2023 Embedded Vision Summit. Next-generation AI and computer vision applications for autonomous vehicles, cameras, drones and robots require higher-than-ever computing power. Often, the most efficient way to deliver high performance (especially in cost- and power-constrained applications) is to use multi-core processors. But developers must then map their applications onto the multiple cores in an efficient manner, which can be difficult. To address this challenge and streamline application development, CEVA has introduced the Architecture Planner tool as a new element in CEVA’s comprehensive AI SDK. In this talk, Drucker shows how the Architecture Planner tool analyzes the network model and the processor configuration (number of cores, memory sizes), then automatically maps the workload onto the multiple cores in an efficient manner. He explains key techniques used by the tool, including symmetrical and asymmetrical multi-processing, partition by sub-graphs, batch partitioning and pipeline partitioning.

December 4, ProjectUniversity of Colorado at Boulder

7nm "Navi" GPU - A GPU Built For Performance

AMD

DL (v2).pptx

FKKBWITTAINAN

Optimizing the Graphics Pipeline with Compute, GDC 2016

Graham Wihlidal

With further advancement in the current console cycle, new tricks are being learned to squeeze the maximum performance out of the hardware. This talk will present how the compute power of the console and PC GPUs can be used to improve the triangle throughput beyond the limits of the fixed function hardware. The discussed method shows a way to perform efficient "just-in-time" optimization of geometry, and opens the way for per-primitive filtering kernels and procedural geometry processing. Takeaway: Attendees will learn how to preprocess geometry on-the-fly per frame to improve rendering performance and efficiency. Intended Audience: This presentation is targeting seasoned graphics developers. Experience with DirectX 12 and GCN is recommended, but not required.

Panoramic Video in Environmental Monitoring Software Development and Applica...

pycontw

Verifiably Random

David Evans

Code vectorization for mobile devices

St1X

A Deep Dive Into Understanding Apache Cassandra

DataStax Academy

Inside Cassandra – C* is an interesting piece of software for many reasons, but it is especially interesting in its use of elegant data structures and algorithms. This talk will focus on the data structures and algorithms that make C* such a scalable and performant database. We will walk along the write, read and delete paths exploring the low-level details of how each of these operations work. We will also explore some of the background processes that maintain availability and performance. The goal of this talk is to gain a deeper understanding of C* by exploring the low-level details of its implementation.

HBaseCon 2013: Scalable Network Designs for Apache HBase

Cloudera, Inc.

Genome Browser based on Google Maps API

Hong ChangBum

Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel

Anne Nicolas

At a rate of almost 9 changes per hour (24/7), the Linux kernel is definitely a scary beast. Bugs are introduced on a daily basis and, through the use of multiple code analyzers, *some* of them are detected and fixed before they hit mainline. Over the course of the last few years, Gustavo has been fixing such bugs and many different issues in every corner of the Linux kernel. Recently, he was in charge of leading the efforts to globally enable -Wimplicit-fallthrough; which appears by default in Linux v5.3. This presentation is a report on all the stuff Gustavo has found and fixed in the kernel with the support of the Core Infrastructure Initiative. Gustavo A.R. Silva

[251] implementing deep learning using cu dnn

NAVER D2

MobileNet - PR044

Jinwon Lee

Similar to An Analysis of Convolution for Inference (20)

Visual thinking colin_ware_lectures_2013_3_findability

“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...

Rainbow Over the Windows: More Colors Than You Could Expect

Video Compression, Part 2-Section 2, Video Coding Concepts

#6 PyData Warsaw: Deep learning for image segmentation

02 DSD-NL 2016 - Simona Gebruikersmiddag - Floating point onnauwkeurigheid en...

“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...

December 4, Project

7nm "Navi" GPU - A GPU Built For Performance

DL (v2).pptx

Optimizing the Graphics Pipeline with Compute, GDC 2016

Panoramic Video in Environmental Monitoring Software Development and Applica...

Verifiably Random

Code vectorization for mobile devices

A Deep Dive Into Understanding Apache Cassandra

HBaseCon 2013: Scalable Network Designs for Apache HBase

Genome Browser based on Google Maps API

Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel

[251] implementing deep learning using cu dnn

MobileNet - PR044

Recently uploaded

Essentials of Automations: The Art of Triggers and Actions in FME

Safe Software

In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation. We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios. Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!

UiPath Test Automation using UiPath Test Suite series, part 5

DianaGray10

GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024

Neo4j

Artificial Intelligence for XMLDevelopment

Octavian Nadolu

In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject. We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup. Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved. The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring. The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise. By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.

Monitoring Java Application Security with JDK Tools and JFR Events

Ana-Maria Mihalceanu

DevOps and Testing slides at DASA Connect

Kari Kakkonen

zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs

Alex Pruden

This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second). Paper: https://eprint.iacr.org/2023/1886

20240607 QFM018 Elixir Reading List May 2024

Matthew Sinclair

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...

Neo4j

Dr. Sean Tan, Head of Data Science, Changi Airport Group Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.

How to Get CNIC Information System with Paksim Ga.pptx

danishmna97

GraphRAG is All You need? LLM & Knowledge Graph

Guy Korland

Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs. 1. Unifying Large Language Models and Knowledge Graphs: A Roadmap. https://arxiv.org/abs/2306.08302 2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs: https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

James Anderson

Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management. The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM). Speakers: Bob Boule Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle. Gopinath Rebala Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf

Paige Cruz

Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack. While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack. I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:

RESUME BUILDER APPLICATION Project for students

KAMESHS29

20240605 QFM017 Machine Intelligence Reading List May 2024

Matthew Sinclair

Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...

SOFTTECHHUB

The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing. One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.

Elizabeth Buie - Older adults: Are we really designing for our future selves?

Nexer Digital

Free Complete Python - A step towards Data Science

RinaMondal9

Epistemic Interaction - tuning interfaces to provide information for AI support

Alan Dix

Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024 https://alandix.com/academic/papers/synergy2024-epistemic/ As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.

Video Streaming: Then, Now, and in the Future

Alpen-Adria-Universität

In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.

Recently uploaded (20)

Essentials of Automations: The Art of Triggers and Actions in FME

UiPath Test Automation using UiPath Test Suite series, part 5

GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024

Artificial Intelligence for XMLDevelopment

Monitoring Java Application Security with JDK Tools and JFR Events

DevOps and Testing slides at DASA Connect

zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs

20240607 QFM018 Elixir Reading List May 2024

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...

How to Get CNIC Information System with Paksim Ga.pptx

GraphRAG is All You need? LLM & Knowledge Graph

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf

RESUME BUILDER APPLICATION Project for students

20240605 QFM017 Machine Intelligence Reading List May 2024

Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...

Elizabeth Buie - Older adults: Are we really designing for our future selves?

Free Complete Python - A step towards Data Science

Epistemic Interaction - tuning interfaces to provide information for AI support

Video Streaming: Then, Now, and in the Future

An Analysis of Convolution for Inference

1. An Analysis of Convolution for Inference 24 June 2016 Scott Gray Nervana Systems MAKING MACHINES SMARTER.™

2. Proprietary and conﬁdential. Do not distribute.ner va na Direct Convolution 2 • Compute with in-place slicing + gemm • Data layout considerations: C, H, W, N • Minimize slicing logic • Maximize contiguous access • Leverage filter overlap

3. Proprietary and conﬁdential. Do not distribute.ner va na Small N direct convolution: Without Superblocking 3 fprop Q = (W-S+1 + 2 * pad) / stride wi = sk + qj * stride - pad Fig from V. Dumoulin, https://github.com/vdumoulin/conv_arithmetic

4. Proprietary and conﬁdential. Do not distribute.ner va na Small N direct convolution: With Superblocking 4 fprop Q = (W-S+1 + 2 * pad) / stride wi = sk + qj * stride - pad

5. Proprietary and conﬁdential. Do not distribute.ner va na Small N direct convolution: Bprop for deconv 5 bprop pad’ = S - pad - 1 wi = (qj - pad’ + sk) / stride

6. Proprietary and conﬁdential. Do not distribute.ner va na Small N direct convolution: Dilated Filters 6 Dilated S’ = (S-1) * rate + 1 Q = (W-S’+1 + 2*pad) / stride wi = sk * rate + qj * stride - pad Fig from F. Yu, V. Koltun http://arxiv.org/abs/1511.07122v3

7. Proprietary and conﬁdential. Do not distribute.ner va na Convolution with Algorithmic Speedups 7 • FFT and Winograd have same basic computational flow • FFT tiles typically need to be much bigger • Winograd history: Toom and Cook, then Lavin

8. Proprietary and conﬁdential. Do not distribute.ner va na Winograd: input transform 8 Input Feature Map 4x4 stride 2 • Input transform • 2D Winograd is a nested product of 1D transforms • Transforms can be simplified to remove zeros

9. Proprietary and conﬁdential. Do not distribute.ner va na Winograd: filter transform 9 • Filter transform • Same as input but with different coefficients • Transform each feature map independently

10. Proprietary and conﬁdential. Do not distribute.ner va na Winograd: batched GEMM 10 • Point-wise Multiplication • Posed as batched GEMM operation

11. Proprietary and conﬁdential. Do not distribute.ner va na Winograd: output transform 11 Output Feature Map • Output transform • Same as input and filter • Transform back to pixel space to obtain 2x2 output tile

12. Proprietary and conﬁdential. Do not distribute.ner va na Transforms for Increased Accuracy 12 Integer roots 4 0 -5 0 1 0 0 -4 -4 1 1 0 0 4 -4 -1 1 0 0 -2 -1 2 1 0 0 2 -1 -2 1 0 0 4 0 -5 0 1 0.87 0 -2.64 0 1 0 0 -1.4 -2.25 0.62 1 0 0 1.4 -2.25 -0.62 1 0 0 -0.58 -0.39 1.5 1 0 0 0.58 -0.39 -1.5 1 0 0 0.87 0 -2.64 0 1 Fractional roots Input transforms for 4x4

13. Proprietary and conﬁdential. Do not distribute.ner va na Precision 13 Percentage error from Convolution 0 5 10 15 20 25 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Direct 2x2 Winograd 4x4 winograd (Fractional Roots) 4x4 Winograd (Integer Roots) PercentageError Bit width Bits Direct 2x2 Winograd 4x4 frac 4x4 int 3 56.461 112.174 351.196 314.62 4 23.533 46.222 274.28 432.959 5 10.879 21.394 142.649 459.723 6 5.245 10.34 68.062 446.271 7 2.585 5.074 33.73 250.057 8 1.286 2.516 16.667 123.585 9 0.639 1.253 8.246 62.001 10 0.319 0.626 4.154 31.006 11 0.159 0.312 2.064 15.439 12 0.08 0.156 1.029 7.669 13 0.04 0.078 0.515 3.857 14 0.02 0.039 0.259 1.923 15 0.01 0.019 0.129 0.966 16 0.005 0.01 0.064 0.483

14. Proprietary and conﬁdential. Do not distribute.ner va na Multiplier Transistor Efficiency 14 Algo bits speedup transistors performance / transistor Direct 8 1.0 3000 1 2x2 9 2.25 3750 1.8 4x4 12 4.0 6000 2.0 Transistor Counts from Wikipedia:

15. Proprietary and conﬁdential. Do not distribute.ner va na Logarithmic quantization 15 D. Miyashita, EH. Lee, B. Murmann Convolutional Neural Networks using Logarithmic Data Representation http://arxiv.org/abs/1603.01025v2

16. Proprietary and conﬁdential. Do not distribute.ner va na 16 Performance: VGG fp32 on GTX1080effectiveTFLOPS Batch Size VGG - Totals: 0 5 10 15 20 25 64 32 16 8 4 2 1 Neon Direct Neon F(2x2,3x3) Neon F(4x4,3x3) cuDNN FFT

17. Proprietary and conﬁdential. Do not distribute.ner va na 17 Peak Performance: VGG fp32 on GTX1080effectiveTFLOPS Batch Size VGG - Layer 4.2: 0 5 10 15 20 25 64 32 16 8 4 2 1 Neon Direct Neon F(2x2,3x3) Neon F(4x4,3x3) cuDNN FFT

An Analysis of Convolution for Inference

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to An Analysis of Convolution for Inference

Similar to An Analysis of Convolution for Inference (20)

Recently uploaded

Recently uploaded (20)

An Analysis of Convolution for Inference