SlideShare a Scribd company logo
A Dataflow Processing Chip for Training Deep Neural Networks
Dr. Chris Nicol
Chief Technology Officer
Wave Computing Copyright 2017.
Founded in 2010
• Tallwood Venture Capital
• Southern Cross Venture Partners
Headquartered in Campbell, CA
• World class team of 53 dataflow, data science, and systems experts
• 60+ patents
Invented Dataflow Processing Unit (DPU) architecture to
accelerate deep learning training by up to 1000x
• Coarse Grain Reconfigurable Array (CGRA) Architecture
• Static scheduling of data flow graphs onto massive array of processors
Now accepting qualified customers for Early Access Program
Wave Computing. Copyright 2017.
Wave Computing Profile
Extended training time due to increasing size of datasets
• Weeks to tune and train typical deep learning models
Hardware for accelerating ML was created for other applications
• GPUs for graphics, FPGA’s for RTL emulation
Data coming in “from the edge” is growing faster than
the datacenter can accommodate/use it…
Design
•Neural network
architecture
•Cost functions
Tune
•Parameter
initialization
•Learning rate
•Mini-batch size
Train
•Accuracy
•Convergence
Rate
Deploy
•For testing
Deploy
•For production
➢ Problem: Model
development times can
take days or weeks
Wave Computing. Copyright 2017.
Challenges of Machine Learning
Source: Google; http://download.tensorflow.org/paper/whitepaper2015.pdf
• Co-processors must
wait on the CPU for
instructions
• This limits
performance and
reduces efficiency
and scalability
• Restricts embedded
use cases to
inferencing-only
GPU waiting on CPU
Figure 13: EEG visualization of Inception training showing CPU and GPU activity.
Wave Computing. Copyright 2017.
Problems with Existing Solutions
Times
Times
I/O
Softmax
Plus
Plus
Mem I/OSigmoid
Programmed on
Deep Learning
Software
Run on Wave
Dataflow
Processor
Times
Times
Plus
Plus
Softmax
Sigmoid
Deep Learning
Networks are
Dataflow
Graphs
Wave Dataflow Processor
WaveFlow Agent Library
Wave Computing. Copyright 2017.
Wave Dataflow Processor is Ideal for Deep Learning
DDR4 DDR4
HMCHMC
HMCHMC
PCIe Gen3 x16 MCU
AXI4
AXI4
AXI4
AXI4
Secure DPU
Program
Buffer
Secure DPU
Program
Loader
16ff CMOS Process Node 16K Processors,
8192 DPU Arithmetic Units
Self-Timed,
MPP Synchronization
181 Peak Tera-Ops, 7.25 Tera
Bytes/sec Bisection Bandwidth
16 MB Distributed
Data Memory
8 MB Distributed
Instruction Memory
1.71 TB/s I/O Bandwidth
4096 Programmable FIFOs
270 GB/s Peak
Memory Bandwidth
2048 outstanding
memory requests
4 Billion 16-Byte Random
Access Transfers / sec
4 Hybrid Memory
Cube Interfaces
2 DDR4 Interfaces
PCIe Gen3 16-Lane
Host interface
32-b Andes N9 MCU 1 MB Program
Store for Paging
Hardware Engine for Fast
Loading of AES Encrypted
Programs
Up to 32 Programmable
dynamic reconfiguration zones
Variable Fabric Dimensions
(User Programmable at Boot)
Wave Computing. Copyright 2017.
Wave Dataflow Processing Unit
Chip Characteristics & Design Features
• Clock-less CGRA is robust to Process, Voltage & Temperature.
• Distributed memory architecture for parallel processing
• Optimized for data flow graph execution
• DMA-driven architecture – overlapping I/O and computation
Key DPU Board Features
• 65,536 CGRA Processing Elements
• 4 Wave DPU chips per board
• Modular, flexible design
• Multiple DPU boards per Wave
Compute Appliance
• Off-the-shelf components
• 32GB of ultra high-speed DRAM
• 512GB of DDR4 DRAM
• FPGA for high-speed
board-to-board communication
Wave Computing. Copyright 2017.
Wave Current Generation DPU Board
• Best-in-class, highly scalable deep learning training and inference
• More than orders of magnitude better compute-power efficiency
• Plug-and-play node in a datacenter network -- Big Data – Hadoop, Yarn, Spark, Kafka
• Native support of Google TensorFlow (initially)
Wave Computing. Copyright 2017.
Wave’s Solution: Dataflow Computer for Deep Learning
Pipelined 1KB Single Port Data RAM /w BIST & ECC
Pipelined 256-entry Instruction RAM /w ECC
Quad of PEs are fully
connected
PE c
PE a PE b
PE d
Wave Computing. Copyright 2017.
Dataflow Processing Element (PE)
• 16 Processor CLUSTER: a full custom tiled GDSII block
• Fully-Connected PE Quads with fan-out
• 8 DPU Arithmetic Units
– Per-cycle grouping into 8, 16, 24, 32, 64-b Operations
– Pipelined MAC Units with (un)Signed Saturation
– Support for floating point emulation
– Barrel Shifter, Bit Processor
– SIMD and MIMD instruction classes
– Data driven
• 16KB Data RAM
• 16 Instruction RAMs
• Full custom semi-static digital circuits
• Robust PVT insensitive operation
– Scalable to low voltages
– No global signals, no global clocks
Wave Computing. Copyright 2017.
Cluster of 16 Dataflow PEs
Each cluster has a pipelined instruction-driven word-level switch
Each cluster has a 4 independent pipelined
instruction driven byte-switches
Word switch supports fan-out and fan-in
fan-in
All switches have
registers for Router use
to avoid congestion
“valid” and “invalid” data in the switch enables fan-in
fan-out
Wave Computing. Copyright 2017.
Hybrid CGRA Architecture
From Asleep to Active
• Word switch fabric remains active
• If valid data arrives at switch input AND switch executes
instruction to send data to one of Quads THEN wake up PEs
• Copy PC from word switch to PE and byte switch iRAMs
• Send the incoming data to the PEs
From Active to Asleep
• A PE executes a “sleep” instruction
• All PE & byte switch execution is suspended
• PE can opt for fast wakeup or slow wakeup
(deep sleep with lower power)
Wave Computing. Copyright 2017.
Data-Driven Power Management
Wave Computing. Copyright 2017.
Compute Machine with AXI4 Interfaces
24 Compute Machines
Wave Computing. Copyright 2017.
Wave DPU Hierarchy
Wave Computing. Copyright 2017.
Wave DPU Memory Hierarchy
• Clock skew and jitter limit cycle time with traditional clock distribution
• Self-timed “done” signal from PEs if they are awake. Programmable
tuning of margin.
• Synchronized with neighboring Clusters to minimize skew
• 1-sigma local mismatch ~1.3ps and global + local mismatch ~6ps at
140ps cycle time
Clock Distribution and Generation
Network across entire Fabric
Wave Computing. Copyright 2017.
6-10 GHz Auto-Calibrated Clock Distribution
-4
-5
-3
-4
-6
-7
-5
-6
-2
-3
-1
-2
-4
-5
-3
-4
Up-counter in each cluster
initialized to -(1+Manhattan
distance from end cluster)
End cluster
Start cluster
When counter reaches 0, either:
- Reset the processors
- Suspend processors for configuration (at PC=0)
- Enable processors to execute (from PC=0)
-4
-5
-3
-4
-6
-7
-5
-6
-2
-3
-1
-2
-4
-5
-3
-4
Pre-program 4
clusters to ENTER
config mode.
Counters
operating
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
DMA new kernel
instructions into
cluster I-mems -4
-5
-3
-4
-6
-7
-5
-6
-2
-3
-1
-2
-4
-5
-3
-4
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Old
kernel
Enter config mode Exit config modePropagate SignalPropagate Signal
Step 1 Step 2 Step 3 Step 4
Propagate control signal
from start cluster to end
cluster. Advances 1
cluster per cycle.
Pre-program 4
clusters to EXIT
config mode.
All clusters running in-synch
New
kernel
SW controls this process to manage surge current
Propagate signal
starts the up-counter
in each cluster.
Counters
operating
Counters
operating
New
kernel
Reset, Configuration Modes
Stop
(config mode)
(running)
(running)
(running)
(running)
(running)
(running) (running)
(running)
(running)
1
1
1
1
2
2
2
2
-
-
-
-
2
2
2
2
Kernel1 and kernel2 mounted.
3
3
3
3
I/O
Mount(kernel3)
(note: I/Os at bottom)
New Kernel
goes here!
1
1
1
1
2
2
2
2
3
3
3
3
2
2
2
2
After mount(kernel3)
I/O I/O
I/O
Just-in-time route
through Kernel 2
The I/Os cannot
go here!
• Runtime resource manager in lightweight host.
• Mount(). Online placement algorithm with maxrects management of empty clusters.
• Uses “porosity map” for each kernel showing route-thru opportunities. (SDK provides this)
• Just-in-time Place & Route (using A*) of I/Os through other kernels without functional side-effects.
• Unmount(). Removes paths through other kernels.
• Machines are combined for mounting large kernels. Partitioned during unmount().
• Periodic garbage collection used for cleanup.
• Average mount time < 1ms
Runtime resource manager performing mount()
Dynamic Reconfiguration
• WFG Compiler
• WFG Linker
• WFG Simulator
• DF agent partitioning
• DFG throughput
optimization
• Runs on Session Host
WaveFlow Execution
Engine
• Resource Manager
• Monitors
• Drivers
• Runs on a Wave Deep
Learning Computer
WaveFlow Session
Manager
WaveFlow
Agent Library
• BLAS 1,2,3
• CONV2D
• SoftMax, etc.
WaveFlow SDK
On line
Off line Encrypted Agent Code
Wave Computing. Copyright 2017.
WaveFlow Software Stack
WaveFlow agents are pre compiled off-line using WaveFlow SDK
• Wave provides a complete agent library for TensorFlow
• Customer can create additional agents for differentiation
Customer supplied
agent source code
Wave supplied
agent source code
WaveFlow
Agent Library
Wave SDK
• WFG Compiler
• WFG Linker
• WFG Simulator
• WFG Debugger
• MATMUL
• Batchnorm
• Relu, etc.
Your new
DNN training
technique
Encrypted Agent Code
Wave Computing. Copyright 2017.
WaveFlow Agent Library
SATSolver WFG Compiler
LLVM Frontend
WFG Linker
AssemblerArchitectural Simulator WaveFlow
Agents
WFG
Simulator
ML function (gemm, sigmoid, …)
To appear in ICCAD 2017
WFG = Wave Flow Graph
Wave Computing. Copyright 2017.
WaveFlow SDK
Kernels are islands of machine code scheduled onto machine cycles
Example: Sum of Products on 16 PEs in a single cluster
WFG of Sum of ProductsSum of Products Kernel
PE 0 to 15
Time
mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov
mac mac mac mac mac mac mac mac mac mac mac mac mac mac mac mac
mov mov movcr mov mov mov mov mov mov mov movcr movi mov mov mov
movcr mov mov mov movi mov mov mov movcr mov mov mov add8 mov
mov mov mov mov mov mov add8 mov mov mov incc8 mov movcr
mov mov mov movcr mov mov mov mov mov mov mov mov mov mov mov
mov mov mov mov mov mov mov mov mov mov mov mov movcr mov mov
mov mov mov memr movcr mov memr mov mov mov mov mov mov
mov movcr mov mov mov mov movcr mov mov mov movcr mov mov mov
mov mov mac mac mov mov mov mov mov mov mac mac mov mov mov mov
mac mac mov mov mac mac mac mac mac mac mov mov mac mac mac mac
mov mov mov memr mov mov mov mov mov mov mov mov mov movcr
mov mov mov movcr mov mov memr mov mov mov mov mov mov
incc8 mov mov mov mov incc8 mov mov mov mov mov mov mov mov
mov mov mov mov mov mov incc8 mov mov mov movcr movcr memw
mov mov mov mov mov mov mov movcr mov mov mov mov mov
mov mov memr mov mov mov memr incc8 mov mov mov mov st mov mov
mov mov mov mov mov mov mov mov mov mov mov mov mov mov
mac mac mac mac mac mac mac mac mac mac mac mac mac mac mac mac
mov mov mov mov mov mov mov mov mov mov mov mov mov mov
memr mov mov mov memr mov mov mov mov mov mov
mov mov mov mov mov mov mov mov mov mov mov cmuxi mov mov
mov mov mov mov mov mov mov mov mov mov mov mov mov mov
mov mov mov memr mov mov mov memr mov mov mov mov mov mov mov
mov mov mov mov mov mov mov mov mov mov mov mov mov mov
mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov
mac mac mac mac mac mac mac mac mac mac mac mac mac mac mac mac
mov mov mov mov mov mov mov mov mov mov mov mov mov
mov memr mov mov memr mov mov mov mov mov mov mov
mov mov mov mov mov mov mov mov mov mov mov mov
mov mov mov mov mov mov mov mov mov mov mov
mov mov mov mov mov mov mov mov mov mov mov
mov mov mov mov mov mov mov mov mov mov mov mov mov mov
mov mov mov mov mov mov mov mov mov mov mov mov mov mov
mac mac mac mac mac mac mac mac mac mac mac mac mac mac mac mac
mov mov mov mov mov mov mov mov mov mov mov mov mov
mov mov mov mov mov mov mov mov mov mov mov mov mov mov
mov mov mov mov mov mov mov mov mov mov mov mov
mov mov mov mov mov mov mov memr mov mov mov mov mov mov
mov mov memr mov mov mov mov mov mov mov mov mov
mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov
mov mov mov mov mov mov mov mov mov mov mov
mac mac mac mac mac mac mac mac mac mac mac mac mac mac mac mac
mov mov mov mov mov mov mov mov mov mov mov mov
mov mov mov mov mov mov mov mov mov mov mov mov mov mov
mov mov mov mov mov mov mov mov mov mov mov
mov mov mov mov mov mov mov mov mov mov mov
mov mov mov mov mov mov mov mov mov mov mov
mov mov mov mov mov mov mov mov mov mov mov
mov mov mov mov mov mov mov
mac mac mac mac mac mac mac mac mac mac mac mac mac mac mac mac
mov mov mov mov mov mov mov mov mov mov mov mov
mov mov mov mov mov mov mov mov mov
mov mov mov mov mov mov mov mov mov mov mov mov
mov mov mov mov mov mov mov
mov mov mov mov mov mov mov
mov mov mov mov mov
mov mov mov mov mov mov mov mov mov mov mov mov mov
mac mac mac mac mac mac mac mac mac mac mac mac mac mac mac mac
mov mov mov
mov mov mov mov mov mov mov
Wave Computing. Copyright 2017.
Wave SDK: Compiler Produces Kernels
Session Manager Partitions & Maps to DPUs & Memory
Inference Graph generated directly from Keras model by Wave Compiler
Wave Flow Graph Format
Mapping Inception V4 to DPUs
Single Node
64-DPU Computer
Benchmarks on a single node 64-DPU Data Flow Computer
• ImageNet training, 90 epochs, 1.28M images, 224x224x3
• Seq2Seq training using parameters from https://papers.nips.cc/paper/5346-sequence-to-sequence-
learning-with-neural-networks.pdf by I. Sutskever, O. Vinyals & Q. Le
Network Inferencing
(Images/sec)
Training time
AlexNet 962,000 40 mins
GoogleNet 420,000 1 hour 45 mins
Squeezenet 75,000 3 hours
Seq2Seq - 7 hours 15 min
Deep Neural Network Performance
Wave Computing. Copyright 2017.
Wave is now accepting qualified customers to its Early Access Program (EAP)
Provides select companies access to a Wave machine learning computer for testing
and benchmarking months before official system sales begin
For details about participation in the limited number of EAP positions,
contact info@wavecomp.com
Wave Computing. Copyright 2017.
Wave’s Early Access Program

More Related Content

What's hot

Accelerate Service Function Chaining Vertical Solution with DPDK
Accelerate Service Function Chaining Vertical Solution with DPDKAccelerate Service Function Chaining Vertical Solution with DPDK
Accelerate Service Function Chaining Vertical Solution with DPDK
OPNFV
 
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
Jim St. Leger
 
00 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2
Yutaka Kawai
 
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre..."Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
Edge AI and Vision Alliance
 
DPDK Integration: A Product's Journey - Roger B. Melton
DPDK Integration: A Product's Journey - Roger B. MeltonDPDK Integration: A Product's Journey - Roger B. Melton
DPDK Integration: A Product's Journey - Roger B. Melton
harryvanhaaren
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialmadhuinturi
 
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
Jim St. Leger
 
Sunx4450 Intel7460 GigaSpaces XAP Platform Benchmark
Sunx4450 Intel7460 GigaSpaces XAP Platform BenchmarkSunx4450 Intel7460 GigaSpaces XAP Platform Benchmark
Sunx4450 Intel7460 GigaSpaces XAP Platform Benchmark
Shay Hassidim
 
Learning from ZFS to Scale Storage on and under Containers
Learning from ZFS to Scale Storage on and under ContainersLearning from ZFS to Scale Storage on and under Containers
Learning from ZFS to Scale Storage on and under Containers
inside-BigData.com
 
ODSA Proof of Concept SmartNIC Speeds & Feeds
ODSA Proof of Concept SmartNIC Speeds & FeedsODSA Proof of Concept SmartNIC Speeds & Feeds
ODSA Proof of Concept SmartNIC Speeds & Feeds
ODSA Workgroup
 
ODSA Use Case - SmartNIC
ODSA Use Case - SmartNICODSA Use Case - SmartNIC
ODSA Use Case - SmartNIC
ODSA Workgroup
 
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
IBM Research
 
ODSA Workshop: Development Effort Summary
ODSA Workshop: Development Effort SummaryODSA Workshop: Development Effort Summary
ODSA Workshop: Development Effort Summary
ODSA Workgroup
 
OVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
OVS and DPDK - T.F. Herbert, K. Traynor, M. GrayOVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
OVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
harryvanhaaren
 
DPDK Summit 2015 - Intel - Keith Wiles
DPDK Summit 2015 - Intel - Keith WilesDPDK Summit 2015 - Intel - Keith Wiles
DPDK Summit 2015 - Intel - Keith Wiles
Jim St. Leger
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
Shien-Chun Luo
 
DPDK summit 2015: It's kind of fun to do the impossible with DPDK
DPDK summit 2015: It's kind of fun  to do the impossible with DPDKDPDK summit 2015: It's kind of fun  to do the impossible with DPDK
DPDK summit 2015: It's kind of fun to do the impossible with DPDK
Lagopus SDN/OpenFlow switch
 
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchDPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
Jim St. Leger
 

What's hot (20)

Accelerate Service Function Chaining Vertical Solution with DPDK
Accelerate Service Function Chaining Vertical Solution with DPDKAccelerate Service Function Chaining Vertical Solution with DPDK
Accelerate Service Function Chaining Vertical Solution with DPDK
 
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
 
Infrastructure et serveurs HP
Infrastructure et serveurs HPInfrastructure et serveurs HP
Infrastructure et serveurs HP
 
00 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2
 
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre..."Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
 
DPDK Integration: A Product's Journey - Roger B. Melton
DPDK Integration: A Product's Journey - Roger B. MeltonDPDK Integration: A Product's Journey - Roger B. Melton
DPDK Integration: A Product's Journey - Roger B. Melton
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
 
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
 
Sunx4450 Intel7460 GigaSpaces XAP Platform Benchmark
Sunx4450 Intel7460 GigaSpaces XAP Platform BenchmarkSunx4450 Intel7460 GigaSpaces XAP Platform Benchmark
Sunx4450 Intel7460 GigaSpaces XAP Platform Benchmark
 
Learning from ZFS to Scale Storage on and under Containers
Learning from ZFS to Scale Storage on and under ContainersLearning from ZFS to Scale Storage on and under Containers
Learning from ZFS to Scale Storage on and under Containers
 
ODSA Proof of Concept SmartNIC Speeds & Feeds
ODSA Proof of Concept SmartNIC Speeds & FeedsODSA Proof of Concept SmartNIC Speeds & Feeds
ODSA Proof of Concept SmartNIC Speeds & Feeds
 
ODSA Use Case - SmartNIC
ODSA Use Case - SmartNICODSA Use Case - SmartNIC
ODSA Use Case - SmartNIC
 
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
 
ODSA Workshop: Development Effort Summary
ODSA Workshop: Development Effort SummaryODSA Workshop: Development Effort Summary
ODSA Workshop: Development Effort Summary
 
OVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
OVS and DPDK - T.F. Herbert, K. Traynor, M. GrayOVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
OVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
 
DPDK Summit 2015 - Intel - Keith Wiles
DPDK Summit 2015 - Intel - Keith WilesDPDK Summit 2015 - Intel - Keith Wiles
DPDK Summit 2015 - Intel - Keith Wiles
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
 
DPDK summit 2015: It's kind of fun to do the impossible with DPDK
DPDK summit 2015: It's kind of fun  to do the impossible with DPDKDPDK summit 2015: It's kind of fun  to do the impossible with DPDK
DPDK summit 2015: It's kind of fun to do the impossible with DPDK
 
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchDPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
 

Similar to A Dataflow Processing Chip for Training Deep Neural Networks

Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
AkshitAgiwal1
 
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Michelle Holley
 
Introduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AIIntroduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AI
Tyrone Systems
 
SoC Solutions Enabling Server-Based Networking
SoC Solutions Enabling Server-Based NetworkingSoC Solutions Enabling Server-Based Networking
SoC Solutions Enabling Server-Based Networking
Netronome
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
Kernel TLV
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
CastLabKAIST
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
Sagar Dolas
 
Advanced Computer Architecture
Advanced Computer ArchitectureAdvanced Computer Architecture
Advanced Computer Architecture
nibiganesh
 
HPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand ChallengeHPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand Challenge
Anand Haridass
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Ontico
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
Ganesan Narayanasamy
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
Peter Clapham
 
Introduction to Digital Signal processors
Introduction to Digital Signal processorsIntroduction to Digital Signal processors
Introduction to Digital Signal processors
PeriyanayagiS
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
DataWorks Summit/Hadoop Summit
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
HungWei Chiu
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
Dr. Swaminathan Kathirvel
 
Streaming multiprocessors and HPC
Streaming multiprocessors and HPCStreaming multiprocessors and HPC
Streaming multiprocessors and HPC
OmkarKachare1
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Lablup Inc.
 

Similar to A Dataflow Processing Chip for Training Deep Neural Networks (20)

Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
 
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
 
Introduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AIIntroduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AI
 
SoC Solutions Enabling Server-Based Networking
SoC Solutions Enabling Server-Based NetworkingSoC Solutions Enabling Server-Based Networking
SoC Solutions Enabling Server-Based Networking
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Advanced Computer Architecture
Advanced Computer ArchitectureAdvanced Computer Architecture
Advanced Computer Architecture
 
HPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand ChallengeHPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand Challenge
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
 
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Introduction to Digital Signal processors
Introduction to Digital Signal processorsIntroduction to Digital Signal processors
Introduction to Digital Signal processors
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
 
Streaming multiprocessors and HPC
Streaming multiprocessors and HPCStreaming multiprocessors and HPC
Streaming multiprocessors and HPC
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 

More from inside-BigData.com

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
inside-BigData.com
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
inside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
inside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
inside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
inside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
inside-BigData.com
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
inside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
inside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
inside-BigData.com
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
inside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
 

Recently uploaded

Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 

Recently uploaded (20)

Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 

A Dataflow Processing Chip for Training Deep Neural Networks

  • 1. A Dataflow Processing Chip for Training Deep Neural Networks Dr. Chris Nicol Chief Technology Officer Wave Computing Copyright 2017.
  • 2. Founded in 2010 • Tallwood Venture Capital • Southern Cross Venture Partners Headquartered in Campbell, CA • World class team of 53 dataflow, data science, and systems experts • 60+ patents Invented Dataflow Processing Unit (DPU) architecture to accelerate deep learning training by up to 1000x • Coarse Grain Reconfigurable Array (CGRA) Architecture • Static scheduling of data flow graphs onto massive array of processors Now accepting qualified customers for Early Access Program Wave Computing. Copyright 2017. Wave Computing Profile
  • 3. Extended training time due to increasing size of datasets • Weeks to tune and train typical deep learning models Hardware for accelerating ML was created for other applications • GPUs for graphics, FPGA’s for RTL emulation Data coming in “from the edge” is growing faster than the datacenter can accommodate/use it… Design •Neural network architecture •Cost functions Tune •Parameter initialization •Learning rate •Mini-batch size Train •Accuracy •Convergence Rate Deploy •For testing Deploy •For production ➢ Problem: Model development times can take days or weeks Wave Computing. Copyright 2017. Challenges of Machine Learning
  • 4. Source: Google; http://download.tensorflow.org/paper/whitepaper2015.pdf • Co-processors must wait on the CPU for instructions • This limits performance and reduces efficiency and scalability • Restricts embedded use cases to inferencing-only GPU waiting on CPU Figure 13: EEG visualization of Inception training showing CPU and GPU activity. Wave Computing. Copyright 2017. Problems with Existing Solutions
  • 5. Times Times I/O Softmax Plus Plus Mem I/OSigmoid Programmed on Deep Learning Software Run on Wave Dataflow Processor Times Times Plus Plus Softmax Sigmoid Deep Learning Networks are Dataflow Graphs Wave Dataflow Processor WaveFlow Agent Library Wave Computing. Copyright 2017. Wave Dataflow Processor is Ideal for Deep Learning
  • 6. DDR4 DDR4 HMCHMC HMCHMC PCIe Gen3 x16 MCU AXI4 AXI4 AXI4 AXI4 Secure DPU Program Buffer Secure DPU Program Loader 16ff CMOS Process Node 16K Processors, 8192 DPU Arithmetic Units Self-Timed, MPP Synchronization 181 Peak Tera-Ops, 7.25 Tera Bytes/sec Bisection Bandwidth 16 MB Distributed Data Memory 8 MB Distributed Instruction Memory 1.71 TB/s I/O Bandwidth 4096 Programmable FIFOs 270 GB/s Peak Memory Bandwidth 2048 outstanding memory requests 4 Billion 16-Byte Random Access Transfers / sec 4 Hybrid Memory Cube Interfaces 2 DDR4 Interfaces PCIe Gen3 16-Lane Host interface 32-b Andes N9 MCU 1 MB Program Store for Paging Hardware Engine for Fast Loading of AES Encrypted Programs Up to 32 Programmable dynamic reconfiguration zones Variable Fabric Dimensions (User Programmable at Boot) Wave Computing. Copyright 2017. Wave Dataflow Processing Unit Chip Characteristics & Design Features • Clock-less CGRA is robust to Process, Voltage & Temperature. • Distributed memory architecture for parallel processing • Optimized for data flow graph execution • DMA-driven architecture – overlapping I/O and computation
  • 7. Key DPU Board Features • 65,536 CGRA Processing Elements • 4 Wave DPU chips per board • Modular, flexible design • Multiple DPU boards per Wave Compute Appliance • Off-the-shelf components • 32GB of ultra high-speed DRAM • 512GB of DDR4 DRAM • FPGA for high-speed board-to-board communication Wave Computing. Copyright 2017. Wave Current Generation DPU Board
  • 8. • Best-in-class, highly scalable deep learning training and inference • More than orders of magnitude better compute-power efficiency • Plug-and-play node in a datacenter network -- Big Data – Hadoop, Yarn, Spark, Kafka • Native support of Google TensorFlow (initially) Wave Computing. Copyright 2017. Wave’s Solution: Dataflow Computer for Deep Learning
  • 9. Pipelined 1KB Single Port Data RAM /w BIST & ECC Pipelined 256-entry Instruction RAM /w ECC Quad of PEs are fully connected PE c PE a PE b PE d Wave Computing. Copyright 2017. Dataflow Processing Element (PE)
  • 10. • 16 Processor CLUSTER: a full custom tiled GDSII block • Fully-Connected PE Quads with fan-out • 8 DPU Arithmetic Units – Per-cycle grouping into 8, 16, 24, 32, 64-b Operations – Pipelined MAC Units with (un)Signed Saturation – Support for floating point emulation – Barrel Shifter, Bit Processor – SIMD and MIMD instruction classes – Data driven • 16KB Data RAM • 16 Instruction RAMs • Full custom semi-static digital circuits • Robust PVT insensitive operation – Scalable to low voltages – No global signals, no global clocks Wave Computing. Copyright 2017. Cluster of 16 Dataflow PEs
  • 11. Each cluster has a pipelined instruction-driven word-level switch Each cluster has a 4 independent pipelined instruction driven byte-switches Word switch supports fan-out and fan-in fan-in All switches have registers for Router use to avoid congestion “valid” and “invalid” data in the switch enables fan-in fan-out Wave Computing. Copyright 2017. Hybrid CGRA Architecture
  • 12. From Asleep to Active • Word switch fabric remains active • If valid data arrives at switch input AND switch executes instruction to send data to one of Quads THEN wake up PEs • Copy PC from word switch to PE and byte switch iRAMs • Send the incoming data to the PEs From Active to Asleep • A PE executes a “sleep” instruction • All PE & byte switch execution is suspended • PE can opt for fast wakeup or slow wakeup (deep sleep with lower power) Wave Computing. Copyright 2017. Data-Driven Power Management
  • 13. Wave Computing. Copyright 2017. Compute Machine with AXI4 Interfaces
  • 14. 24 Compute Machines Wave Computing. Copyright 2017. Wave DPU Hierarchy
  • 15. Wave Computing. Copyright 2017. Wave DPU Memory Hierarchy
  • 16. • Clock skew and jitter limit cycle time with traditional clock distribution • Self-timed “done” signal from PEs if they are awake. Programmable tuning of margin. • Synchronized with neighboring Clusters to minimize skew • 1-sigma local mismatch ~1.3ps and global + local mismatch ~6ps at 140ps cycle time Clock Distribution and Generation Network across entire Fabric Wave Computing. Copyright 2017. 6-10 GHz Auto-Calibrated Clock Distribution
  • 17. -4 -5 -3 -4 -6 -7 -5 -6 -2 -3 -1 -2 -4 -5 -3 -4 Up-counter in each cluster initialized to -(1+Manhattan distance from end cluster) End cluster Start cluster When counter reaches 0, either: - Reset the processors - Suspend processors for configuration (at PC=0) - Enable processors to execute (from PC=0) -4 -5 -3 -4 -6 -7 -5 -6 -2 -3 -1 -2 -4 -5 -3 -4 Pre-program 4 clusters to ENTER config mode. Counters operating 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 DMA new kernel instructions into cluster I-mems -4 -5 -3 -4 -6 -7 -5 -6 -2 -3 -1 -2 -4 -5 -3 -4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Old kernel Enter config mode Exit config modePropagate SignalPropagate Signal Step 1 Step 2 Step 3 Step 4 Propagate control signal from start cluster to end cluster. Advances 1 cluster per cycle. Pre-program 4 clusters to EXIT config mode. All clusters running in-synch New kernel SW controls this process to manage surge current Propagate signal starts the up-counter in each cluster. Counters operating Counters operating New kernel Reset, Configuration Modes Stop (config mode) (running) (running) (running) (running) (running) (running) (running) (running) (running)
  • 18. 1 1 1 1 2 2 2 2 - - - - 2 2 2 2 Kernel1 and kernel2 mounted. 3 3 3 3 I/O Mount(kernel3) (note: I/Os at bottom) New Kernel goes here! 1 1 1 1 2 2 2 2 3 3 3 3 2 2 2 2 After mount(kernel3) I/O I/O I/O Just-in-time route through Kernel 2 The I/Os cannot go here! • Runtime resource manager in lightweight host. • Mount(). Online placement algorithm with maxrects management of empty clusters. • Uses “porosity map” for each kernel showing route-thru opportunities. (SDK provides this) • Just-in-time Place & Route (using A*) of I/Os through other kernels without functional side-effects. • Unmount(). Removes paths through other kernels. • Machines are combined for mounting large kernels. Partitioned during unmount(). • Periodic garbage collection used for cleanup. • Average mount time < 1ms Runtime resource manager performing mount() Dynamic Reconfiguration
  • 19. • WFG Compiler • WFG Linker • WFG Simulator • DF agent partitioning • DFG throughput optimization • Runs on Session Host WaveFlow Execution Engine • Resource Manager • Monitors • Drivers • Runs on a Wave Deep Learning Computer WaveFlow Session Manager WaveFlow Agent Library • BLAS 1,2,3 • CONV2D • SoftMax, etc. WaveFlow SDK On line Off line Encrypted Agent Code Wave Computing. Copyright 2017. WaveFlow Software Stack
  • 20. WaveFlow agents are pre compiled off-line using WaveFlow SDK • Wave provides a complete agent library for TensorFlow • Customer can create additional agents for differentiation Customer supplied agent source code Wave supplied agent source code WaveFlow Agent Library Wave SDK • WFG Compiler • WFG Linker • WFG Simulator • WFG Debugger • MATMUL • Batchnorm • Relu, etc. Your new DNN training technique Encrypted Agent Code Wave Computing. Copyright 2017. WaveFlow Agent Library
  • 21. SATSolver WFG Compiler LLVM Frontend WFG Linker AssemblerArchitectural Simulator WaveFlow Agents WFG Simulator ML function (gemm, sigmoid, …) To appear in ICCAD 2017 WFG = Wave Flow Graph Wave Computing. Copyright 2017. WaveFlow SDK
  • 22. Kernels are islands of machine code scheduled onto machine cycles Example: Sum of Products on 16 PEs in a single cluster WFG of Sum of ProductsSum of Products Kernel PE 0 to 15 Time mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mac mac mac mac mac mac mac mac mac mac mac mac mac mac mac mac mov mov movcr mov mov mov mov mov mov mov movcr movi mov mov mov movcr mov mov mov movi mov mov mov movcr mov mov mov add8 mov mov mov mov mov mov mov add8 mov mov mov incc8 mov movcr mov mov mov movcr mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov movcr mov mov mov mov mov memr movcr mov memr mov mov mov mov mov mov mov movcr mov mov mov mov movcr mov mov mov movcr mov mov mov mov mov mac mac mov mov mov mov mov mov mac mac mov mov mov mov mac mac mov mov mac mac mac mac mac mac mov mov mac mac mac mac mov mov mov memr mov mov mov mov mov mov mov mov mov movcr mov mov mov movcr mov mov memr mov mov mov mov mov mov incc8 mov mov mov mov incc8 mov mov mov mov mov mov mov mov mov mov mov mov mov mov incc8 mov mov mov movcr movcr memw mov mov mov mov mov mov mov movcr mov mov mov mov mov mov mov memr mov mov mov memr incc8 mov mov mov mov st mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mac mac mac mac mac mac mac mac mac mac mac mac mac mac mac mac mov mov mov mov mov mov mov mov mov mov mov mov mov mov memr mov mov mov memr mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov cmuxi mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov memr mov mov mov memr mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mac mac mac mac mac mac mac mac mac mac mac mac mac mac mac mac mov mov mov mov mov mov mov mov mov mov mov mov mov mov memr mov mov memr mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mac mac mac mac mac mac mac mac mac mac mac mac mac mac mac mac mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov memr mov mov mov mov mov mov mov mov memr mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mac mac mac mac mac mac mac mac mac mac mac mac mac mac mac mac mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mac mac mac mac mac mac mac mac mac mac mac mac mac mac mac mac mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mov mac mac mac mac mac mac mac mac mac mac mac mac mac mac mac mac mov mov mov mov mov mov mov mov mov mov Wave Computing. Copyright 2017. Wave SDK: Compiler Produces Kernels
  • 23. Session Manager Partitions & Maps to DPUs & Memory Inference Graph generated directly from Keras model by Wave Compiler Wave Flow Graph Format Mapping Inception V4 to DPUs Single Node 64-DPU Computer
  • 24. Benchmarks on a single node 64-DPU Data Flow Computer • ImageNet training, 90 epochs, 1.28M images, 224x224x3 • Seq2Seq training using parameters from https://papers.nips.cc/paper/5346-sequence-to-sequence- learning-with-neural-networks.pdf by I. Sutskever, O. Vinyals & Q. Le Network Inferencing (Images/sec) Training time AlexNet 962,000 40 mins GoogleNet 420,000 1 hour 45 mins Squeezenet 75,000 3 hours Seq2Seq - 7 hours 15 min Deep Neural Network Performance Wave Computing. Copyright 2017.
  • 25. Wave is now accepting qualified customers to its Early Access Program (EAP) Provides select companies access to a Wave machine learning computer for testing and benchmarking months before official system sales begin For details about participation in the limited number of EAP positions, contact info@wavecomp.com Wave Computing. Copyright 2017. Wave’s Early Access Program