"New Dataflow Architecture for Machine Learning," a Presentation from Wave Computing

Copyright © 2017 Wave Computing 1
Dr. Chris Nicol, CTO
May 2017
New Dataflow Architecture for Machine Learning

• Founded in 2010
• Tallwood Venture Capital
• Southern Cross Venture Partners
• Headquartered in Campbell, CA
• World class team of 45 dataflow, data science, and systems experts
• 60+ patents
• Invented Dataflow Processing Unit (DPU) architecture to accelerate deep learning training by up to 1000x
• Coarse Grain Reconfigurable Array (CGRA) Architecture
• Static scheduling of data flow graphs onto massive array of processors
• Now accepting qualified customers for Early Access Program
Wave Computing Profile

Wave Computing – Market Focus
Consumer Smart Memory
Wave’s initial market:
Machine Learning in the Datacenter
Wave’s
Dataflow Computing
Technology
Industrial

Wave’s Dataflow Computer for ML Training
2.9 Peta-Ops/Second
256,000 Processing Elements
Over 2 TB Bulk & High Speed Memory
Up to 32 TB SSD Storage
Over 4.5 TB/Sec Dataflow Bandwidth
Up to 4 Wave Computers per Data Center Node
Initially Supporting TensorFlow

Inception v3 DNN Model
https://github.com/tensorflow/models/tree/master/inception
icp1 icp2a icp3b icp4a
input
icp5b icp6c icp7d icp8e icp9
icp10a icp11b
final
Training on Wave Deep Learning Computer
(Wave uses 16-b fixed point with stochastic rounding)

Training Performance: Inception V3
Layers Configuration Weights (M) GMAC DPU Arith.
Units Used
input 0.213 4.64 15,872
icp1 35x35x256 0.255 0.94 3,392
Icp2a 35x35x288 0.276 1.02 3,712
Icp3b 35x35x288 0.284 1.04 3,712
Icp4a 17x17x768 1.152 1.21 4,160
Icp5b 17x17x768 1.294 1.12 4,096
Icp6c 17x17x768 1.688 1.46 5,248
Icp7d 17x17x768 1.688 1.46 5,248
Icp8e 17x17x768 2.138 1.85 6,400
Icp9 17x17x1280 1.696 0.87 3,072
Icp10a 8x8x2048 5.038 0.97 3,584
Icp11b 8x8x2048 6.070 1.17 4,288
final 2.048 0.06 64
Total 23.8 17.7 62,848
Training Processing per 299x299x3 Image Implementation Stat Value
Weights / Image 47.6 MB
Memory Bandwidth 304 GB/sec
TMAC /s (Peak) 53 (@6.67 GHz)
TMAC/s (utilization) 37.67
Utilization of DPU Arithmetic 71%
Images Total 1.28 Million
Training Epochs 90
Training Time 15 Hours
Direct
execution
of dataflow
graph
Load
balancing
for high
throughput
High
Utilization
of MAC
units
https://github.com/tensorflow/models/tree/master/inception

Current Technology Limits DL Performance
Existing CPUs & GPUs are not designed
for dataflow computations
Deep Learning
Dataflow Graph
From http://download.tensorflow.org/paper/whitepaper2015.pdf
Typical compute model
OpenCL and CUDA
• Convert dataflow graph to
sequential threads on host
• CUDA/OpenCL kernels for
acceleration
CPU + GPU/FPGA
* MPI = Message Pass Interface
1. Deep learning is a dataflow application
2. We execute dataflow applications on dataflow processors

Wave Dataflow Processor for Deep Learning
Times
Times
I/O
Softmax
Plus
Plus
Mem I/OSigmoid
Times
Times
Plus
Plus
Softmax
Sigmoid
Programmed on
Deep Learning
Software
Run on Wave
Dataflow Processor
WaveFlow Agent Library
Deep Learning
Networks are
Dataflow
Graphs
Wave Dataflow
Processor

ASIC
Wave Computing
GPU
FPGA DSP
CPU
MPPA
ASSP
Coarse Grain Reconfigurable Array
• Good power efficiency from coarse-grain
statically compiled architecture
• Does not require host with runtime API
(like CUDA or OpenCL)
• Similar flexibility to MPPA, GPU and CPU
• Programmable in HLL
• Dynamic reconfiguration for mix of
applications
• Training and Inference
• ML and non-ML
CGRA Competitive Positioning
PowerEfficiency
(TOPS/Watt) Flexibility
CGRA

Wave DPU Architecture
24 Compute Machines
Cluster (16 PE’s and 8 Arithmetic Units)

Wave DPU Chip
16 nm CMOS Process Node 16 K Processors,
8192 DPU Arithmetic Units
6.7 GHz (typ)
181 Peak Tera-Ops, 8.6 Tera
Bytes/sec Bisection Bandwidth
16 MB Distributed
Data Memory
8 MB Distributed
Instruction Memory
1.71 TB/s I/O Bandwidth
through
4096 Programmable FIFOs
270 GB/s Peak
Memory Bandwidth
2048 outstanding
memory requests
PCIe Gen3-16 Host Interface 4 HMC Interfaces 2 DDR4 2400 Interfaces
Hardware Engine for Fast
Loading of Encrypted
Programs
Up to 32 Programmable
dynamic reconfiguration
zones
Auto-Calibrated Clocking /w
no global signals
Chip Characteristics & Design Features
• Clock-less CGRA is robust to Process, Voltage & Temperature
• Distributed memory architecture for parallel processing
• Optimized for data flow graph execution
• DMA-driven architecture – overlapping I/O and computation

Wave Current Generation DPU Board

Wave’s Dataflow Solution is Plug & Play
Data scientist
TensorFlow Client
Plug & Play
• No change in customer workflow
• Training and Inferencing
• TensorFlow works MUCH faster
TensorFlow subgraphs are executed
on Wave DPU processors in the
Wave Deep Learning Computers
Existing ML
Nodes
…flexible…
Wave Session Host Wave Session Host
Data Center
(Public Cloud or Enterprise)
Client Host
Wave requires NO change to existing Data Center
Data Center connection

Running ML on WaveFlow SW Stack
Wave ML Framework Interface
Proto
Buf
TensorFlow
Frontend
Proto
Buf
Caffe
Frontend
Proto
Buf
MXNet
Frontend
WaveFlow Session Manager
WaveFlow Execution Engine WaveFlow
Agent Library
Client Host Wave Session Host
Data Center
(Public Cloud or Enterprise)
Wave Run-Time Software

• WaveFlow agents are pre compiled off-line using WaveFlow SDK
• Wave provides a complete agent library for TensorFlow
• Customer can create additional agents for differentiation
WaveFlow Agent Library
Customer-supplied
agent source code
Wave-supplied
agent source code
WaveFlow
Agent
Library
WaveFlow SDK
• WFG Compiler
• WFG Linker
• WFG Simulator
• WFG Debugger
• GEMM
• ReLU
• Softmax, etc.

TensorFlow example
# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""A very simple MNIST classifier.
See extensive documentation at
http://tensorflow.org/tutorials/mnist/beginners/index.md
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# Import data
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
flags = tf.app.flags
FLAGS = flags.FLAGS
flags.DEFINE_string('data_dir', '/tmp/data/', 'Directory for storing data')
mnist = input_data.read_data_sets(FLAGS.data_dir, one_hot=True)
sess = tf.InteractiveSession()
N = 100
# Create the model
x = tf.placeholder(tf.float32, [N, 784], name="W_x")
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
y = tf.nn.softmax(tf.matmul(x, W) + b, name="W_y")
# Define loss and optimizer
y_ = tf.placeholder(tf.float32, [N,10], name="W_y_")
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]),name="W_entropy")
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32),name="W_acc")
tf.initialize_all_variables().run()
tf.train.write_graph(sess.graph_def, ".", "mtrain.pb", True );
for i in range(101):
batch_xs, batch_ys = mnist.train.next_batch(N)
train_step.run({x: batch_xs, y_: batch_ys})
print( accuracy.eval({x: batch_xs, y_: batch_ys}))
print( cross_entropy.eval({x: batch_xs, y_: batch_ys}))
print( sess.run(b) )
Session Manager
Dataflow graph is
partitioned into agents and
dispatched to dataflow
computer for execution
Graph is generated at runtime from
graph provided by Tensorflow.

• Wave is now accepting qualified customers to its Early Access Program (EAP)
• Provides select companies access to a Wave machine learning computer for
testing and benchmarking months before official system sales begin
• For details about participation in the limited number of EAP positions,
contact info@wavecomp.com
Wave’s Early Access Program

"New Dataflow Architecture for Machine Learning," a Presentation from Wave Computing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to "New Dataflow Architecture for Machine Learning," a Presentation from Wave Computing

Similar to "New Dataflow Architecture for Machine Learning," a Presentation from Wave Computing (20)

More from Edge AI and Vision Alliance

More from Edge AI and Vision Alliance (20)

Recently uploaded

Recently uploaded (20)

"New Dataflow Architecture for Machine Learning," a Presentation from Wave Computing