Bergman Enabling Computation for neuro ML external

V8.8.2016
Enabling Computation for
Machine Learning Algorithms
Inspired by Neurobiology
Taking Lessons From Nature
Presentation to CSRC Colloquium, SDSU
January 27, 2017
Doug Bergman, PhD, Staff Scientist / Mathemaperson
KnuEdge

V8.8.2016
Contents
• Neurobiological inspiration for machine learning algorithms
• Problems with existing machine learning tools and hardware
• Design principles
• Hermosa processor design
• Developer tools and Programming
• Events
© KnuEdge™ 2016. All Rights Reserved. 2

V8.8.2016
The Neurobiological Computer
Neurobiological Systems: Flexible and Scalable
Animal # Neurons (10x scale)
Roundworm 302 (2)
Medicinal leech 10,000 (4)
Pond snail 11,000 (4)
Sea slug 18,000 (4)
Lobster 100,000 (5)
Fruit Fly 250,000 (5)
Ant 250,000 (5)
Honey bee 960,000 (6)
Cockroach 1,000,000 (6)
Frog 16,000,000 (7)
Mouse 71,000,000 (8)
Finch 131,000,000 (8)
Octopus 500,000,000 (9)
Human 100,000,000,000 (14)
Elephant 200,000,000,000 (14)
Current generation
machine learning
capabilities
Where KnuEdge
wants to be in 2021:
MindScale.
Source: Wikipedia
https://en.wikipedia.org/wiki/List_of_animals_by_number_of_neurons

V8.8.2016
BatVision
• SONAR or optical “blind” vision
• 25,000 bats in one cave can avoid collisions
• Can tell a moth from a fly at 20 m
• Outgoing pulses
– Frequency: Up to120 KHz, f ~ 60% (decay)
– Pulse width: 2 msec
– Pulse rep rate of ~130 Hz – increases with proximity
– Sound amplitude of 130 decibels (jet plane loud)
• Incoming reception
– Time resolution of 2 μsecs
• Signal Processing
– ~ 10 million neurons

V8.8.2016
Structure of the Mammalian Nervous System
• Extremely complex and diverse
• Thousands or millions of neuron types
• Complex topology of connections like internet
Take away: Heterogeneous and sparsely connected

V8.8.2016
Natural Pattern Recognition
• Recognition is immediate and automatic
• Only a few unlabeled samples needed
• Inspiring development of new machine
learning models
• Perfected through a billion years of natural
selection
• Heterogeneity of neuron types
• Sparse,“plastic” evolving connections

V8.8.2016
Contents
• Events

V8.8.2016
Artificial Pattern Recognition: Neural Networks
• Lots of costly computation required
• Hundreds or thousands of labeled samples
to train
• Largely variations on homogeneous
Multilayered Perceptron models
• Relatively easy to develop
• Import library into script, train with data, run
• Build models without sophisticated theory; choose
parameters arbitrarily
“Throw mud on wall, see what sticks”
This has been the thrust of development in
part because these single-instruction,
multiple-data (SIMD) models map well to
existing hardware e.g. GPUs

V8.8.2016
Nature-Inspired Learning Models
• Heterogeneity of neuron types
• Hebbian Learning: neurons interconnect
sparsely after repeated sympathetic firing
• Plasticity and pruning: connections may
change over time
• Recurrent models
• Sequential “Deep Learning” models allow
for training on unlabeled or partially-
labeled data
Field has been slower to develop due in
part to unavailability of Multiple-
Instruction, Multiple Data (MIMD)
processing
Raudies, Zilli, Hasselmo
http://journals.plos.org/plosone/article?id
=10.1371/journal.pone.0093250
Piekniewski, Laurent, Petre, Richert,
Fisher, Hylton
https://arxiv.org/pdf/1607.06854v3.pdf

V8.8.2016
Problem: Software
• Brittleness/Lack of Security
• Multiple modalities of communications
• Poor scatter/gather communications
• Power consumption
• Ease of use
• Lack of skilled manpower
24 million1.8 million90 million 2.0 million
Increasing Complexity of Applications (Lines of Code)
Answer: Learning, based on neurobiological principles, could be preferable to programming.

V8.8.2016
Current Generation Chips are a Problem
• Memory gets larger but not
faster
• Logic gets faster but spends
more time waiting for
memory
• Logic gets more energy
efficient but memory
transport does not
Today’s processors: Time & Energy Dominated by Fetch

V8.8.2016
Contents
• Events

V8.8.2016
Hardware Solution: Wetware to Silicon
• Neuron body contains “context”
• Communication via synapses
• Axonal connections define geometry
Intellisis Design Goals
• Scalable heterogeneous parallelism
• Proximity of memory and processing
• Support for complex connections
• Communications driven architecture
• All information is “pushed”
• Must operate with noise and failures
Nature’s Design Principles
F0
F1
F2
F3
W1,0
W0,2
W2,1
W3,1
W2,3W0,3
Directed Graph












































'
3
'
2
'
1
'
0
3
2
1
0
1,3
3,21,2
0,1
3,02,0
000
00
000
00
A
A
A
A
A
A
A
A
W
WW
W
WW
Neurobiology Random Access Operations
ASIC
Take away: Be flexible.

V8.8.2016
Nodal Processing
• Data flow from inputs
(raw data) to outputs
(calculated results)
• Finite-state processing:
“Lambda” push model
• Parallel processing
kernels occur
independent of other
activities and flows
• Nodes can be
subnetworks themselves
• New nodes can be
inserted at any time
14
A
B
C
G
H
E
F
J
K
Q
T
Fig 5. Heterogeneous network with heterogeneous nodes and subnets.
L
M
P
Inputs
Outputs
X
Net A
Net D
Net B
Net CNet D
Net X
Processing Pipeline
Classifier

V8.8.2016
Hermosa Processing
fj
i = 0
i = 1
i = n-1
i = 2
i = 3
Hermosa Processing Unit Model
Inputs
(connections) Outputs
(Activations)
Processing
(neuron)
k
Hermosa Connection Processing
J
Router
 Each node accepts n inputs.
 When all n inputs are processed a
single output is generated
 Specially designed for coupled
differential equation solving
Node Processing
 There are many more software
connections than hardware
connections
 A router performs the task of
connecting neurons.
Edge Processing
F4
F1
F2
F5
F3
F8
F6
F7
Directed Graph Representation of Processing
Each line is called an “edge” and each circle is called a
“vertex”

V8.8.2016
Lambda Architecture Model
• The fundamental architecture block is a cluster, a physical analog of a network node or subgraph
• Arbitrarily scalable and nestable
• Each cluster contains data storage and specialized processors
Co-location minimizes latency and energy consumption
• Processes are pipelined by message passing among clusters
• Clusters retain finite state information; are free to perform other tasks while waiting for responses
• Internal handling of processing within clusters minimizes traffic congestion
16© KnuEdge™ 2016. All Rights Reserved.1/30/2017
Cluster 0
Cluster 1
Cluster 2
Cluster 3

V8.8.2016
Another application: Graph Analytics
In addition to Machine
Learning, the
Hermosa processor
and Lambda fabric
architecture will
handle graph
processing elegantly,
using data in
compressed format
Memory Cluster 0 Memory Cluster 1 Memory Cluster N-1
Worker 4 Worker 3
Worker2Worker1
Manager Worker 0
Worker5Worker6
Vertex Pointers,
Working Data
in Local Memory 0
Edge Successor
Data in Auxiliary
Memory 0
10
2 3
5 4
10
2 3
5 4
. . .
Router
Worker 4 Worker 3
Worker2Worker1
Manager Worker 0
Worker5Worker6
10
2 3
5 4
10
2 3
5 4
Worker 4 Worker 3
Worker2Worker1
Manager Worker 0
Worker5Worker6
10
2 3
5 4
10
2 3
5 4
Vertex Pointers,
Working Data
in Local Memory 1
Vertex Pointers,
Working Data
in Local Memory N-1
Edge Successor
Data in Auxiliary
Memory 1
Edge Successor
Data in Auxiliary
Memory N-1
Packet

V8.8.2016
Graph Finite-State Operations
• Graph Partitions = Data Objects = Memory Clusters
• Partitions have local states that track states of primitive functions
18© KnuEdge™ 2016. All Rights Reserved.1/30/2017
1
4
2
3
 
   
   
min
min
min
0,2 2,3 ,
0,3 min
0,1 1,3
D D
D
D D
  
  
  
 min 0,1D
 min 0,2D
0
 2,3D
 1,3D
Example: Find Minimum Distance from Vertex 0 to Vertex 3

V8.8.2016
Contents
• Events

V8.8.2016
Asynchronous Cloud Processor Data Plane
L1 Router
L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2
> 8 Gbytes Highly Segmented Memory
MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC
Supervisor Root of Trust
Multiple Memory Blocks
tDSP Processor core:
• 256 registers
• 4 Packet I/O engines
• Single cycle sleep state
• HW Event synchronization
• Independent clock domains
High-speed Serial:
• 28.5 Gbps
• FEC
• 802.3 Phy
• 10GBaseKR compatible
Layer 1 Router
• Ports: 49
• Rate: 8 GB/s/port
• Size: 2 mm2
• Latency: 7 ns
All communications via flit packets:
• Physical & virtual addressing
• Operation codes
Flit links:
• 64b/256b parallel
• Clock domain crossing
• Interposer or die xing

V8.8.2016
“Hermosa” ASIC
24.6 mm
Clusters
Central Router
SERDES

V8.8.2016
LambdaFabricTM
: Scalable Across Resources
Cluster Chip Board Rack Multi-Rack
 Scale invariant network architecture
 Low latency to everywhere
 High bandwidth
 Multi-dimensional connectivity
 Scalable up to 512,000 chips
6ns latency 62ns 247ns 437ns
Low-Latency, High-Throughput, Low-Power Computing Fabric
tDSP
0
tDSP
1
tDSP
2
tDSP
6
tDSP
5
tDSP
3
tDSP
7
tDSP
4
AIP
Memory
(2MB)

V8.8.2016
“Tiny DSP” Tuned to Stream Operations
Cluster and tDSP Processor
• 256 registers (128 GP and 128 special)
• 1 GHz clock (gated)
• Harvard program/data memory separation
• Fixed 32 bit instruction word.
• Storage merged into shared register file
• Built-in synch w/ event flags
• Scatter/gather engines
• Three states: Sleep, run, or single step
• Single instruction packet launch
• 2K instruction store
• Multiple state machines EPIC-style
• No interrupts
• Everything is addressed including registers Op3Op1 Op2Opcode
Instruction format
Fits in about 4 microns2 – about the size of a neuron.
SRAM
tDSP tDSP tDSP tDSP
Controller
Dispatcher
Packet
Router
Feeder
To other
Routers
DRAM
To other
Routers
AIP
…
Device
Cluster

V8.8.2016
Contents
• Events

V8.8.2016
“Flit” Packet Communications
• Only a single modality of operation
• Can operate from registers instead of
SRAM
• Superior for real-time transport
• “Cut-through” operations prioritize
important traffic
• No bus contention
• All comms are bi-directional (core can
receive and transmit at the same time)
• Flit headers carry VLIW packet
execution codes (POP)
Queue
Normal Packet Queuing
Que
ue
Flit Packet Queuing
Next packet along must wait for long period before
previous packet leaves the queue. Up to 60,000 clocks.
Next packet along must wait for only 4 clocks before
previous packet leaves the queue. Only 4 clocks.
Advantages:
Payload
(0 to 32 QWDS)
O
P
C
ADR
S
I
Z
V
P
O
P
V
Packet Format

V8.8.2016
Hermosa Control-Plane Organization
• Router Arbitration
• Synchronization
• Flags (Notifications)
• Mutexes (Ownership of a
resource)
• Shared Memory Control
• Vector memory
• Matrix memory
• Layer Processing
• Feeders == Programmable
cache controllers
• Error handling
• SAM – System Activity Monitoring
• Traffic flows
HRouter1
HRouter2 HRouter2 HRouter2Ctrl2
tDSP
To Next Device
Hermosa Hierarchical Control and
Data Distribution Scheme
tDSP
Ctrl2 Ctrl2
Supv
Ctrl2 Ctrl2
Control Path
Data Path
Each control element includes
local memory and control
registers.
Control Plane/Data Plane separation is central to Software Defined Networks
4 External Device Inputs tDSP m Event
Register
EVFG0
CCR: Cluster Event
Register
EVFG1
EVFG2
EVFG0
DCR: Device Event
Register
EVFG1
EVFG2
EVFG3
EVFD0
EVFD1
EVFD2
EVFD3
EVFG3
EVFD0
EVFD1
EVFD2
EVFD3
EVFMBX
EVFIOR
EVFDMA2
EVFFDR3
EVFG0
EVFG1
EVFG2
EVFG3
EVFD0
EVFD1
EVFD2
EVFD3
EVFC0
EVFC1
EVFC2
EVFC3
EVFC0
EVFC1
EVFC2
EVFC3
EVFMBX
EVFIOR
EVFDMA
EVFFDR
CCR: tDSP Latch &
Control
EVFMBX
EVFIOR
EVFDMA2
EVFFDR3
Mask against all
event flags via
WFLOR or WFLAND
functions. If true
then activate tDSP.
Event flag will
remain latched
until cleared by
tDSP.
H1000 Event Flag Mapping
tDSP
· AIP Synch Function sets EVFCn event flags.
· Supervisor sets EFDn event flags.
· Host machine sets EVFGn event flags
· EVFCn & EVFUn event flags are settable by
the tDSPs directly.
· EVFCn event flags are shared by all tDSPs in
a cluster.
EVFFU0
EVFU15
··
EVFFU0
EVFU15
··

V8.8.2016
KNUPATH Hermosa Programming Model
A range of programming interfaces to balance performance and productivity
Hermosa Assembly
Language
KNUPATH
Performance Interface
(KPI)
KNUPATH Network
Interface
(KNI)
PerformanceAbstraction
• Implicit message passing model
• Dataflow programming model based on Kahn Process Networks (KPN)
• C++ libraries and compiler extensions are used to support user-defined kernels and
networks
• Host target allows development of KNI networks and kernels without Hermosa
hardware
• Explicit message passing model
• Hermosa kernel functions library for MPI-like message passing and
synchronization
• C/C++ kernel development using the clang/LLVM compiler toolchain with a
Hermosa target
• Host/device accelerator model, similar to CUDA/OpenCL workflows

V8.8.2016
Contents
• Events

V8.8.2016
KnuEdge Web Service
• Hermosa developer boards available for
development of parallel computation on a
queued server
• Available to anyone, free of charge
• Training materials available
• User agreement required
• Contact jotchis@knupath.com for details

V8.8.2016
Upcoming Events
Date Event Information
Thursday, Feb. 16, 2017 Machine Learning Society Meetup:
Processing Hardware for Deep Learning
Panelist: D. Palmer (KnuEdge CTO)
At ScaleMatrix, 5775 Kearny Villa
Road, San Diego
https://www.meetup.com/machine-
learning-society/events/237055385/
www.mlsociety.com
April 26-28, 2017 Workshop on Sparse and Heterogeneous
Neural Networks
Hosted by California Institute for
Telecommunications and Information
Technology (Calit2)
Sponsored by KnuEdge
Participants Wanted – Submit Your
Abstract!
At Calit2, Atkinson Hall, UCSD
campus
https://www.knuedge.com/about-
us/events/hnnworkshop/
TBA KnuEdge Developer Users’ Group Stay tuned!

Bergman Enabling Computation for neuro ML external

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Bergman Enabling Computation for neuro ML external

Similar to Bergman Enabling Computation for neuro ML external (20)

Bergman Enabling Computation for neuro ML external