SlideShare a Scribd company logo
1 of 32
Download to read offline
V8.8.2016
Enabling Computation for
Machine Learning Algorithms
Inspired by Neurobiology
Taking Lessons From Nature
Presentation to CSRC Colloquium, SDSU
January 27, 2017
Doug Bergman, PhD, Staff Scientist / Mathemaperson
KnuEdge
V8.8.2016
Contents
• Neurobiological inspiration for machine learning algorithms
• Problems with existing machine learning tools and hardware
• Design principles
• Hermosa processor design
• Developer tools and Programming
• Events
© KnuEdge™ 2016. All Rights Reserved. 2
V8.8.2016
The Neurobiological Computer
Neurobiological Systems: Flexible and Scalable
Animal # Neurons (10x scale)
Roundworm 302 (2)
Medicinal leech 10,000 (4)
Pond snail 11,000 (4)
Sea slug 18,000 (4)
Lobster 100,000 (5)
Fruit Fly 250,000 (5)
Ant 250,000 (5)
Honey bee 960,000 (6)
Cockroach 1,000,000 (6)
Frog 16,000,000 (7)
Mouse 71,000,000 (8)
Finch 131,000,000 (8)
Octopus 500,000,000 (9)
Human 100,000,000,000 (14)
Elephant 200,000,000,000 (14)
Current generation
machine learning
capabilities
Where KnuEdge
wants to be in 2021:
MindScale.
© KnuEdge™ 2016. All Rights Reserved. 3
Source: Wikipedia
https://en.wikipedia.org/wiki/List_of_animals_by_number_of_neurons
V8.8.2016
BatVision
• SONAR or optical “blind” vision
• 25,000 bats in one cave can avoid collisions
• Can tell a moth from a fly at 20 m
• Outgoing pulses
– Frequency: Up to120 KHz, f ~ 60% (decay)
– Pulse width: 2 msec
– Pulse rep rate of ~130 Hz – increases with proximity
– Sound amplitude of 130 decibels (jet plane loud)
• Incoming reception
– Time resolution of 2 μsecs
• Signal Processing
– ~ 10 million neurons
© KnuEdge™ 2016. All Rights Reserved. 4
V8.8.2016
Structure of the Mammalian Nervous System
• Extremely complex and diverse
• Thousands or millions of neuron types
• Complex topology of connections like internet
Take away: Heterogeneous and sparsely connected
© KnuEdge™ 2016. All Rights Reserved. 5
V8.8.2016
Natural Pattern Recognition
• Recognition is immediate and automatic
• Only a few unlabeled samples needed
• Inspiring development of new machine
learning models
• Perfected through a billion years of natural
selection
• Heterogeneity of neuron types
• Sparse,“plastic” evolving connections
© KnuEdge™ 2016. All Rights Reserved. 6
V8.8.2016
Contents
• Neurobiological inspiration for machine learning algorithms
• Problems with existing machine learning tools and hardware
• Design principles
• Hermosa processor design
• Developer tools and Programming
• Events
© KnuEdge™ 2016. All Rights Reserved. 7
V8.8.2016
Artificial Pattern Recognition: Neural Networks
• Lots of costly computation required
• Hundreds or thousands of labeled samples
to train
• Largely variations on homogeneous
Multilayered Perceptron models
• Relatively easy to develop
• Import library into script, train with data, run
• Build models without sophisticated theory; choose
parameters arbitrarily
“Throw mud on wall, see what sticks”
This has been the thrust of development in
part because these single-instruction,
multiple-data (SIMD) models map well to
existing hardware e.g. GPUs
© KnuEdge™ 2016. All Rights Reserved. 8
V8.8.2016
Nature-Inspired Learning Models
• Heterogeneity of neuron types
• Hebbian Learning: neurons interconnect
sparsely after repeated sympathetic firing
• Plasticity and pruning: connections may
change over time
• Recurrent models
• Sequential “Deep Learning” models allow
for training on unlabeled or partially-
labeled data
Field has been slower to develop due in
part to unavailability of Multiple-
Instruction, Multiple Data (MIMD)
processing
© KnuEdge™ 2016. All Rights Reserved. 9
Raudies, Zilli, Hasselmo
http://journals.plos.org/plosone/article?id
=10.1371/journal.pone.0093250
Piekniewski, Laurent, Petre, Richert,
Fisher, Hylton
https://arxiv.org/pdf/1607.06854v3.pdf
V8.8.2016
Problem: Software
• Brittleness/Lack of Security
• Multiple modalities of communications
• Poor scatter/gather communications
• Power consumption
• Ease of use
• Lack of skilled manpower
24 million1.8 million90 million 2.0 million
Increasing Complexity of Applications (Lines of Code)
Answer: Learning, based on neurobiological principles, could be preferable to programming.
© KnuEdge™ 2016. All Rights Reserved. 10
V8.8.2016
Current Generation Chips are a Problem
• Memory gets larger but not
faster
• Logic gets faster but spends
more time waiting for
memory
• Logic gets more energy
efficient but memory
transport does not
Today’s processors: Time & Energy Dominated by Fetch
© KnuEdge™ 2016. All Rights Reserved. 11
V8.8.2016
Contents
• Neurobiological inspiration for machine learning algorithms
• Problems with existing machine learning tools and hardware
• Design principles
• Hermosa processor design
• Developer tools and Programming
• Events
© KnuEdge™ 2016. All Rights Reserved. 12
V8.8.2016
Hardware Solution: Wetware to Silicon
• Neuron body contains “context”
• Communication via synapses
• Axonal connections define geometry
Intellisis Design Goals
• Scalable heterogeneous parallelism
• Proximity of memory and processing
• Support for complex connections
• Communications driven architecture
• All information is “pushed”
• Must operate with noise and failures
Nature’s Design Principles
F0
F1
F2
F3
W1,0
W0,2
W2,1
W3,1
W2,3W0,3
Directed Graph












































'
3
'
2
'
1
'
0
3
2
1
0
1,3
3,21,2
0,1
3,02,0
000
00
000
00
A
A
A
A
A
A
A
A
W
WW
W
WW
Neurobiology Random Access Operations
ASIC
Take away: Be flexible.
© KnuEdge™ 2016. All Rights Reserved. 13
V8.8.2016
© KnuEdge™ 2016. All Rights Reserved. 14
Nodal Processing
• Data flow from inputs
(raw data) to outputs
(calculated results)
• Finite-state processing:
“Lambda” push model
• Parallel processing
kernels occur
independent of other
activities and flows
• Nodes can be
subnetworks themselves
• New nodes can be
inserted at any time
14
A
B
C
G
H
E
F
J
K
Q
T
Fig 5. Heterogeneous network with heterogeneous nodes and subnets.
L
M
P
Inputs
Outputs
X
Net A
Net D
Net B
Net CNet D
Net X
Processing Pipeline
Classifier
V8.8.2016
Hermosa Processing
fj
i = 0
i = 1
i = n-1
i = 2
i = 3
Hermosa Processing Unit Model
Inputs
(connections) Outputs
(Activations)
Processing
(neuron)
k
Hermosa Connection Processing
J
Router
 Each node accepts n inputs.
 When all n inputs are processed a
single output is generated
 Specially designed for coupled
differential equation solving
Node Processing
 There are many more software
connections than hardware
connections
 A router performs the task of
connecting neurons.
Edge Processing
F4
F1
F2
F5
F3
F8
F6
F7
Directed Graph Representation of Processing
Each line is called an “edge” and each circle is called a
“vertex”
© KnuEdge™ 2016. All Rights Reserved. 15
V8.8.2016
Lambda Architecture Model
• The fundamental architecture block is a cluster, a physical analog of a network node or subgraph
• Arbitrarily scalable and nestable
• Each cluster contains data storage and specialized processors
Co-location minimizes latency and energy consumption
• Processes are pipelined by message passing among clusters
• Clusters retain finite state information; are free to perform other tasks while waiting for responses
• Internal handling of processing within clusters minimizes traffic congestion
16© KnuEdge™ 2016. All Rights Reserved.1/30/2017
Cluster 0
Cluster 1
Cluster 2
Cluster 3
V8.8.2016
Another application: Graph Analytics
In addition to Machine
Learning, the
Hermosa processor
and Lambda fabric
architecture will
handle graph
processing elegantly,
using data in
compressed format
© KnuEdge™ 2016. All Rights Reserved. 17
Memory Cluster 0 Memory Cluster 1 Memory Cluster N-1
Worker 4 Worker 3
Worker2Worker1
Manager Worker 0
Worker5Worker6
Vertex Pointers,
Working Data
in Local Memory 0
Edge Successor
Data in Auxiliary
Memory 0
10
2 3
5 4
10
2 3
5 4
. . .
Router
Worker 4 Worker 3
Worker2Worker1
Manager Worker 0
Worker5Worker6
10
2 3
5 4
10
2 3
5 4
Worker 4 Worker 3
Worker2Worker1
Manager Worker 0
Worker5Worker6
10
2 3
5 4
10
2 3
5 4
Vertex Pointers,
Working Data
in Local Memory 1
Vertex Pointers,
Working Data
in Local Memory N-1
Edge Successor
Data in Auxiliary
Memory 1
Edge Successor
Data in Auxiliary
Memory N-1
Packet
V8.8.2016
Graph Finite-State Operations
• Graph Partitions = Data Objects = Memory Clusters
• Partitions have local states that track states of primitive functions
18© KnuEdge™ 2016. All Rights Reserved.1/30/2017
1
4
2
3
 
   
   
min
min
min
0,2 2,3 ,
0,3 min
0,1 1,3
D D
D
D D
  
  
  
 min 0,1D
 min 0,2D
0
 2,3D
 1,3D
Example: Find Minimum Distance from Vertex 0 to Vertex 3
V8.8.2016
Contents
• Neurobiological inspiration for machine learning algorithms
• Problems with existing machine learning tools and hardware
• Design principles
• Hermosa processor design
• Developer tools and Programming
• Events
© KnuEdge™ 2016. All Rights Reserved. 19
V8.8.2016
Asynchronous Cloud Processor Data Plane
L1 Router
L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2
> 8 Gbytes Highly Segmented Memory
MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC
Supervisor Root of Trust
Multiple Memory Blocks
tDSP Processor core:
• 256 registers
• 4 Packet I/O engines
• Single cycle sleep state
• HW Event synchronization
• Independent clock domains
High-speed Serial:
• 28.5 Gbps
• FEC
• 802.3 Phy
• 10GBaseKR compatible
Layer 1 Router
• Ports: 49
• Rate: 8 GB/s/port
• Size: 2 mm2
• Latency: 7 ns
All communications via flit packets:
• Physical & virtual addressing
• Operation codes
Flit links:
• 64b/256b parallel
• Clock domain crossing
• Interposer or die xing
© KnuEdge™ 2016. All Rights Reserved. 20
V8.8.2016
“Hermosa” ASIC
24.6 mm
Clusters
Central Router
SERDES
© KnuEdge™ 2016. All Rights Reserved. 21
V8.8.2016
LambdaFabricTM
: Scalable Across Resources
Cluster Chip Board Rack Multi-Rack
 Scale invariant network architecture
 Low latency to everywhere
 High bandwidth
 Multi-dimensional connectivity
 Scalable up to 512,000 chips
6ns latency 62ns 247ns 437ns
Low-Latency, High-Throughput, Low-Power Computing Fabric
tDSP
0
tDSP
1
tDSP
2
tDSP
6
tDSP
5
tDSP
3
tDSP
7
tDSP
4
AIP
Memory
(2MB)
© KnuEdge™ 2016. All Rights Reserved. 22
V8.8.2016
“Tiny DSP” Tuned to Stream Operations
Cluster and tDSP Processor
• 256 registers (128 GP and 128 special)
• 1 GHz clock (gated)
• Harvard program/data memory separation
• Fixed 32 bit instruction word.
• Storage merged into shared register file
• Built-in synch w/ event flags
• Scatter/gather engines
• Three states: Sleep, run, or single step
• Single instruction packet launch
• 2K instruction store
• Multiple state machines EPIC-style
• No interrupts
• Everything is addressed including registers Op3Op1 Op2Opcode
Instruction format
Fits in about 4 microns2 – about the size of a neuron.
SRAM
tDSP tDSP tDSP tDSP
Controller
Dispatcher
Packet
Router
Feeder
To other
Routers
DRAM
To other
Routers
AIP
…
Device
Cluster
© KnuEdge™ 2016. All Rights Reserved. 23
V8.8.2016
Contents
• Neurobiological inspiration for machine learning algorithms
• Problems with existing machine learning tools and hardware
• Design principles
• Hermosa processor design
• Developer tools and Programming
• Events
© KnuEdge™ 2016. All Rights Reserved. 24
V8.8.2016
“Flit” Packet Communications
• Only a single modality of operation
• Can operate from registers instead of
SRAM
• Superior for real-time transport
• “Cut-through” operations prioritize
important traffic
• No bus contention
• All comms are bi-directional (core can
receive and transmit at the same time)
• Flit headers carry VLIW packet
execution codes (POP)
Queue
Normal Packet Queuing
Que
ue
Flit Packet Queuing
Next packet along must wait for long period before
previous packet leaves the queue. Up to 60,000 clocks.
Next packet along must wait for only 4 clocks before
previous packet leaves the queue. Only 4 clocks.
Advantages:
Payload
(0 to 32 QWDS)
O
P
C
ADR
S
I
Z
V
P
O
P
V
Packet Format
© KnuEdge™ 2016. All Rights Reserved. 25
V8.8.2016
Hermosa Control-Plane Organization
• Router Arbitration
• Synchronization
• Flags (Notifications)
• Mutexes (Ownership of a
resource)
• Shared Memory Control
• Vector memory
• Matrix memory
• Layer Processing
• Feeders == Programmable
cache controllers
• Error handling
• SAM – System Activity Monitoring
• Traffic flows
HRouter1
HRouter2 HRouter2 HRouter2Ctrl2
tDSP
To Next Device
Hermosa Hierarchical Control and
Data Distribution Scheme
tDSP
Ctrl2 Ctrl2
Supv
Ctrl2 Ctrl2
Control Path
Data Path
Each control element includes
local memory and control
registers.
Control Plane/Data Plane separation is central to Software Defined Networks
4 External Device Inputs tDSP m Event
Register
EVFG0
CCR: Cluster Event
Register
EVFG1
EVFG2
EVFG0
DCR: Device Event
Register
EVFG1
EVFG2
EVFG3
EVFD0
EVFD1
EVFD2
EVFD3
EVFG3
EVFD0
EVFD1
EVFD2
EVFD3
EVFMBX
EVFIOR
EVFDMA2
EVFFDR3
EVFG0
EVFG1
EVFG2
EVFG3
EVFD0
EVFD1
EVFD2
EVFD3
EVFC0
EVFC1
EVFC2
EVFC3
EVFC0
EVFC1
EVFC2
EVFC3
EVFMBX
EVFIOR
EVFDMA
EVFFDR
CCR: tDSP Latch &
Control
EVFMBX
EVFIOR
EVFDMA2
EVFFDR3
Mask against all
event flags via
WFLOR or WFLAND
functions. If true
then activate tDSP.
Event flag will
remain latched
until cleared by
tDSP.
H1000 Event Flag Mapping
tDSP
· AIP Synch Function sets EVFCn event flags.
· Supervisor sets EFDn event flags.
· Host machine sets EVFGn event flags
· EVFCn & EVFUn event flags are settable by
the tDSPs directly.
· EVFCn event flags are shared by all tDSPs in
a cluster.
EVFFU0
EVFU15
··
EVFFU0
EVFU15
··
© KnuEdge™ 2016. All Rights Reserved. 26
V8.8.2016
KNUPATH Hermosa Programming Model
A range of programming interfaces to balance performance and productivity
Hermosa Assembly
Language
KNUPATH
Performance Interface
(KPI)
KNUPATH Network
Interface
(KNI)
PerformanceAbstraction
• Implicit message passing model
• Dataflow programming model based on Kahn Process Networks (KPN)
• C++ libraries and compiler extensions are used to support user-defined kernels and
networks
• Host target allows development of KNI networks and kernels without Hermosa
hardware
• Explicit message passing model
• Hermosa kernel functions library for MPI-like message passing and
synchronization
• C/C++ kernel development using the clang/LLVM compiler toolchain with a
Hermosa target
• Host/device accelerator model, similar to CUDA/OpenCL workflows
© KnuEdge™ 2016. All Rights Reserved. 27
V8.8.2016
Contents
• Neurobiological inspiration for machine learning algorithms
• Problems with existing machine learning tools and hardware
• Design principles
• Hermosa processor design
• Developer tools and Programming
• Events
© KnuEdge™ 2016. All Rights Reserved. 28
V8.8.2016
KnuEdge Web Service
• Hermosa developer boards available for
development of parallel computation on a
queued server
• Available to anyone, free of charge
• Training materials available
• User agreement required
• Contact jotchis@knupath.com for details
© KnuEdge™ 2016. All Rights Reserved. 29
V8.8.2016
Upcoming Events
Date Event Information
Thursday, Feb. 16, 2017 Machine Learning Society Meetup:
Processing Hardware for Deep Learning
Panelist: D. Palmer (KnuEdge CTO)
At ScaleMatrix, 5775 Kearny Villa
Road, San Diego
https://www.meetup.com/machine-
learning-society/events/237055385/
www.mlsociety.com
April 26-28, 2017 Workshop on Sparse and Heterogeneous
Neural Networks
Hosted by California Institute for
Telecommunications and Information
Technology (Calit2)
Sponsored by KnuEdge
Participants Wanted – Submit Your
Abstract!
At Calit2, Atkinson Hall, UCSD
campus
https://www.knuedge.com/about-
us/events/hnnworkshop/
TBA KnuEdge Developer Users’ Group Stay tuned!
© KnuEdge™ 2016. All Rights Reserved. 30
V8.8.2016
Questions?
31© KnuEdge™ 2016. All Rights Reserved.
V8.8.2016
™

More Related Content

What's hot

Dov Nimratz, Roman Chobik "Embedded artificial intelligence"
Dov Nimratz, Roman Chobik "Embedded artificial intelligence"Dov Nimratz, Roman Chobik "Embedded artificial intelligence"
Dov Nimratz, Roman Chobik "Embedded artificial intelligence"Lviv Startup Club
 
The OptIPuter and Its Applications
The OptIPuter and Its ApplicationsThe OptIPuter and Its Applications
The OptIPuter and Its ApplicationsLarry Smarr
 
A California-Wide Cyberinfrastructure for Data-Intensive Research
A California-Wide Cyberinfrastructure for Data-Intensive ResearchA California-Wide Cyberinfrastructure for Data-Intensive Research
A California-Wide Cyberinfrastructure for Data-Intensive ResearchLarry Smarr
 
Riding the Light: How Dedicated Optical Circuits are Enabling New Science
Riding the Light: How Dedicated Optical Circuits are Enabling New ScienceRiding the Light: How Dedicated Optical Circuits are Enabling New Science
Riding the Light: How Dedicated Optical Circuits are Enabling New ScienceLarry Smarr
 
Toward a National Research Platform
Toward a National Research PlatformToward a National Research Platform
Toward a National Research PlatformLarry Smarr
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facilityinside-BigData.com
 
ONOS-based Location and Load aware Virtually Dedicated Container Networking o...
ONOS-based Location and Load aware Virtually Dedicated Container Networking o...ONOS-based Location and Load aware Virtually Dedicated Container Networking o...
ONOS-based Location and Load aware Virtually Dedicated Container Networking o...APNIC
 
Cloud computing and bioinformatics
Cloud computing and bioinformaticsCloud computing and bioinformatics
Cloud computing and bioinformaticsEnis Afgan
 
OptIPuter Overview
OptIPuter OverviewOptIPuter Overview
OptIPuter OverviewLarry Smarr
 
OCCI - The Open Cloud Computing Interface – flexible, portable, interoperable...
OCCI - The Open Cloud Computing Interface – flexible, portable, interoperable...OCCI - The Open Cloud Computing Interface – flexible, portable, interoperable...
OCCI - The Open Cloud Computing Interface – flexible, portable, interoperable...Alan Sill
 
DEYPOS: DEDUPLICATABLE DYNAMIC PROOF OF STORAGE FOR MULTI-USER ENVIRONMENTS
DEYPOS: DEDUPLICATABLE DYNAMIC PROOF OF STORAGE FOR MULTI-USER ENVIRONMENTSDEYPOS: DEDUPLICATABLE DYNAMIC PROOF OF STORAGE FOR MULTI-USER ENVIRONMENTS
DEYPOS: DEDUPLICATABLE DYNAMIC PROOF OF STORAGE FOR MULTI-USER ENVIRONMENTSNexgen Technology
 
High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...
High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...
High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...Larry Smarr
 
OGF Introductory Overview - OGF 44 at EGI Conference 2015
OGF Introductory Overview - OGF 44 at EGI Conference 2015OGF Introductory Overview - OGF 44 at EGI Conference 2015
OGF Introductory Overview - OGF 44 at EGI Conference 2015Alan Sill
 
Cloud Testbeds for Standards Development and Innovation
Cloud Testbeds for Standards Development and InnovationCloud Testbeds for Standards Development and Innovation
Cloud Testbeds for Standards Development and InnovationAlan Sill
 
OpenACC Monthly Highlights - September
OpenACC Monthly Highlights - SeptemberOpenACC Monthly Highlights - September
OpenACC Monthly Highlights - SeptemberNVIDIA
 
PLNOG 18 - Dr Marek Michalewicz - InfiniCortex: Superkomputer wielki jak świat
PLNOG 18 - Dr Marek Michalewicz - InfiniCortex: Superkomputer wielki jak światPLNOG 18 - Dr Marek Michalewicz - InfiniCortex: Superkomputer wielki jak świat
PLNOG 18 - Dr Marek Michalewicz - InfiniCortex: Superkomputer wielki jak światPROIDEA
 
OGF standards for cloud computing
OGF standards for cloud computingOGF standards for cloud computing
OGF standards for cloud computingAlan Sill
 
OGF Standards Overview - ITU-T JCA Cloud
OGF Standards Overview - ITU-T JCA CloudOGF Standards Overview - ITU-T JCA Cloud
OGF Standards Overview - ITU-T JCA CloudAlan Sill
 
OGF Introductory Overview - FAS* 2014
OGF Introductory Overview -  FAS* 2014OGF Introductory Overview -  FAS* 2014
OGF Introductory Overview - FAS* 2014Alan Sill
 

What's hot (20)

Dov Nimratz, Roman Chobik "Embedded artificial intelligence"
Dov Nimratz, Roman Chobik "Embedded artificial intelligence"Dov Nimratz, Roman Chobik "Embedded artificial intelligence"
Dov Nimratz, Roman Chobik "Embedded artificial intelligence"
 
The OptIPuter and Its Applications
The OptIPuter and Its ApplicationsThe OptIPuter and Its Applications
The OptIPuter and Its Applications
 
A California-Wide Cyberinfrastructure for Data-Intensive Research
A California-Wide Cyberinfrastructure for Data-Intensive ResearchA California-Wide Cyberinfrastructure for Data-Intensive Research
A California-Wide Cyberinfrastructure for Data-Intensive Research
 
DIET_BLAST
DIET_BLASTDIET_BLAST
DIET_BLAST
 
Riding the Light: How Dedicated Optical Circuits are Enabling New Science
Riding the Light: How Dedicated Optical Circuits are Enabling New ScienceRiding the Light: How Dedicated Optical Circuits are Enabling New Science
Riding the Light: How Dedicated Optical Circuits are Enabling New Science
 
Toward a National Research Platform
Toward a National Research PlatformToward a National Research Platform
Toward a National Research Platform
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
 
ONOS-based Location and Load aware Virtually Dedicated Container Networking o...
ONOS-based Location and Load aware Virtually Dedicated Container Networking o...ONOS-based Location and Load aware Virtually Dedicated Container Networking o...
ONOS-based Location and Load aware Virtually Dedicated Container Networking o...
 
Cloud computing and bioinformatics
Cloud computing and bioinformaticsCloud computing and bioinformatics
Cloud computing and bioinformatics
 
OptIPuter Overview
OptIPuter OverviewOptIPuter Overview
OptIPuter Overview
 
OCCI - The Open Cloud Computing Interface – flexible, portable, interoperable...
OCCI - The Open Cloud Computing Interface – flexible, portable, interoperable...OCCI - The Open Cloud Computing Interface – flexible, portable, interoperable...
OCCI - The Open Cloud Computing Interface – flexible, portable, interoperable...
 
DEYPOS: DEDUPLICATABLE DYNAMIC PROOF OF STORAGE FOR MULTI-USER ENVIRONMENTS
DEYPOS: DEDUPLICATABLE DYNAMIC PROOF OF STORAGE FOR MULTI-USER ENVIRONMENTSDEYPOS: DEDUPLICATABLE DYNAMIC PROOF OF STORAGE FOR MULTI-USER ENVIRONMENTS
DEYPOS: DEDUPLICATABLE DYNAMIC PROOF OF STORAGE FOR MULTI-USER ENVIRONMENTS
 
High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...
High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...
High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...
 
OGF Introductory Overview - OGF 44 at EGI Conference 2015
OGF Introductory Overview - OGF 44 at EGI Conference 2015OGF Introductory Overview - OGF 44 at EGI Conference 2015
OGF Introductory Overview - OGF 44 at EGI Conference 2015
 
Cloud Testbeds for Standards Development and Innovation
Cloud Testbeds for Standards Development and InnovationCloud Testbeds for Standards Development and Innovation
Cloud Testbeds for Standards Development and Innovation
 
OpenACC Monthly Highlights - September
OpenACC Monthly Highlights - SeptemberOpenACC Monthly Highlights - September
OpenACC Monthly Highlights - September
 
PLNOG 18 - Dr Marek Michalewicz - InfiniCortex: Superkomputer wielki jak świat
PLNOG 18 - Dr Marek Michalewicz - InfiniCortex: Superkomputer wielki jak światPLNOG 18 - Dr Marek Michalewicz - InfiniCortex: Superkomputer wielki jak świat
PLNOG 18 - Dr Marek Michalewicz - InfiniCortex: Superkomputer wielki jak świat
 
OGF standards for cloud computing
OGF standards for cloud computingOGF standards for cloud computing
OGF standards for cloud computing
 
OGF Standards Overview - ITU-T JCA Cloud
OGF Standards Overview - ITU-T JCA CloudOGF Standards Overview - ITU-T JCA Cloud
OGF Standards Overview - ITU-T JCA Cloud
 
OGF Introductory Overview - FAS* 2014
OGF Introductory Overview -  FAS* 2014OGF Introductory Overview -  FAS* 2014
OGF Introductory Overview - FAS* 2014
 

Similar to Bergman Enabling Computation for neuro ML external

Rack Cluster Deployment for SDSC Supercomputer
Rack Cluster Deployment for SDSC SupercomputerRack Cluster Deployment for SDSC Supercomputer
Rack Cluster Deployment for SDSC SupercomputerRebekah Rodriguez
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning Dr. Swaminathan Kathirvel
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageMayaData Inc
 
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)Amazon Web Services
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyPeter Clapham
 
Pioneering and Democratizing Scalable HPC+AI at PSC
Pioneering and Democratizing Scalable HPC+AI at PSCPioneering and Democratizing Scalable HPC+AI at PSC
Pioneering and Democratizing Scalable HPC+AI at PSCinside-BigData.com
 
SoC Solutions Enabling Server-Based Networking
SoC Solutions Enabling Server-Based NetworkingSoC Solutions Enabling Server-Based Networking
SoC Solutions Enabling Server-Based NetworkingNetronome
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.Kyong-Ha Lee
 
Dell High-Performance Computing solutions: Enable innovations, outperform exp...
Dell High-Performance Computing solutions: Enable innovations, outperform exp...Dell High-Performance Computing solutions: Enable innovations, outperform exp...
Dell High-Performance Computing solutions: Enable innovations, outperform exp...Dell World
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...Databricks
 
ODSA Sub-Project Launch
ODSA Sub-Project LaunchODSA Sub-Project Launch
ODSA Sub-Project LaunchODSA Workgroup
 
ODSA Sub-Project Launch
 ODSA Sub-Project Launch ODSA Sub-Project Launch
ODSA Sub-Project LaunchNetronome
 
Deep Learning Frameworks Using Spark on YARN by Vartika Singh
Deep Learning Frameworks Using Spark on YARN by Vartika SinghDeep Learning Frameworks Using Spark on YARN by Vartika Singh
Deep Learning Frameworks Using Spark on YARN by Vartika SinghData Con LA
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataHitoshi Sato
 
Device Data Directory and Asynchronous execution: A path to heterogeneous com...
Device Data Directory and Asynchronous execution: A path to heterogeneous com...Device Data Directory and Asynchronous execution: A path to heterogeneous com...
Device Data Directory and Asynchronous execution: A path to heterogeneous com...LEGATO project
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...Databricks
 

Similar to Bergman Enabling Computation for neuro ML external (20)

Rack Cluster Deployment for SDSC Supercomputer
Rack Cluster Deployment for SDSC SupercomputerRack Cluster Deployment for SDSC Supercomputer
Rack Cluster Deployment for SDSC Supercomputer
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
 
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Pioneering and Democratizing Scalable HPC+AI at PSC
Pioneering and Democratizing Scalable HPC+AI at PSCPioneering and Democratizing Scalable HPC+AI at PSC
Pioneering and Democratizing Scalable HPC+AI at PSC
 
SoC Solutions Enabling Server-Based Networking
SoC Solutions Enabling Server-Based NetworkingSoC Solutions Enabling Server-Based Networking
SoC Solutions Enabling Server-Based Networking
 
Future of hpc
Future of hpcFuture of hpc
Future of hpc
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
 
Dell High-Performance Computing solutions: Enable innovations, outperform exp...
Dell High-Performance Computing solutions: Enable innovations, outperform exp...Dell High-Performance Computing solutions: Enable innovations, outperform exp...
Dell High-Performance Computing solutions: Enable innovations, outperform exp...
 
Available HPC Resources at CSUC
Available HPC Resources at CSUCAvailable HPC Resources at CSUC
Available HPC Resources at CSUC
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
 
ODSA Sub-Project Launch
ODSA Sub-Project LaunchODSA Sub-Project Launch
ODSA Sub-Project Launch
 
ODSA Sub-Project Launch
 ODSA Sub-Project Launch ODSA Sub-Project Launch
ODSA Sub-Project Launch
 
Deep Learning Frameworks Using Spark on YARN by Vartika Singh
Deep Learning Frameworks Using Spark on YARN by Vartika SinghDeep Learning Frameworks Using Spark on YARN by Vartika Singh
Deep Learning Frameworks Using Spark on YARN by Vartika Singh
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
 
Device Data Directory and Asynchronous execution: A path to heterogeneous com...
Device Data Directory and Asynchronous execution: A path to heterogeneous com...Device Data Directory and Asynchronous execution: A path to heterogeneous com...
Device Data Directory and Asynchronous execution: A path to heterogeneous com...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
 

Bergman Enabling Computation for neuro ML external

  • 1. V8.8.2016 Enabling Computation for Machine Learning Algorithms Inspired by Neurobiology Taking Lessons From Nature Presentation to CSRC Colloquium, SDSU January 27, 2017 Doug Bergman, PhD, Staff Scientist / Mathemaperson KnuEdge
  • 2. V8.8.2016 Contents • Neurobiological inspiration for machine learning algorithms • Problems with existing machine learning tools and hardware • Design principles • Hermosa processor design • Developer tools and Programming • Events © KnuEdge™ 2016. All Rights Reserved. 2
  • 3. V8.8.2016 The Neurobiological Computer Neurobiological Systems: Flexible and Scalable Animal # Neurons (10x scale) Roundworm 302 (2) Medicinal leech 10,000 (4) Pond snail 11,000 (4) Sea slug 18,000 (4) Lobster 100,000 (5) Fruit Fly 250,000 (5) Ant 250,000 (5) Honey bee 960,000 (6) Cockroach 1,000,000 (6) Frog 16,000,000 (7) Mouse 71,000,000 (8) Finch 131,000,000 (8) Octopus 500,000,000 (9) Human 100,000,000,000 (14) Elephant 200,000,000,000 (14) Current generation machine learning capabilities Where KnuEdge wants to be in 2021: MindScale. © KnuEdge™ 2016. All Rights Reserved. 3 Source: Wikipedia https://en.wikipedia.org/wiki/List_of_animals_by_number_of_neurons
  • 4. V8.8.2016 BatVision • SONAR or optical “blind” vision • 25,000 bats in one cave can avoid collisions • Can tell a moth from a fly at 20 m • Outgoing pulses – Frequency: Up to120 KHz, f ~ 60% (decay) – Pulse width: 2 msec – Pulse rep rate of ~130 Hz – increases with proximity – Sound amplitude of 130 decibels (jet plane loud) • Incoming reception – Time resolution of 2 μsecs • Signal Processing – ~ 10 million neurons © KnuEdge™ 2016. All Rights Reserved. 4
  • 5. V8.8.2016 Structure of the Mammalian Nervous System • Extremely complex and diverse • Thousands or millions of neuron types • Complex topology of connections like internet Take away: Heterogeneous and sparsely connected © KnuEdge™ 2016. All Rights Reserved. 5
  • 6. V8.8.2016 Natural Pattern Recognition • Recognition is immediate and automatic • Only a few unlabeled samples needed • Inspiring development of new machine learning models • Perfected through a billion years of natural selection • Heterogeneity of neuron types • Sparse,“plastic” evolving connections © KnuEdge™ 2016. All Rights Reserved. 6
  • 7. V8.8.2016 Contents • Neurobiological inspiration for machine learning algorithms • Problems with existing machine learning tools and hardware • Design principles • Hermosa processor design • Developer tools and Programming • Events © KnuEdge™ 2016. All Rights Reserved. 7
  • 8. V8.8.2016 Artificial Pattern Recognition: Neural Networks • Lots of costly computation required • Hundreds or thousands of labeled samples to train • Largely variations on homogeneous Multilayered Perceptron models • Relatively easy to develop • Import library into script, train with data, run • Build models without sophisticated theory; choose parameters arbitrarily “Throw mud on wall, see what sticks” This has been the thrust of development in part because these single-instruction, multiple-data (SIMD) models map well to existing hardware e.g. GPUs © KnuEdge™ 2016. All Rights Reserved. 8
  • 9. V8.8.2016 Nature-Inspired Learning Models • Heterogeneity of neuron types • Hebbian Learning: neurons interconnect sparsely after repeated sympathetic firing • Plasticity and pruning: connections may change over time • Recurrent models • Sequential “Deep Learning” models allow for training on unlabeled or partially- labeled data Field has been slower to develop due in part to unavailability of Multiple- Instruction, Multiple Data (MIMD) processing © KnuEdge™ 2016. All Rights Reserved. 9 Raudies, Zilli, Hasselmo http://journals.plos.org/plosone/article?id =10.1371/journal.pone.0093250 Piekniewski, Laurent, Petre, Richert, Fisher, Hylton https://arxiv.org/pdf/1607.06854v3.pdf
  • 10. V8.8.2016 Problem: Software • Brittleness/Lack of Security • Multiple modalities of communications • Poor scatter/gather communications • Power consumption • Ease of use • Lack of skilled manpower 24 million1.8 million90 million 2.0 million Increasing Complexity of Applications (Lines of Code) Answer: Learning, based on neurobiological principles, could be preferable to programming. © KnuEdge™ 2016. All Rights Reserved. 10
  • 11. V8.8.2016 Current Generation Chips are a Problem • Memory gets larger but not faster • Logic gets faster but spends more time waiting for memory • Logic gets more energy efficient but memory transport does not Today’s processors: Time & Energy Dominated by Fetch © KnuEdge™ 2016. All Rights Reserved. 11
  • 12. V8.8.2016 Contents • Neurobiological inspiration for machine learning algorithms • Problems with existing machine learning tools and hardware • Design principles • Hermosa processor design • Developer tools and Programming • Events © KnuEdge™ 2016. All Rights Reserved. 12
  • 13. V8.8.2016 Hardware Solution: Wetware to Silicon • Neuron body contains “context” • Communication via synapses • Axonal connections define geometry Intellisis Design Goals • Scalable heterogeneous parallelism • Proximity of memory and processing • Support for complex connections • Communications driven architecture • All information is “pushed” • Must operate with noise and failures Nature’s Design Principles F0 F1 F2 F3 W1,0 W0,2 W2,1 W3,1 W2,3W0,3 Directed Graph                                             ' 3 ' 2 ' 1 ' 0 3 2 1 0 1,3 3,21,2 0,1 3,02,0 000 00 000 00 A A A A A A A A W WW W WW Neurobiology Random Access Operations ASIC Take away: Be flexible. © KnuEdge™ 2016. All Rights Reserved. 13
  • 14. V8.8.2016 © KnuEdge™ 2016. All Rights Reserved. 14 Nodal Processing • Data flow from inputs (raw data) to outputs (calculated results) • Finite-state processing: “Lambda” push model • Parallel processing kernels occur independent of other activities and flows • Nodes can be subnetworks themselves • New nodes can be inserted at any time 14 A B C G H E F J K Q T Fig 5. Heterogeneous network with heterogeneous nodes and subnets. L M P Inputs Outputs X Net A Net D Net B Net CNet D Net X Processing Pipeline Classifier
  • 15. V8.8.2016 Hermosa Processing fj i = 0 i = 1 i = n-1 i = 2 i = 3 Hermosa Processing Unit Model Inputs (connections) Outputs (Activations) Processing (neuron) k Hermosa Connection Processing J Router  Each node accepts n inputs.  When all n inputs are processed a single output is generated  Specially designed for coupled differential equation solving Node Processing  There are many more software connections than hardware connections  A router performs the task of connecting neurons. Edge Processing F4 F1 F2 F5 F3 F8 F6 F7 Directed Graph Representation of Processing Each line is called an “edge” and each circle is called a “vertex” © KnuEdge™ 2016. All Rights Reserved. 15
  • 16. V8.8.2016 Lambda Architecture Model • The fundamental architecture block is a cluster, a physical analog of a network node or subgraph • Arbitrarily scalable and nestable • Each cluster contains data storage and specialized processors Co-location minimizes latency and energy consumption • Processes are pipelined by message passing among clusters • Clusters retain finite state information; are free to perform other tasks while waiting for responses • Internal handling of processing within clusters minimizes traffic congestion 16© KnuEdge™ 2016. All Rights Reserved.1/30/2017 Cluster 0 Cluster 1 Cluster 2 Cluster 3
  • 17. V8.8.2016 Another application: Graph Analytics In addition to Machine Learning, the Hermosa processor and Lambda fabric architecture will handle graph processing elegantly, using data in compressed format © KnuEdge™ 2016. All Rights Reserved. 17 Memory Cluster 0 Memory Cluster 1 Memory Cluster N-1 Worker 4 Worker 3 Worker2Worker1 Manager Worker 0 Worker5Worker6 Vertex Pointers, Working Data in Local Memory 0 Edge Successor Data in Auxiliary Memory 0 10 2 3 5 4 10 2 3 5 4 . . . Router Worker 4 Worker 3 Worker2Worker1 Manager Worker 0 Worker5Worker6 10 2 3 5 4 10 2 3 5 4 Worker 4 Worker 3 Worker2Worker1 Manager Worker 0 Worker5Worker6 10 2 3 5 4 10 2 3 5 4 Vertex Pointers, Working Data in Local Memory 1 Vertex Pointers, Working Data in Local Memory N-1 Edge Successor Data in Auxiliary Memory 1 Edge Successor Data in Auxiliary Memory N-1 Packet
  • 18. V8.8.2016 Graph Finite-State Operations • Graph Partitions = Data Objects = Memory Clusters • Partitions have local states that track states of primitive functions 18© KnuEdge™ 2016. All Rights Reserved.1/30/2017 1 4 2 3           min min min 0,2 2,3 , 0,3 min 0,1 1,3 D D D D D           min 0,1D  min 0,2D 0  2,3D  1,3D Example: Find Minimum Distance from Vertex 0 to Vertex 3
  • 19. V8.8.2016 Contents • Neurobiological inspiration for machine learning algorithms • Problems with existing machine learning tools and hardware • Design principles • Hermosa processor design • Developer tools and Programming • Events © KnuEdge™ 2016. All Rights Reserved. 19
  • 20. V8.8.2016 Asynchronous Cloud Processor Data Plane L1 Router L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 > 8 Gbytes Highly Segmented Memory MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC Supervisor Root of Trust Multiple Memory Blocks tDSP Processor core: • 256 registers • 4 Packet I/O engines • Single cycle sleep state • HW Event synchronization • Independent clock domains High-speed Serial: • 28.5 Gbps • FEC • 802.3 Phy • 10GBaseKR compatible Layer 1 Router • Ports: 49 • Rate: 8 GB/s/port • Size: 2 mm2 • Latency: 7 ns All communications via flit packets: • Physical & virtual addressing • Operation codes Flit links: • 64b/256b parallel • Clock domain crossing • Interposer or die xing © KnuEdge™ 2016. All Rights Reserved. 20
  • 21. V8.8.2016 “Hermosa” ASIC 24.6 mm Clusters Central Router SERDES © KnuEdge™ 2016. All Rights Reserved. 21
  • 22. V8.8.2016 LambdaFabricTM : Scalable Across Resources Cluster Chip Board Rack Multi-Rack  Scale invariant network architecture  Low latency to everywhere  High bandwidth  Multi-dimensional connectivity  Scalable up to 512,000 chips 6ns latency 62ns 247ns 437ns Low-Latency, High-Throughput, Low-Power Computing Fabric tDSP 0 tDSP 1 tDSP 2 tDSP 6 tDSP 5 tDSP 3 tDSP 7 tDSP 4 AIP Memory (2MB) © KnuEdge™ 2016. All Rights Reserved. 22
  • 23. V8.8.2016 “Tiny DSP” Tuned to Stream Operations Cluster and tDSP Processor • 256 registers (128 GP and 128 special) • 1 GHz clock (gated) • Harvard program/data memory separation • Fixed 32 bit instruction word. • Storage merged into shared register file • Built-in synch w/ event flags • Scatter/gather engines • Three states: Sleep, run, or single step • Single instruction packet launch • 2K instruction store • Multiple state machines EPIC-style • No interrupts • Everything is addressed including registers Op3Op1 Op2Opcode Instruction format Fits in about 4 microns2 – about the size of a neuron. SRAM tDSP tDSP tDSP tDSP Controller Dispatcher Packet Router Feeder To other Routers DRAM To other Routers AIP … Device Cluster © KnuEdge™ 2016. All Rights Reserved. 23
  • 24. V8.8.2016 Contents • Neurobiological inspiration for machine learning algorithms • Problems with existing machine learning tools and hardware • Design principles • Hermosa processor design • Developer tools and Programming • Events © KnuEdge™ 2016. All Rights Reserved. 24
  • 25. V8.8.2016 “Flit” Packet Communications • Only a single modality of operation • Can operate from registers instead of SRAM • Superior for real-time transport • “Cut-through” operations prioritize important traffic • No bus contention • All comms are bi-directional (core can receive and transmit at the same time) • Flit headers carry VLIW packet execution codes (POP) Queue Normal Packet Queuing Que ue Flit Packet Queuing Next packet along must wait for long period before previous packet leaves the queue. Up to 60,000 clocks. Next packet along must wait for only 4 clocks before previous packet leaves the queue. Only 4 clocks. Advantages: Payload (0 to 32 QWDS) O P C ADR S I Z V P O P V Packet Format © KnuEdge™ 2016. All Rights Reserved. 25
  • 26. V8.8.2016 Hermosa Control-Plane Organization • Router Arbitration • Synchronization • Flags (Notifications) • Mutexes (Ownership of a resource) • Shared Memory Control • Vector memory • Matrix memory • Layer Processing • Feeders == Programmable cache controllers • Error handling • SAM – System Activity Monitoring • Traffic flows HRouter1 HRouter2 HRouter2 HRouter2Ctrl2 tDSP To Next Device Hermosa Hierarchical Control and Data Distribution Scheme tDSP Ctrl2 Ctrl2 Supv Ctrl2 Ctrl2 Control Path Data Path Each control element includes local memory and control registers. Control Plane/Data Plane separation is central to Software Defined Networks 4 External Device Inputs tDSP m Event Register EVFG0 CCR: Cluster Event Register EVFG1 EVFG2 EVFG0 DCR: Device Event Register EVFG1 EVFG2 EVFG3 EVFD0 EVFD1 EVFD2 EVFD3 EVFG3 EVFD0 EVFD1 EVFD2 EVFD3 EVFMBX EVFIOR EVFDMA2 EVFFDR3 EVFG0 EVFG1 EVFG2 EVFG3 EVFD0 EVFD1 EVFD2 EVFD3 EVFC0 EVFC1 EVFC2 EVFC3 EVFC0 EVFC1 EVFC2 EVFC3 EVFMBX EVFIOR EVFDMA EVFFDR CCR: tDSP Latch & Control EVFMBX EVFIOR EVFDMA2 EVFFDR3 Mask against all event flags via WFLOR or WFLAND functions. If true then activate tDSP. Event flag will remain latched until cleared by tDSP. H1000 Event Flag Mapping tDSP · AIP Synch Function sets EVFCn event flags. · Supervisor sets EFDn event flags. · Host machine sets EVFGn event flags · EVFCn & EVFUn event flags are settable by the tDSPs directly. · EVFCn event flags are shared by all tDSPs in a cluster. EVFFU0 EVFU15 ·· EVFFU0 EVFU15 ·· © KnuEdge™ 2016. All Rights Reserved. 26
  • 27. V8.8.2016 KNUPATH Hermosa Programming Model A range of programming interfaces to balance performance and productivity Hermosa Assembly Language KNUPATH Performance Interface (KPI) KNUPATH Network Interface (KNI) PerformanceAbstraction • Implicit message passing model • Dataflow programming model based on Kahn Process Networks (KPN) • C++ libraries and compiler extensions are used to support user-defined kernels and networks • Host target allows development of KNI networks and kernels without Hermosa hardware • Explicit message passing model • Hermosa kernel functions library for MPI-like message passing and synchronization • C/C++ kernel development using the clang/LLVM compiler toolchain with a Hermosa target • Host/device accelerator model, similar to CUDA/OpenCL workflows © KnuEdge™ 2016. All Rights Reserved. 27
  • 28. V8.8.2016 Contents • Neurobiological inspiration for machine learning algorithms • Problems with existing machine learning tools and hardware • Design principles • Hermosa processor design • Developer tools and Programming • Events © KnuEdge™ 2016. All Rights Reserved. 28
  • 29. V8.8.2016 KnuEdge Web Service • Hermosa developer boards available for development of parallel computation on a queued server • Available to anyone, free of charge • Training materials available • User agreement required • Contact jotchis@knupath.com for details © KnuEdge™ 2016. All Rights Reserved. 29
  • 30. V8.8.2016 Upcoming Events Date Event Information Thursday, Feb. 16, 2017 Machine Learning Society Meetup: Processing Hardware for Deep Learning Panelist: D. Palmer (KnuEdge CTO) At ScaleMatrix, 5775 Kearny Villa Road, San Diego https://www.meetup.com/machine- learning-society/events/237055385/ www.mlsociety.com April 26-28, 2017 Workshop on Sparse and Heterogeneous Neural Networks Hosted by California Institute for Telecommunications and Information Technology (Calit2) Sponsored by KnuEdge Participants Wanted – Submit Your Abstract! At Calit2, Atkinson Hall, UCSD campus https://www.knuedge.com/about- us/events/hnnworkshop/ TBA KnuEdge Developer Users’ Group Stay tuned! © KnuEdge™ 2016. All Rights Reserved. 30