Industrial trends in heterogeneous and esoteric compute

Architectural Failures
Hybrid & Heterogeneity
Typical SOC design
Emerging Trends
Heterogeneous Compute is the Future
What are the trends in industrial adoption of new computing elements?

Hewlett Packard
Micron
21 Years
IEEE Senior Member
ACM Senior Member
Embedded Computing
Distinguished TechnologistChief Architect
Imaging and Embedded Systems
ASIC and SOC Design
Software Engineer
Firmware Engineer
Advanced Memory Systems Architect
Patents
ARM
MIPS
BS: Computer Science
MS: Computer EngineeringPD: Electrical Engineering
Columbia University
Heterogeneous Computing
Hardware Architect
Publications Transmeta
Intel
HP Labs: The Machine
memristor

Talking about something as a failure will lead to
debate. I’m fine debating.
What I ask is we define failure as:
Why didn’t these architectures stand
alone?
Architecture, cost, market, timing?

•Founded by Danny Hillis’s PhD work in 1983, Cambridge MA..
•Array of bit-serial processors.
•Hypercube interconnect.
•Near memory (4K bit) microprocessors.
•Not commercially viable until they added an OTS FPU.
•Originally programmed in LISP and later C.
•Later CM5 moved to MIMD architecture and a fat tree network.
•Peak performance: 20 Gflops
•Max revenue: $65M

•Founded by Jeff Kalb (DEC), 1987 in Sunnyvale.
•Based on SIMD architecture
•1K to 16K processors on MP1
•Custom logic, fab’d by HP and TI.
•Required a front end VAX 11/780
•Originally programmed in Fortran and C on
front end and MPL to run on MasPar
•Marketed as “a general purpose computer system”
•Peak performance: 1.2 Gflops

•Founded by Josh Fisher, 1984 in New Haven, CT.
•Based on VLIW architecture and trace scheduling
•Ran Unix
•7 - 32 bit parallel ops
•125 units sold
•Peak performance: 100 Mflops

Company Primary
Architecture
Start Chapter 11
Filing Date
Where are they
now?
Thinking
Machine
Massively
Parallel bit-serial
1983 1994 SUN/Oracle
nCube Distribute
MIMD
Hypercube
1983 ~2005 IP used for video
on demand
Meiko Mesh of Inmos
Transputers
1985 2003 / 2009 Bought by
Quadrics -
defunct
Multiflow VLIW 1984 1990 HP  ST
Maspar Massive SIMD
Array
1987 1996 Remnants in
data mining
1. The 80’s were a good time to be a computer architect.
2. Novel architectures to solve all the world’s problems probably won’t last.

17Uniquedesigns&architectures.
5Uniquedesigns&architectures.

“Heterogeneous System Architecture is a type of
computer processor architecture that integrates central
processing units and graphics processors on the same
bus …”
While that is true, I contend that a true heterogeneous architecture blends
the right core/ISA/processor to the job.
A GPU/CPU blend only gets you so far..

Bus contention and
saturation
Inter-processor
communication
MPI may not exist.
Fixed function ownership
Deadlock propagation
Example of bus saturation in a
heterogeneous SOC

•Unpredictable code flow paths
•Legacy code base and application
•Naturally aligns to fine grain parallelism
•Cost is less of a factor
•Code has real-time requirements
•Dataflow is naturally streaming in application
•Design unproven, needs programmatic flexibility
•Cost somewhat a factor
•Code is embarrassingly parallel
•Code aligns will with SMT
•Cost not an object
•Code must perform small kernels of execution
in very low power
•Code and data have security implications
•Die area concerns
CPU
SMP Traditional
DSPVLIW
uController
GPU
•Code has real-time requirements
•Dataflow is naturally streaming in application
•Fixed IO dependencies
•Design can be hardened
•Ultimate performance
•NRE high, risk high
Fixed Function
SI

SOCs and ASSPs amortize
system cost (power, board
area, die) into a single die and
package
Push as much functionality
into a single package as
possible.
SOC data flow optimized for
particularly narrow use cases.
Heterogeneous is more than a
CPU + GPU

Typical SOCs currently blend together
many cores, much fixed function
silicon, many OS, and many code
bases.
VM’s are adopted to ease software
migration
2 Symmetric or asymmetric CPUs
running some RTOS
Vivante GC 2D/3D GPU running
multiple threads.
DSP running unique RTOS
Others…
Marvell Armada 1500
Courtesy Marvell
Actually 8 cores in SOC
Potentially 6 operating systems
NAND: RTOS
DSP: small DSP OS (STOS)
Secure Core: TEE
Front Panel: uKernel
ARM 1: Greenhills RTOS
ARM 2: Linux
TS Processor: ThreadX

Is it the 1980’s again?
What emerging architectures may
find their way into an SOC?

Collapsed Memory Stacks
Neuromorphic Computing
Managed Language Accelerators
Computing Memories
Computational RAM

The Machine
Combination of:
Collapsed memory stack,
Near memory SOC
compute (Moonshot-like)
Photonic interconnect.
Pervasive from IOT/embedded
markets to exascale.
New OS being crated (Linux++)
with university collaboration
Designed ground up to
address, performance, security,
and data locality

Claims:
Problem Size: 2^42 edges vs Blue
Gene Q@ 2^40 edges
Performance: 16 GTEPS vs Blue Gene
Q @ 15.3 GTEPS
Power: 400 kW vs Blue Gene Q at
7,900 kW
Cores/Racks: 122K Cores/20 Racks vs
Blue Gene Q @ 1.6M Cores/96 Racks
Utilization < 70%
The memristor is optional in this
architecture.
PCM may be a backup option.

2 types of computive memory with data stored
as resistance.
Memristor crossbar
Material Implication “IMP” architecture
p implies q…
If p then q
Adding memristor IMP components would
double the size of the die, but yield a 1000x
performance improvement.
Still need silicon gate to drive current.
Memristors don’t “drive” anything.
Still unproven to synthesize bulk memristors
with any form of yield.
Trust table of boolean logic built with memristor IMP
Courtesy Shahsavari 2010

Automata Processor
Micron fabricated DRAM
Non von Neumann dataflow
48K state transition elements
per chip
6.6T path decisions/s
4W max TPD
DDR3 RAM interface
State Machine Compiler and Unique Tools
Pattern matching on von Neumann CPU: O(n^2)
Pattern matching on Automata O(1)

Automata Processor to full ALU support
Berkeley iRAM
VIRAM
CRAM Experiments during IRAM era
Courtesy Elliot, 1999

Direct Bytecode Execution (DBX)
No need to JIT
Improves startup time.
Reduces code inflation with JIT process
(~8x)
ARM Jazelle
140 JAVA instructions are directly
executed
94 are emulated in short bursts of ARM
instructions
12K silicon gates
Dalvik VM put it out of business.
No JIT
Register based
Back in 2004 this data looked phenomenal

Build a non-Von Neumann
architecture based on “leaky
integrate and fire” CMOS topology.
Each core (4096) models 256
“neurons” with 256 “synapses” on a
5B transitor (28nm) die.
System clock is very slow.
Memory units tightly coupled to
“neurons”. IBM refers to this as the
TrueNorth architecture.
Goals:
10B cores, 100T connections
(synapses)
1KW (45pJ per compute)
2 liters of space
TrueNorth Architecture
Uses simple leaky integrate and fire CMOS.

1st parts fabricated at HRL Labs
Total current funding: $102M since 2009
Timeline
Phase 2 (2013): multi-core synaptic
processor based on TrueNorth. 1M
neurons (256 neurons per core – 400
0cores)
Phase 3 (??): fabricate 10M core.
Simulate mouse.
Phase 4 (2017): 100M core
TrueNorth Architecture
Uses simple leaky integrate and fire CMOS.

Technology Challenges Prediction
The Machine •Large OS hurdle.
•Persistent memory yields.
•Doubtful SOC benefits
•Select customer adoption.
•Low penetration in mobile
or embedded.
Computive Memory •Difficult to program •Uncertain application
•Crossbar may have genesis
in FPGA alternative
Computational RAM •Toolchain
•Automata programming
•Application limits
•Possible HPC and mobile
penetration as technology
matures.
Managed Language
Accelerators
•Modern JIT engines
changing rapidally.
•Low acceptance
Neuromorphic Engines •Huge die size and cost
•Programming difficulties
•Acceptance in research
and limited HPC
application.

Industrial trends in heterogeneous and esoteric compute

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Industrial trends in heterogeneous and esoteric compute

Similar to Industrial trends in heterogeneous and esoteric compute (20)

Recently uploaded

Recently uploaded (20)

Industrial trends in heterogeneous and esoteric compute