Title: What I learned building a parallel processor from scratch
Presenter: Andreas Olofsson
Location: ISPASS 2015 Keynote
Date: March
Abstract:
In 2008 I left my long time employer (Analog Devices) to start a company with a mission to build a new type of parallel computer. This talk will present the inspiration, design philosophy, mistakes, success, iterations and surprises encountered in designing the Epiphany computer architecture, four generations of Epiphany chips, and the $99 Parallella credit card sized “supercomputer” project.
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
What I learned building a parallel processor from scratch
1. What I Learned Building a
Parallel Processor from Scratch
ISPASS, Philadelphia
March 31, 2015
1
2. My 2007 “Epiphany”
It's more efficient to design a pipeline of
specialized processors than having a single massive
general purpose processor that runs a complete
application.
DUH!
2
3. My Background
8 years designing DSPs
100 people, $100M R&D
TigerSHARC was a tech success,
but a huge financial flop!
Mixed signal SOCs for CCDs
Invented new ISA in 2 weeks
3-4 designers per derivative
Huge success (>$100M)
+
3
4. My 2008 Processor Design Goals
●
ANSI-C programmable
●
Floating point (industrial + medical markets)
●
Scalabale
●
10X boost in GFLOPS/Watt (otherwise, who would buy?)
●
Simple enough to be built by < 5 engineers
4
5. The Epiphany Multicore Architecture
RISC SRAM
NOC
Data
Mover
(3,3)
(0,0)
● ANSI-C Programmable
● IEEE 32bit floating point
● Nothing shared
● Distributed memory model
● Scalabale to “infinity”
● 50 GFLOPS/W
RISC
(1GHZ)
SRAM
NOC
Data
Mover
5
6. 20/20 Decision Report Card
6
✔
NO HW cache
✔
NO HW coherence
✔
Tiny custom RISC ISA
✔
Floating point
✔
2D Mesh
✔
C-programmability
✔
Distributed memory
✔
64-bit LDR/STR
✔
64 registers
✔
4 bank local memory
✔
5+3 dual issue pipeline
✔
Single flit NOC
✔
Byte addressability
✔
HW loops
✔
Autoincrement LDR/STR
✔
Slow reads
✔
Fetch from external
✔
NO 64bit FPU
✔
NO 64bit addressing
✔
32KB of memory
✔
NO native integer mult
✔
NO DRAM interface
7. My Development Process
●
One person owns the whole path from arch to layout
●
Spreadsheet (triangulating architectures)
●
Emacs/VI (does code look clean on paper?)
●
Manually “count” gates in code. “Feel complexity”
●
Verilog Simulation (checks for logical/mental mistakes)
●
Synthesize design (in FPGA or EDA to measure cost!!)
●
Use advisors (peer review to look for gotchas)
●
Application development (No but I should have!!)
(*considering the lack of benchmarks, it's a small miracle
Epiphany works as well as it does)7
8. Step 1: Study History - 3 months
My inspiration of what to do (and NOT to do)
●
Henessy and Patterson's classic books. (Re)Read them!
●
Blackfin, TigerSHARC, and c64x (DSP features)
●
ARM and MIPS (standard features)
●
Tilera and Ambric (manycore features)
GOTCHA: Watch out for “clever” features! The really
useful features will be common to all architectures.
8
9. Step 2: Create a baseline - 3 months
●
How much area goes to computation?
●
How expensive is SRAM and cache?
●
What is the $ cost of IP and design effort?
●
How expensive is the network?
●
What is the performance gain of instruction X in app Y?
●
How often is instruction X used?
●
What does a “typical” application look like
These numbers are your anchor, get them right!
9
11. Epiphany (-1)
●
May 2008
●
Naive but close...
●
Could have had 64
cores in 2008!
●
Epiphany arch has
only changed by
“delta” since 2008
●
Submitted many
grant proposals,
papers (all rejected)
11
12. Epiphany-0
●
Oct 2008 synthesis and layout prototype
●
Virtual prototypes (synthesis and layout of cores)
●
Showed that 1 Ghz operation and 50GFLOPS/W was feasible
●
Gave me the courage to ask friends and family for money ($200K)
12
13. Epiphany-I
●
Jun 2009 tapeout
●
16 cores, 65nm prototype
●
~1 person, 12 weeks
●
Changes 7 days after TO!
●
Barely worked but proved
it was possible
●
~10mm^2
13
14. Raised capital (Nov 2009)
14
Series-A
●
$1.5M from BittWare, a
DSP board company.
(a TigerSHARC customer)
●
They invested because
their customers were
interested
●
Complicated deal
●
6 traunches x $250K
15. Building a Team (Dec, 2009)
Oleg Raikhman
DV/Tools
Roman Trogan
Design/Layout
Andreas Olofsson
Arch/Design
Lesson: Team makeup is the #1 determinant for success. You
must have it in place before you set out on journey.
15
16. Epiphany-II
●
May 2010 tapeout
●
16 cores, 65nm
●
~3 person, ~6 weeks
●
Changes 1 day before TO
●
A vendor screwed up!
●
Should have been a
product!
●
~12mm^2
16
17. It Works!
●
Sept 2010 bringup
●
Ran floating point
program!
●
Proved efficiency!
●
Packaging yield was
less than 10% on
eLink (vendor!!).
●
SPI debug channel was
key to debugging
vendor packaging
issue17
18. Epiphany-III
●
Dec 2010 tapeout
●
16 cores, 65nm
●
This is a product!
●
Kept improving with
every iteration
●
RTL changes 1 day
before TO
●
~25 GFLOPS/W
●
~12mm^2
18
19. New Packaging
●
60um staggered pitch is
not trivial!
●
Vendor failure on E1
and E2 almost broke
company!
●
New vendor (Kyocera)
●
You can't be on the
bleeding edge without
the right partners!
●
Better results!
19
20. First Delivery
●
May 2011
●
Chip worked
perfectly out of the
box!
●
This silicon became
the E16G301
●
Felt at peace
●
Happy moment, but
only the beginning...
20
22. First sale!
●
June 2011
●
A $5,000 pitiful eval kit
●
To a big competitor...
●
They never even turned
it on...
●
...which was probably a
good thing.
22
24. First sale!
●
June 2011
●
A $5,000 pitiful eval kit
●
To a big competitor...
●
They never even turned
it on...
●
...which was probably a
good thing.
24
25. First Board
Product
●
May, 2012 ship
●
FMC board from
Bittware
●
4 chip array
●
Expensive and
not a solution
●
Niche market
●
Never shipped in
volume25
26. Software
●
2012 focus
●
Invested heavily in
software tools +
demos
●
Face recognition,
FFT, Matmul
●
Beat ARM by 10X
on demo for BIG
smartphone
company!
●
Response: “Meh...”26
27. Parallella
●
Sept, 2012
●
Running out of
money...
●
We needed a
parallel community
and a better kit, so
the $99 Parallella
was born.
●
KS combines pre-
purchase,
community, and
marketing.27
28. The Parallella Computer
● An open parallel computer
● Launched in 2012 at $99
● Open source SW/HW!
● Dual-core ARM A9 processor
● FPGA logic
● 1GB RAM, USB, HDMI, GigE
● 16/64 Epiphany coprocessors
● 50 Gbit/sec IO, 25/100
GFLOPS
● <5 Watts
● 10,000 shipped (20,000 built)
● 200 Universities
31. Powerup (Gen0)
●
May 2013
●
My first board and it
worked!
●
Only minor issues
●
Challenge was super
aggressive deisgn
targets (credit card,
features, cost,
schedule)
31
32. Gen0 Release
●
July 2013
●
Gen0 board
released!
●
We build working
cluster with 42
boards!
●
Sent out 50 boards
to KS backers, only
~1 got significant
use!
●
Why?32
33. Chips are back!
●
Aug 2013
●
Full mask tapeout
●
New Tier-1 package
vendor (ASE).
●
~90% yield!
●
Good thermals
●
How to test 10,000
chips with $0 NRE?
33
34. New Investment...
●
Dec 20, 2013
●
$3.6M from Ericsson + VC
●
Saved the company...but
came in too late to save
eng team...
●
...and we were still buried
in KS comittments and
logistics issues...
●
5,000 customers waiting
for us to deliver
34
35. 2014 Distractions: Manufacturig, logistics, T-
shirts, cases, logistics, power supplies,
accessories...what is Adapteva about again?
35
36. Shipping HW
●
June 2014
●
Shipped to 200
Universities & 10,000
developers
●
Built the “A1”, the
world's densest
cluster and launched
at ISC14
●
Making progress...
36
38. Some Key Hardware Lessons
●
Designing leading edge chips is fairly easy today for a small
team with the right experience (<<$100M)
●
Designing boards is easier than designing chips but not trivial
●
One bad choice can break a startup (99/100 is a failing grade)
●
Never push the tools, process, or capabilities of your vendor
●
Architects should be vertically integrated
●
Symmetry is a very powerful design principle
●
Your product is only strong as your weakest vendor partner
●
Tiled architectures are MAGICAL
38
39. Some Software Lessons
●
Customers take the path of least resistance
●
Developers take the path of least resistance
●
Surprisingly hard to find applications that need more
performance....
...and those applications are really complex..
..meaning they have tons of legacy stack software...
●
SW trumps energy efficiency in 99.99% of applications
●
Parallel programming still very much a niche
39
41. ..so we keep pushing the ball up hill
The Parallel Architectures Library
●
A new “standard library” for parallel
●
Compact C library with optimized routines for vector math,
synchronization, and multi-processor communication.
●
Designed to be portable across multiple ISAs
●
Open source (apache 2.0 permissive license)
●
Open invitation to participate!!
●
https://github.com/parallella/pal
41
42. What I (still) Know to be True...
●
The world will continue to be driven by $$
●
Moore's law WILL come to an end
●
Parallel computing is inevitable
●
Architectures like Epiphany are the future
●
CPUs, FPGAs, and manycore will coexist
●
Never doubt the ingenuity of an engineer
●
It's a dirty job but someone has to do it...
4242