What I Learned Building a
Parallel Processor from Scratch
ISPASS, Philadelphia
March 31, 2015
1
My 2007 “Epiphany”
It's more efficient to design a pipeline of
specialized processors than having a single massive
general purpose processor that runs a complete
application.
DUH!
2
My Background
8 years designing DSPs
100 people, $100M R&D
TigerSHARC was a tech success,
but a huge financial flop!
Mixed signal SOCs for CCDs
Invented new ISA in 2 weeks
3-4 designers per derivative
Huge success (>$100M)
+
3
My 2008 Processor Design Goals
●
ANSI-C programmable
●
Floating point (industrial + medical markets)
●
Scalabale
●
10X boost in GFLOPS/Watt (otherwise, who would buy?)
●
Simple enough to be built by < 5 engineers
4
The Epiphany Multicore Architecture
RISC SRAM
NOC
Data
Mover
(3,3)
(0,0)
● ANSI-C Programmable
● IEEE 32bit floating point
● Nothing shared
● Distributed memory model
● Scalabale to “infinity”
● 50 GFLOPS/W
RISC
(1GHZ)
SRAM
NOC
Data
Mover
5
20/20 Decision Report Card
6
✔
NO HW cache
✔
NO HW coherence
✔
Tiny custom RISC ISA
✔
Floating point
✔
2D Mesh
✔
C-programmability
✔
Distributed memory
✔
64-bit LDR/STR
✔
64 registers
✔
4 bank local memory
✔
5+3 dual issue pipeline
✔
Single flit NOC
✔
Byte addressability
✔
HW loops
✔
Autoincrement LDR/STR
✔
Slow reads
✔
Fetch from external
✔
NO 64bit FPU
✔
NO 64bit addressing
✔
32KB of memory
✔
NO native integer mult
✔
NO DRAM interface
My Development Process
●
One person owns the whole path from arch to layout
●
Spreadsheet (triangulating architectures)
●
Emacs/VI (does code look clean on paper?)
●
Manually “count” gates in code. “Feel complexity”
●
Verilog Simulation (checks for logical/mental mistakes)
●
Synthesize design (in FPGA or EDA to measure cost!!)
●
Use advisors (peer review to look for gotchas)
●
Application development (No but I should have!!)
(*considering the lack of benchmarks, it's a small miracle
Epiphany works as well as it does)7
Step 1: Study History - 3 months
My inspiration of what to do (and NOT to do)
●
Henessy and Patterson's classic books. (Re)Read them!
●
Blackfin, TigerSHARC, and c64x (DSP features)
●
ARM and MIPS (standard features)
●
Tilera and Ambric (manycore features)
GOTCHA: Watch out for “clever” features! The really
useful features will be common to all architectures.
8
Step 2: Create a baseline - 3 months
●
How much area goes to computation?
●
How expensive is SRAM and cache?
●
What is the $ cost of IP and design effort?
●
How expensive is the network?
●
What is the performance gain of instruction X in app Y?
●
How often is instruction X used?
●
What does a “typical” application look like
These numbers are your anchor, get them right!
9
Step 3: Verify assumptions - 6 months
●
Design and simulate (to verify cost assumptions)
●
Build prototypes (to verify simulation assumptions)
●
Talk to customers (to verify product feature assumptions)
10
Epiphany (-1)
●
May 2008
●
Naive but close...
●
Could have had 64
cores in 2008!
●
Epiphany arch has
only changed by
“delta” since 2008
●
Submitted many
grant proposals,
papers (all rejected)
11
Epiphany-0
●
Oct 2008 synthesis and layout prototype
●
Virtual prototypes (synthesis and layout of cores)
●
Showed that 1 Ghz operation and 50GFLOPS/W was feasible
●
Gave me the courage to ask friends and family for money ($200K)
12
Epiphany-I
●
Jun 2009 tapeout
●
16 cores, 65nm prototype
●
~1 person, 12 weeks
●
Changes 7 days after TO!
●
Barely worked but proved
it was possible
●
~10mm^2
13
Raised capital (Nov 2009)
14
Series-A
●
$1.5M from BittWare, a
DSP board company.
(a TigerSHARC customer)
●
They invested because
their customers were
interested
●
Complicated deal
●
6 traunches x $250K
Building a Team (Dec, 2009)
Oleg Raikhman
DV/Tools
Roman Trogan
Design/Layout
Andreas Olofsson
Arch/Design
Lesson: Team makeup is the #1 determinant for success. You
must have it in place before you set out on journey.
15
Epiphany-II
●
May 2010 tapeout
●
16 cores, 65nm
●
~3 person, ~6 weeks
●
Changes 1 day before TO
●
A vendor screwed up!
●
Should have been a
product!
●
~12mm^2
16
It Works!
●
Sept 2010 bringup
●
Ran floating point
program!
●
Proved efficiency!
●
Packaging yield was
less than 10% on
eLink (vendor!!).
●
SPI debug channel was
key to debugging
vendor packaging
issue17
Epiphany-III
●
Dec 2010 tapeout
●
16 cores, 65nm
●
This is a product!
●
Kept improving with
every iteration
●
RTL changes 1 day
before TO
●
~25 GFLOPS/W
●
~12mm^2
18
New Packaging
●
60um staggered pitch is
not trivial!
●
Vendor failure on E1
and E2 almost broke
company!
●
New vendor (Kyocera)
●
You can't be on the
bleeding edge without
the right partners!
●
Better results!
19
First Delivery
●
May 2011
●
Chip worked
perfectly out of the
box!
●
This silicon became
the E16G301
●
Felt at peace
●
Happy moment, but
only the beginning...
20
Embedded
Systems
Conference
●
May, 2011
●
Launched Epiphany-III
●
Lots of press!
●
Competitors curious
●
But where were the
customers!!??
●
Flew home depressed
and pissed off...21
First sale!
●
June 2011
●
A $5,000 pitiful eval kit
●
To a big competitor...
●
They never even turned
it on...
●
...which was probably a
good thing.
22
Epiphany-IV
●
Aug 2011 tapeout
●
(Jul 2012 samples)
●
64 cores, 28nm
●
50 GFLOPS/W
●
RTL changes 2 days
before TO
●
3 engineers, 12 weeks!
23
First sale!
●
June 2011
●
A $5,000 pitiful eval kit
●
To a big competitor...
●
They never even turned
it on...
●
...which was probably a
good thing.
24
First Board
Product
●
May, 2012 ship
●
FMC board from
Bittware
●
4 chip array
●
Expensive and
not a solution
●
Niche market
●
Never shipped in
volume25
Software
●
2012 focus
●
Invested heavily in
software tools +
demos
●
Face recognition,
FFT, Matmul
●
Beat ARM by 10X
on demo for BIG
smartphone
company!
●
Response: “Meh...”26
Parallella
●
Sept, 2012
●
Running out of
money...
●
We needed a
parallel community
and a better kit, so
the $99 Parallella
was born.
●
KS combines pre-
purchase,
community, and
marketing.27
The Parallella Computer
● An open parallel computer
● Launched in 2012 at $99
● Open source SW/HW!
● Dual-core ARM A9 processor
● FPGA logic
● 1GB RAM, USB, HDMI, GigE
● 16/64 Epiphany coprocessors
● 50 Gbit/sec IO, 25/100
GFLOPS
● <5 Watts
● 10,000 shipped (20,000 built)
● 200 Universities
OPTIMISTIC & STUBBORN
29
Naive
Optimism
Birthday
Bottomed Out
Pure
Stubborness
Relaunch
Success?
●
Nov 2012
●
Woke up
next
morning
stressed and
behind
schedule.
●
Only sell
what you
have!!
30
Powerup (Gen0)
●
May 2013
●
My first board and it
worked!
●
Only minor issues
●
Challenge was super
aggressive deisgn
targets (credit card,
features, cost,
schedule)
31
Gen0 Release
●
July 2013
●
Gen0 board
released!
●
We build working
cluster with 42
boards!
●
Sent out 50 boards
to KS backers, only
~1 got significant
use!
●
Why?32
Chips are back!
●
Aug 2013
●
Full mask tapeout
●
New Tier-1 package
vendor (ASE).
●
~90% yield!
●
Good thermals
●
How to test 10,000
chips with $0 NRE?
33
New Investment...
●
Dec 20, 2013
●
$3.6M from Ericsson + VC
●
Saved the company...but
came in too late to save
eng team...
●
...and we were still buried
in KS comittments and
logistics issues...
●
5,000 customers waiting
for us to deliver
34
2014 Distractions: Manufacturig, logistics, T-
shirts, cases, logistics, power supplies,
accessories...what is Adapteva about again?
35
Shipping HW
●
June 2014
●
Shipped to 200
Universities & 10,000
developers
●
Built the “A1”, the
world's densest
cluster and launched
at ISC14
●
Making progress...
36
Now what?
Faulty cores
Cooperating
Program
(messages)
NODE:
64KB
1GHz RISC Core
2 GFLOPS/core
Physical
Constraints:
1.5ns/hop latency
10pJ / FLOP
30pJ / off chip
read/write
10pJ / on chip
end2end
(give or take....)
Some Key Hardware Lessons
●
Designing leading edge chips is fairly easy today for a small
team with the right experience (<<$100M)
●
Designing boards is easier than designing chips but not trivial
●
One bad choice can break a startup (99/100 is a failing grade)
●
Never push the tools, process, or capabilities of your vendor
●
Architects should be vertically integrated
●
Symmetry is a very powerful design principle
●
Your product is only strong as your weakest vendor partner
●
Tiled architectures are MAGICAL
38
Some Software Lessons
●
Customers take the path of least resistance
●
Developers take the path of least resistance
●
Surprisingly hard to find applications that need more
performance....
...and those applications are really complex..
..meaning they have tons of legacy stack software...
●
SW trumps energy efficiency in 99.99% of applications
●
Parallel programming still very much a niche
39
MY FINAL LESSON:
(took me 7 years to learn)
IT'S THE SOFTWARE
STUPID!!!!
40
..so we keep pushing the ball up hill
The Parallel Architectures Library
●
A new “standard library” for parallel
●
Compact C library with optimized routines for vector math,
synchronization, and multi-processor communication.
●
Designed to be portable across multiple ISAs
●
Open source (apache 2.0 permissive license)
●
Open invitation to participate!!
●
https://github.com/parallella/pal
41
What I (still) Know to be True...
●
The world will continue to be driven by $$
●
Moore's law WILL come to an end
●
Parallel computing is inevitable
●
Architectures like Epiphany are the future
●
CPUs, FPGAs, and manycore will coexist
●
Never doubt the ingenuity of an engineer
●
It's a dirty job but someone has to do it...
4242

What I learned building a parallel processor from scratch

  • 1.
    What I LearnedBuilding a Parallel Processor from Scratch ISPASS, Philadelphia March 31, 2015 1
  • 2.
    My 2007 “Epiphany” It'smore efficient to design a pipeline of specialized processors than having a single massive general purpose processor that runs a complete application. DUH! 2
  • 3.
    My Background 8 yearsdesigning DSPs 100 people, $100M R&D TigerSHARC was a tech success, but a huge financial flop! Mixed signal SOCs for CCDs Invented new ISA in 2 weeks 3-4 designers per derivative Huge success (>$100M) + 3
  • 4.
    My 2008 ProcessorDesign Goals ● ANSI-C programmable ● Floating point (industrial + medical markets) ● Scalabale ● 10X boost in GFLOPS/Watt (otherwise, who would buy?) ● Simple enough to be built by < 5 engineers 4
  • 5.
    The Epiphany MulticoreArchitecture RISC SRAM NOC Data Mover (3,3) (0,0) ● ANSI-C Programmable ● IEEE 32bit floating point ● Nothing shared ● Distributed memory model ● Scalabale to “infinity” ● 50 GFLOPS/W RISC (1GHZ) SRAM NOC Data Mover 5
  • 6.
    20/20 Decision ReportCard 6 ✔ NO HW cache ✔ NO HW coherence ✔ Tiny custom RISC ISA ✔ Floating point ✔ 2D Mesh ✔ C-programmability ✔ Distributed memory ✔ 64-bit LDR/STR ✔ 64 registers ✔ 4 bank local memory ✔ 5+3 dual issue pipeline ✔ Single flit NOC ✔ Byte addressability ✔ HW loops ✔ Autoincrement LDR/STR ✔ Slow reads ✔ Fetch from external ✔ NO 64bit FPU ✔ NO 64bit addressing ✔ 32KB of memory ✔ NO native integer mult ✔ NO DRAM interface
  • 7.
    My Development Process ● Oneperson owns the whole path from arch to layout ● Spreadsheet (triangulating architectures) ● Emacs/VI (does code look clean on paper?) ● Manually “count” gates in code. “Feel complexity” ● Verilog Simulation (checks for logical/mental mistakes) ● Synthesize design (in FPGA or EDA to measure cost!!) ● Use advisors (peer review to look for gotchas) ● Application development (No but I should have!!) (*considering the lack of benchmarks, it's a small miracle Epiphany works as well as it does)7
  • 8.
    Step 1: StudyHistory - 3 months My inspiration of what to do (and NOT to do) ● Henessy and Patterson's classic books. (Re)Read them! ● Blackfin, TigerSHARC, and c64x (DSP features) ● ARM and MIPS (standard features) ● Tilera and Ambric (manycore features) GOTCHA: Watch out for “clever” features! The really useful features will be common to all architectures. 8
  • 9.
    Step 2: Createa baseline - 3 months ● How much area goes to computation? ● How expensive is SRAM and cache? ● What is the $ cost of IP and design effort? ● How expensive is the network? ● What is the performance gain of instruction X in app Y? ● How often is instruction X used? ● What does a “typical” application look like These numbers are your anchor, get them right! 9
  • 10.
    Step 3: Verifyassumptions - 6 months ● Design and simulate (to verify cost assumptions) ● Build prototypes (to verify simulation assumptions) ● Talk to customers (to verify product feature assumptions) 10
  • 11.
    Epiphany (-1) ● May 2008 ● Naivebut close... ● Could have had 64 cores in 2008! ● Epiphany arch has only changed by “delta” since 2008 ● Submitted many grant proposals, papers (all rejected) 11
  • 12.
    Epiphany-0 ● Oct 2008 synthesisand layout prototype ● Virtual prototypes (synthesis and layout of cores) ● Showed that 1 Ghz operation and 50GFLOPS/W was feasible ● Gave me the courage to ask friends and family for money ($200K) 12
  • 13.
    Epiphany-I ● Jun 2009 tapeout ● 16cores, 65nm prototype ● ~1 person, 12 weeks ● Changes 7 days after TO! ● Barely worked but proved it was possible ● ~10mm^2 13
  • 14.
    Raised capital (Nov2009) 14 Series-A ● $1.5M from BittWare, a DSP board company. (a TigerSHARC customer) ● They invested because their customers were interested ● Complicated deal ● 6 traunches x $250K
  • 15.
    Building a Team(Dec, 2009) Oleg Raikhman DV/Tools Roman Trogan Design/Layout Andreas Olofsson Arch/Design Lesson: Team makeup is the #1 determinant for success. You must have it in place before you set out on journey. 15
  • 16.
    Epiphany-II ● May 2010 tapeout ● 16cores, 65nm ● ~3 person, ~6 weeks ● Changes 1 day before TO ● A vendor screwed up! ● Should have been a product! ● ~12mm^2 16
  • 17.
    It Works! ● Sept 2010bringup ● Ran floating point program! ● Proved efficiency! ● Packaging yield was less than 10% on eLink (vendor!!). ● SPI debug channel was key to debugging vendor packaging issue17
  • 18.
    Epiphany-III ● Dec 2010 tapeout ● 16cores, 65nm ● This is a product! ● Kept improving with every iteration ● RTL changes 1 day before TO ● ~25 GFLOPS/W ● ~12mm^2 18
  • 19.
    New Packaging ● 60um staggeredpitch is not trivial! ● Vendor failure on E1 and E2 almost broke company! ● New vendor (Kyocera) ● You can't be on the bleeding edge without the right partners! ● Better results! 19
  • 20.
    First Delivery ● May 2011 ● Chipworked perfectly out of the box! ● This silicon became the E16G301 ● Felt at peace ● Happy moment, but only the beginning... 20
  • 21.
    Embedded Systems Conference ● May, 2011 ● Launched Epiphany-III ● Lotsof press! ● Competitors curious ● But where were the customers!!?? ● Flew home depressed and pissed off...21
  • 22.
    First sale! ● June 2011 ● A$5,000 pitiful eval kit ● To a big competitor... ● They never even turned it on... ● ...which was probably a good thing. 22
  • 23.
    Epiphany-IV ● Aug 2011 tapeout ● (Jul2012 samples) ● 64 cores, 28nm ● 50 GFLOPS/W ● RTL changes 2 days before TO ● 3 engineers, 12 weeks! 23
  • 24.
    First sale! ● June 2011 ● A$5,000 pitiful eval kit ● To a big competitor... ● They never even turned it on... ● ...which was probably a good thing. 24
  • 25.
    First Board Product ● May, 2012ship ● FMC board from Bittware ● 4 chip array ● Expensive and not a solution ● Niche market ● Never shipped in volume25
  • 26.
    Software ● 2012 focus ● Invested heavilyin software tools + demos ● Face recognition, FFT, Matmul ● Beat ARM by 10X on demo for BIG smartphone company! ● Response: “Meh...”26
  • 27.
    Parallella ● Sept, 2012 ● Running outof money... ● We needed a parallel community and a better kit, so the $99 Parallella was born. ● KS combines pre- purchase, community, and marketing.27
  • 28.
    The Parallella Computer ●An open parallel computer ● Launched in 2012 at $99 ● Open source SW/HW! ● Dual-core ARM A9 processor ● FPGA logic ● 1GB RAM, USB, HDMI, GigE ● 16/64 Epiphany coprocessors ● 50 Gbit/sec IO, 25/100 GFLOPS ● <5 Watts ● 10,000 shipped (20,000 built) ● 200 Universities
  • 29.
  • 30.
    Success? ● Nov 2012 ● Woke up next morning stressedand behind schedule. ● Only sell what you have!! 30
  • 31.
    Powerup (Gen0) ● May 2013 ● Myfirst board and it worked! ● Only minor issues ● Challenge was super aggressive deisgn targets (credit card, features, cost, schedule) 31
  • 32.
    Gen0 Release ● July 2013 ● Gen0board released! ● We build working cluster with 42 boards! ● Sent out 50 boards to KS backers, only ~1 got significant use! ● Why?32
  • 33.
    Chips are back! ● Aug2013 ● Full mask tapeout ● New Tier-1 package vendor (ASE). ● ~90% yield! ● Good thermals ● How to test 10,000 chips with $0 NRE? 33
  • 34.
    New Investment... ● Dec 20,2013 ● $3.6M from Ericsson + VC ● Saved the company...but came in too late to save eng team... ● ...and we were still buried in KS comittments and logistics issues... ● 5,000 customers waiting for us to deliver 34
  • 35.
    2014 Distractions: Manufacturig,logistics, T- shirts, cases, logistics, power supplies, accessories...what is Adapteva about again? 35
  • 36.
    Shipping HW ● June 2014 ● Shippedto 200 Universities & 10,000 developers ● Built the “A1”, the world's densest cluster and launched at ISC14 ● Making progress... 36
  • 37.
    Now what? Faulty cores Cooperating Program (messages) NODE: 64KB 1GHzRISC Core 2 GFLOPS/core Physical Constraints: 1.5ns/hop latency 10pJ / FLOP 30pJ / off chip read/write 10pJ / on chip end2end (give or take....)
  • 38.
    Some Key HardwareLessons ● Designing leading edge chips is fairly easy today for a small team with the right experience (<<$100M) ● Designing boards is easier than designing chips but not trivial ● One bad choice can break a startup (99/100 is a failing grade) ● Never push the tools, process, or capabilities of your vendor ● Architects should be vertically integrated ● Symmetry is a very powerful design principle ● Your product is only strong as your weakest vendor partner ● Tiled architectures are MAGICAL 38
  • 39.
    Some Software Lessons ● Customerstake the path of least resistance ● Developers take the path of least resistance ● Surprisingly hard to find applications that need more performance.... ...and those applications are really complex.. ..meaning they have tons of legacy stack software... ● SW trumps energy efficiency in 99.99% of applications ● Parallel programming still very much a niche 39
  • 40.
    MY FINAL LESSON: (tookme 7 years to learn) IT'S THE SOFTWARE STUPID!!!! 40
  • 41.
    ..so we keeppushing the ball up hill The Parallel Architectures Library ● A new “standard library” for parallel ● Compact C library with optimized routines for vector math, synchronization, and multi-processor communication. ● Designed to be portable across multiple ISAs ● Open source (apache 2.0 permissive license) ● Open invitation to participate!! ● https://github.com/parallella/pal 41
  • 42.
    What I (still)Know to be True... ● The world will continue to be driven by $$ ● Moore's law WILL come to an end ● Parallel computing is inevitable ● Architectures like Epiphany are the future ● CPUs, FPGAs, and manycore will coexist ● Never doubt the ingenuity of an engineer ● It's a dirty job but someone has to do it... 4242