SlideShare a Scribd company logo
1 of 42
Download to read offline
What I Learned Building a
Parallel Processor from Scratch
ISPASS, Philadelphia
March 31, 2015
1
My 2007 “Epiphany”
It's more efficient to design a pipeline of
specialized processors than having a single massive
general purpose processor that runs a complete
application.
DUH!
2
My Background
8 years designing DSPs
100 people, $100M R&D
TigerSHARC was a tech success,
but a huge financial flop!
Mixed signal SOCs for CCDs
Invented new ISA in 2 weeks
3-4 designers per derivative
Huge success (>$100M)
+
3
My 2008 Processor Design Goals
●
ANSI-C programmable
●
Floating point (industrial + medical markets)
●
Scalabale
●
10X boost in GFLOPS/Watt (otherwise, who would buy?)
●
Simple enough to be built by < 5 engineers
4
The Epiphany Multicore Architecture
RISC SRAM
NOC
Data
Mover
(3,3)
(0,0)
● ANSI-C Programmable
● IEEE 32bit floating point
● Nothing shared
● Distributed memory model
● Scalabale to “infinity”
● 50 GFLOPS/W
RISC
(1GHZ)
SRAM
NOC
Data
Mover
5
20/20 Decision Report Card
6
✔
NO HW cache
✔
NO HW coherence
✔
Tiny custom RISC ISA
✔
Floating point
✔
2D Mesh
✔
C-programmability
✔
Distributed memory
✔
64-bit LDR/STR
✔
64 registers
✔
4 bank local memory
✔
5+3 dual issue pipeline
✔
Single flit NOC
✔
Byte addressability
✔
HW loops
✔
Autoincrement LDR/STR
✔
Slow reads
✔
Fetch from external
✔
NO 64bit FPU
✔
NO 64bit addressing
✔
32KB of memory
✔
NO native integer mult
✔
NO DRAM interface
My Development Process
●
One person owns the whole path from arch to layout
●
Spreadsheet (triangulating architectures)
●
Emacs/VI (does code look clean on paper?)
●
Manually “count” gates in code. “Feel complexity”
●
Verilog Simulation (checks for logical/mental mistakes)
●
Synthesize design (in FPGA or EDA to measure cost!!)
●
Use advisors (peer review to look for gotchas)
●
Application development (No but I should have!!)
(*considering the lack of benchmarks, it's a small miracle
Epiphany works as well as it does)7
Step 1: Study History - 3 months
My inspiration of what to do (and NOT to do)
●
Henessy and Patterson's classic books. (Re)Read them!
●
Blackfin, TigerSHARC, and c64x (DSP features)
●
ARM and MIPS (standard features)
●
Tilera and Ambric (manycore features)
GOTCHA: Watch out for “clever” features! The really
useful features will be common to all architectures.
8
Step 2: Create a baseline - 3 months
●
How much area goes to computation?
●
How expensive is SRAM and cache?
●
What is the $ cost of IP and design effort?
●
How expensive is the network?
●
What is the performance gain of instruction X in app Y?
●
How often is instruction X used?
●
What does a “typical” application look like
These numbers are your anchor, get them right!
9
Step 3: Verify assumptions - 6 months
●
Design and simulate (to verify cost assumptions)
●
Build prototypes (to verify simulation assumptions)
●
Talk to customers (to verify product feature assumptions)
10
Epiphany (-1)
●
May 2008
●
Naive but close...
●
Could have had 64
cores in 2008!
●
Epiphany arch has
only changed by
“delta” since 2008
●
Submitted many
grant proposals,
papers (all rejected)
11
Epiphany-0
●
Oct 2008 synthesis and layout prototype
●
Virtual prototypes (synthesis and layout of cores)
●
Showed that 1 Ghz operation and 50GFLOPS/W was feasible
●
Gave me the courage to ask friends and family for money ($200K)
12
Epiphany-I
●
Jun 2009 tapeout
●
16 cores, 65nm prototype
●
~1 person, 12 weeks
●
Changes 7 days after TO!
●
Barely worked but proved
it was possible
●
~10mm^2
13
Raised capital (Nov 2009)
14
Series-A
●
$1.5M from BittWare, a
DSP board company.
(a TigerSHARC customer)
●
They invested because
their customers were
interested
●
Complicated deal
●
6 traunches x $250K
Building a Team (Dec, 2009)
Oleg Raikhman
DV/Tools
Roman Trogan
Design/Layout
Andreas Olofsson
Arch/Design
Lesson: Team makeup is the #1 determinant for success. You
must have it in place before you set out on journey.
15
Epiphany-II
●
May 2010 tapeout
●
16 cores, 65nm
●
~3 person, ~6 weeks
●
Changes 1 day before TO
●
A vendor screwed up!
●
Should have been a
product!
●
~12mm^2
16
It Works!
●
Sept 2010 bringup
●
Ran floating point
program!
●
Proved efficiency!
●
Packaging yield was
less than 10% on
eLink (vendor!!).
●
SPI debug channel was
key to debugging
vendor packaging
issue17
Epiphany-III
●
Dec 2010 tapeout
●
16 cores, 65nm
●
This is a product!
●
Kept improving with
every iteration
●
RTL changes 1 day
before TO
●
~25 GFLOPS/W
●
~12mm^2
18
New Packaging
●
60um staggered pitch is
not trivial!
●
Vendor failure on E1
and E2 almost broke
company!
●
New vendor (Kyocera)
●
You can't be on the
bleeding edge without
the right partners!
●
Better results!
19
First Delivery
●
May 2011
●
Chip worked
perfectly out of the
box!
●
This silicon became
the E16G301
●
Felt at peace
●
Happy moment, but
only the beginning...
20
Embedded
Systems
Conference
●
May, 2011
●
Launched Epiphany-III
●
Lots of press!
●
Competitors curious
●
But where were the
customers!!??
●
Flew home depressed
and pissed off...21
First sale!
●
June 2011
●
A $5,000 pitiful eval kit
●
To a big competitor...
●
They never even turned
it on...
●
...which was probably a
good thing.
22
Epiphany-IV
●
Aug 2011 tapeout
●
(Jul 2012 samples)
●
64 cores, 28nm
●
50 GFLOPS/W
●
RTL changes 2 days
before TO
●
3 engineers, 12 weeks!
23
First sale!
●
June 2011
●
A $5,000 pitiful eval kit
●
To a big competitor...
●
They never even turned
it on...
●
...which was probably a
good thing.
24
First Board
Product
●
May, 2012 ship
●
FMC board from
Bittware
●
4 chip array
●
Expensive and
not a solution
●
Niche market
●
Never shipped in
volume25
Software
●
2012 focus
●
Invested heavily in
software tools +
demos
●
Face recognition,
FFT, Matmul
●
Beat ARM by 10X
on demo for BIG
smartphone
company!
●
Response: “Meh...”26
Parallella
●
Sept, 2012
●
Running out of
money...
●
We needed a
parallel community
and a better kit, so
the $99 Parallella
was born.
●
KS combines pre-
purchase,
community, and
marketing.27
The Parallella Computer
● An open parallel computer
● Launched in 2012 at $99
● Open source SW/HW!
● Dual-core ARM A9 processor
● FPGA logic
● 1GB RAM, USB, HDMI, GigE
● 16/64 Epiphany coprocessors
● 50 Gbit/sec IO, 25/100
GFLOPS
● <5 Watts
● 10,000 shipped (20,000 built)
● 200 Universities
OPTIMISTIC & STUBBORN
29
Naive
Optimism
Birthday
Bottomed Out
Pure
Stubborness
Relaunch
Success?
●
Nov 2012
●
Woke up
next
morning
stressed and
behind
schedule.
●
Only sell
what you
have!!
30
Powerup (Gen0)
●
May 2013
●
My first board and it
worked!
●
Only minor issues
●
Challenge was super
aggressive deisgn
targets (credit card,
features, cost,
schedule)
31
Gen0 Release
●
July 2013
●
Gen0 board
released!
●
We build working
cluster with 42
boards!
●
Sent out 50 boards
to KS backers, only
~1 got significant
use!
●
Why?32
Chips are back!
●
Aug 2013
●
Full mask tapeout
●
New Tier-1 package
vendor (ASE).
●
~90% yield!
●
Good thermals
●
How to test 10,000
chips with $0 NRE?
33
New Investment...
●
Dec 20, 2013
●
$3.6M from Ericsson + VC
●
Saved the company...but
came in too late to save
eng team...
●
...and we were still buried
in KS comittments and
logistics issues...
●
5,000 customers waiting
for us to deliver
34
2014 Distractions: Manufacturig, logistics, T-
shirts, cases, logistics, power supplies,
accessories...what is Adapteva about again?
35
Shipping HW
●
June 2014
●
Shipped to 200
Universities & 10,000
developers
●
Built the “A1”, the
world's densest
cluster and launched
at ISC14
●
Making progress...
36
Now what?
Faulty cores
Cooperating
Program
(messages)
NODE:
64KB
1GHz RISC Core
2 GFLOPS/core
Physical
Constraints:
1.5ns/hop latency
10pJ / FLOP
30pJ / off chip
read/write
10pJ / on chip
end2end
(give or take....)
Some Key Hardware Lessons
●
Designing leading edge chips is fairly easy today for a small
team with the right experience (<<$100M)
●
Designing boards is easier than designing chips but not trivial
●
One bad choice can break a startup (99/100 is a failing grade)
●
Never push the tools, process, or capabilities of your vendor
●
Architects should be vertically integrated
●
Symmetry is a very powerful design principle
●
Your product is only strong as your weakest vendor partner
●
Tiled architectures are MAGICAL
38
Some Software Lessons
●
Customers take the path of least resistance
●
Developers take the path of least resistance
●
Surprisingly hard to find applications that need more
performance....
...and those applications are really complex..
..meaning they have tons of legacy stack software...
●
SW trumps energy efficiency in 99.99% of applications
●
Parallel programming still very much a niche
39
MY FINAL LESSON:
(took me 7 years to learn)
IT'S THE SOFTWARE
STUPID!!!!
40
..so we keep pushing the ball up hill
The Parallel Architectures Library
●
A new “standard library” for parallel
●
Compact C library with optimized routines for vector math,
synchronization, and multi-processor communication.
●
Designed to be portable across multiple ISAs
●
Open source (apache 2.0 permissive license)
●
Open invitation to participate!!
●
https://github.com/parallella/pal
41
What I (still) Know to be True...
●
The world will continue to be driven by $$
●
Moore's law WILL come to an end
●
Parallel computing is inevitable
●
Architectures like Epiphany are the future
●
CPUs, FPGAs, and manycore will coexist
●
Never doubt the ingenuity of an engineer
●
It's a dirty job but someone has to do it...
4242

More Related Content

What's hot

VLSI Systems & Design
VLSI Systems & DesignVLSI Systems & Design
VLSI Systems & Design
Aakash Mishra
 

What's hot (20)

Linux on Open Source Hardware
Linux on Open Source HardwareLinux on Open Source Hardware
Linux on Open Source Hardware
 
Dileep Random Access Talk at salishan 2016
Dileep Random Access Talk at salishan 2016Dileep Random Access Talk at salishan 2016
Dileep Random Access Talk at salishan 2016
 
20150528 group presentation jing
20150528 group presentation jing20150528 group presentation jing
20150528 group presentation jing
 
Appsterdam talk - about the chips inside your phone
Appsterdam talk - about the chips inside your phoneAppsterdam talk - about the chips inside your phone
Appsterdam talk - about the chips inside your phone
 
IS-4082, Real-Time insight in Big Data – Even faster using HSA, by Norbert He...
IS-4082, Real-Time insight in Big Data – Even faster using HSA, by Norbert He...IS-4082, Real-Time insight in Big Data – Even faster using HSA, by Norbert He...
IS-4082, Real-Time insight in Big Data – Even faster using HSA, by Norbert He...
 
EDA
EDAEDA
EDA
 
VLSI Systems & Design
VLSI Systems & DesignVLSI Systems & Design
VLSI Systems & Design
 
Introduction to EDA Tools
Introduction to EDA ToolsIntroduction to EDA Tools
Introduction to EDA Tools
 
Artificial intelligence on the Edge
Artificial intelligence on the EdgeArtificial intelligence on the Edge
Artificial intelligence on the Edge
 
FPGA/Reconfigurable computing (HPRC)
FPGA/Reconfigurable computing (HPRC)FPGA/Reconfigurable computing (HPRC)
FPGA/Reconfigurable computing (HPRC)
 
E3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - SundanceE3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - Sundance
 
SGI HPC Update for June 2013
SGI HPC Update for June 2013SGI HPC Update for June 2013
SGI HPC Update for June 2013
 
From Breadboard to Finished Product
From Breadboard to Finished ProductFrom Breadboard to Finished Product
From Breadboard to Finished Product
 
Ceph Day Beijing- Ceph Community Update
Ceph Day Beijing- Ceph Community UpdateCeph Day Beijing- Ceph Community Update
Ceph Day Beijing- Ceph Community Update
 
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...
 
OpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CADOpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CAD
 
vlsi design summer training ppt
vlsi design summer training pptvlsi design summer training ppt
vlsi design summer training ppt
 
Chips&toys
Chips&toysChips&toys
Chips&toys
 
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...
 
How to Choose Mobile Workstation? VR Ready
How to Choose Mobile Workstation? VR ReadyHow to Choose Mobile Workstation? VR Ready
How to Choose Mobile Workstation? VR Ready
 

Similar to What I learned building a parallel processor from scratch

IoT simple with the ESP8266 - presented at the July 2015 Austin IoT Hardware ...
IoT simple with the ESP8266 - presented at the July 2015 Austin IoT Hardware ...IoT simple with the ESP8266 - presented at the July 2015 Austin IoT Hardware ...
IoT simple with the ESP8266 - presented at the July 2015 Austin IoT Hardware ...
David Fowler
 

Similar to What I learned building a parallel processor from scratch (20)

Build an Open Hardware GNU/Linux PowerPC Notebook
Build an Open Hardware GNU/Linux PowerPC NotebookBuild an Open Hardware GNU/Linux PowerPC Notebook
Build an Open Hardware GNU/Linux PowerPC Notebook
 
Electronic CAD Tool Options for Schematic and PCB work
Electronic CAD Tool Options for Schematic and PCB workElectronic CAD Tool Options for Schematic and PCB work
Electronic CAD Tool Options for Schematic and PCB work
 
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
 
My talk at Linux Piter 2015
My talk at Linux Piter 2015My talk at Linux Piter 2015
My talk at Linux Piter 2015
 
SFScon18 - Roberto Innocenti - Open Hardware PowerPC Notebook disclose the mo...
SFScon18 - Roberto Innocenti - Open Hardware PowerPC Notebook disclose the mo...SFScon18 - Roberto Innocenti - Open Hardware PowerPC Notebook disclose the mo...
SFScon18 - Roberto Innocenti - Open Hardware PowerPC Notebook disclose the mo...
 
Advanced Video Production with FOSS
Advanced Video Production with FOSSAdvanced Video Production with FOSS
Advanced Video Production with FOSS
 
Deep Learning on Everyday Devices
Deep Learning on Everyday DevicesDeep Learning on Everyday Devices
Deep Learning on Everyday Devices
 
Introduction to Internet of Things Hardware
Introduction to Internet of Things HardwareIntroduction to Internet of Things Hardware
Introduction to Internet of Things Hardware
 
Small Electronics for Your Makerspace (CLC Trendspotting - September 2014)
Small Electronics for Your Makerspace (CLC Trendspotting - September 2014)Small Electronics for Your Makerspace (CLC Trendspotting - September 2014)
Small Electronics for Your Makerspace (CLC Trendspotting - September 2014)
 
BKK16-100K1 George Grey, Linaro CEO Opening Keynote
BKK16-100K1 George Grey, Linaro CEO Opening KeynoteBKK16-100K1 George Grey, Linaro CEO Opening Keynote
BKK16-100K1 George Grey, Linaro CEO Opening Keynote
 
Hybrid Cloud: The Engine for Digital Transformation
Hybrid Cloud: The Engine for Digital TransformationHybrid Cloud: The Engine for Digital Transformation
Hybrid Cloud: The Engine for Digital Transformation
 
IoT simple with the ESP8266 - presented at the July 2015 Austin IoT Hardware ...
IoT simple with the ESP8266 - presented at the July 2015 Austin IoT Hardware ...IoT simple with the ESP8266 - presented at the July 2015 Austin IoT Hardware ...
IoT simple with the ESP8266 - presented at the July 2015 Austin IoT Hardware ...
 
Arduino Hands-on Workshop
Arduino Hands-on WorkshopArduino Hands-on Workshop
Arduino Hands-on Workshop
 
Fpga computing
Fpga computingFpga computing
Fpga computing
 
digitaldesign-s20-lecture3b-fpga-afterlecture.pdf
digitaldesign-s20-lecture3b-fpga-afterlecture.pdfdigitaldesign-s20-lecture3b-fpga-afterlecture.pdf
digitaldesign-s20-lecture3b-fpga-afterlecture.pdf
 
Glimworm 21 11-13 (1)
Glimworm 21 11-13 (1)Glimworm 21 11-13 (1)
Glimworm 21 11-13 (1)
 
Linux day 2015 presentation of Open Hardware Source PowerPC Notebook
Linux day 2015 presentation of Open Hardware Source PowerPC NotebookLinux day 2015 presentation of Open Hardware Source PowerPC Notebook
Linux day 2015 presentation of Open Hardware Source PowerPC Notebook
 
Prepare yourself to switch computing to Open Hardware Power Architecture
Prepare yourself to switch computing to Open Hardware Power ArchitecturePrepare yourself to switch computing to Open Hardware Power Architecture
Prepare yourself to switch computing to Open Hardware Power Architecture
 
LAS16 109 - The status quo and the future of 96Boards
LAS16 109 - The status quo and the future of 96BoardsLAS16 109 - The status quo and the future of 96Boards
LAS16 109 - The status quo and the future of 96Boards
 
LAS16-109: LAS16-109: The status quo and the future of 96Boards
LAS16-109: LAS16-109: The status quo and the future of 96BoardsLAS16-109: LAS16-109: The status quo and the future of 96Boards
LAS16-109: LAS16-109: The status quo and the future of 96Boards
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

What I learned building a parallel processor from scratch

  • 1. What I Learned Building a Parallel Processor from Scratch ISPASS, Philadelphia March 31, 2015 1
  • 2. My 2007 “Epiphany” It's more efficient to design a pipeline of specialized processors than having a single massive general purpose processor that runs a complete application. DUH! 2
  • 3. My Background 8 years designing DSPs 100 people, $100M R&D TigerSHARC was a tech success, but a huge financial flop! Mixed signal SOCs for CCDs Invented new ISA in 2 weeks 3-4 designers per derivative Huge success (>$100M) + 3
  • 4. My 2008 Processor Design Goals ● ANSI-C programmable ● Floating point (industrial + medical markets) ● Scalabale ● 10X boost in GFLOPS/Watt (otherwise, who would buy?) ● Simple enough to be built by < 5 engineers 4
  • 5. The Epiphany Multicore Architecture RISC SRAM NOC Data Mover (3,3) (0,0) ● ANSI-C Programmable ● IEEE 32bit floating point ● Nothing shared ● Distributed memory model ● Scalabale to “infinity” ● 50 GFLOPS/W RISC (1GHZ) SRAM NOC Data Mover 5
  • 6. 20/20 Decision Report Card 6 ✔ NO HW cache ✔ NO HW coherence ✔ Tiny custom RISC ISA ✔ Floating point ✔ 2D Mesh ✔ C-programmability ✔ Distributed memory ✔ 64-bit LDR/STR ✔ 64 registers ✔ 4 bank local memory ✔ 5+3 dual issue pipeline ✔ Single flit NOC ✔ Byte addressability ✔ HW loops ✔ Autoincrement LDR/STR ✔ Slow reads ✔ Fetch from external ✔ NO 64bit FPU ✔ NO 64bit addressing ✔ 32KB of memory ✔ NO native integer mult ✔ NO DRAM interface
  • 7. My Development Process ● One person owns the whole path from arch to layout ● Spreadsheet (triangulating architectures) ● Emacs/VI (does code look clean on paper?) ● Manually “count” gates in code. “Feel complexity” ● Verilog Simulation (checks for logical/mental mistakes) ● Synthesize design (in FPGA or EDA to measure cost!!) ● Use advisors (peer review to look for gotchas) ● Application development (No but I should have!!) (*considering the lack of benchmarks, it's a small miracle Epiphany works as well as it does)7
  • 8. Step 1: Study History - 3 months My inspiration of what to do (and NOT to do) ● Henessy and Patterson's classic books. (Re)Read them! ● Blackfin, TigerSHARC, and c64x (DSP features) ● ARM and MIPS (standard features) ● Tilera and Ambric (manycore features) GOTCHA: Watch out for “clever” features! The really useful features will be common to all architectures. 8
  • 9. Step 2: Create a baseline - 3 months ● How much area goes to computation? ● How expensive is SRAM and cache? ● What is the $ cost of IP and design effort? ● How expensive is the network? ● What is the performance gain of instruction X in app Y? ● How often is instruction X used? ● What does a “typical” application look like These numbers are your anchor, get them right! 9
  • 10. Step 3: Verify assumptions - 6 months ● Design and simulate (to verify cost assumptions) ● Build prototypes (to verify simulation assumptions) ● Talk to customers (to verify product feature assumptions) 10
  • 11. Epiphany (-1) ● May 2008 ● Naive but close... ● Could have had 64 cores in 2008! ● Epiphany arch has only changed by “delta” since 2008 ● Submitted many grant proposals, papers (all rejected) 11
  • 12. Epiphany-0 ● Oct 2008 synthesis and layout prototype ● Virtual prototypes (synthesis and layout of cores) ● Showed that 1 Ghz operation and 50GFLOPS/W was feasible ● Gave me the courage to ask friends and family for money ($200K) 12
  • 13. Epiphany-I ● Jun 2009 tapeout ● 16 cores, 65nm prototype ● ~1 person, 12 weeks ● Changes 7 days after TO! ● Barely worked but proved it was possible ● ~10mm^2 13
  • 14. Raised capital (Nov 2009) 14 Series-A ● $1.5M from BittWare, a DSP board company. (a TigerSHARC customer) ● They invested because their customers were interested ● Complicated deal ● 6 traunches x $250K
  • 15. Building a Team (Dec, 2009) Oleg Raikhman DV/Tools Roman Trogan Design/Layout Andreas Olofsson Arch/Design Lesson: Team makeup is the #1 determinant for success. You must have it in place before you set out on journey. 15
  • 16. Epiphany-II ● May 2010 tapeout ● 16 cores, 65nm ● ~3 person, ~6 weeks ● Changes 1 day before TO ● A vendor screwed up! ● Should have been a product! ● ~12mm^2 16
  • 17. It Works! ● Sept 2010 bringup ● Ran floating point program! ● Proved efficiency! ● Packaging yield was less than 10% on eLink (vendor!!). ● SPI debug channel was key to debugging vendor packaging issue17
  • 18. Epiphany-III ● Dec 2010 tapeout ● 16 cores, 65nm ● This is a product! ● Kept improving with every iteration ● RTL changes 1 day before TO ● ~25 GFLOPS/W ● ~12mm^2 18
  • 19. New Packaging ● 60um staggered pitch is not trivial! ● Vendor failure on E1 and E2 almost broke company! ● New vendor (Kyocera) ● You can't be on the bleeding edge without the right partners! ● Better results! 19
  • 20. First Delivery ● May 2011 ● Chip worked perfectly out of the box! ● This silicon became the E16G301 ● Felt at peace ● Happy moment, but only the beginning... 20
  • 21. Embedded Systems Conference ● May, 2011 ● Launched Epiphany-III ● Lots of press! ● Competitors curious ● But where were the customers!!?? ● Flew home depressed and pissed off...21
  • 22. First sale! ● June 2011 ● A $5,000 pitiful eval kit ● To a big competitor... ● They never even turned it on... ● ...which was probably a good thing. 22
  • 23. Epiphany-IV ● Aug 2011 tapeout ● (Jul 2012 samples) ● 64 cores, 28nm ● 50 GFLOPS/W ● RTL changes 2 days before TO ● 3 engineers, 12 weeks! 23
  • 24. First sale! ● June 2011 ● A $5,000 pitiful eval kit ● To a big competitor... ● They never even turned it on... ● ...which was probably a good thing. 24
  • 25. First Board Product ● May, 2012 ship ● FMC board from Bittware ● 4 chip array ● Expensive and not a solution ● Niche market ● Never shipped in volume25
  • 26. Software ● 2012 focus ● Invested heavily in software tools + demos ● Face recognition, FFT, Matmul ● Beat ARM by 10X on demo for BIG smartphone company! ● Response: “Meh...”26
  • 27. Parallella ● Sept, 2012 ● Running out of money... ● We needed a parallel community and a better kit, so the $99 Parallella was born. ● KS combines pre- purchase, community, and marketing.27
  • 28. The Parallella Computer ● An open parallel computer ● Launched in 2012 at $99 ● Open source SW/HW! ● Dual-core ARM A9 processor ● FPGA logic ● 1GB RAM, USB, HDMI, GigE ● 16/64 Epiphany coprocessors ● 50 Gbit/sec IO, 25/100 GFLOPS ● <5 Watts ● 10,000 shipped (20,000 built) ● 200 Universities
  • 30. Success? ● Nov 2012 ● Woke up next morning stressed and behind schedule. ● Only sell what you have!! 30
  • 31. Powerup (Gen0) ● May 2013 ● My first board and it worked! ● Only minor issues ● Challenge was super aggressive deisgn targets (credit card, features, cost, schedule) 31
  • 32. Gen0 Release ● July 2013 ● Gen0 board released! ● We build working cluster with 42 boards! ● Sent out 50 boards to KS backers, only ~1 got significant use! ● Why?32
  • 33. Chips are back! ● Aug 2013 ● Full mask tapeout ● New Tier-1 package vendor (ASE). ● ~90% yield! ● Good thermals ● How to test 10,000 chips with $0 NRE? 33
  • 34. New Investment... ● Dec 20, 2013 ● $3.6M from Ericsson + VC ● Saved the company...but came in too late to save eng team... ● ...and we were still buried in KS comittments and logistics issues... ● 5,000 customers waiting for us to deliver 34
  • 35. 2014 Distractions: Manufacturig, logistics, T- shirts, cases, logistics, power supplies, accessories...what is Adapteva about again? 35
  • 36. Shipping HW ● June 2014 ● Shipped to 200 Universities & 10,000 developers ● Built the “A1”, the world's densest cluster and launched at ISC14 ● Making progress... 36
  • 37. Now what? Faulty cores Cooperating Program (messages) NODE: 64KB 1GHz RISC Core 2 GFLOPS/core Physical Constraints: 1.5ns/hop latency 10pJ / FLOP 30pJ / off chip read/write 10pJ / on chip end2end (give or take....)
  • 38. Some Key Hardware Lessons ● Designing leading edge chips is fairly easy today for a small team with the right experience (<<$100M) ● Designing boards is easier than designing chips but not trivial ● One bad choice can break a startup (99/100 is a failing grade) ● Never push the tools, process, or capabilities of your vendor ● Architects should be vertically integrated ● Symmetry is a very powerful design principle ● Your product is only strong as your weakest vendor partner ● Tiled architectures are MAGICAL 38
  • 39. Some Software Lessons ● Customers take the path of least resistance ● Developers take the path of least resistance ● Surprisingly hard to find applications that need more performance.... ...and those applications are really complex.. ..meaning they have tons of legacy stack software... ● SW trumps energy efficiency in 99.99% of applications ● Parallel programming still very much a niche 39
  • 40. MY FINAL LESSON: (took me 7 years to learn) IT'S THE SOFTWARE STUPID!!!! 40
  • 41. ..so we keep pushing the ball up hill The Parallel Architectures Library ● A new “standard library” for parallel ● Compact C library with optimized routines for vector math, synchronization, and multi-processor communication. ● Designed to be portable across multiple ISAs ● Open source (apache 2.0 permissive license) ● Open invitation to participate!! ● https://github.com/parallella/pal 41
  • 42. What I (still) Know to be True... ● The world will continue to be driven by $$ ● Moore's law WILL come to an end ● Parallel computing is inevitable ● Architectures like Epiphany are the future ● CPUs, FPGAs, and manycore will coexist ● Never doubt the ingenuity of an engineer ● It's a dirty job but someone has to do it... 4242