FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

www.flextiles.eu
FlexTiles
Runtime Mapping of Hardware Accelerators on the Embedded FPGA Layer
FPL’14, FlexTiles Workshop September 1st 2014
Olivier SENTIEYS★, Christophe HURIAUX, Antoine COURTAY  University of Rennes 1
★ Inria

2 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
The Multicore Era is Hitting the Utilization Wall
Multicore era is true since 2005-2008, but what’s next?
Energy efficiency is not scaling along with integration capacity
Transistor and power budgets no longer balanced
Classical scaling
Device count S2
Device frequency S
Device power (cap) 1/S
Device power (Vdd) 1/S2
Utilization 1
Leakage limited scaling
Device count S2
Device frequency S
Device power (cap) 1/S
Device power (Vdd) ~1
Utilization 1/S2
Pi=ai fi Ci Vddi2
Corei
[Venkatesh et al., ASPLOS’10]

3 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution,
copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
The Utilization Wall
With each successive process generation, the
percentage of a chip that can switch at full frequency
drops exponentially due to power constraints
8nm in 2018
best-case average
3.7x speedup
14% per year
(highly parallel codes
and optimal per-benchmark)
[Esmaeilzadeh et al., ISCA’11]

4 /
32
0
5
10
15
20
45nm
32nm
22nm
16nm
11nm
8nm
Speedup
Historical Scaling
ITRS Scaling
Realistic Scaling
18x
7.9x
3.7x
Multicore and Dark Silicon
[Doug Burger, HiPEAC’13]
Dark Silicon
47%
36%
71%
51%
62%
40%
17%
1%
2014
>2016
>2018

5 /
The Efficiency of Specialization
* Source: Ning Zhang and Bob Brodersen, ISSCC data
100-1000X Gap in Efficiency … but Specialization
comes with Penalties in Programmability
ASICs
FPGAs

6 /
32
Heterogeneous Multicores
Different cores on a single chip
GPPs, HW accelerators, memory, network-on-chip
Reconfigurable HW accelerators keep flexibility while increasing area and energy efficiency Self-adapting devices
Dynamically adapt the hardware to the application and to changing environments
Core
Core
Core
Core
Core
Core
Core
Core
Core
Proc.
Reconf.
HW
Mem.
HW
Acc.

7 /
32
Can 3D Stacking Help?
3D-Stacked Reconfigurable Accelerators
Improved bandwidth/latency between cores and accelerators
Improved resource usage
Improved performance and energy efficiency
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
reconfigurable layer
multicore layer

8 /
32
Outline
eFPGA Reconfigurable Fabric
General architecture overview
Expected features
Task migration in FPGA vs. task migration in eFPGA Virtual Bit-Stream Coping with Heterogeneous Blocks Development Flow Achievements & Conclusion

9 /
32
FlexTiles Architecture Overview
- 9
3D interface to the NoC
DSP blocks
Memory blocks

10 /
32
Expected Features of the Reconfigurable Layer
Main expected features
Low reconfiguration time (and power) overhead
Double-context configuration memory
Low complexity reconfiguration control
Resource sharing/distribution easiness, simplified task migration
No predefined configuration domains
Bit-stream independent from task location
Smaller bit-stream size in configuration memory  Virtual Bit-Stream (VBS)

11 /
32
Task Allocation & Migration in an FPGA
Predefined reconfigurable regions
Bit-stream depends on task location
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
HW Accelerator #1
BS #1
HW Accelerator #1
BS #2

12 /
32
Task Migration in eFPGA
3D NI
3D NI
3D NI
3D NI
RAM
RAM
RAM
RAM
RAM
RAM
RAM
RAM
3D NI
3D NI
3D NI
3D NI
3D NI
3D NI
3D NI
3D NI
3D NI
3D NI
3D NI
HW Accelerator #2
BS #2
HW Accelerator #1
BS #1

13 /
32
Outline
eFPGA Reconfigurable Fabric Virtual Bit-Stream
Concept
Abstraction of routing details
Results Coping with Heterogeneous Fabric Development Flow Achievements & Conclusion

14 /
Concept of Virtual Bit-Stream
A task is synthesized and
placed&routed into a Virtual
Bit-Stream (VBS)
 Hide some routing details which are
architecture dependent
 Remove details coming from task
physical location in the fabric
 No predefined configuration domains
Final Bits-Stream is
generated at run time
 Resource sharing/distribution
becomes easier, task migration is
simplified
Quartus II

15 /
32
Interconnection Architecture
Hiding routing details
Full BS is 129 bits
Could be reduced by giving less details
CLBIN[1]
CLBIN[2]
CLBIN[3]
CLBOUT
CLBIN[0]
4 5 6 7
12 13 14 15
0 1 2 3
8 9 10 11
16
17
18
19 20

16 /
32
Virtual Bit Stream
Hiding routing details
List of I/O and connections
20  8
1  9
5  18
4 5 6 7
12 13 14 15
0 1 2 3
8 9 10 11
16
17
18
19 20

17 /
Results
VBS is independent of task location with a
smaller size than BS
44.4%
49.2%
47.2%
55.2%
49.7%
29.5%
27.4% 26.6%
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
70.0%
80.0%
90.0%
100.0%
0
200
400
600
800
1000
1200
1400
1600
tseng tseng diffeq diffeq apex4 des ex5p misex3
Kilo-bits
BS size
VBS size
Compression ra o
3-4 time smaller for
large bit-streams

18 /
32
eFPGA Architecture using VBS
Reconfiguration controller
Upon GPP requirements: can place, duplicate and migrate tasks
Finalizes VBS
Reconfiguration controller
External memory
VBS 1
VBS 2
VBS 3
VBS N
Buffer memory
data
control
1
2

19 /
32
Outline
eFPGA Reconfigurable Fabric Virtual Bit-Stream Coping with Heterogeneous Fabric
Heterogeneous Blocks
Task placement in a Homogeneous context
Task placement in a Heterogeneous context Development Flow Achievements & Conclusion

20 /
Heterogeneous Blocks
Logic Elements
 Cluster of four 6-input LUTs
 3309 mm2
Arithmetic Elements
 18x18 multiplier, 48-bit adder/subtractor
 4351 mm2
…
…
… … …
CLBIN
CLBOUT
LUT
LUT
LUT
LUT
+
-
A
B
18
18
36
48

21 /
Heterogeneous Blocks
Memories
 1024 x 16-bit word SRAM
 6570 mm2
3D TSV and Accelerator Interface
Reconfiguration
Controller
3D
3D 3D
3D
3D
3D
3D
3D
3D
Reconfiguration
RAM
3DNI 3DNI 3DNI
3DNI 3DNI
3DNI 3DNI 3DNI
NoC Link (400 I/O) Pitch X Y size X size Y Area mm²
40 20 20 800 800 0,64
26.95mm²
Work In Progress

22 /
32
eFPGA Floorplan (heterogeneous)
Logic Block Arithmetic Accelerator Memories Accelerator Interface

23 /
32
Task Placement & Migration
Homogeneous case
No constraint on task placement
Regular routing architecture
Easy! (thanks to the Virtual Bit-Stream) Cope with heterogeneity
RAM, DSP, 3D I/Os
Migration is limited
vertically to the same column
to the next column containing same complex blocks
Task
Configured LE
Logic Element (LE)

24 /
32
eFPGA: Handling of Complex Blocks
Heterogeneous blocks routing is abstracted from logic routing
Long lines allow a trade-off between placement flexibility and routing complexity
A two-level routing is performed at runtime:
Logic routing (as in the homogeneous case)
Heterogeneous block routing through long lines

25 /
32
eFPGA: Handling of Complex Blocks
Delay depends on final placement
Only worst-case delay can be estimated offline Flexibility is still limited in the vertical axis
Multiple of block height Length of long lines and connections long-lines – routing-resources should be limited
Area overhead, but slight delay penalty
(see our paper at FPL’14 on Wednesday)

26 /
32
Outline
eFPGA Reconfigurable Fabric
Virtual Bit-Stream
Coping with Heterogeneous Fabric
Development Flow
Achievements & Conclusion

27 /
Development Flow
Custom development flow from C to Virtual Bit-Stream
High-level Synthesis
High-level task
description
RTL task description
HDL Synthesis
HDL task description
Flat logic netlist
Technology mapping
Mapped logic netlist
Placer Router
Placement
data
Routing
data
Arch.
netlist
Bitstream generation
Virtual bit-stream
Arch.
description
 Integrated within the
FlexTiles
development flow
 Generates VBS from
a C description or a
HDL description

28 /
32
Development Flow
Relies on Catapult C from Calypto Design Systems
High-level synthesis from C to VHDL

29 /
32
Development Flow
Use the Verilog To Routing (VTR) academic tool flow to generate netlist and routing data from Verilog
RTL task description HDL Synthesis HDL task description Flat logic netlist Technology mapping Mapped logic netlist Placer Router Placement data Routing data Arch. netlist Arch. description

30 /
32
Development Flow
A custom back-end generate the VBS from the data generated by VTR
The VBS can be loaded on the FlexTiles platform

31 /
32
Conclusions
Overall results and achievements
3-D stacked embedded FPGA coupled to a processor layer
Flexible resource allocation/sharing
Seamless task migration
Virtual Bit-Stream
VBS also reduces bitstream size eFPGA Chip “Proof of Concept”
65nm CMOS
Homogenous Fabric of LBs
I/O Ring (not 3D…)
External Reconfiguration Controller

32 /
32
Results
Thank you for your attention

33 /
32
D-cache 6%
Datapath 3%
Energy Saved 91%
D-cache 6%
Datapath 38%
Reg. File 14%
Fetch/ Decode 19%
I-cache 23%
Where do the energy savings come from?
MIPS baseline 91 pJ/instr.
Specialized core 8 pJ/instr.
[Goulding et al., Hot Chips’10]

34 /
32
Energy per operation: 45nm CMOS, 40nm V6 FPGA
HW operators (45nm)
32-bit addition: 0.5pJ
16-bit multiply: 2.2pJ
64-bit FPU: 50pJ/op 40nm V6 FPGA
16/32-bit multiply and add: 114pJ (DSP blocks), 170pJ (LUT)
32-bit I/O access: 1.47nJ
32-bit memory read: 660 pJ
32-bit register R/W: 1.12 pJ Embedded RISC Processor (45nm)
32-bit register R/W: 0.33pJ
32-bit cache R/W: 3.5pJ
add instruction⋆⋆: 5.32 pJ
⋆⋆add instruction (best case) = fetch, decode, read 2 operands from RF, execute, write back (into local reg. first, then copy into RF)
[Dally et al., Computer, 2010]
[Bonamy et al., 2013]

35 /
32
The Energy Cost of Data Movement
Fetching operands costs more than computing
Energy cost of cache coherence is huge!
28nm
CMOS
500 pJ
Efficient
off-chip link
16 nJ
DRAM
Rd/Wr
64-bit DP
20pJ
26 pJ
256 pJ
1 nJ
256- bit
buses
50 pJ
256-bit access
8 kB SRAM
[Dally, IPDPS’11]

36 /
32
Efficient Hardware Task Swapping
Hiding reconfiguration time with computing
Single-context memory
Double-context memory
eFPGA will use double-context memory
Gain in dynamic reconfiguration efficiency
At the cost of ~50% overhead
Task 1
Task 2
time
Cfg. 2
Cfg. 1
Task 1
Task 2
time
Cfg. 2
Cfg. 1
CB
FF
ConfClk
Latch
ConfEn
CB
CB: one configuration bit

37 /
32
eFPGA(V1) Architecture
Logic Block Switch Block
LUT
CLBIN
ScanIn
FF
mux
CB
ScanOut
CLBOUT
clk,rstb
CB
CB
CB
CB
NORTH(i)
SOUTH(i)
EAST(i)
WEST(i)
ScanIn
ScanOut

38 /
32
eFPGA Architecture
Interconnection Block
CLBIN[1]
CLBIN[2]
CLBIN[3]
CLBOUT
CLBIN[0]
NORTH
0 1 2 3
0 1 2 3
SOUTH
0 1 2 3
WEST
EAST
0 1 2 3

39 /
32
eFPGA Architecture
eFPGA macro
CHANY
(i,j+1)
SB
(i-1,j)
CHANX
(i+1,j)
CLB
(i+1,j)
SB
(i,j-1)
SB(i,j)
CLB
(i,j+1)
CLB
(i,j)
CLBIN[1]
CLBIN[2]
CLBIN[0]
CLBIN[3]
CLBOUT
CHANX(i,j)
CHANY(i,j)
CLBIN[3]
CLBOUT
CLBIN[0]

40 /
32
eFPGA Floorplan
eFPGA Floorplan

FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

Similar to FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators (20)

More from FlexTiles Team

More from FlexTiles Team (14)

Recently uploaded

Recently uploaded (20)

FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators