CMPP 2012 held in conjunction with ICNC’12

2012/12/07 The Third International Conference on Networking and Computing
International Workshop on Challenges on Massively Parallel Processors (CMPP) (11:00-11:30)
25-minute presentation and 5-minute question and discussion time

Towards a Low-Power Accelerator of
Many FPGAs for Stencil Computations

☆Ryohei Kobayashi†1 Shinya Takamaeda-Yamazaki†1 †2 Kenji Kise†1

†1 Tokyo Institute of Technology, Japan
†2 JSPS Research Fellow, Japan

Motivation(1/2)
 GPU or FPGA ??

or

1

FPGA Based Accelerator
 Growing demand to perform scientific computation in low-
power and high performance
 Designed various accelerators to solve scientific computing
kernels by using FPGA
► CUBE Mencer, O SPL.2009
◇Systolic array of 512 FPGAs
◇For encryption, pattern matching

► Stencil computation accelerator composed of 9 FPGAs
◇Scalable streaming-Array with constant memory-bandwidth

Sano, K., IEEE 19th Annual International Symposium
on Field-Programmable
Custom Computing Machines, (2011).

2

2D Stencil Computation
 Iterative computation updating data set by using nearest
neighbor values called stencil
 One of methods to obtain approximate solution of partial
differential equation (e.g. Thermodynamics, Hydrodynamics,
Electromagnetism …)
v1[i][j] =
(C0 * v0[i-1][j]) + (C1 * v0[i][j+1]) +
(C2 * v0[i][j-1]) + (C3 * v0[i+1][j]);

v1[i][j] is updated by the summation of four values.
Cx : weighting factor
Time-step k

Update data set
3

Motivation(2/2)
 Small or Big ??

or

4

ScalableCore System *Takamaeda-Yamazaki, S., (ARC 2012) (2012).

 Tile architecture simulator by Multiple low end FPGAs
► High speed simulation environment for many-core processors
research
► We use hardware components of the system as an infrastructure for
HPC hardware accelerators.

One FPGA node

FPGA

PROM SRAM

5

Our Plan

One node 4 nodes(2×2) 100 nodes(10×10)
Final goal
Now implementing

6

Parallel Stencil Computation by
Using Multi-FPGA

7

Block Division and Assigned to Each FPGA
：grid-point

：data subset communicated Group of grid-points
：communication
with neighbor FPGAs assigned one FPGA

・Data set is divided into several blocks according to the number of FPGAs
・Each FPGA performs stencil computation in parallel
8

The Computing Order of Grid-points on FPGA

Proposed method

Our proposed method increases the acceptable communication latency!
Now, let’s compare (a)’s model with proposed method
9

Comparison between (a) and (b) (1/2)

・”Iteration” : a sequent process to compute all the grid-points at a time-
step
・Now we suppose a computation updating a value of one grid-point takes
just a cycle.
・Each FPGA updates the assigned data of sixteen grid-points (from 0 to 15)
during every Iteration.
A0 A1 A2 A3 C12 C13 C14 C15
FPGA(A)

FPGA(C)
A4 A5 A6 A7 C8 C9 C10 C11

A8 A9 A10 A11 C4 C5 C6 C7

A12

B0
A13

B1
A14

B2
A15

B3
vs C0

D0
C1

D1
C2

D2
C3

D3
FPGA(B)

FPGA(D)

B4 B5 B6 B7 D4 D5 D6 D7

B8 B9 B10 B11 D8 D9 D10 D11

B12 B13 B14 B15 D12 D13 D14 D15
(a) (b) Proposed method 10

A0 A1 A2 A3
First Iteration end
0 16
FPGA(A)

A4 A5 A6 A7

A8 A9 A10 A11 A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A0 A1

A12 A13 A14 A15 …
B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15 B0 B1
B0 B1 B2 B3
FPGA(B)

B4 B5 B6 B7

B8 B9 B10 B11

(a) B12 B13 B14 B15

Proposed C12 C13 C14 C15 0 First Iteration end 16
method
FPGA(C)

C8 C9 C10 C11 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C0 C1

C4 C5 C6 C7 …
D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 D0 D1
C0 C1 C2 C3

D0 D1 D2 D3
FPGA(D)

D4 D5 D6 D7

D8 D9 D10 D11

(b) D12 D13 D14 D15 11

A0 A1 A2 A3
First Iteration end
0 16
FPGA(A)

A4 A5 A6 A7


A12 A13 A14 A15 …
B0 B1 B2 B3

In order not to stall the computation
FPGA(B)

B4 B5 B6 B7
of B1, the value of A13 must be
B8 B9 B10 B11
communicated within three cycles
(a) B12 B13 B14 B15 (14, 15, 16) after the computation…

method
FPGA(C)


C4 C5 C6 C7 …
C0 C1 C2 C3

D0 D1 D2 D3
FPGA(D)

D4 D5 D6 D7

D8 D9 D10 D11

(b) D12 D13 D14 D15 12

A0 A1 A2 A3
First Iteration end
0 16
FPGA(A)

A4 A5 A6 A7


A12 A13 A14 A15 …
B0 B1 B2 B3

In order not to stall the computation
FPGA(B)

B4 B5 B6 B7
of B1, the value of A13 must be
B8 B9 B10 B11
communicated within three cycles
(a) B12 B13 B14 B15 (14, 15, 16) after the computation…

method
FPGA(C)


C4 C5 C6 C7 …
C0 C1 C2 C3

D0 D1 D2 D3
FPGA(D)

D4 D5 D6 D7
In order not to stall the
D8 D9 D10 D11 computation of D1 of Iteration 2
(17th cycle), the margin to send
(b) D12 D13 D14 D15 13
value of C1 (1st cycle) is 15 cycles

Comparison between (a) and (b) (N×M grid-points)
N If the N×M grid-points are assigned to a
single FPGA, every shared value must be
communicated within N–1cycles
FPGA

M Iteration end

… …
FPGA

(a) N-1 cycles

If the N×M grid-points are assigned to a
Proposed N
method
communicated within N×M–1cycles
FPGA

M Iteration end

… …
FPGA

N×M-1 cycles 14
(b)

Comparison between (a) and (b) (N×M grid-points)
N If the N×M grid-points are assigned to a
communicated within N–1cycles
FPGA

M Iteration end

Proposed method gives
… …

increase acceptable
FPGA

(a)
latency N×M grid-points are assigned to a
If the
of N-1 cycles

Proposed N
method communication N×M–1cycles be
!!
single FPGA, every shared value must
communicated within
FPGA

M Iteration end

… …
FPGA

N×M-1 cycles 15
(b)

Computing Order Applied Proposed Method

：computation order

 This method ensures margin of about one Iteration.
 As the number of grid-points increases, acceptable latency is scaled.
16

Architecture and Implementation

17

System Architecture
from North from South
from East

from West
mux2
Memory unit (BlockRAMs)

Computation unit
Configuration
ROM JTAG port
mux mux mux mux mux mux mux mux
XCF04S

MADD MADD MADD MADD MADD MADD MADD MADD

FPGA
Spartan-6

GATE[0]
mux8 GATE[3] Clock
to West to East

GATE[1] GATE[2]
Reset
to North to South to/from
Adjacent
Units
Ser/Des

Ser/Des

Ser/Des

Ser/Des
18

Relationship between The Data Subset and
BlockRAM(Memory unit)
BlockRAM: low-latency SRAM which each FPGA has.

FPGA array 4×4 BlockRAMs
(Data is assigned)
The data set which assigned to each FPGA is split in the
vertical direction, and is stored in each BlockRAM (0～7)

If the data set of 64×128 is assigned to one FPGA, the split data set
(8×128) is stored in each BlockRAM (0～7).

19

Relationship between MADD and
BlockRAM(Memory unit)
・The data set stored in each
BlockRAM is computed by each MADD.
・Each MADD performs the
computation in parallel
・The computed data is stored in
BlockRAM.

20

MADD Architecture(Computation unit)
 MADD
► Multiply: seven pipeline stages
► Adder: seven pipeline stages
► Both multiply and adder are single precision floating-point unit which
conforms to IEEE 754.

21

Stencil Computation at MADD
 v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 *
v0[i][j+1]) + (C3 * v0[i+1][j]);

8-stages

8-stages

22

 v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 *
v0[i][j+1]) + (C3 * v0[i+1][j]);

C0
8-stages

8-stages

23

 v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 *
v0[i][j+1]) + (C3 * v0[i+1][j]);

C1
8-stages

Take 8 cycles

8-stages

24

 v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 *
v0[i][j+1]) + (C3 * v0[i+1][j]);

C1
8-stages

8-stages

25

 v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 *
v0[i][j+1]) + (C3 * v0[i+1][j]);

C2
8-stages

Take 8 cycles

8-stages

Take 8 cycles

26

 v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 *
v0[i][j+1]) + (C3 * v0[i+1][j]);

C2
8-stages

8-stages

27

 v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 *
v0[i][j+1]) + (C3 * v0[i+1][j]);

C3
8-stages

8-stages

28

 v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 *
v0[i][j+1]) + (C3 * v0[i+1][j]);

8-stages

8-stages

29

 v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 *
v0[i][j+1]) + (C3 * v0[i+1][j]);

8-stages

8-stages

V1[i][j] 30

MADD Pipeline Operation(Computation unit)
 The computation of grid-points 11～18

8-stages
Input２(adder)

8-stages

Input1(adder)

31

MADD Pipeline Operation (in cycles 0〜7)
 The computation of grid-points 11～18 8
7
6
5
The grid-points 1～8 are loaded from 4
3
BlockRAM and they are input to the 2
1
multiplier in cycles 0～7. 8-stages
Input２(adder)

8-stages

Input1(adder)

32

16
15
14
13
The computation result is output from 12
11
multiplier, at the same times, grid-points 10
10～17 are input to the multiplier in 8 8-stages
7
cycles 8～15. 6
5
4
3
Input２(adder) 2 8-stages
1

Input1(adder)

33

18
17
16
The grid-points 12～19 are input to the 15
14
multiplier, at the same time, value of grid- 13
12
points 1〜8 and 10～17 multiplied by a 8 17
7 16 8-stages
weighting factor are summed in cycles 16～ 6 15
5 14
23. 4 13
3 12
2 11 8-stages
Input２(adder) 1 10

Input1(adder)

34

27
26
25
Input2(adder): 1～8 and 10～17 grid-points 8 17 24
23
Input1(adder): 12～19 grid-points 7 16 22
21
6 15
Input(multiplier): 21～28 grid-points 19
8-stages
5 14 18
13 17
4
Input２(adder) 3
16
12 15
2 11 14
1 10
13 8-stages
12

Input1(adder)

35

17
16
15
Input2(adder): 1～8, 10～17 and 12～19 grid-points 8 17 19 14
13
Input1(adder): 21～28 grid-points 7 16 18 12
11
Input(multiplier): 11～18 grid-points 6 15 17 28
5 14 16 27 8-stages
4 13 15 26
Input２(adder) 25
3 12 14 24
2 11 13 23
1 10 12
22
21
8-stages

Input1(adder)

36

26
25
24
The computation results that data of up, down, 23
22
left and right gird-points are multiplied by a 21 18
20 17
weighting factor and summed are output in 16
cycles 40～48. 15 8-stages
14
13
Input２(adder) 12
11
8 17 19 28 8-stages
7 16 18 27
6 15 17 26
5 14 16 25
4 13 15 Input1(adder)
24
3 12 14 23
2 11 13 22
1 10 12 21

37

MADD Pipeline Operation(Computation unit)
The filing rate of the pipeline: (N-8/N)×100% (N is
cycles which taken this computation.)
► Achievement of high computation performance and the small circuit area
► This scheduling is valid only when width of computed grid is equal to the
pipeline stages of multiplier and adder.

38

Initialization Mechanism(1/2)

Master
(1,0) (2,0) (3,0)
(0,0)

(0,1) (1,1) (2,1) (3,1)

(0,2) (1,2) (2,2) (3,2)
・To determine the computation order
of each FPGA, every FPGA uses own
(0,3) (1,3) (2,3) (3,3) position coordinate in the system.

：x-coordinate + 1

：y-coordinate + 1
39

Initialization Mechanism(2/2)

FPGA FPGA FPGA FPGA

・It is necessary for this array system
to be synchronized precisely the timing
FPGA FPGA FPGA FPGA of start of computation in the first
Iteration.

・Because this array system is not able
FPGA FPGA FPGA FPGA
to get the data of communication
region to be used for the next Iteration
if there is a skew.
FPGA FPGA FPGA FPGA

Sending start signal of computation

40

Environment
 FPGA：Xilinx Spartan-6 XC6SLX16
► BlockRAM: 72KB
 Design tool: Xilinx ISE webpack 13.3
 Hardware description language: Verilog HDL
 Implementation of MADD：IP core generated by Xilinx core-generator
► Implementing single MADD expends four pieces of 32 DSP-blocks which a Spartan-6
FPGA has.
◇ Therefore, the number of MADD to be able to be implemented in single FPGA is
eight

SRAM is not used.
Hardware configuration of FPGA array ScalableCore board 42

Performance of Single FPGA Node(1/2)
 Grid-size：64×128
 Iteration：500,000
 Performance and Power Consumption(160MHz)
► Performance：2.24GFlop/s
► Power Consumption：2.37W

Peak performance[GFlop/s]

Peak = 2×F×NFPGA×NMADD×7/8
Peak：Peak performance[GFlop/s]
F：Operation frequency[GHz]
NFPGA：the number of FPGA
NMADD：the number of MADD
7/8: Average utilization of MADD unit
→ The four multiplications and the three additions
v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j+1]) +
(C2 * v0[i][j-1]) + (C3 * v0[i+1][j]);

43

Performance of Single FPGA Node(2/2)
 Performance and Performance par watt (160MHz)
► Performance：2.24GFlop/s
26% of Intel Core i7-2600 (single
thread, 3.4GHz, -O3 option)
► Performance par watt：0.95GFlop/sW

Performance/W value is about six-times
better than Nvidia GTX280 GPU card.

Nvidia GTX 280 card

 Hardware Resource Consumption
► LUT: 50%
► Slice: 67%
► BlockRAM: 75%
► DSP48A1: 100% 44

Estimation of Effective Performance in 256 FPGA Nodes

 Upper Limit of Effective Performance
► 573GFlop/s =（8 multipliers + 8 adders）× 256FPGA × 160MHz × 7/8
 Performance par Watt
► 0.944GFlop/sW
1000

Freqency：0.16[GHz]
Effec ve performance[GFlop/s]

100

10

1
2 4 8 16 32 64 128 256
Number of FPGA nodes

Estimation of effective performance improvement rate. 45

Conclusion
 Proposition of high performance stencil computing method
and architecture
 Implementation result (One-FPGA node)
► Frequency 160MHz (no communication)
► Effective performance 2.24GFlop/s. Power consumption 2.37W.
► Hardware resource consumption : Slices 67%
 Estimation of performance in 256 FPGA nodes
► Upper limit of effective performance：573GFlop/s
► Effective performance par watt：0.944GFlop/sW
Low end FPGAs array system is promising ! (Better than Nvidia
GTX280 GPU card)
 Future works
► Implementation and evaluation of more scaled FPGA array
► Implementation towards lower-power
46

CMPP 2012 held in conjunction with ICNC’12

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Viewers also liked

Viewers also liked (11)

Similar to CMPP 2012 held in conjunction with ICNC’12

Similar to CMPP 2012 held in conjunction with ICNC’12 (20)