CMPP 2012 held in conjunction with ICNC’12
Upcoming SlideShare
Loading in...5
×
 

CMPP 2012 held in conjunction with ICNC’12

on

  • 317 views

ICNC2012の併設ワークショップCMPPで発表してきました.

ICNC2012の併設ワークショップCMPPで発表してきました.

Statistics

Views

Total Views
317
Views on SlideShare
317
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

CMPP 2012 held in conjunction with ICNC’12 CMPP 2012 held in conjunction with ICNC’12 Presentation Transcript

  • 2012/12/07 The Third International Conference on Networking and ComputingInternational Workshop on Challenges on Massively Parallel Processors (CMPP) (11:00-11:30)25-minute presentation and 5-minute question and discussion time Towards a Low-Power Accelerator of Many FPGAs for Stencil Computations ☆Ryohei Kobayashi†1 Shinya Takamaeda-Yamazaki†1 †2 Kenji Kise†1 †1 Tokyo Institute of Technology, Japan †2 JSPS Research Fellow, Japan
  • Motivation(1/2) GPU or FPGA ?? or 1
  • FPGA Based Accelerator Growing demand to perform scientific computation in low- power and high performance Designed various accelerators to solve scientific computing kernels by using FPGA ► CUBE Mencer, O SPL.2009 ◇Systolic array of 512 FPGAs ◇For encryption, pattern matching ► Stencil computation accelerator composed of 9 FPGAs ◇Scalable streaming-Array with constant memory-bandwidth Sano, K., IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines, (2011). 2
  • 2D Stencil Computation Iterative computation updating data set by using nearest neighbor values called stencil One of methods to obtain approximate solution of partial differential equation (e.g. Thermodynamics, Hydrodynamics, Electromagnetism …) v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j+1]) + (C2 * v0[i][j-1]) + (C3 * v0[i+1][j]); v1[i][j] is updated by the summation of four values. Cx : weighting factor Time-step k Update data set 3
  • Motivation(2/2) Small or Big ?? or 4
  • ScalableCore System *Takamaeda-Yamazaki, S., (ARC 2012) (2012). Tile architecture simulator by Multiple low end FPGAs ► High speed simulation environment for many-core processors research ► We use hardware components of the system as an infrastructure for HPC hardware accelerators. One FPGA node FPGA PROM SRAM 5
  • Our PlanOne node 4 nodes(2×2) 100 nodes(10×10) Final goal Now implementing 6
  • Parallel Stencil Computation byUsing Multi-FPGA 7
  • Block Division and Assigned to Each FPGA :grid-point :data subset communicated Group of grid-points:communication with neighbor FPGAs assigned one FPGA ・Data set is divided into several blocks according to the number of FPGAs ・Each FPGA performs stencil computation in parallel 8
  • The Computing Order of Grid-points on FPGA Proposed methodOur proposed method increases the acceptable communication latency!Now, let’s compare (a)’s model with proposed method 9
  • Comparison between (a) and (b) (1/2)・”Iteration” : a sequent process to compute all the grid-points at a time-step・Now we suppose a computation updating a value of one grid-point takesjust a cycle.・Each FPGA updates the assigned data of sixteen grid-points (from 0 to 15)during every Iteration. A0 A1 A2 A3 C12 C13 C14 C15 FPGA(A) FPGA(C) A4 A5 A6 A7 C8 C9 C10 C11 A8 A9 A10 A11 C4 C5 C6 C7 A12 B0 A13 B1 A14 B2 A15 B3 vs C0 D0 C1 D1 C2 D2 C3 D3 FPGA(B) FPGA(D) B4 B5 B6 B7 D4 D5 D6 D7 B8 B9 B10 B11 D8 D9 D10 D11 B12 B13 B14 B15 D12 D13 D14 D15 (a) (b) Proposed method 10
  • Comparison between (a) and (b) (2/2) A0 A1 A2 A3 First Iteration end 0 16 FPGA(A) A4 A5 A6 A7 A8 A9 A10 A11 A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A0 A1 A12 A13 A14 A15 … B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15 B0 B1 B0 B1 B2 B3 FPGA(B) B4 B5 B6 B7 B8 B9 B10 B11 (a) B12 B13 B14 B15Proposed C12 C13 C14 C15 0 First Iteration end 16 method FPGA(C) C8 C9 C10 C11 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C0 C1 C4 C5 C6 C7 … D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 D0 D1 C0 C1 C2 C3 D0 D1 D2 D3 FPGA(D) D4 D5 D6 D7 D8 D9 D10 D11 (b) D12 D13 D14 D15 11
  • Comparison between (a) and (b) (2/2) A0 A1 A2 A3 First Iteration end 0 16 FPGA(A) A4 A5 A6 A7 A8 A9 A10 A11 A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A0 A1 A12 A13 A14 A15 … B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15 B0 B1 B0 B1 B2 B3 In order not to stall the computation FPGA(B) B4 B5 B6 B7 of B1, the value of A13 must be B8 B9 B10 B11 communicated within three cycles (a) B12 B13 B14 B15 (14, 15, 16) after the computation…Proposed C12 C13 C14 C15 0 First Iteration end 16 method FPGA(C) C8 C9 C10 C11 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C0 C1 C4 C5 C6 C7 … D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 D0 D1 C0 C1 C2 C3 D0 D1 D2 D3 FPGA(D) D4 D5 D6 D7 D8 D9 D10 D11 (b) D12 D13 D14 D15 12
  • Comparison between (a) and (b) (2/2) A0 A1 A2 A3 First Iteration end 0 16 FPGA(A) A4 A5 A6 A7 A8 A9 A10 A11 A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A0 A1 A12 A13 A14 A15 … B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15 B0 B1 B0 B1 B2 B3 In order not to stall the computation FPGA(B) B4 B5 B6 B7 of B1, the value of A13 must be B8 B9 B10 B11 communicated within three cycles (a) B12 B13 B14 B15 (14, 15, 16) after the computation…Proposed C12 C13 C14 C15 0 First Iteration end 16 method FPGA(C) C8 C9 C10 C11 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C0 C1 C4 C5 C6 C7 … D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 D0 D1 C0 C1 C2 C3 D0 D1 D2 D3 FPGA(D) D4 D5 D6 D7 In order not to stall the D8 D9 D10 D11 computation of D1 of Iteration 2 (17th cycle), the margin to send (b) D12 D13 D14 D15 13 value of C1 (1st cycle) is 15 cycles
  • Comparison between (a) and (b) (N×M grid-points) N If the N×M grid-points are assigned to a single FPGA, every shared value must be communicated within N–1cycles FPGA M Iteration end … … FPGA (a) N-1 cycles If the N×M grid-points are assigned to aProposed N single FPGA, every shared value must be method communicated within N×M–1cycles FPGA M Iteration end … … FPGA N×M-1 cycles 14 (b)
  • Comparison between (a) and (b) (N×M grid-points) N If the N×M grid-points are assigned to a single FPGA, every shared value must be communicated within N–1cycles FPGA M Iteration end Proposed method gives … … increase acceptable FPGA (a) latency N×M grid-points are assigned to a If the of N-1 cyclesProposed N method communication N×M–1cycles be !! single FPGA, every shared value must communicated within FPGA M Iteration end … … FPGA N×M-1 cycles 15 (b)
  • Computing Order Applied Proposed Method :computation order This method ensures margin of about one Iteration. As the number of grid-points increases, acceptable latency is scaled. 16
  • Architecture and Implementation 17
  • System Architecture from North from South from East from West mux2 Memory unit (BlockRAMs) Computation unit Configuration ROM JTAG port mux mux mux mux mux mux mux mux XCF04S MADD MADD MADD MADD MADD MADD MADD MADD FPGA Spartan-6 GATE[0] mux8 GATE[3] Clock to West to East GATE[1] GATE[2] Reset to North to South to/from Adjacent Units Ser/Des Ser/Des Ser/Des Ser/Des 18
  • Relationship between The Data Subset and BlockRAM(Memory unit) BlockRAM: low-latency SRAM which each FPGA has.FPGA array 4×4 BlockRAMs(Data is assigned) The data set which assigned to each FPGA is split in the vertical direction, and is stored in each BlockRAM (0~7) If the data set of 64×128 is assigned to one FPGA, the split data set (8×128) is stored in each BlockRAM (0~7). 19
  • Relationship between MADD andBlockRAM(Memory unit) ・The data set stored in each BlockRAM is computed by each MADD. ・Each MADD performs the computation in parallel ・The computed data is stored in BlockRAM. 20
  • MADD Architecture(Computation unit) MADD ► Multiply: seven pipeline stages ► Adder: seven pipeline stages ► Both multiply and adder are single precision floating-point unit which conforms to IEEE 754. 21
  • Stencil Computation at MADD v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 * v0[i][j+1]) + (C3 * v0[i+1][j]); 8-stages 8-stages 22
  • Stencil Computation at MADD v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 * v0[i][j+1]) + (C3 * v0[i+1][j]); C0 8-stages 8-stages 23
  • Stencil Computation at MADD v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 * v0[i][j+1]) + (C3 * v0[i+1][j]); C1 8-stages Take 8 cycles 8-stages 24
  • Stencil Computation at MADD v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 * v0[i][j+1]) + (C3 * v0[i+1][j]); C1 8-stages 8-stages 25
  • Stencil Computation at MADD v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 * v0[i][j+1]) + (C3 * v0[i+1][j]); C2 8-stages Take 8 cycles 8-stages Take 8 cycles 26
  • Stencil Computation at MADD v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 * v0[i][j+1]) + (C3 * v0[i+1][j]); C2 8-stages 8-stages 27
  • Stencil Computation at MADD v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 * v0[i][j+1]) + (C3 * v0[i+1][j]); C3 8-stages 8-stages 28
  • Stencil Computation at MADD v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 * v0[i][j+1]) + (C3 * v0[i+1][j]); 8-stages 8-stages 29
  • Stencil Computation at MADD v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 * v0[i][j+1]) + (C3 * v0[i+1][j]); 8-stages 8-stages V1[i][j] 30
  • MADD Pipeline Operation(Computation unit) The computation of grid-points 11~18 8-stages Input2(adder) 8-stages Input1(adder) 31
  • MADD Pipeline Operation (in cycles 0〜7)  The computation of grid-points 11~18 8 7 6 5The grid-points 1~8 are loaded from 4 3BlockRAM and they are input to the 2 1multiplier in cycles 0~7. 8-stages Input2(adder) 8-stages Input1(adder) 32
  • MADD Pipeline Operation (in cycles 8〜15) The computation of grid-points 11~18 17 16 15 14 13 The computation result is output from 12 11 multiplier, at the same times, grid-points 10 10~17 are input to the multiplier in 8 8-stages 7 cycles 8~15. 6 5 4 3 Input2(adder) 2 8-stages 1 Input1(adder) 33
  • MADD Pipeline Operation (in cycles 16〜23) The computation of grid-points 11~18 19 18 17 16The grid-points 12~19 are input to the 15 14multiplier, at the same time, value of grid- 13 12points 1〜8 and 10~17 multiplied by a 8 17 7 16 8-stagesweighting factor are summed in cycles 16~ 6 15 5 1423. 4 13 3 12 2 11 8-stages Input2(adder) 1 10 Input1(adder) 34
  • MADD Pipeline Operation (in cycles 24〜31)  The computation of grid-points 11~18 28 27 26 25Input2(adder): 1~8 and 10~17 grid-points 8 17 24 23Input1(adder): 12~19 grid-points 7 16 22 21 6 15Input(multiplier): 21~28 grid-points 19 8-stages 5 14 18 13 17 4 Input2(adder) 3 16 12 15 2 11 14 1 10 13 8-stages 12 Input1(adder) 35
  • MADD Pipeline Operation (in cycles 32〜39)  The computation of grid-points 11~18 18 17 16 15Input2(adder): 1~8, 10~17 and 12~19 grid-points 8 17 19 14 13Input1(adder): 21~28 grid-points 7 16 18 12 11Input(multiplier): 11~18 grid-points 6 15 17 28 5 14 16 27 8-stages 4 13 15 26 Input2(adder) 25 3 12 14 24 2 11 13 23 1 10 12 22 21 8-stages Input1(adder) 36
  • MADD Pipeline Operation (in cycles 40〜48)  The computation of grid-points 11~18 27 26 25 24The computation results that data of up, down, 23 22left and right gird-points are multiplied by a 21 18 20 17weighting factor and summed are output in 16cycles 40~48. 15 8-stages 14 13 Input2(adder) 12 11 8 17 19 28 8-stages 7 16 18 27 6 15 17 26 5 14 16 25 4 13 15 Input1(adder) 24 3 12 14 23 2 11 13 22 1 10 12 21 37
  • MADD Pipeline Operation(Computation unit)The filing rate of the pipeline: (N-8/N)×100% (N is cycles which taken this computation.) ► Achievement of high computation performance and the small circuit area ► This scheduling is valid only when width of computed grid is equal to the pipeline stages of multiplier and adder. 38
  • Initialization Mechanism(1/2)Master (1,0) (2,0) (3,0) (0,0) (0,1) (1,1) (2,1) (3,1) (0,2) (1,2) (2,2) (3,2) ・To determine the computation order of each FPGA, every FPGA uses own (0,3) (1,3) (2,3) (3,3) position coordinate in the system. :x-coordinate + 1 :y-coordinate + 1 39
  • Initialization Mechanism(2/2) FPGA FPGA FPGA FPGA ・It is necessary for this array system to be synchronized precisely the timing FPGA FPGA FPGA FPGA of start of computation in the first Iteration. ・Because this array system is not able FPGA FPGA FPGA FPGA to get the data of communication region to be used for the next Iteration if there is a skew. FPGA FPGA FPGA FPGA Sending start signal of computation 40
  • Evaluation 41
  • Environment FPGA:Xilinx Spartan-6 XC6SLX16 ► BlockRAM: 72KB Design tool: Xilinx ISE webpack 13.3 Hardware description language: Verilog HDL Implementation of MADD:IP core generated by Xilinx core-generator ► Implementing single MADD expends four pieces of 32 DSP-blocks which a Spartan-6 FPGA has. ◇ Therefore, the number of MADD to be able to be implemented in single FPGA is eight SRAM is not used. Hardware configuration of FPGA array ScalableCore board 42
  • Performance of Single FPGA Node(1/2) Grid-size:64×128 Iteration:500,000 Performance and Power Consumption(160MHz) ► Performance:2.24GFlop/s ► Power Consumption:2.37W Peak performance[GFlop/s] Peak = 2×F×NFPGA×NMADD×7/8 Peak:Peak performance[GFlop/s] F:Operation frequency[GHz] NFPGA:the number of FPGA NMADD:the number of MADD 7/8: Average utilization of MADD unit → The four multiplications and the three additions v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j+1]) + (C2 * v0[i][j-1]) + (C3 * v0[i+1][j]); 43
  • Performance of Single FPGA Node(2/2) Performance and Performance par watt (160MHz) ► Performance:2.24GFlop/s 26% of Intel Core i7-2600 (single thread, 3.4GHz, -O3 option) ► Performance par watt:0.95GFlop/sW Performance/W value is about six-times better than Nvidia GTX280 GPU card. Nvidia GTX 280 card Hardware Resource Consumption ► LUT: 50% ► Slice: 67% ► BlockRAM: 75% ► DSP48A1: 100% 44
  • Estimation of Effective Performance in 256 FPGA Nodes Upper Limit of Effective Performance ► 573GFlop/s =(8 multipliers + 8 adders)× 256FPGA × 160MHz × 7/8 Performance par Watt ► 0.944GFlop/sW 1000 Freqency:0.16[GHz] Effec ve performance[GFlop/s] 100 10 1 2 4 8 16 32 64 128 256 Number of FPGA nodes Estimation of effective performance improvement rate. 45
  • Conclusion Proposition of high performance stencil computing method and architecture Implementation result (One-FPGA node) ► Frequency 160MHz (no communication) ► Effective performance 2.24GFlop/s. Power consumption 2.37W. ► Hardware resource consumption : Slices 67% Estimation of performance in 256 FPGA nodes ► Upper limit of effective performance:573GFlop/s ► Effective performance par watt:0.944GFlop/sW Low end FPGAs array system is promising ! (Better than Nvidia GTX280 GPU card) Future works ► Implementation and evaluation of more scaled FPGA array ► Implementation towards lower-power 46