14:30 – 15:00 June 2, 2011HEART 2011 @Imperial College LondonAn FPGA-based Scalable SimulationAccelerator for Tile Archite...
This presentation shows ScalableCore system n  Multi-FPGA system for Tile architecture simulations    l  Achieving SCALA...
Agendan  Background & Motivationn  Proposal: ScalableCoren  System Implementation   l  Overall system   l  Components...
Background: Multicores to Many-cores                             Intel Single Chip Cloud Computer                         ...
Simulation Target Manycore: M-Coren  Tile architecture with 2D mesh network   l  A Node has: Core, Local Memory, INCC (D...
How to evaluate the architectures? n  Customizability vs. Simulation Speed       l  We want to run a large benchmark fas...
Less scalability of simulation speedon software simulators n  Decreasing speed with the increasing # target cores    l  ...
Motivation n  Achieve the SCALABLE simulation speed    l  = Keep the constant simulation speed in case of large number o...
Proposal of ScalableCore n  Multiple FPGAs corresponding to the target processor    l  Each ScalableCore Unit has a part...
Simulation Target Manycore: M-Coren  Tile architecture with 2D mesh network   l  A Node has: Core, Local Memory, INCC (D...
ScalableCore system 1.1: Overview n  Simulating the M-Core with up to 64 Nodes (= FPGAs)                                 ...
1Node : 1 ScalableCore Unit                      45cm                              30cm
4 Nodes (2x2) : 4 ScalableCore Units                       45cm                                  30cm
16 Nodes (4×4) : 16 ScalableCore Units                      45cm                                 30cm
64 Nodes (8×8) : 64 ScalableCore Units              Scalable Extension!
ScalableCore system 1.1: Componentsn  ScalableCore Unit    FPGA board with off-chip SRAM   l  Xilinx Spartan-3E XC3S500E...
ScalableCore system 1.1:Logic Hierarchy                         Core         INCC           Router                        ...
ScalableCore system 1.1:Logic Architecture                                                                                ...
Two key techniquesn  Local Barrier Synchronization   l  Each FPGA has one Node of M-Core (or other tile architecture)   ...
Local Barrier Synchronization n  Handshakes with 4 neighbor FPGAs    l  Constant handshaking overhead, not increasing wi...
Virtual Cycle n  Multiple FPGA clock cycles for 1 target clock cycle       l  Virtually complex hardware by using simple...
Evaluation n  Evaluation Points    l  Simulation Speed [K cycle / sec]    l  Power [W] n  Environment    l  ScalableC...
Evaluation: Simulation Speed [K cycle/sec]              n  = Clock frequency of the target processor [KHz]               ...
Evaluation: Power [W]             n  = Energy consumption of the system per sec                   l  Software simulator:...
Conclusionn ScalableCore system 1.1  An FPGA-based scalable simulation system  for tile architecture evaluations   l  Mu...
Upcoming SlideShare
Loading in …5
×

An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

1,226 views
1,118 views

Published on

A presentation of ScalableCore system 1.1 at HEART 2011 @Imperial College London

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,226
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

  1. 1. 14:30 – 15:00 June 2, 2011HEART 2011 @Imperial College LondonAn FPGA-based Scalable SimulationAccelerator for Tile Architectures Shinya Takamaeda-Yamazaki†‡, Ryosuke Sasakawa†, Yoshito Sakaguchi†, Kenji Kise† †Tokyo Institute of Technology, Japan ‡JSPS Research Fellow
  2. 2. This presentation shows ScalableCore system n  Multi-FPGA system for Tile architecture simulations l  Achieving SCALABLE simulation speed Target Core System Function
  3. 3. Agendan  Background & Motivationn  Proposal: ScalableCoren  System Implementation l  Overall system l  Components: ScalableCore Unit & Board l  Logic Hierarch & Architecturen  Evaluation l  Simulation Speed l  Powern  Conclusion
  4. 4. Background: Multicores to Many-cores Intel Single Chip Cloud Computer 48 cores (x86) TILERA TILE-Gx100 100 cores (MIPS)
  5. 5. Simulation Target Manycore: M-Coren  Tile architecture with 2D mesh network l  A Node has: Core, Local Memory, INCC (DMA controller) and Router l  Local Memory: Independent Address Space, Data transfer by DMAs DRAM Controller DRAM Controller Local Memory Core INCC R Node DRAM Controller DRAM Controller
  6. 6. How to evaluate the architectures? n  Customizability vs. Simulation Speed l  We want to run a large benchmark fast Reality Chip Easy construction of ideal system without HW limitations FPGA Real but Simulator expensive Software Faster simulation and Simulator customizable Difficulty to construct
  7. 7. Less scalability of simulation speedon software simulators n  Decreasing speed with the increasing # target cores l  SimMc :M-Core simulator l  Difficult to achieve the scalable speed •  Overhead for cycle accurate simulation 400 343 Speed degradation 350 more than the increasing # cores Simulation Speed 300 [K cycle / sec] 250 200 149 150 96 100 70 50 0 16 32 48 64 # Target Cores Simulation Speed on SimMc (M-Core simulator)
  8. 8. Motivation n  Achieve the SCALABLE simulation speed l  = Keep the constant simulation speed in case of large number of cores n  How to scale the simulation speed? l  Our target architecture: M-Core •  Tile architecture with 2D mesh network Partitioning of the target processor into multiple FPGAs Partition Many-core Processor
  9. 9. Proposal of ScalableCore n  Multiple FPGAs corresponding to the target processor l  Each ScalableCore Unit has a part of the target processor and shares the simulation progress with its neighbor Units ScalableCore Unit (FPGA Card with off-chip Memory) A part of the target processor ScalableCore Board Connecting among the ScalableCore Units LCD Display for simulation information Target Core System Function Target Processor (M-Core)
  10. 10. Simulation Target Manycore: M-Coren  Tile architecture with 2D mesh network l  A Node has: Core, Local Memory, INCC (DMA controller) and Router l  Local Memory: Independent Address Space, Data transfer by DMAs DRAM Controller DRAM Controller Local Memory Core INCC R Node Current Target of ScalableCore system DRAM Controller DRAM Controller
  11. 11. ScalableCore system 1.1: Overview n  Simulating the M-Core with up to 64 Nodes (= FPGAs) Local Memory Core INCC R System Functions Able to increase/decrease the number of Nodes
  12. 12. 1Node : 1 ScalableCore Unit 45cm 30cm
  13. 13. 4 Nodes (2x2) : 4 ScalableCore Units 45cm 30cm
  14. 14. 16 Nodes (4×4) : 16 ScalableCore Units 45cm 30cm
  15. 15. 64 Nodes (8×8) : 64 ScalableCore Units Scalable Extension!
  16. 16. ScalableCore system 1.1: Componentsn  ScalableCore Unit FPGA board with off-chip SRAM l  Xilinx Spartan-3E XC3S500E l  512KBi SRAM (8bit, 1 port for read/write) l  Configuration ROMn  ScalableCore Board Interface board bridging Units l  Power regulator & SD card slot
  17. 17. ScalableCore system 1.1:Logic Hierarchy Core INCC Router Local Memory Target Core (Interface) (a Node in M-Core) Interface Register Arbiter System Functions Memory Multiplexer Ser/Des Device Controller Initializer
  18. 18. ScalableCore system 1.1:Logic Architecture Off-chip SRAM SRAM Controller SD Card Controller Devices Node Memory Memory Controller DMA Register SD Memory Multiplexer IR IR IR IR Configuration ROM JTAG Memory DMA XCF04S port Fetch Unit Generator/ Access Unit Receiver INCC Register Interface Interface Decoder File Register Register Router Execution Unit Arbiter Core State Machine Controller IR IR XBAR to/from Adjacent Units Clock Ser/Des Reset Ser/Des IR Ser/Des ScalableCore Unit FPGA Spartan-3E Ser/Des
  19. 19. Two key techniquesn  Local Barrier Synchronization l  Each FPGA has one Node of M-Core (or other tile architecture) l  To satisfy the cycle accuracy, hand shaking of simulation state is needed •  All-to-All hand shake: Increasing overhead to the number of cores l  Our target is a tile architecture, so … Hand shaking by only 4 neighborsn  Virtual Cycle l  How to emulate the complex hardware? •  Ex.) larger number of memory ports Use multiple FPGA cycles for 1 target cycle
  20. 20. Local Barrier Synchronization n  Handshakes with 4 neighbor FPGAs l  Constant handshaking overhead, not increasing with the increasing of # target cores l  So it achieves scalable simulation speed Sending to Unit 0 Sending to Unit 0 Sending to Unit 1 Sending to Unit 1 0 Sending to Unit 2 Sending to Unit 2 Sending to Unit 3 Sending to Unit 33 4 1 Receiving from Unit 0 Receiving from Unit 0 Receiving from Unit 1 Receiving from Unit 1 Receiving from Unit 2 Receiving from Unit 2 2 Receiving from Unit 3 Receiving from Unit 3 Cycle 1 Cycle 2
  21. 21. Virtual Cycle n  Multiple FPGA clock cycles for 1 target clock cycle l  Virtually complex hardware by using simple FPGA equipment •  Example. Multiport RAM by driving 1 port RAM multiple times Drive the circuit of target components Core Proceeding INCC Target Circuit State Router Process the memory accesses Interleaved Core (IF) Core (L/S) INCC Send INCC Recv Memory Access via Memory Multiplexer Start sending Sending the synchronized data via Serial I/O (North) Data Sender Sending the synchronized data via Serial I/O (East) via Serial I/Os … Sending the synchronized data via Serial I/O (West) Sending the synchronized data via Serial I/O (South) Receiving the synchronized data via Serial I/O (North) Receiving the synchronized data via Serial I/O (East) Data Receiver via Serial I/Os Receiving the synchronized data via Serial I/O (West) Receiving the synchronized data via Serial I/O (South) Finish synchronization 1 Virtual Cycle Time Virtual Cycle Virtual Cycle N N+1
  22. 22. Evaluation n  Evaluation Points l  Simulation Speed [K cycle / sec] l  Power [W] n  Environment l  ScalableCore system 1.1 (FPGA-based simulator) •  Freq.: 45MHz l  SimMc 1.1(Software simulator of M-Core) •  Intel Core2Duo, Memory 4GB, gcc4.1.2, Debian 5 n  # Node l  16, 32, 48, 64
  23. 23. Evaluation: Simulation Speed [K cycle/sec] n  = Clock frequency of the target processor [KHz] l  Software simulator: degrading speed with the increasing of # target cores l  ScalableCore system: constant speed rate n  Relative Speed l  Increasing # cores, Increasing the relative speed •  In simulation of 64 Nodes, achieves 14.2x speed up ScalableCore system Software Simulator 16.0 14.2 1200 14.0 1000 1000 1000 1000Simulation Speed Relative Speed 1000 12.0 10.4 [K cycle / sec] 800 10.0 8.0 6.7 600 343 6.0 400 2.9 149 4.0 200 96 70 2.0 0 0.0 16 32 48 64 16 32 48 64 # Nodes # Nodes
  24. 24. Evaluation: Power [W] n  = Energy consumption of the system per sec l  Software simulator: constant consumption [W] l  ScalableCore system: increasing the power [W] n  Relative Efficiency (=Ratio of energy used for simulation of 1 clock cycle on the target1) l  More efficient, increasing # target cores •  In simulation of 64 nodes, achieves 25.0 22.2 22.9 23.5 ScalableCore system Software Simulator Relative Efficiency 19.2 100 84 84 84 84 20.0 80 15.0Power [W] 60 51 38 10.0 40 26 13 5.0 20 0 0.0 16 32 48 64 16 32 48 64 # Nodes # Nodes
  25. 25. Conclusionn ScalableCore system 1.1 An FPGA-based scalable simulation system for tile architecture evaluations l  Multiple FPGAs l  Two key techniques •  Virtual cycle •  Local Barrier Synchronization l  14.2 times faster simulation than the software simulator •  When simulating the more detailed architecture the speedup rate becomes the very largern  Future Work l  Off-chip DRAM support l  Virtual combined multiple FPGAs for a large core l  Time-multiplexed driven for higher hardware utilization

×