An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

14:30 – 15:00 June 2, 2011
HEART 2011 @Imperial College London

An FPGA-based Scalable Simulation
Accelerator for Tile Architectures

Shinya Takamaeda-Yamazaki†‡, Ryosuke Sasakawa†,
Yoshito Sakaguchi†, Kenji Kise†

†Tokyo Institute of Technology, Japan
‡JSPS Research Fellow

This presentation shows ScalableCore system
n  Multi-FPGA system for Tile architecture simulations
l  Achieving SCALABLE simulation speed

Target Core
System
Function

Agenda

n  Background & Motivation
n  Proposal: ScalableCore
n  System Implementation
l  Overall system
l  Components: ScalableCore Unit & Board
l  Logic Hierarch & Architecture
n  Evaluation
l  Simulation Speed
l  Power
n  Conclusion

Background: Multicores to Many-cores

Intel Single Chip Cloud Computer
48 cores (x86)

TILERA TILE-Gx100
100 cores (MIPS)

Simulation Target Manycore: M-Core
n  Tile architecture with 2D mesh network
l  A Node has: Core, Local Memory, INCC (DMA controller) and Router
l  Local Memory: Independent Address Space, Data transfer by DMAs
DRAM Controller DRAM Controller

Local
Memory
Core
INCC

R

Node


How to evaluate the architectures?
n  Customizability vs. Simulation Speed
l  We want to run a large benchmark fast
Reality

Chip
Easy construction of
ideal system without
HW limitations FPGA
Real but
Simulator
expensive

Software
Faster simulation and
Simulator
customizable

Difficulty to construct

Less scalability of simulation speed
on software simulators
n  Decreasing speed with the increasing # target cores
l  SimMc :M-Core simulator
l  Difficult to achieve the scalable speed
•  Overhead for cycle accurate simulation
400
343 Speed degradation
350
more than the increasing # cores
Simulation Speed

300
[K cycle / sec]

250
200
149
150
96
100 70
50
0
16 32 48 64
# Target Cores
Simulation Speed on SimMc (M-Core simulator)

Motivation
n  Achieve the SCALABLE simulation speed
l  = Keep the constant simulation speed in case of large number of
cores
n  How to scale the simulation speed?
l  Our target architecture: M-Core
•  Tile architecture with 2D mesh network

Partitioning of the target processor into multiple FPGAs

Partition
Many-core
Processor

Proposal of ScalableCore
n  Multiple FPGAs corresponding to the target processor
l  Each ScalableCore Unit has a part of the target processor
and shares the simulation progress with its neighbor Units

ScalableCore Unit
(FPGA Card with off-chip Memory)
A part of the target processor

ScalableCore Board
Connecting among
the ScalableCore Units

LCD Display
for simulation information

Target Core
System
Function
Target Processor (M-Core)

Simulation Target Manycore: M-Core
n  Tile architecture with 2D mesh network
l  A Node has: Core, Local Memory, INCC (DMA controller) and Router
l  Local Memory: Independent Address Space, Data transfer by DMAs

Local
Memory
Core
INCC

R

Node

Current Target of
ScalableCore system


ScalableCore system 1.1: Overview
n  Simulating the M-Core with up to 64 Nodes (= FPGAs)

Local
Memory
Core
INCC

R

System Functions

Able to increase/decrease
the number of Nodes

1Node : 1 ScalableCore Unit

45cm

30cm

4 Nodes (2x2) : 4 ScalableCore Units

45cm

30cm

16 Nodes (4×4) : 16 ScalableCore Units

45cm

30cm

64 Nodes (8×8) : 64 ScalableCore Units

Scalable Extension!

ScalableCore system 1.1: Components

n  ScalableCore Unit
FPGA board with off-chip SRAM
l  Xilinx Spartan-3E XC3S500E
l  512KBi SRAM (8bit, 1 port for read/write)
l  Configuration ROM

n  ScalableCore Board
Interface board bridging Units
l  Power regulator & SD card slot

ScalableCore system 1.1:Logic Hierarchy

Core INCC Router

Local Memory
Target Core (Interface)
(a Node in M-Core)
Interface Register Arbiter

System Functions
Memory Multiplexer Ser/Des

Device Controller Initializer

ScalableCore system 1.1:Logic Architecture
Off-chip
SRAM
SRAM Controller SD Card Controller Devices

Node Memory
Memory Controller DMA Register
SD

Memory Multiplexer
IR IR IR IR

Configuration
ROM JTAG
Memory DMA XCF04S port
Fetch Unit Generator/
Access Unit
Receiver
INCC

Register Interface Interface
Decoder
File Register Register

Router
Execution Unit Arbiter
Core

State Machine Controller IR IR
XBAR
to/from
Adjacent Units
Clock
Ser/Des

Reset
Ser/Des

IR Ser/Des
ScalableCore Unit
FPGA Spartan-3E Ser/Des

Two key techniques
n  Local Barrier Synchronization
l  Each FPGA has one Node of M-Core (or other tile architecture)
l  To satisfy the cycle accuracy, hand shaking of simulation state is
needed
•  All-to-All hand shake: Increasing overhead to the number of cores
l  Our target is a tile architecture, so …

Hand shaking by only 4 neighbors

n  Virtual Cycle
l  How to emulate the complex hardware?
•  Ex.) larger number of memory ports

Use multiple FPGA cycles for 1 target cycle

Local Barrier Synchronization
n  Handshakes with 4 neighbor FPGAs
l  Constant handshaking overhead, not increasing with the
increasing of # target cores
l  So it achieves scalable simulation speed

Sending to Unit 0 Sending to Unit 0

0 Sending to Unit 2 Sending to Unit 2


3 4 1 Receiving from Unit 0 Receiving from Unit 0

Receiving from Unit 1 Receiving from Unit 1

Receiving from Unit 2 Receiving from Unit 2
2 Receiving from Unit 3 Receiving from Unit 3

Cycle 1 Cycle 2

Virtual Cycle
n  Multiple FPGA clock cycles for 1 target clock cycle
l  Virtually complex hardware by using simple FPGA equipment
•  Example. Multiport RAM by driving 1 port RAM multiple times

Drive the circuit of target components
Core
Proceeding INCC
Target Circuit State Router Process the memory accesses
Interleaved
Core (IF) Core (L/S) INCC Send INCC Recv
Memory Access
via Memory Multiplexer Start sending
Sending the synchronized data via Serial I/O (North)

Data Sender Sending the synchronized data via Serial I/O (East)
via Serial I/Os

…
Sending the synchronized data via Serial I/O (West)
Sending the synchronized data via Serial I/O (South)

Receiving the synchronized data via Serial I/O (North)
Receiving the synchronized data via Serial I/O (East)
Data Receiver
via Serial I/Os Receiving the synchronized data via Serial I/O (West)
Receiving the synchronized data via Serial I/O (South)
Finish synchronization
1 Virtual Cycle
Time
Virtual Cycle Virtual Cycle
N N+1

Evaluation

n  Evaluation Points
l  Simulation Speed [K cycle / sec]
l  Power [W]
n  Environment
l  ScalableCore system 1.1 (FPGA-based simulator)
•  Freq.: 45MHz
l  SimMc 1.1(Software simulator of M-Core)
•  Intel Core2Duo, Memory 4GB, gcc4.1.2, Debian 5

n  # Node
l  16, 32, 48, 64

Evaluation: Simulation Speed [K cycle/sec]
n  = Clock frequency of the target processor [KHz]
l  Software simulator: degrading speed with the increasing of #
target cores
l  ScalableCore system: constant speed rate
n  Relative Speed
l  Increasing # cores, Increasing the relative speed
•  In simulation of 64 Nodes, achieves 14.2x speed up
ScalableCore system Software Simulator
16.0 14.2
1200 14.0
1000 1000 1000 1000
Simulation Speed

Relative Speed
1000 12.0 10.4
[K cycle / sec]

800 10.0
8.0 6.7
600
343 6.0
400 2.9
149 4.0
200 96 70 2.0
0 0.0
16 32 48 64 16 32 48 64
# Nodes # Nodes

Evaluation: Power [W]
n  = Energy consumption of the system per sec
l  Software simulator: constant consumption [W]
l  ScalableCore system: increasing the power [W]
n  Relative Efficiency
(=Ratio of energy used for simulation of 1 clock cycle on the target1)
l  More efficient, increasing # target cores
•  In simulation of 64 nodes, achieves
25.0 22.2 22.9 23.5
ScalableCore system Software Simulator

Relative Efficiency
19.2
100 84 84 84 84 20.0
80 15.0
Power [W]

60 51
38 10.0
40 26
13 5.0
20
0 0.0
16 32 48 64 16 32 48 64
# Nodes # Nodes

Conclusion
n ScalableCore system 1.1
An FPGA-based scalable simulation system
for tile architecture evaluations
l  Multiple FPGAs
l  Two key techniques
•  Virtual cycle
•  Local Barrier Synchronization
l  14.2 times faster simulation than the software simulator
•  When simulating the more detailed architecture the speedup rate
becomes the very larger
n  Future Work
l  Off-chip DRAM support
l  Virtual combined multiple FPGAs for a large core
l  Time-multiplexed driven for higher hardware utilization

An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

Similar to An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011 (20)

More from Shinya Takamaeda-Y

More from Shinya Takamaeda-Y (14)

Recently uploaded

Recently uploaded (20)

An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011