SlideShare a Scribd company logo
1 of 25
Download to read offline
14:30 – 15:00 June 2, 2011
HEART 2011 @Imperial College London


An FPGA-based Scalable Simulation
Accelerator for Tile Architectures

   Shinya Takamaeda-Yamazaki†‡, Ryosuke Sasakawa†,
                       Yoshito Sakaguchi†, Kenji Kise†

                  †Tokyo   Institute of Technology, Japan
                                ‡JSPS   Research Fellow
This presentation shows ScalableCore system
 n  Multi-FPGA system for Tile architecture simulations
    l  Achieving SCALABLE simulation speed


                                                           Target Core
                                                            System
                                                            Function
Agenda

n  Background & Motivation
n  Proposal: ScalableCore
n  System Implementation
   l  Overall system
   l  Components: ScalableCore Unit & Board
   l  Logic Hierarch & Architecture
n  Evaluation
   l  Simulation Speed
   l  Power
n  Conclusion
Background: Multicores to Many-cores

                             Intel Single Chip Cloud Computer
                             48 cores (x86)




       TILERA TILE-Gx100
          100 cores (MIPS)
Simulation Target Manycore: M-Core
n  Tile architecture with 2D mesh network
   l  A Node has: Core, Local Memory, INCC (DMA controller) and Router
   l  Local Memory: Independent Address Space, Data transfer by DMAs
        DRAM Controller      DRAM Controller




                                                         Local
                                                        Memory
                                                 Core
                                                         INCC


                                                                 R

                                                        Node




        DRAM Controller      DRAM Controller
How to evaluate the architectures?
 n  Customizability vs. Simulation Speed
       l  We want to run a large benchmark fast
   Reality




                                                            Chip
             Easy construction of
             ideal system without
                HW limitations       FPGA
                                                                Real but
                                    Simulator
                                                               expensive



                  Software
                                       Faster simulation and
                  Simulator
                                           customizable


                                                Difficulty to construct
Less scalability of simulation speed
on software simulators
 n  Decreasing speed with the increasing # target cores
    l  SimMc :M-Core simulator
    l  Difficult to achieve the scalable speed
        •  Overhead for cycle accurate simulation
                            400
                                      343                 Speed degradation
                            350
                                                          more than the increasing # cores
        Simulation Speed




                            300
          [K cycle / sec]




                            250
                            200
                                                 149
                            150
                                                             96
                            100                                          70
                             50
                              0
                                       16         32          48         64
                                                  # Target Cores
                                  Simulation Speed on SimMc (M-Core simulator)
Motivation
 n  Achieve the SCALABLE simulation speed
    l  = Keep the constant simulation speed in case of large number of
        cores
 n  How to scale the simulation speed?
    l  Our target architecture: M-Core
        •  Tile architecture with 2D mesh network


   Partitioning of the target processor into multiple FPGAs


                                 Partition
              Many-core
              Processor
Proposal of ScalableCore
 n  Multiple FPGAs corresponding to the target processor
    l  Each ScalableCore Unit has a part of the target processor
        and shares the simulation progress with its neighbor Units

      ScalableCore Unit
      (FPGA Card with off-chip Memory)
      A part of the target processor

     ScalableCore Board
     Connecting among
     the ScalableCore Units

     LCD Display
     for simulation information

                                  Target Core
                                   System
                                   Function
                                                Target Processor (M-Core)
Simulation Target Manycore: M-Core
n  Tile architecture with 2D mesh network
   l  A Node has: Core, Local Memory, INCC (DMA controller) and Router
   l  Local Memory: Independent Address Space, Data transfer by DMAs
        DRAM Controller      DRAM Controller




                                                         Local
                                                        Memory
                                                 Core
                                                         INCC


                                                                 R

                                                        Node



                                                    Current Target of
                                                  ScalableCore system

        DRAM Controller      DRAM Controller
ScalableCore system 1.1: Overview
 n  Simulating the M-Core with up to 64 Nodes (= FPGAs)

                                              Local
                                             Memory
                                      Core
                                             INCC


                                                      R



                                      System Functions




                                   Able to increase/decrease
                                   the number of Nodes
1Node : 1 ScalableCore Unit




                      45cm




                              30cm
4 Nodes (2x2) : 4 ScalableCore Units




                       45cm




                                  30cm
16 Nodes (4×4) : 16 ScalableCore Units




                      45cm




                                 30cm
64 Nodes (8×8) : 64 ScalableCore Units




              Scalable Extension!
ScalableCore system 1.1: Components

n  ScalableCore Unit
    FPGA board with off-chip SRAM
   l  Xilinx Spartan-3E XC3S500E
   l  512KBi SRAM (8bit, 1 port for read/write)
   l  Configuration ROM




n  ScalableCore Board
    Interface board bridging Units
   l  Power regulator & SD card slot
ScalableCore system 1.1:Logic Hierarchy


                         Core         INCC           Router


                           Local Memory
  Target Core                (Interface)
  (a Node in M-Core)
                                Interface Register             Arbiter

  System Functions
                        Memory Multiplexer           Ser/Des



                       Device Controller       Initializer
ScalableCore system 1.1:Logic Architecture
                                                                                                                          Off-chip
                                                                                                         SRAM
                           SRAM Controller                SD Card Controller                                              Devices


                                                                Node Memory
          Memory Controller               DMA Register
                                                                                                    SD


                            Memory Multiplexer
         IR                       IR               IR         IR

                                                                                                    Configuration
                                                                                                       ROM                 JTAG
      Memory                                              DMA                                         XCF04S                port
                             Fetch Unit                 Generator/
     Access Unit
                                                         Receiver
                                                                         INCC

      Register                                           Interface       Interface
                              Decoder
        File                                             Register        Register

                                                                                     Router
                 Execution Unit                                                      Arbiter
                    Core


       State Machine Controller                    IR                                          IR
                                                                          XBAR
                                                                                                                   to/from
                                                                                                                Adjacent Units
                   Clock
                                                                                                            Ser/Des

                   Reset
                                                                                                            Ser/Des

                                                                               IR                           Ser/Des
        ScalableCore Unit
        FPGA Spartan-3E                                                                                     Ser/Des
Two key techniques
n  Local Barrier Synchronization
   l  Each FPGA has one Node of M-Core (or other tile architecture)
   l  To satisfy the cycle accuracy, hand shaking of simulation state is
       needed
       •  All-to-All hand shake: Increasing overhead to the number of cores
   l  Our target is a tile architecture, so …

             Hand shaking by only 4 neighbors

n  Virtual Cycle
   l  How to emulate the complex hardware?
       •  Ex.) larger number of memory ports

       Use multiple FPGA cycles for 1 target cycle
Local Barrier Synchronization
 n  Handshakes with 4 neighbor FPGAs
    l  Constant handshaking overhead, not increasing with the
        increasing of # target cores
    l  So it achieves scalable simulation speed

                      Sending to Unit 0            Sending to Unit 0

                      Sending to Unit 1            Sending to Unit 1
    0                 Sending to Unit 2            Sending to Unit 2

                      Sending to Unit 3            Sending to Unit 3

3   4    1               Receiving from Unit 0        Receiving from Unit 0

                           Receiving from Unit 1        Receiving from Unit 1

                        Receiving from Unit 2        Receiving from Unit 2
    2                 Receiving from Unit 3        Receiving from Unit 3


                            Cycle 1                      Cycle 2
Virtual Cycle
 n  Multiple FPGA clock cycles for 1 target clock cycle
       l  Virtually complex hardware by using simple FPGA equipment
               •  Example. Multiport RAM by driving 1 port RAM multiple times

                                         Drive the circuit of target components
                              Core
             Proceeding      INCC
     Target Circuit State    Router                      Process the memory accesses
             Interleaved
                                             Core (IF)      Core (L/S)     INCC Send       INCC Recv
        Memory Access
  via Memory Multiplexer               Start sending
                                               Sending the synchronized data via Serial I/O (North)

            Data Sender                          Sending the synchronized data via Serial I/O (East)
           via Serial I/Os

                                                                                                                                 …
                                                 Sending the synchronized data via Serial I/O (West)
                                                 Sending the synchronized data via Serial I/O (South)

                                                   Receiving the synchronized data via Serial I/O (North)
                                                            Receiving the synchronized data via Serial I/O (East)
           Data Receiver
           via Serial I/Os                          Receiving the synchronized data via Serial I/O (West)
                                                               Receiving the synchronized data via Serial I/O (South)
                                                                                                        Finish synchronization
                                                                   1 Virtual Cycle
          Time
                             Virtual Cycle                                                                                       Virtual Cycle
                             N                                                                                                   N+1
Evaluation

 n  Evaluation Points
    l  Simulation Speed [K cycle / sec]
    l  Power [W]
 n  Environment
    l  ScalableCore system 1.1 (FPGA-based simulator)
        •  Freq.: 45MHz
    l  SimMc 1.1(Software simulator of M-Core)
        •  Intel Core2Duo, Memory 4GB, gcc4.1.2, Debian 5

 n  # Node
    l  16, 32, 48, 64
Evaluation: Simulation Speed [K cycle/sec]
              n  = Clock frequency of the target processor [KHz]
                        l  Software simulator: degrading speed with the increasing of #
                            target cores
                        l  ScalableCore system: constant speed rate
              n  Relative Speed
                        l  Increasing # cores, Increasing the relative speed
                             •  In simulation of 64 Nodes, achieves 14.2x speed up
                    ScalableCore system         Software Simulator
                                                                                      16.0                        14.2
                    1200                                                              14.0
                           1000     1000          1000      1000
Simulation Speed




                                                                     Relative Speed
                    1000                                                              12.0                 10.4
  [K cycle / sec]




                     800                                                              10.0
                                                                                       8.0         6.7
                     600
                              343                                                      6.0
                     400                                                                     2.9
                                          149                                          4.0
                     200                              96        70                     2.0
                       0                                                               0.0
                              16          32       48          64                            16    32       48    64
                                            # Nodes                                                  # Nodes
Evaluation: Power [W]
             n  = Energy consumption of the system per sec
                   l  Software simulator: constant consumption [W]
                   l  ScalableCore system: increasing the power [W]
             n  Relative Efficiency
                  (=Ratio of energy used for simulation of 1 clock cycle on the target1)
                   l  More efficient, increasing # target cores
                          •  In simulation of 64 nodes, achieves
                                                                                      25.0          22.2   22.9   23.5
             ScalableCore system         Software Simulator


                                                                Relative Efficiency
                                                                                             19.2
            100         84          84          84         84                         20.0
             80                                                                       15.0
Power [W]




             60                                       51
                                           38                                         10.0
             40                26
                   13                                                                  5.0
             20
              0                                                                        0.0
                     16         32       48            64                                    16     32     48     64
                                  # Nodes                                                            # Nodes
Conclusion
n ScalableCore system 1.1
  An FPGA-based scalable simulation system
  for tile architecture evaluations
   l  Multiple FPGAs
   l  Two key techniques
       •  Virtual cycle
       •  Local Barrier Synchronization
   l  14.2 times faster simulation than the software simulator
       •  When simulating the more detailed architecture the speedup rate
          becomes the very larger
n  Future Work
   l  Off-chip DRAM support
   l  Virtual combined multiple FPGAs for a large core
   l  Time-multiplexed driven for higher hardware utilization

More Related Content

What's hot

Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC clusterToward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC clusterRyousei Takano
 
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009Randall Hand
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architectureDhaval Kaneria
 
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC clusterToward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC clusterRyousei Takano
 
thread-clustering
thread-clusteringthread-clustering
thread-clusteringdavidkftam
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programmingnpinto
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...npinto
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...npinto
 
Algorithmic Memory Increases Memory Performance by an Order of Magnitude
Algorithmic Memory Increases Memory Performance by an Order of MagnitudeAlgorithmic Memory Increases Memory Performance by an Order of Magnitude
Algorithmic Memory Increases Memory Performance by an Order of Magnitudechiportal
 
Scalable Elastic Systems Architecture (SESA)
Scalable Elastic Systems Architecture (SESA)Scalable Elastic Systems Architecture (SESA)
Scalable Elastic Systems Architecture (SESA)Eric Van Hensbergen
 
GPU Virtualization on VMware's Hosted I/O Architecture
GPU Virtualization on VMware's Hosted I/O ArchitectureGPU Virtualization on VMware's Hosted I/O Architecture
GPU Virtualization on VMware's Hosted I/O Architectureguestb3fc97
 
DFX Architecture for High-performance Multi-core Microprocessors
DFX Architecture for High-performance Multi-core MicroprocessorsDFX Architecture for High-performance Multi-core Microprocessors
DFX Architecture for High-performance Multi-core MicroprocessorsIshwar Parulkar
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)npinto
 

What's hot (20)

Delay Tolerant Streaming Services, Thomas Plagemann, UiO
Delay Tolerant Streaming Services, Thomas Plagemann, UiODelay Tolerant Streaming Services, Thomas Plagemann, UiO
Delay Tolerant Streaming Services, Thomas Plagemann, UiO
 
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC clusterToward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
 
Xvisor: embedded and lightweight hypervisor
Xvisor: embedded and lightweight hypervisorXvisor: embedded and lightweight hypervisor
Xvisor: embedded and lightweight hypervisor
 
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
XS Boston 2008 Security
XS Boston 2008 SecurityXS Boston 2008 Security
XS Boston 2008 Security
 
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC clusterToward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
 
thread-clustering
thread-clusteringthread-clustering
thread-clustering
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
 
Nvidia Cuda Apps Jun27 11
Nvidia Cuda Apps Jun27 11Nvidia Cuda Apps Jun27 11
Nvidia Cuda Apps Jun27 11
 
Algorithmic Memory Increases Memory Performance by an Order of Magnitude
Algorithmic Memory Increases Memory Performance by an Order of MagnitudeAlgorithmic Memory Increases Memory Performance by an Order of Magnitude
Algorithmic Memory Increases Memory Performance by an Order of Magnitude
 
Embedded Virtualization for Mobile Devices
Embedded Virtualization for Mobile DevicesEmbedded Virtualization for Mobile Devices
Embedded Virtualization for Mobile Devices
 
Low Level View of Android System Architecture
Low Level View of Android System ArchitectureLow Level View of Android System Architecture
Low Level View of Android System Architecture
 
Explore Android Internals
Explore Android InternalsExplore Android Internals
Explore Android Internals
 
Scalable Elastic Systems Architecture (SESA)
Scalable Elastic Systems Architecture (SESA)Scalable Elastic Systems Architecture (SESA)
Scalable Elastic Systems Architecture (SESA)
 
GPU Virtualization on VMware's Hosted I/O Architecture
GPU Virtualization on VMware's Hosted I/O ArchitectureGPU Virtualization on VMware's Hosted I/O Architecture
GPU Virtualization on VMware's Hosted I/O Architecture
 
DFX Architecture for High-performance Multi-core Microprocessors
DFX Architecture for High-performance Multi-core MicroprocessorsDFX Architecture for High-performance Multi-core Microprocessors
DFX Architecture for High-performance Multi-core Microprocessors
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
 

Viewers also liked

Pythonによる高位設計フレームワークPyCoRAMでFPGAシステムを開発してみよう
Pythonによる高位設計フレームワークPyCoRAMでFPGAシステムを開発してみようPythonによる高位設計フレームワークPyCoRAMでFPGAシステムを開発してみよう
Pythonによる高位設計フレームワークPyCoRAMでFPGAシステムを開発してみようShinya Takamaeda-Y
 
A CGRA-based Approach for Accelerating Convolutional Neural Networks
A CGRA-based Approachfor Accelerating Convolutional Neural NetworksA CGRA-based Approachfor Accelerating Convolutional Neural Networks
A CGRA-based Approach for Accelerating Convolutional Neural NetworksShinya Takamaeda-Y
 
PyCoRAMを用いたグラフ処理FPGAアクセラレータ
PyCoRAMを用いたグラフ処理FPGAアクセラレータPyCoRAMを用いたグラフ処理FPGAアクセラレータ
PyCoRAMを用いたグラフ処理FPGAアクセラレータShinya Takamaeda-Y
 
PyCoRAM (高位合成友の会@ドワンゴ, 2015年1月16日)
PyCoRAM (高位合成友の会@ドワンゴ, 2015年1月16日)PyCoRAM (高位合成友の会@ドワンゴ, 2015年1月16日)
PyCoRAM (高位合成友の会@ドワンゴ, 2015年1月16日)Shinya Takamaeda-Y
 
PyCoRAMによるPythonを用いたポータブルなFPGAアクセラレータ開発 (チュートリアル@ESS2014)
PyCoRAMによるPythonを用いたポータブルなFPGAアクセラレータ開発 (チュートリアル@ESS2014)PyCoRAMによるPythonを用いたポータブルなFPGAアクセラレータ開発 (チュートリアル@ESS2014)
PyCoRAMによるPythonを用いたポータブルなFPGAアクセラレータ開発 (チュートリアル@ESS2014)Shinya Takamaeda-Y
 
PythonとPyCoRAMでお手軽にFPGAシステムを開発してみよう
PythonとPyCoRAMでお手軽にFPGAシステムを開発してみようPythonとPyCoRAMでお手軽にFPGAシステムを開発してみよう
PythonとPyCoRAMでお手軽にFPGAシステムを開発してみようShinya Takamaeda-Y
 
Mapping Applications with Collectives over Sub-communicators on Torus Network...
Mapping Applications with Collectives over Sub-communicators on Torus Network...Mapping Applications with Collectives over Sub-communicators on Torus Network...
Mapping Applications with Collectives over Sub-communicators on Torus Network...Shinya Takamaeda-Y
 
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」Shinya Takamaeda-Y
 
マルチパラダイム型高水準ハードウェア設計環境の検討
マルチパラダイム型高水準ハードウェア設計環境の検討マルチパラダイム型高水準ハードウェア設計環境の検討
マルチパラダイム型高水準ハードウェア設計環境の検討Shinya Takamaeda-Y
 
Veriloggen: Pythonによるハードウェアメタプログラミング(第3回 高位合成友の会 @ドワンゴ)
Veriloggen: Pythonによるハードウェアメタプログラミング(第3回 高位合成友の会 @ドワンゴ)Veriloggen: Pythonによるハードウェアメタプログラミング(第3回 高位合成友の会 @ドワンゴ)
Veriloggen: Pythonによるハードウェアメタプログラミング(第3回 高位合成友の会 @ドワンゴ)Shinya Takamaeda-Y
 
Pythonを用いた高水準ハードウェア設計環境の検討
Pythonを用いた高水準ハードウェア設計環境の検討Pythonを用いた高水準ハードウェア設計環境の検討
Pythonを用いた高水準ハードウェア設計環境の検討Shinya Takamaeda-Y
 
コンピュータアーキテクチャ研究の最新動向〜ISCA2015参加報告〜 @FPGAエクストリーム・コンピューティング 第7回 (#fpgax #7)
コンピュータアーキテクチャ研究の最新動向〜ISCA2015参加報告〜 @FPGAエクストリーム・コンピューティング 第7回 (#fpgax #7)コンピュータアーキテクチャ研究の最新動向〜ISCA2015参加報告〜 @FPGAエクストリーム・コンピューティング 第7回 (#fpgax #7)
コンピュータアーキテクチャ研究の最新動向〜ISCA2015参加報告〜 @FPGAエクストリーム・コンピューティング 第7回 (#fpgax #7)Shinya Takamaeda-Y
 
PythonとVeriloggenを用いたRTL設計メタプログラミング
PythonとVeriloggenを用いたRTL設計メタプログラミングPythonとVeriloggenを用いたRTL設計メタプログラミング
PythonとVeriloggenを用いたRTL設計メタプログラミングShinya Takamaeda-Y
 
Zynq + Vivado HLS入門
Zynq + Vivado HLS入門Zynq + Vivado HLS入門
Zynq + Vivado HLS入門narusugimoto
 
Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)
Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)
Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)Shinya Takamaeda-Y
 
FPGA・リコンフィギャラブルシステム研究の最新動向
FPGA・リコンフィギャラブルシステム研究の最新動向FPGA・リコンフィギャラブルシステム研究の最新動向
FPGA・リコンフィギャラブルシステム研究の最新動向Shinya Takamaeda-Y
 

Viewers also liked (17)

Pythonによる高位設計フレームワークPyCoRAMでFPGAシステムを開発してみよう
Pythonによる高位設計フレームワークPyCoRAMでFPGAシステムを開発してみようPythonによる高位設計フレームワークPyCoRAMでFPGAシステムを開発してみよう
Pythonによる高位設計フレームワークPyCoRAMでFPGAシステムを開発してみよう
 
A CGRA-based Approach for Accelerating Convolutional Neural Networks
A CGRA-based Approachfor Accelerating Convolutional Neural NetworksA CGRA-based Approachfor Accelerating Convolutional Neural Networks
A CGRA-based Approach for Accelerating Convolutional Neural Networks
 
PyCoRAMを用いたグラフ処理FPGAアクセラレータ
PyCoRAMを用いたグラフ処理FPGAアクセラレータPyCoRAMを用いたグラフ処理FPGAアクセラレータ
PyCoRAMを用いたグラフ処理FPGAアクセラレータ
 
PyCoRAM (高位合成友の会@ドワンゴ, 2015年1月16日)
PyCoRAM (高位合成友の会@ドワンゴ, 2015年1月16日)PyCoRAM (高位合成友の会@ドワンゴ, 2015年1月16日)
PyCoRAM (高位合成友の会@ドワンゴ, 2015年1月16日)
 
PyCoRAMによるPythonを用いたポータブルなFPGAアクセラレータ開発 (チュートリアル@ESS2014)
PyCoRAMによるPythonを用いたポータブルなFPGAアクセラレータ開発 (チュートリアル@ESS2014)PyCoRAMによるPythonを用いたポータブルなFPGAアクセラレータ開発 (チュートリアル@ESS2014)
PyCoRAMによるPythonを用いたポータブルなFPGAアクセラレータ開発 (チュートリアル@ESS2014)
 
PythonとPyCoRAMでお手軽にFPGAシステムを開発してみよう
PythonとPyCoRAMでお手軽にFPGAシステムを開発してみようPythonとPyCoRAMでお手軽にFPGAシステムを開発してみよう
PythonとPyCoRAMでお手軽にFPGAシステムを開発してみよう
 
Mapping Applications with Collectives over Sub-communicators on Torus Network...
Mapping Applications with Collectives over Sub-communicators on Torus Network...Mapping Applications with Collectives over Sub-communicators on Torus Network...
Mapping Applications with Collectives over Sub-communicators on Torus Network...
 
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
 
マルチパラダイム型高水準ハードウェア設計環境の検討
マルチパラダイム型高水準ハードウェア設計環境の検討マルチパラダイム型高水準ハードウェア設計環境の検討
マルチパラダイム型高水準ハードウェア設計環境の検討
 
Veriloggen: Pythonによるハードウェアメタプログラミング(第3回 高位合成友の会 @ドワンゴ)
Veriloggen: Pythonによるハードウェアメタプログラミング(第3回 高位合成友の会 @ドワンゴ)Veriloggen: Pythonによるハードウェアメタプログラミング(第3回 高位合成友の会 @ドワンゴ)
Veriloggen: Pythonによるハードウェアメタプログラミング(第3回 高位合成友の会 @ドワンゴ)
 
Pythonを用いた高水準ハードウェア設計環境の検討
Pythonを用いた高水準ハードウェア設計環境の検討Pythonを用いた高水準ハードウェア設計環境の検討
Pythonを用いた高水準ハードウェア設計環境の検討
 
コンピュータアーキテクチャ研究の最新動向〜ISCA2015参加報告〜 @FPGAエクストリーム・コンピューティング 第7回 (#fpgax #7)
コンピュータアーキテクチャ研究の最新動向〜ISCA2015参加報告〜 @FPGAエクストリーム・コンピューティング 第7回 (#fpgax #7)コンピュータアーキテクチャ研究の最新動向〜ISCA2015参加報告〜 @FPGAエクストリーム・コンピューティング 第7回 (#fpgax #7)
コンピュータアーキテクチャ研究の最新動向〜ISCA2015参加報告〜 @FPGAエクストリーム・コンピューティング 第7回 (#fpgax #7)
 
Zynq+PyCoRAM(+Debian)入門
Zynq+PyCoRAM(+Debian)入門Zynq+PyCoRAM(+Debian)入門
Zynq+PyCoRAM(+Debian)入門
 
PythonとVeriloggenを用いたRTL設計メタプログラミング
PythonとVeriloggenを用いたRTL設計メタプログラミングPythonとVeriloggenを用いたRTL設計メタプログラミング
PythonとVeriloggenを用いたRTL設計メタプログラミング
 
Zynq + Vivado HLS入門
Zynq + Vivado HLS入門Zynq + Vivado HLS入門
Zynq + Vivado HLS入門
 
Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)
Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)
Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)
 
FPGA・リコンフィギャラブルシステム研究の最新動向
FPGA・リコンフィギャラブルシステム研究の最新動向FPGA・リコンフィギャラブルシステム研究の最新動向
FPGA・リコンフィギャラブルシステム研究の最新動向
 

Similar to An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

Alessandro Abbruzzetti - Kernal64
Alessandro Abbruzzetti - Kernal64Alessandro Abbruzzetti - Kernal64
Alessandro Abbruzzetti - Kernal64Scala Italy
 
Ximea - the pc camera, 90 gflps smart camera
Ximea  - the pc camera, 90 gflps smart cameraXimea  - the pc camera, 90 gflps smart camera
Ximea - the pc camera, 90 gflps smart cameraXIMEA
 
MARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicMARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicEric Verhulst
 
Amd Barcelona Presentation Slideshare
Amd Barcelona Presentation SlideshareAmd Barcelona Presentation Slideshare
Amd Barcelona Presentation SlideshareDon Scansen
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 
Mirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryMirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryDeepak Shankar
 
Reverse Engineering of Rocket Chip
Reverse Engineering of Rocket ChipReverse Engineering of Rocket Chip
Reverse Engineering of Rocket ChipRISC-V International
 
FPGA_prototyping proccesing with conclusion
FPGA_prototyping proccesing with conclusionFPGA_prototyping proccesing with conclusion
FPGA_prototyping proccesing with conclusionPersiPersi1
 
Industrial trends in heterogeneous and esoteric compute
Industrial trends in heterogeneous and esoteric computeIndustrial trends in heterogeneous and esoteric compute
Industrial trends in heterogeneous and esoteric computePerry Lea
 
Something about SSE and beyond
Something about SSE and beyondSomething about SSE and beyond
Something about SSE and beyondLihang Li
 
ODSA Use Case - SmartNIC
ODSA Use Case - SmartNICODSA Use Case - SmartNIC
ODSA Use Case - SmartNICODSA Workgroup
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 HardwareJacob Wu
 

Similar to An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011 (20)

SOC design
SOC design SOC design
SOC design
 
FPGA @ UPB-BGA
FPGA @ UPB-BGAFPGA @ UPB-BGA
FPGA @ UPB-BGA
 
Fpga technology
Fpga technologyFpga technology
Fpga technology
 
SoC Design
SoC DesignSoC Design
SoC Design
 
Alessandro Abbruzzetti - Kernal64
Alessandro Abbruzzetti - Kernal64Alessandro Abbruzzetti - Kernal64
Alessandro Abbruzzetti - Kernal64
 
Dr.s.shiyamala fpga ppt
Dr.s.shiyamala  fpga pptDr.s.shiyamala  fpga ppt
Dr.s.shiyamala fpga ppt
 
Ximea - the pc camera, 90 gflps smart camera
Ximea  - the pc camera, 90 gflps smart cameraXimea  - the pc camera, 90 gflps smart camera
Ximea - the pc camera, 90 gflps smart camera
 
MARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicMARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 Altreonic
 
Amd Barcelona Presentation Slideshare
Amd Barcelona Presentation SlideshareAmd Barcelona Presentation Slideshare
Amd Barcelona Presentation Slideshare
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
Krupesh_Resume
Krupesh_ResumeKrupesh_Resume
Krupesh_Resume
 
Mirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryMirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP Library
 
Reverse Engineering of Rocket Chip
Reverse Engineering of Rocket ChipReverse Engineering of Rocket Chip
Reverse Engineering of Rocket Chip
 
Processors selection
Processors selectionProcessors selection
Processors selection
 
FPGA_prototyping proccesing with conclusion
FPGA_prototyping proccesing with conclusionFPGA_prototyping proccesing with conclusion
FPGA_prototyping proccesing with conclusion
 
Industrial trends in heterogeneous and esoteric compute
Industrial trends in heterogeneous and esoteric computeIndustrial trends in heterogeneous and esoteric compute
Industrial trends in heterogeneous and esoteric compute
 
Something about SSE and beyond
Something about SSE and beyondSomething about SSE and beyond
Something about SSE and beyond
 
ajay_Profile
ajay_Profileajay_Profile
ajay_Profile
 
ODSA Use Case - SmartNIC
ODSA Use Case - SmartNICODSA Use Case - SmartNIC
ODSA Use Case - SmartNIC
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 Hardware
 

More from Shinya Takamaeda-Y

オープンソースコンパイラNNgenでつくるエッジ・ディープラーニングシステム
オープンソースコンパイラNNgenでつくるエッジ・ディープラーニングシステムオープンソースコンパイラNNgenでつくるエッジ・ディープラーニングシステム
オープンソースコンパイラNNgenでつくるエッジ・ディープラーニングシステムShinya Takamaeda-Y
 
DNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモ
DNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモDNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモ
DNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモShinya Takamaeda-Y
 
ディープニューラルネットワーク向け拡張可能な高位合成コンパイラの開発
ディープニューラルネットワーク向け拡張可能な高位合成コンパイラの開発ディープニューラルネットワーク向け拡張可能な高位合成コンパイラの開発
ディープニューラルネットワーク向け拡張可能な高位合成コンパイラの開発Shinya Takamaeda-Y
 
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)Shinya Takamaeda-Y
 
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...Shinya Takamaeda-Y
 
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)Shinya Takamaeda-Y
 
ゆるふわコンピュータ (IPSJ-ONE2017)
ゆるふわコンピュータ (IPSJ-ONE2017)ゆるふわコンピュータ (IPSJ-ONE2017)
ゆるふわコンピュータ (IPSJ-ONE2017)Shinya Takamaeda-Y
 
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...Shinya Takamaeda-Y
 
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...Shinya Takamaeda-Y
 
PyCoRAM: Python-Verilog高位合成とメモリ抽象化によるFPGAアクセラレータ向けIPコア開発フレームワーク (FPGAX #05)
PyCoRAM: Python-Verilog高位合成とメモリ抽象化によるFPGAアクセラレータ向けIPコア開発フレームワーク (FPGAX #05)PyCoRAM: Python-Verilog高位合成とメモリ抽象化によるFPGAアクセラレータ向けIPコア開発フレームワーク (FPGAX #05)
PyCoRAM: Python-Verilog高位合成とメモリ抽象化によるFPGAアクセラレータ向けIPコア開発フレームワーク (FPGAX #05)Shinya Takamaeda-Y
 
メモリ抽象化フレームワークPyCoRAMを用いたソフトプロセッサ混載FPGAアクセラレータの開発
メモリ抽象化フレームワークPyCoRAMを用いたソフトプロセッサ混載FPGAアクセラレータの開発メモリ抽象化フレームワークPyCoRAMを用いたソフトプロセッサ混載FPGAアクセラレータの開発
メモリ抽象化フレームワークPyCoRAMを用いたソフトプロセッサ混載FPGAアクセラレータの開発Shinya Takamaeda-Y
 
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...Shinya Takamaeda-Y
 
むかし名言集bot作りました!
むかし名言集bot作りました!むかし名言集bot作りました!
むかし名言集bot作りました!Shinya Takamaeda-Y
 
APGAS言語X10を用いたオンチップネットワークシミュレーションの並列化
APGAS言語X10を用いたオンチップネットワークシミュレーションの並列化APGAS言語X10を用いたオンチップネットワークシミュレーションの並列化
APGAS言語X10を用いたオンチップネットワークシミュレーションの並列化Shinya Takamaeda-Y
 

More from Shinya Takamaeda-Y (14)

オープンソースコンパイラNNgenでつくるエッジ・ディープラーニングシステム
オープンソースコンパイラNNgenでつくるエッジ・ディープラーニングシステムオープンソースコンパイラNNgenでつくるエッジ・ディープラーニングシステム
オープンソースコンパイラNNgenでつくるエッジ・ディープラーニングシステム
 
DNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモ
DNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモDNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモ
DNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモ
 
ディープニューラルネットワーク向け拡張可能な高位合成コンパイラの開発
ディープニューラルネットワーク向け拡張可能な高位合成コンパイラの開発ディープニューラルネットワーク向け拡張可能な高位合成コンパイラの開発
ディープニューラルネットワーク向け拡張可能な高位合成コンパイラの開発
 
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
 
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
 
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
 
ゆるふわコンピュータ (IPSJ-ONE2017)
ゆるふわコンピュータ (IPSJ-ONE2017)ゆるふわコンピュータ (IPSJ-ONE2017)
ゆるふわコンピュータ (IPSJ-ONE2017)
 
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...
 
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
 
PyCoRAM: Python-Verilog高位合成とメモリ抽象化によるFPGAアクセラレータ向けIPコア開発フレームワーク (FPGAX #05)
PyCoRAM: Python-Verilog高位合成とメモリ抽象化によるFPGAアクセラレータ向けIPコア開発フレームワーク (FPGAX #05)PyCoRAM: Python-Verilog高位合成とメモリ抽象化によるFPGAアクセラレータ向けIPコア開発フレームワーク (FPGAX #05)
PyCoRAM: Python-Verilog高位合成とメモリ抽象化によるFPGAアクセラレータ向けIPコア開発フレームワーク (FPGAX #05)
 
メモリ抽象化フレームワークPyCoRAMを用いたソフトプロセッサ混載FPGAアクセラレータの開発
メモリ抽象化フレームワークPyCoRAMを用いたソフトプロセッサ混載FPGAアクセラレータの開発メモリ抽象化フレームワークPyCoRAMを用いたソフトプロセッサ混載FPGAアクセラレータの開発
メモリ抽象化フレームワークPyCoRAMを用いたソフトプロセッサ混載FPGAアクセラレータの開発
 
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
 
むかし名言集bot作りました!
むかし名言集bot作りました!むかし名言集bot作りました!
むかし名言集bot作りました!
 
APGAS言語X10を用いたオンチップネットワークシミュレーションの並列化
APGAS言語X10を用いたオンチップネットワークシミュレーションの並列化APGAS言語X10を用いたオンチップネットワークシミュレーションの並列化
APGAS言語X10を用いたオンチップネットワークシミュレーションの並列化
 

Recently uploaded

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Recently uploaded (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 

An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011

  • 1. 14:30 – 15:00 June 2, 2011 HEART 2011 @Imperial College London An FPGA-based Scalable Simulation Accelerator for Tile Architectures Shinya Takamaeda-Yamazaki†‡, Ryosuke Sasakawa†, Yoshito Sakaguchi†, Kenji Kise† †Tokyo Institute of Technology, Japan ‡JSPS Research Fellow
  • 2. This presentation shows ScalableCore system n  Multi-FPGA system for Tile architecture simulations l  Achieving SCALABLE simulation speed Target Core System Function
  • 3. Agenda n  Background & Motivation n  Proposal: ScalableCore n  System Implementation l  Overall system l  Components: ScalableCore Unit & Board l  Logic Hierarch & Architecture n  Evaluation l  Simulation Speed l  Power n  Conclusion
  • 4. Background: Multicores to Many-cores Intel Single Chip Cloud Computer 48 cores (x86) TILERA TILE-Gx100 100 cores (MIPS)
  • 5. Simulation Target Manycore: M-Core n  Tile architecture with 2D mesh network l  A Node has: Core, Local Memory, INCC (DMA controller) and Router l  Local Memory: Independent Address Space, Data transfer by DMAs DRAM Controller DRAM Controller Local Memory Core INCC R Node DRAM Controller DRAM Controller
  • 6. How to evaluate the architectures? n  Customizability vs. Simulation Speed l  We want to run a large benchmark fast Reality Chip Easy construction of ideal system without HW limitations FPGA Real but Simulator expensive Software Faster simulation and Simulator customizable Difficulty to construct
  • 7. Less scalability of simulation speed on software simulators n  Decreasing speed with the increasing # target cores l  SimMc :M-Core simulator l  Difficult to achieve the scalable speed •  Overhead for cycle accurate simulation 400 343 Speed degradation 350 more than the increasing # cores Simulation Speed 300 [K cycle / sec] 250 200 149 150 96 100 70 50 0 16 32 48 64 # Target Cores Simulation Speed on SimMc (M-Core simulator)
  • 8. Motivation n  Achieve the SCALABLE simulation speed l  = Keep the constant simulation speed in case of large number of cores n  How to scale the simulation speed? l  Our target architecture: M-Core •  Tile architecture with 2D mesh network Partitioning of the target processor into multiple FPGAs Partition Many-core Processor
  • 9. Proposal of ScalableCore n  Multiple FPGAs corresponding to the target processor l  Each ScalableCore Unit has a part of the target processor and shares the simulation progress with its neighbor Units ScalableCore Unit (FPGA Card with off-chip Memory) A part of the target processor ScalableCore Board Connecting among the ScalableCore Units LCD Display for simulation information Target Core System Function Target Processor (M-Core)
  • 10. Simulation Target Manycore: M-Core n  Tile architecture with 2D mesh network l  A Node has: Core, Local Memory, INCC (DMA controller) and Router l  Local Memory: Independent Address Space, Data transfer by DMAs DRAM Controller DRAM Controller Local Memory Core INCC R Node Current Target of ScalableCore system DRAM Controller DRAM Controller
  • 11. ScalableCore system 1.1: Overview n  Simulating the M-Core with up to 64 Nodes (= FPGAs) Local Memory Core INCC R System Functions Able to increase/decrease the number of Nodes
  • 12. 1Node : 1 ScalableCore Unit 45cm 30cm
  • 13. 4 Nodes (2x2) : 4 ScalableCore Units 45cm 30cm
  • 14. 16 Nodes (4×4) : 16 ScalableCore Units 45cm 30cm
  • 15. 64 Nodes (8×8) : 64 ScalableCore Units Scalable Extension!
  • 16. ScalableCore system 1.1: Components n  ScalableCore Unit FPGA board with off-chip SRAM l  Xilinx Spartan-3E XC3S500E l  512KBi SRAM (8bit, 1 port for read/write) l  Configuration ROM n  ScalableCore Board Interface board bridging Units l  Power regulator & SD card slot
  • 17. ScalableCore system 1.1:Logic Hierarchy Core INCC Router Local Memory Target Core (Interface) (a Node in M-Core) Interface Register Arbiter System Functions Memory Multiplexer Ser/Des Device Controller Initializer
  • 18. ScalableCore system 1.1:Logic Architecture Off-chip SRAM SRAM Controller SD Card Controller Devices Node Memory Memory Controller DMA Register SD Memory Multiplexer IR IR IR IR Configuration ROM JTAG Memory DMA XCF04S port Fetch Unit Generator/ Access Unit Receiver INCC Register Interface Interface Decoder File Register Register Router Execution Unit Arbiter Core State Machine Controller IR IR XBAR to/from Adjacent Units Clock Ser/Des Reset Ser/Des IR Ser/Des ScalableCore Unit FPGA Spartan-3E Ser/Des
  • 19. Two key techniques n  Local Barrier Synchronization l  Each FPGA has one Node of M-Core (or other tile architecture) l  To satisfy the cycle accuracy, hand shaking of simulation state is needed •  All-to-All hand shake: Increasing overhead to the number of cores l  Our target is a tile architecture, so … Hand shaking by only 4 neighbors n  Virtual Cycle l  How to emulate the complex hardware? •  Ex.) larger number of memory ports Use multiple FPGA cycles for 1 target cycle
  • 20. Local Barrier Synchronization n  Handshakes with 4 neighbor FPGAs l  Constant handshaking overhead, not increasing with the increasing of # target cores l  So it achieves scalable simulation speed Sending to Unit 0 Sending to Unit 0 Sending to Unit 1 Sending to Unit 1 0 Sending to Unit 2 Sending to Unit 2 Sending to Unit 3 Sending to Unit 3 3 4 1 Receiving from Unit 0 Receiving from Unit 0 Receiving from Unit 1 Receiving from Unit 1 Receiving from Unit 2 Receiving from Unit 2 2 Receiving from Unit 3 Receiving from Unit 3 Cycle 1 Cycle 2
  • 21. Virtual Cycle n  Multiple FPGA clock cycles for 1 target clock cycle l  Virtually complex hardware by using simple FPGA equipment •  Example. Multiport RAM by driving 1 port RAM multiple times Drive the circuit of target components Core Proceeding INCC Target Circuit State Router Process the memory accesses Interleaved Core (IF) Core (L/S) INCC Send INCC Recv Memory Access via Memory Multiplexer Start sending Sending the synchronized data via Serial I/O (North) Data Sender Sending the synchronized data via Serial I/O (East) via Serial I/Os … Sending the synchronized data via Serial I/O (West) Sending the synchronized data via Serial I/O (South) Receiving the synchronized data via Serial I/O (North) Receiving the synchronized data via Serial I/O (East) Data Receiver via Serial I/Os Receiving the synchronized data via Serial I/O (West) Receiving the synchronized data via Serial I/O (South) Finish synchronization 1 Virtual Cycle Time Virtual Cycle Virtual Cycle N N+1
  • 22. Evaluation n  Evaluation Points l  Simulation Speed [K cycle / sec] l  Power [W] n  Environment l  ScalableCore system 1.1 (FPGA-based simulator) •  Freq.: 45MHz l  SimMc 1.1(Software simulator of M-Core) •  Intel Core2Duo, Memory 4GB, gcc4.1.2, Debian 5 n  # Node l  16, 32, 48, 64
  • 23. Evaluation: Simulation Speed [K cycle/sec] n  = Clock frequency of the target processor [KHz] l  Software simulator: degrading speed with the increasing of # target cores l  ScalableCore system: constant speed rate n  Relative Speed l  Increasing # cores, Increasing the relative speed •  In simulation of 64 Nodes, achieves 14.2x speed up ScalableCore system Software Simulator 16.0 14.2 1200 14.0 1000 1000 1000 1000 Simulation Speed Relative Speed 1000 12.0 10.4 [K cycle / sec] 800 10.0 8.0 6.7 600 343 6.0 400 2.9 149 4.0 200 96 70 2.0 0 0.0 16 32 48 64 16 32 48 64 # Nodes # Nodes
  • 24. Evaluation: Power [W] n  = Energy consumption of the system per sec l  Software simulator: constant consumption [W] l  ScalableCore system: increasing the power [W] n  Relative Efficiency (=Ratio of energy used for simulation of 1 clock cycle on the target1) l  More efficient, increasing # target cores •  In simulation of 64 nodes, achieves 25.0 22.2 22.9 23.5 ScalableCore system Software Simulator Relative Efficiency 19.2 100 84 84 84 84 20.0 80 15.0 Power [W] 60 51 38 10.0 40 26 13 5.0 20 0 0.0 16 32 48 64 16 32 48 64 # Nodes # Nodes
  • 25. Conclusion n ScalableCore system 1.1 An FPGA-based scalable simulation system for tile architecture evaluations l  Multiple FPGAs l  Two key techniques •  Virtual cycle •  Local Barrier Synchronization l  14.2 times faster simulation than the software simulator •  When simulating the more detailed architecture the speedup rate becomes the very larger n  Future Work l  Off-chip DRAM support l  Virtual combined multiple FPGAs for a large core l  Time-multiplexed driven for higher hardware utilization