HPPS - Final - 06/14/2007

H igh P erformance P rocessors and S ystems PdM – UIC joint master 2007 Instructor: Prof. Donatella Sciuto HPPS @ PdM – June 2007

General Outline DRESD DReAMS Alessandro Panella Matteo Murgida Operating System Ivan Beretta Design Flow Antonio Piazzi Polaris Massimo Morandi Marco Novati HLR Marco Maggioni

DRESD in a Nutshell D ynamic R econfigurability in E mbedded S ystem D esign DRESD @ PdM – June 2007

Outline Reconfiguration Motivations Basic Definition SoC

Motivations Increasing need for behavioral flexibility in embedded systems design Support of new standards, e.g. in media processing Addition of new features Applications too large to fit on the device all at once Speedup the overall computation of the final system

Reconfiguration The process of physically altering the location or functionality of network or system elements. Automatic configuration describes the way sophisticated networks can readjust themselves in the event of a link or device failing, enabling the network to continue operation. Gerald Estrin, 1960

SoC Reconfiguration f i x Partial Total Embedded

Different Scenarios... Single Device Distributed System

What’s next DRESD DReAMS Alessandro Panella Matteo Murgida Operating System Ivan Beretta Design Flow Antonio Piazzi Polaris Massimo Morandi Marco Novati HLR Marco Maggioni

D ynamic Re configurability A pplied to M ulti-FPGA S ystems

DReAMS Dynamic Reconfigurability Applied to Multi-FPGA Systems Branch of DRESD project Inherits architectures and tools Automatic workflow from VHDL system description to FPGA implementation VHDL parsing and system simulation System creation over a specific architecture Bitstream creation and download onto FPGAs

Multi-FPGA Partitioning Alessandro Panella [email_address]

Outline Problem description Project goals and contributions Project phases What is partitioning? Existing approaches Going deep into the problem SpartA The framework The idea The algorithm Experimental results Future work

Problem description Multi-FPGA - RATIONALE Large designs do not fit into a single chip High performance parallelized applications Our case: apply dynamic reconfigurability Need to break the initial design into several blocks One block corresponds to a single FPGA chip Which inputs/outputs? Which objectives? Which techinques?

Project goals and contributions Analyze existing approaches Obtain a deep knowledge of this -well explored- field Extract basic ideas for a new approach Obtain some terms of comparison Define precisely which problem(s) we cope with Contextualize the problem Focus on our needs Develop a new solution Theoretical background Implementation and evaluation

Project phases First Phase [15th March – 12th April] Documentation: presentation (12/4), report Goals: Analysis of the state of the art Produce some hints on a new approach Second Phase [13th April – 17th May] Documentation: presentation (17/5), report Goals: Precise definition of the problem Propose a new solution Third Phase [18th May – 14th June] Documentation: presentation (14/6), final report Goal Implementation and evaluation of the proposed solution

What is partitioning? Goal Divide a set of interrelated objects into a set of subsets Optimize a specific objective(s) K-way partitioning Given a graph G=(V,E), partition it into k subsets V 1 ...V k such that their intersection is empty and their union = V. Balance constraint: |V i | ≈ |V|/k Aims at minimizing (or maximizing) an objective function Edge-cut Other objectives In general: NP-complete Several heuristics that provide good results have been developed

Existing approaches - a glance Traditional methods Kernighan – Lin and Fiduccia – Mattheyses heuristics Iterative-improvement algorithms Begins with an initial partition and iteratively improve it O(n 3 ) complexity Iterative algorithms Genetic Simulated annealing Multilevel algorithms Clustering -> Initial partitioning -> Refining MeTIS/hMETIS suite: best current results for large flattened graphs partitioning

Going deeper into the problem Two kinds of multi-FPGA partition Topology-aware Architecture topology is an input No optimization of the no. of FPGAs needed Main task: association between the (larger) system graph and the (smaller) architectural graph Topology-free Architecture topology is not provided Input: dimension and communication features of FPGAs Minimization of the number of FPGAs Place and route after partitioning At the moment, we deal with the Topology-free problem

SPartA: the framework Input: VHDL system description Output: several VHDL files, one for each block (FPGA) Three main phases: Extract design from VHDL description “ Real” partitioning phase (core) Build VHDL files

SPartA: the idea Structural approach Fully exploits the design hierarchy Modules can be treated as single blocks Bases for expansions toward dynamic reconfigurability Objectives Minimize cutsize Minimized the number of used FPGAs Preserving module integrity

SPartA: the algorithm 1/2 Recursive algorithm (deals with trees) Starts from TOP node Precondition No leaves with dimension > FPGA size At every moment, a node can be: COVERED, UNCOVERED or PARTIALLY COVERED Stop condition Node TOP is COVERED

SPartA: the algorithm 2/2 OPEN ISSUE: Selecting the first node to be inserted into an empty partition Random node Node with overall max communication Node with max communication with its siblings

Results 2/2 Complexity: exponential, due to the recursive nature of the algorithm Execution time however low (tens of seconds for a reasonable large design) EXAMPLE ORIGINAL TREE PARTITIONED TREE

Results 3/3 Evaluation metrics EDGECUT, FILLING and SPLITS Evaluation of the three policies for node selection 18 different trees of varying size

Future work Algorithm improvement Balancing of last partition First node selection policies More refined “score” function for selecting node Use closeness metrics Comparisons with existing algorithms Expansion SpartA framework development Topology-aware partitioning

Chimera Multi-FPGAs Architecture Definition Matteo Murgida [email_address]

Outline Introduction Problem description Project Goals State of the Art Project in details Contributions Phases Results What’s next

Problem Description Architectural description of a distributed FPGAs environment 3 layers architecture

Project Goals Design the architecture of the most generic distributed system Node definition Interface definition Communication channel definition Design a communication protocol Essential protocol Interrupt based protocol Timeout improvement

State of the Art CONFigurable ElecTronic TIssue (CONFETTI) by EPFL Cellular based architecture PROs: high degree of parallelism, high computational power CONs: no flexibility, oversized for small problems, small architectural customizations imply big cost/effort Splash 2 by IDA Supercomputing Center Architecture composed by a Sun Sparcstation host, an interface board and “Splash Array”s boards PROs: again high parallelism and power CONs: a central host coordinates the computational units, no fault tollerance, no flexibility

Contributions The proposed architecture: Allows several Spartan-3 Starter Boards to communicate and exchange data It is portable to different FPGAs with minimum effort It is the basic infrastructure that will allow external partial dynamic reconfiguration

Project Phases First Phase, time window: 15th March – 12th April Documentation: prj presentation (12/4), prj report Goals: Digilent Spartan-3 Starter Board study Boards connection Second Phase, time window: 13th April – 17th May Documentation: prj presentation (17/5), prj report Goals: Communication between two Microblaze soft-processors GPIO integration in the architecture Third Phase, time window: 18th May – 14th June Documentation: prj presentation (14/6), prj report Goals: Interrupt handling, timeout handling Simple application as example

Board Study How to use resources like switches, leds and connectors in the board How to map an IP-Core port with a physical pin of the board Choice of the A2 Expansion Connector to connect two boards

Microblaze Communication Communication between two Microblaze soft-processors Development of a display controller to visualize the data flow

GPIO Insertion Higher architecture portability through the use of the GPIO IP-Core. Higher architecture portability through the use of the GPIO IP-Core

Interrupt Controller Insertion Communication protocol improvement by interrupt handling to prevent processor from busy waiting Interrupt Controller is included in the architecture to permit multi-interrupt detection and handling

Timeout Malfunctioning due to interference on the communication channel lead to deadlocks Communication protocol is not reliable at all Counter implementation, including the driver used by the processor to lower down raised interrupts Development of a simple application to verify to correctness of the proposed approach

Future Work Apply the proposed approach to external partial dynamic reconfiguration Develop a co-simulation framework based on the VHDL/SystemC descriptions of distributed systems Receive as input the VHDL description of the system Build the VHDL description for every node Create the SystemC stub to allow inter node communication Describe the communication in SystemC Co-simulate the VHDL / SystemC description

O perating Sy stem support for R econf i gurable S oC

Development of an OS architecture-independent layer for dynamic reconfiguration Ivan Beretta [email_address]

Problem description Need for an operating system support on Reconfigurable SoCs Simplified software development process Improved code portability Lack of support for dynamic reconfigurable architectures Specific solutions for specific architectures Need for an architecture-independent abstraction layer

Project Goal Primary goals: Analysis of the State of the Art Definition of the new intermediate layer Physical implementation Specific goals: Study of the solutions developed inside the DRESD group Comparison between existing solutions Recovery of on of the two implementations Hardware architectures generation using up-to-date tools on Xilinx Virtex II – Pro VP7

State of the Art Caronte implementation (Alberto Donato, 2005) Two kernel modules ICAP deivice driver IP-Core manager (IPCM)

State of the Art (cont’d) YaRA implementation (Vincenzo Rana, 2006) Multi-layered structure Four modules: Reconfiguration controller driver, MAC, LOL, Reconfiguration Library ROTFL architecture

Contributions Limits of existing implementations Lack of portability E.g. YaRA solution implemented on RAPTOR2000 Reconfiguration process details visible from userspace Definition of an architecture independent middleware Improved portability It works on different hardware architectures It works with different Linux distribution Opportunity to optimize latencies

Phases First phase: Layer definition Goal: Factorization of common features Boundaries of the new middleware Mapping of existing solutions on the functionalities Motivation: Provide guidelines for actual implementation Second phase: Implementation recovery Goal: Recovery of bootstrap process and kernel images Motivation: Full recovery of Caronte solution Third phase: Architectures generation Goal: Synthesis of hardware architectures using up-to-date Xilinx tools and cores Motivation: Synthesis of hardware architectures using up-to-date Xilinx tools

First Phase: Layer definition Definition of new layer boundaries Factorization of existing features Mapping of the required functionalities on existing implementations Legend: ● = Both hardware and software ● = Hardware independent Feature Caronte Solution YaRA Solution Reconfiguration controller support ICAP device driver Reconfiguration Controller Driver Dynamic address space assignment IPCM Module MAC module Dynamic device registration and driver loading IPCM Module LOL module API Direct interaction with modules Reconfiguration library Module management (caching, placement...) Not implemented ROTFL architecture

Second Phase: Implementation Recovery Bootstrap process from flash memory 16 MB Flash 0xe4000000 0xe42FFFFF ... ... 0xe4F00000 0xe4F80000 64 MB DDR SDRAM 0x00000000 ... ... 0xe4FFFFFF 0x03FFFFFF 0x00800000 ... BRAM PowerPC FPGA Bootloader Bootmanager Kernel and RAMDisk Image 1 2 3 4 5 6

Second Phase: Implementation Recovery (cont’d) Several issues No bootmanager nor linux kernel on flash memory at the beginning Flash memory seen as read-only memory at runtime Need for an ad-hoc solution Avmon command line interface Executed from DDR SDRAM memory FTP transfert of bootmanager and flash programming Also useful for kernel download Kernel executable image Kernel image built using a cross-compiler ICAP and IPCM modules loaded at runtime

Third Phase: Architecture generation Hardware architecture used in Second Phase no longer useful Synthesized with Xilinx ISE and EDK 6.1 Same hardware structure realized with updated cores and recend tool versions Synthesis with Xilinx ISE and EDK 7.1 Synthesis with Xilinx ISE and EDK 9.1 Lack of device driver support and documentation to configure newest cores

Results: Implementation Recovery Linux Bootstrap from flash memory

Results: Implementation Recovery Design summary for hardware architectures on Xilinx Virtex II – Pro VP7 Two main limitations Ethernet controller Necessity of a top-level design Design too large for module-based reconfiguration Xilinx ISE/EDK 7.1 Xilinx ISE/EDK 9.1 Resource Used Available % Used Available % Slices 4926 4928 99% 5318 4928 107% Flip-Flops 5217 9856 52% 5724 9856 58% 4-in LUTs 6974 9856 70% 6993 9856 70%

What’s next Device driver updates to support newest architectures Intermediate layer implementation Opportunity to add some additional features Reconfiguration scheduler Opportunity to define a common device driver interface to simplify the creation of a new driver by the use Integration of the middleware and the operating system support in a complete design flow

Design FLow Antonio Piazzi [email_address]

Problem description User has to spread his attention on many problems, some of this related with the implementation of the design. Often users could don’t know anything about reconfigurable architecture generation and they haven’t.

Project Goals New design methodology tailored to support partial dynamic reconfigurable architecture Definition and implememtantion of design framework able to Support different design paradigms i.e. Xilinx Module Based, Xilinx EAPR Hide the dirty work (due to the recofiguration) to the application designer Support different architectural solutions i.e. different communication infrastructure IBM CoreConnect or Wishbone

Contributions With our frame work all user (novice and not) may be able to develop and debug their functionality through a reconfigurable architecture without analyze all problems related with that develop methodology

Phases 1 st phase (15 March – 15 April): Budgeting Study of the state of the art 2 nd phase(15 April – 15 May): Realization phase Construction of the entire frame work based on previously separated tools Implementation of a innovative work flow 3 rd phase (15 May – 15 June): Project’s validation Definition of a new communication infrastructure and transfer protocol for the reconfigurable part Verify the integration of the new infrastructure in the project

First Phase Study of the state of the art Standard reconfigurable design flow Xilinx Modlue Based and EAPR Caronte Design Flow EDK-based architecture

Sel f Reconfigurable Architecture

Second Phase 1/4 Costruction of the entire frame work based on prevoiusly separated tools User has to focus his attention only on the develop of the IBM core-connect architecture and on writing modules which implement his functionality SYSTEM.VHD contains all information about the IBM core-connect architecture

Second Phase 2/4 ArchGen take the system.vhd file and process the contained architecture and translate that static architecture in a dynamic one FIX.VHD contains the instantiations of the processors (one or more) and all the components presented in the IBM core-connect architecture TOP.VHD contains the instantiations of the fix component and the information about the communication infrastructure

Second Phase 3 /4 COMiC generate an NCD file which contains the information about the communication infrastructure and an XDL file which contains the same information in text mode

Second Phase 4/ 4 At this point we have only to collect all the information we need and so, through a parser we insert those into a new top.vhd which will be our fix part of the architecture, at this point we have only to manage the reconfigurable modules written by the user

Third Phase 1/3 An OPB bus based on 3-state buffer used to link one or more modules to the fix part (created with ISE) Definition of a new communication infrastructure and transfer protocol for the reconfigurable part

Third Phase 2/3 Use ncd2xdl converter to obtain an xdl file which contains all parameters of our bus

Third Phase 3/3 Perfect integration in our process, we can use all bus type to connect fix and reconfigurable part Verify the integration of the new infrastructure in the project

Results That frame work answer to the need of automation presented from the novice user and help, generally, all the users that they head a low time to market.

What’s next Our idea for future work is to schedule a one or two work day to patch some bugs presents in the project and to adjust the output of COMiC which has to create an OPB replay bus.

What’s next DRESD DReAMS Matteo Murgida Alessandro Panella Operating System Ivan Beretta Design Flow Antonio Piazzi Polaris Massimo Morandi Marco Novati HLR Marco Maggioni

Polaris Create an integrated HW/SW system to manage 2D reconfiguration SW side: Maintain information on FPGA status Decide of how to efficiently allocate tasks HW side: Provide support for effective task allocation Perform 2D bitstream relocation

Management of 2D Reconfiguration in a Reconfigurable System Massimo Morandi [email_address]

Outline Introduction Problem description Project Goals and Contributions Project in details Phases Results Future Work

Problem Description New Generation of FPGAs Virtex-4 and Virtex-5 Allow bi-dimensional reconfiguration This permits to: Better exploit reconfigurable area Obtain modules performance optimizations More complex management: Handle one more degree of freedom Avoid more fragmentation Perform good placement choices to keep low TRR Keep acceptable intra-module routing paths

Project Goals and Contributions Analyze effects of 2D reconfiguration New advantages New problems Examine possible solutions to new problems Explore literature to find promising ideas Evaluate those solutions in various scenarios Propose a new solution Combining ideas from literature with new ones Obtaining good cost-quality tradeoff

Project Phases First Phase, time window: 15th March – 12th April Documentation: prj presentation (12/4), prj report Goals: General analysis of 2D reconfiguration Detailed description of the new problems Second Phase, time window: 13th April – 17th May Documentation: prj presentation (17/5), prj report Goals: Definition of desired features for a solution Analysis and evaluation of existing solutions Third Phase, time window: 18th May – 14th June Documentation: prj presentation (14/6), prj report Goal: p ropose a new combined solution to effectively handle problems of 2D reconfiguration

Setting and Advantages Definition Definition of the setting: 2D self partial dynamical run-time reconfiguration Analysis of the advantages of 2D Reconfiguration In area usage and performance

2D Fragmentation Problem Analysis of the 2D-fragmentation problem Area generally more fragmented Can nullify the area optimizations obtained

Placement Decisions Analysis of 2D placement choices effects: Again, bad choices can lead to performance loss

Allocation manager Definition of allocation manager desired features: Low TRR Low management overhead High routing efficiency Low fragmentation Definition of allocation manager structure: Empty space manager Complete space Heuristic selection Fitter General (FF,BL,BF,WF…) Focused (FA,RA… )

Most relevant works Maintain complete information on empty space: KAMER: Keep All Maximally Empty Rectangles Apply a general fitting strategy CUR: Maintain the Countour of a Union of Rectangles Apply a focused fitting strategy Heuristically prune part of the information: KNER: Keep Non-overlapping Empty Rectangles Apply a general fitting strategy 2D-HASHING: Keep Non-ov. Empty Rectangles in optimized data structure Apply (exclusively) a general fitting strategy

Evaluation and Proposed Approach Proposed Approach Heuristic (KNER-like) empty space manager, to keep low complexity for use in a self-reconfigurable system Fitting strategy focused on minimizing routing paths, to maintain high performance of the reconfigurable system (chosen metric to minimize Manhattan distance) High placement quality => high complexity Lowest compl. => no focused fitting (bad especially for routing)

Structure of the allocation manager Task, defined by: Arrival time, ASAP, (ALAP), H, W, Latency, Communicating Tasks Hosted in a queue which also adds a pointer to the rectangle where it is placed Reconfigurable Device, represented as: Binary Tree structure, each node is a Rectangle, each leaf is an empty Rectangle. Navigation trough pointers to left child, right child, next leaf and a function to find previous leaf (for bookkeeping after split or merge) Rectangle, defined by: X, Y, H, W Initially one, (X,Y)=(0,0), H=FPGA Rows, W=FPGA Cols

Experimental Results Benchmark of 100 randomly generated tasks: Size (5% to 25% of FPGA), randomly interconnected Execution time: 3x less than CUR, close to KNER Communication cost: 3x less than KNER, close to CUR Task Rejection Rate: all solutions quite close

Future Work Apply the proposed solution to self reconfiguration: Adapt the algorithm to run on the internal processor Create a validation reconfigurable architecture Integrate the architecture with relocation Tune the algorithm to improve results: Experiment techniques to reduce TRR Try to optimize the code to have an algorithm with lower running time

Relocation for 2D Reconfigurable Systems Marco Novati [email_address]

Outline Introduction Problem description Project Goals Project in details Phases Results What’s next

Problem Description Self Dynamical Runtime 2D Reconfiguration Xilinx Virtex-4 and Virtex-5 Relocation, different solutions Software (BAnMat, PARBIT) Hardware (REPLICA, BiRF) We chose an hardware solution BiRF Square

Project Goals Study of the new FPGA Families Examination of Xilinx documentation on V4 and V5 Analysis of the new bitstream structure Generation of V4 and V5 bitstream Development of the new version of BiRF Implementation Validation

Phases First Phase: 15th March – 12th April Documentation: prj presentation (12/4), prj report Goals: Xilinx documentation examination V4 & V5 bitstream structure analysis Second Phase: 13th April – 17th May Documentation: prj presentation (17/5), prj report Goals: Implementation of BiRF Square Synthesis Third Phase: 18th May – 14th June Documentation: prj presentation (14/6), prj report Goals: Verification & Validation

Frame Addressing New Frame Addressing: Possibility of addressing rows and columns

CRC Calculation Particular CRC value, used by Xilinx tools Two version of BiRF Square: By using the “predefined” value With actual CRC calculation An optimized algorithm has been used

Synthesis results On a Virtex-4 with speed grade -12 General purpose version: max frequency of 160 MHz Specific version: maxfrequency of 290Mhz

Results 1/2 BiRF Square Permitsto apply relocation in a self partially and dynamically 2D-reconfigurable system The occupation ratio is relatively small Frequency more than acceptable Reduction of internal memory requirements

Results 2/2 Throughput of 7,3 MB/s: A total configuration file size is about 1 MB Considering an architecture: 1/3 of the area as fixed part 2/3 as reconfigurable part with 6 slots With such hypothesis Size of a partial bitstream will be about 110 KB Relocation time of about 15 ms

What’s Next Future improvements: Direct access to the memory (DMA) Direct manipulation of the bitstream Portability Integration with ICAP Elimination of the relocation overhead Relocation time << reconfiguration time The final goal: Creation of a real architecture that exploits self partial and dynamical 2D-reconfiguration,with relocation

H igh L evel R econfiguration Marco Maggioni marco.maggioni @dresd.org

Outline Introduction Problem description Project Goals State of the Art Project in details Contributions HLR workflow GraphGen IsomorphClustering SimpleLatency Salomone Results What’s next

Problem Description What is H igh L evel R econfiguration...? Theoretical approach to dynamic reconfiguration... Vision... Reconfigurability has many advantages... Mission... Exploit these advantages to obtain best performance... How...? Adapting a system to this execution model managing complexity and drawbacks...

Project Goal Create a complete HLR workflow... From a real system specification to its reconfigurable execution model... Define precise interfaces for each phase... To promote flexibility and future HLR researchs... To develop a complete toolchain... Apply some algorithms regarding reconfigurability... To reuse past works...

State of Art Present of HLR ... Some ideas/concepts regarding clustering and scheduling... ... but no a complete and well-defined workflow. ... but a lot of work to do. System specifications analysis... P and A HW/SW framework to promote new ideas... Dynamic Reconfigurability can be considered as a branch of this research...

Contribution Dynamic library loading system ... Embedded into GNU compilation tool-chain Porting of P and A libraries into Earendil... Suitable for future analysis... HLR tools deployed onto Earendil... Cover each step of workflow...

HLR workflow C lustering (with A nalysis)... 1 st Month C oloring... 2 nd Month S cheduling... 3 rd Month Gcc Frontend Partitioning Algorithm PandA Scheduling Algorithm Clustered Graph Metric Evaluation Reconfigurable Clustered Graph Area Latency Rec. Time Power Target Architecture Database

GraphGen GraphGen is the first step of the HLR toolchain ... Takes as input a system specification or an algorithm... Produces a graph (CFG/BB/DFG/SDG) Perfoms high level analysis step... Transforms the system description (C/C++/SystemC) to a representation suitable for further elaboration... Based on GCC and compiler theory... Uses P and A 0.4 funtionalities to produce a statement level graph...

IsomorphClustering IsomorphClusteing follows GraphGen in the HLR toolchain ... Takes as input a statement level graph... Produces a clustered graph... Clustering phase... Aggregates nodes into configuration (basic unit of reconfigurable execution)... Based on isomorphism, tries to find different instances of isomorph templates... We can also apply differents algorithms...

SimpleLatency SimpleLatency follows IsomorphClusteing in the HLR toolchain ... Takes as input a clustered graph... Adds latency information at each configuration... Produces a reconfigurable clustered graph with latency evaluations... Coloring... “ Colors” each cluster with usefull evalution for reconfigurability... Based on clusters internal critical path... Different metric for different architectures... Connects HLR with real architectural parameters...

Salomone Salomone is the last step in the HLR toolchain ... Takes as input a reconfigurable clustered graph... Produces a schedule on an abstract reconfigurable architecture... Scheduling... It's considered the core task of HLR ... Maps each configuration on an area portion... Adapts the system execution to reconfigurable model... Based onto graph coloring algorithm...

Results 1/3 Based onto AES encryption... Templates found with Isomorph CLustering... Execution time... 123.94 s

Results 2/3 Salomone adapting and coloring... Execution time... 113.55 s

Results 3/3 Final Scheduling...

What's next Heuristich implementation for Salomone... To improve result quality in term of number of area portions... A new metric for area/latency... Based on RTL logical synthesis evaluations... Introduce feedback into HLR workflow... Based on schedule evaluation... New clustering and scheduling algorithms... Such as Napoleon...

HPPS - Final - 06/14/2007

More Related Content

What's hot

Viewers also liked

Similar to HPPS - Final - 06/14/2007

More from Marco Santambrogio

Recently uploaded

HPPS - Final - 06/14/2007