H igh  P erformance   P rocessors   and  S ystems   PdM – UIC joint master 2007 Instructor: Prof. Donatella Sciuto HPPS @ PdM – June 2007
General Outline DRESD DReAMS Alessandro Panella Matteo Murgida Operating System Ivan Beretta Design Flow Antonio Piazzi Polaris Massimo Morandi Marco Novati HLR Marco Maggioni
DRESD  in a Nutshell D ynamic  R econfigurability   in  E mbedded  S ystem  D esign DRESD @ PdM – June 2007
Outline Reconfiguration Motivations Basic Definition SoC
Motivations Increasing need for behavioral flexibility in embedded systems design Support of new standards, e.g. in media processing Addition of new features Applications too large to fit on the device all at once Speedup the overall computation of the final system
Reconfiguration The process of physically altering the location or functionality of network or system elements. Automatic configuration describes the way sophisticated networks can readjust themselves in the event of a link or device failing, enabling the network to continue operation. Gerald Estrin, 1960
SoC Reconfiguration f i x Partial Total Embedded
Different Scenarios... Single Device Distributed System
What’s next DRESD DReAMS Alessandro Panella Matteo Murgida Operating System Ivan Beretta Design Flow Antonio Piazzi Polaris Massimo Morandi Marco Novati HLR Marco Maggioni
D ynamic  Re configurability  A pplied   to  M ulti-FPGA  S ystems
DReAMS Dynamic Reconfigurability Applied to Multi-FPGA Systems Branch of DRESD project Inherits architectures and tools Automatic workflow from VHDL system description to FPGA implementation VHDL parsing and system simulation System creation over a specific architecture Bitstream creation and download onto FPGAs
Multi-FPGA Partitioning Alessandro Panella [email_address]
Outline Problem description Project goals and contributions Project phases What is partitioning? Existing approaches Going deep into the problem SpartA The framework The idea The algorithm Experimental results Future work
Problem description Multi-FPGA - RATIONALE Large designs do not fit into a single chip High performance parallelized applications Our case: apply dynamic reconfigurability Need to break the initial design into several blocks One block corresponds to a single FPGA chip Which inputs/outputs? Which objectives? Which techinques?
Project goals and contributions Analyze existing approaches Obtain a deep knowledge of this -well explored- field Extract basic ideas for a new approach Obtain some terms of comparison Define precisely which problem(s) we cope with Contextualize the problem Focus on our needs Develop a new solution Theoretical background Implementation and evaluation
Project phases First Phase  [15th March – 12th April] Documentation: presentation (12/4), report Goals: Analysis of the state of the art Produce some hints on a new approach Second Phase [13th April – 17th May] Documentation: presentation (17/5), report Goals: Precise definition of the problem Propose a new solution Third Phase [18th May – 14th June] Documentation: presentation (14/6), final report Goal Implementation and evaluation of the proposed solution
What is partitioning? Goal Divide a set of interrelated objects into a set of subsets Optimize a specific objective(s) K-way partitioning Given a graph G=(V,E), partition it into  k  subsets V 1 ...V k  such that their intersection is empty and their union = V. Balance constraint: |V i |  ≈ |V|/k Aims at minimizing (or maximizing) an objective function Edge-cut Other objectives In general: NP-complete Several heuristics that provide good results have been developed
Existing approaches - a glance Traditional methods Kernighan – Lin and Fiduccia – Mattheyses heuristics Iterative-improvement algorithms Begins with an initial partition and iteratively improve it O(n 3 ) complexity Iterative algorithms Genetic Simulated annealing Multilevel algorithms Clustering -> Initial partitioning -> Refining MeTIS/hMETIS suite: best current results for large flattened graphs partitioning
Going deeper into the problem Two kinds of multi-FPGA partition Topology-aware Architecture topology is an input No optimization of the no. of FPGAs needed Main task: association between the (larger)  system graph  and the (smaller)  architectural graph Topology-free Architecture topology is not provided Input: dimension and communication features of FPGAs Minimization of the number of FPGAs Place and route after partitioning At the moment, we deal with the  Topology-free  problem
SPartA: the framework Input: VHDL system description Output: several VHDL files, one for each block (FPGA) Three main phases: Extract design from VHDL description “ Real” partitioning phase (core) Build VHDL files
SPartA: the idea Structural approach Fully exploits the design hierarchy Modules can be treated as single blocks Bases for expansions toward dynamic reconfigurability Objectives Minimize cutsize Minimized the number of used FPGAs Preserving module integrity
SPartA: the algorithm  1/2 Recursive algorithm (deals with trees) Starts from TOP node Precondition No leaves with dimension > FPGA size At every moment, a node can be: COVERED, UNCOVERED or PARTIALLY COVERED Stop condition Node TOP is COVERED
SPartA: the algorithm  2/2 OPEN ISSUE: Selecting the first node to be inserted into an empty partition Random node Node with overall max communication Node with max communication with its siblings
Results  2/2 Complexity: exponential, due to the recursive nature of the algorithm Execution time however low (tens of seconds for a reasonable large design) EXAMPLE ORIGINAL TREE PARTITIONED TREE
Results  3/3 Evaluation metrics EDGECUT, FILLING and SPLITS Evaluation of the three policies for node selection 18 different trees of varying size
Results  3/3
Future work Algorithm improvement Balancing of last partition First node selection policies More refined “score” function for selecting node Use closeness metrics Comparisons with existing algorithms Expansion SpartA framework development Topology-aware partitioning
The end ANY QUESTIONS?
What’s next DRESD DReAMS Alessandro Panella Matteo Murgida Operating System Ivan Beretta Design Flow Antonio Piazzi Polaris Massimo Morandi Marco Novati HLR Marco Maggioni
Chimera Multi-FPGAs Architecture Definition Matteo Murgida [email_address]
Outline Introduction Problem description Project Goals State of the Art Project in details Contributions Phases Results What’s next
Problem Description Architectural description of a distributed FPGAs environment 3 layers architecture
Project Goals Design the architecture of the most generic distributed system Node definition Interface definition Communication channel definition Design a communication protocol Essential protocol Interrupt based protocol Timeout improvement
State of the Art CONFigurable ElecTronic TIssue (CONFETTI) by EPFL Cellular based architecture PROs: high degree of parallelism, high computational power CONs: no flexibility, oversized for small problems, small architectural customizations imply big cost/effort Splash 2 by IDA Supercomputing Center Architecture composed by a Sun Sparcstation host, an interface board and “Splash Array”s boards PROs: again high parallelism and power CONs: a central host coordinates the computational units, no fault tollerance, no flexibility
Contributions The proposed architecture: Allows several Spartan-3 Starter Boards to communicate and exchange data It is portable to different FPGAs with minimum effort It is the basic infrastructure that will allow  external  partial dynamic reconfiguration
Project Phases First Phase, time window: 15th March – 12th April Documentation: prj presentation (12/4), prj report Goals: Digilent Spartan-3 Starter Board study Boards connection Second Phase, time window: 13th April – 17th May Documentation: prj presentation (17/5), prj report Goals: Communication between two Microblaze soft-processors GPIO integration in the architecture Third Phase, time window: 18th May – 14th June Documentation: prj presentation (14/6), prj report Goals:  Interrupt handling, timeout handling Simple application as example
Board Study How to use resources like switches, leds and connectors in the board How to map an IP-Core port with a physical pin of the board Choice of the A2 Expansion Connector to connect two boards
Microblaze Communication Communication between two Microblaze soft-processors Development of a display controller to visualize the data flow
GPIO Insertion Higher architecture portability through the use of the GPIO IP-Core. Higher architecture portability through the use of the GPIO IP-Core
Interrupt Controller Insertion Communication protocol improvement by interrupt handling to prevent processor from  busy waiting  Interrupt Controller is included in the architecture to permit multi-interrupt detection and handling
Timeout Malfunctioning due to interference on the communication channel lead to deadlocks Communication protocol is not reliable at all Counter implementation, including the driver used by the processor to lower down raised interrupts Development of a simple application to verify to correctness of the proposed approach
Results A short Demo ...
Future Work Apply the proposed approach to  external  partial dynamic reconfiguration Develop a co-simulation framework based on the VHDL/SystemC descriptions of distributed systems Receive as input the VHDL description of the system Build the VHDL description for every node Create the SystemC stub to allow inter node communication Describe the communication in SystemC Co-simulate the VHDL / SystemC description
Questions
What’s next DRESD DReAMS Alessandro Panella Matteo Murgida Operating System Ivan Beretta Design Flow Antonio Piazzi Polaris Massimo Morandi Marco Novati HLR Marco Maggioni
O perating  Sy stem support for  R econf i gurable  S oC
Development of an OS architecture-independent layer for dynamic reconfiguration Ivan Beretta [email_address]
Outline Introduction Problem description Project Goals State of the Art Project in details Contributions Phases Results What’s next
Problem description Need for an operating system support on Reconfigurable SoCs Simplified software development process Improved code portability Lack of support for dynamic reconfigurable architectures Specific solutions for specific architectures Need for an architecture-independent abstraction layer
Project Goal Primary goals: Analysis of the State of the Art Definition of the new intermediate layer Physical implementation Specific goals: Study of the solutions developed inside the DRESD group  Comparison between existing solutions Recovery of on of the two implementations Hardware architectures generation using up-to-date tools on Xilinx Virtex II – Pro VP7
State of the Art Caronte implementation (Alberto Donato, 2005) Two kernel modules ICAP deivice driver IP-Core manager (IPCM)
State of the Art (cont’d) YaRA implementation (Vincenzo Rana, 2006) Multi-layered structure Four modules: Reconfiguration controller driver, MAC, LOL, Reconfiguration Library ROTFL architecture
Contributions Limits of existing implementations Lack of portability E.g. YaRA solution implemented on RAPTOR2000 Reconfiguration process details visible from userspace Definition of an architecture independent middleware Improved portability It works on different hardware architectures It works with different Linux distribution Opportunity to optimize latencies
Phases First phase: Layer definition Goal: Factorization of common features Boundaries of the new middleware Mapping of existing solutions on the functionalities Motivation: Provide guidelines for actual implementation Second phase: Implementation recovery Goal: Recovery of bootstrap process and kernel images Motivation: Full recovery of Caronte solution Third phase: Architectures generation Goal: Synthesis of hardware architectures using up-to-date Xilinx tools and cores Motivation: Synthesis of hardware architectures using up-to-date Xilinx tools
First Phase: Layer definition Definition of new layer boundaries Factorization of existing features Mapping of the required functionalities on existing implementations Legend: ● = Both hardware and software ● = Hardware independent Feature Caronte Solution YaRA Solution Reconfiguration controller support ICAP device driver Reconfiguration Controller Driver Dynamic address space assignment IPCM Module MAC module Dynamic device registration and driver loading IPCM Module LOL module API Direct interaction with modules Reconfiguration library Module management (caching, placement...) Not implemented ROTFL architecture
Second Phase: Implementation Recovery Bootstrap process from flash memory 16 MB Flash 0xe4000000 0xe42FFFFF ... ... 0xe4F00000 0xe4F80000 64 MB DDR SDRAM 0x00000000 ... ... 0xe4FFFFFF 0x03FFFFFF 0x00800000 ... BRAM PowerPC FPGA Bootloader Bootmanager Kernel and RAMDisk Image 1 2 3 4 5 6
Second Phase: Implementation Recovery (cont’d) Several issues No bootmanager nor linux kernel on flash memory at the beginning Flash memory seen as read-only memory at runtime Need for an ad-hoc solution Avmon command line interface Executed from DDR SDRAM memory FTP transfert of bootmanager and flash programming Also useful for kernel download Kernel executable image Kernel image built using a cross-compiler ICAP and IPCM modules loaded at runtime
Third Phase: Architecture generation Hardware architecture used in Second Phase no longer useful Synthesized with Xilinx ISE and EDK 6.1 Same hardware structure realized with updated cores and recend tool versions Synthesis with Xilinx ISE and EDK 7.1 Synthesis with Xilinx ISE and EDK 9.1 Lack of device driver support and documentation to configure newest cores
Results: Implementation Recovery Linux Bootstrap from flash memory
Results: Implementation Recovery Design summary for hardware architectures on Xilinx Virtex II – Pro VP7 Two main limitations Ethernet controller Necessity of a top-level design Design too large for module-based reconfiguration Xilinx ISE/EDK 7.1 Xilinx ISE/EDK 9.1 Resource Used Available % Used Available % Slices 4926 4928 99% 5318 4928 107% Flip-Flops 5217 9856 52% 5724 9856 58% 4-in LUTs 6974 9856 70% 6993 9856 70%
What’s next Device driver updates to support newest architectures Intermediate layer implementation Opportunity to add some additional features Reconfiguration scheduler Opportunity to define a common device driver interface to simplify the creation of a new driver by the use Integration of the middleware and the operating system support in a complete design flow
Questions
What’s next DRESD DReAMS Alessandro Panella Matteo Murgida Operating System Ivan Beretta Design Flow Antonio Piazzi Polaris Massimo Morandi Marco Novati HLR Marco Maggioni
Design FLow Antonio Piazzi [email_address]
Outline Introduction Problem description Project Goals State of the Art Project in details Contributions Phases Results What’s next
Problem description User has to spread his attention on many problems, some of this related with the implementation of the design. Often users could don’t know anything about reconfigurable architecture generation and they haven’t.
Project Goals New design methodology tailored to support partial dynamic reconfigurable architecture Definition and implememtantion of design framework able to Support different design paradigms i.e. Xilinx Module Based, Xilinx EAPR Hide the  dirty work  (due to the recofiguration) to the application designer Support different architectural solutions i.e. different communication infrastructure IBM CoreConnect or Wishbone
Contributions With our frame work all user (novice and not) may be able to develop and debug their functionality through a reconfigurable architecture without analyze all problems related with that develop methodology
Phases 1 st   phase (15 March – 15 April): Budgeting Study of the state of the art 2 nd   phase(15 April – 15 May): Realization phase Construction of the entire frame work based on previously separated tools Implementation of a innovative work flow 3 rd  phase (15 May – 15 June): Project’s validation Definition of a new communication infrastructure and transfer protocol for the reconfigurable part Verify the integration of the new infrastructure in the project
First Phase Study of the state of the art Standard  reconfigurable design flow  Xilinx Modlue Based and EAPR Caronte Design Flow EDK-based architecture
Sel f Reconfigurable Architecture
Second Phase  1/4 Costruction of the entire frame work based on prevoiusly separated tools User has to focus his attention only on the develop of the IBM core-connect architecture and on writing modules which implement his functionality SYSTEM.VHD contains all information about the IBM core-connect architecture
Second Phase  2/4 ArchGen take the system.vhd file and process the contained architecture and translate that static architecture in a dynamic one FIX.VHD contains the instantiations of the processors (one or more) and all the components presented in the IBM core-connect architecture TOP.VHD contains the instantiations of the fix component and the information about the communication infrastructure
Second Phase  3 /4 COMiC generate an NCD file which contains the information about the communication infrastructure and an XDL file which contains the same information in text mode
Second Phase  4/ 4 At this point we have only to collect all the information we need and so, through a parser we insert those into a new top.vhd which will be our fix part of the architecture, at this point we have only to manage the reconfigurable modules written by the user
Third Phase  1/3 An OPB bus based on 3-state buffer used to link one or more modules to the fix part (created with ISE) Definition of a new communication infrastructure and transfer protocol for the reconfigurable part
Third Phase  2/3 Use ncd2xdl converter to obtain an xdl file which contains all parameters of our bus
Third Phase  3/3 Perfect integration in our process, we can use all bus type to connect fix and reconfigurable part Verify the integration of the new infrastructure in the project
Results That frame work answer to the need of automation presented from the novice user and help, generally, all the users that they head a low time to market.
What’s next Our idea for future work is to schedule a one or two work day to patch some bugs presents in the project and to adjust the output of COMiC which has to create an OPB replay bus.
Questions?
What’s next DRESD DReAMS Matteo Murgida Alessandro Panella Operating System Ivan Beretta Design Flow Antonio Piazzi Polaris Massimo Morandi Marco Novati HLR Marco Maggioni
Polaris
Polaris Create an integrated HW/SW system to manage 2D reconfiguration SW side: Maintain information on FPGA status Decide of how to efficiently allocate tasks HW side: Provide support for effective task allocation Perform 2D bitstream relocation
Management of 2D Reconfiguration in a Reconfigurable System Massimo Morandi [email_address]
Outline Introduction Problem description  Project Goals and Contributions Project in details Phases Results Future Work
Problem Description New Generation of FPGAs Virtex-4 and Virtex-5 Allow bi-dimensional reconfiguration This permits to: Better exploit reconfigurable area Obtain modules performance optimizations More complex management: Handle one more degree of freedom Avoid more fragmentation Perform good placement choices to keep low TRR Keep acceptable intra-module routing paths
Project Goals and Contributions Analyze effects of 2D reconfiguration New advantages New problems Examine possible solutions to new problems Explore literature to find promising ideas Evaluate those solutions in various scenarios Propose a new solution Combining ideas from literature with new ones Obtaining good cost-quality tradeoff
Project Phases First Phase, time window: 15th March – 12th April Documentation: prj presentation (12/4), prj report Goals: General analysis of 2D reconfiguration Detailed description of the new problems Second Phase, time window: 13th April – 17th May Documentation: prj presentation (17/5), prj report Goals: Definition of desired features for a solution Analysis and evaluation of existing solutions Third Phase, time window: 18th May – 14th June Documentation: prj presentation (14/6), prj report Goal: p ropose a new combined solution to effectively handle problems of 2D reconfiguration
Setting and Advantages Definition Definition of the setting: 2D self partial dynamical run-time reconfiguration Analysis of the advantages of 2D Reconfiguration In area usage and performance
2D Fragmentation Problem Analysis of the 2D-fragmentation problem Area generally more fragmented Can nullify the area optimizations obtained
Placement Decisions Analysis of 2D placement choices effects: Again, bad choices can lead to performance loss
Allocation manager Definition of allocation manager desired features: Low TRR Low management overhead High routing efficiency Low fragmentation Definition of allocation manager structure: Empty space manager Complete space  Heuristic selection Fitter General (FF,BL,BF,WF…) Focused (FA,RA… )
Most relevant works Maintain complete information on empty space: KAMER: Keep All Maximally Empty Rectangles Apply a general fitting strategy CUR: Maintain the Countour of a Union of Rectangles Apply a focused fitting strategy Heuristically prune part of the information: KNER: Keep Non-overlapping Empty Rectangles Apply a general fitting strategy 2D-HASHING: Keep Non-ov. Empty Rectangles in optimized data structure Apply (exclusively) a general fitting strategy
Evaluation and Proposed Approach Proposed Approach Heuristic (KNER-like) empty space manager, to keep low complexity for use in a self-reconfigurable system Fitting strategy focused on minimizing routing paths, to maintain high performance of the reconfigurable system (chosen metric to minimize Manhattan distance) High placement quality => high complexity Lowest compl. => no focused fitting (bad especially for routing)
Structure of the allocation manager Task, defined by: Arrival time, ASAP, (ALAP), H, W, Latency, Communicating Tasks Hosted in a queue which also adds a pointer to the rectangle where it is placed Reconfigurable Device, represented as: Binary Tree structure, each node is a Rectangle, each leaf is an empty Rectangle. Navigation trough pointers to left child, right child, next leaf and a function to find previous leaf (for bookkeeping after split or merge) Rectangle, defined by: X, Y, H, W Initially one, (X,Y)=(0,0), H=FPGA Rows, W=FPGA Cols
The Placement Algorithm
Experimental Results Benchmark of 100 randomly generated tasks: Size (5% to 25% of FPGA), randomly interconnected Execution time: 3x less than CUR, close to KNER Communication cost: 3x less than KNER, close to CUR Task Rejection Rate: all solutions quite close
Future Work Apply the proposed solution to self reconfiguration: Adapt the algorithm to run on the internal processor Create a validation reconfigurable architecture Integrate the architecture with relocation Tune the algorithm to improve results: Experiment techniques to reduce TRR Try to optimize the code to have an algorithm with lower running time
Questions?
What’s next DRESD DReAMS Alessandro Panella Matteo Murgida Operating System Ivan Beretta Design Flow Antonio Piazzi Polaris Massimo Morandi Marco Novati HLR Marco Maggioni
Relocation for 2D Reconfigurable Systems Marco Novati [email_address]
Outline Introduction Problem description Project Goals  Project in details Phases  Results What’s next
Problem   Description Self Dynamical Runtime 2D Reconfiguration Xilinx Virtex-4 and Virtex-5 Relocation, different solutions Software (BAnMat, PARBIT) Hardware (REPLICA, BiRF) We chose an hardware solution BiRF Square
Project Goals Study of the new FPGA Families Examination of Xilinx documentation on V4 and V5 Analysis of the new bitstream structure Generation of V4 and V5 bitstream Development of the new version of BiRF Implementation Validation
Phases First Phase:  15th March – 12th April Documentation: prj presentation (12/4), prj report Goals: Xilinx documentation examination V4 & V5 bitstream structure analysis Second Phase:  13th April – 17th May Documentation: prj presentation (17/5), prj report Goals: Implementation of BiRF Square Synthesis Third Phase:  18th May – 14th June Documentation: prj presentation (14/6), prj report Goals: Verification & Validation
Frame Addressing New Frame Addressing: Possibility of addressing rows and columns
New Parser
CRC Calculation Particular CRC value, used by Xilinx tools Two version of BiRF Square: By using the “predefined” value With actual CRC calculation An optimized algorithm has been used
Synthesis results On a Virtex-4 with speed grade -12 General purpose version: max frequency of 160 MHz Specific version:  maxfrequency of 290Mhz
Target Device
Validation Architecture
Results  1/2 BiRF Square Permitsto apply relocation in a  self partially and dynamically 2D-reconfigurable system The occupation ratio is relatively small Frequency more than acceptable Reduction of internal memory requirements
Results  2/2 Throughput  of 7,3 MB/s: A total configuration file size is about 1 MB Considering an architecture: 1/3 of the area as fixed part  2/3 as reconfigurable part with 6 slots With such hypothesis Size of a partial bitstream will be about 110 KB Relocation time of about 15 ms
What’s Next Future improvements: Direct access to the memory (DMA)  Direct manipulation of the bitstream Portability Integration with ICAP Elimination of the relocation overhead  Relocation time << reconfiguration time The final goal: Creation of a real architecture that exploits self partial and dynamical 2D-reconfiguration,with relocation
Questions
What’s next DRESD DReAMS Alessandro Panella Matteo Murgida Operating System Ivan Beretta Design Flow Antonio Piazzi Polaris Massimo Morandi Marco Novati HLR Marco Maggioni
H igh  L evel  R econfiguration Marco Maggioni marco.maggioni @dresd.org
Outline Introduction Problem description  Project Goals State of the Art Project in details Contributions HLR  workflow GraphGen IsomorphClustering SimpleLatency Salomone Results What’s next
Problem Description What is  H igh  L evel  R econfiguration...? Theoretical approach to dynamic reconfiguration... Vision... Reconfigurability has many advantages... Mission... Exploit these advantages to obtain best performance... How...? Adapting a system to this execution model managing complexity and drawbacks...
Project Goal Create a complete  HLR  workflow... From a real system specification to its reconfigurable execution model... Define precise interfaces for each phase... To promote flexibility and future  HLR  researchs... To develop a complete toolchain... Apply some algorithms regarding reconfigurability... To reuse past works...
State of Art Present of  HLR ... Some ideas/concepts regarding clustering and scheduling... ... but no a complete and well-defined workflow. ... but a lot of work to do. System specifications analysis... P and A  HW/SW framework to promote new ideas... Dynamic Reconfigurability can be considered as a branch of this research...
Contribution Dynamic library loading system ... Embedded into GNU compilation tool-chain Porting of  P and A  libraries into Earendil... Suitable for future analysis...  HLR  tools deployed onto Earendil... Cover each step of workflow...
HLR workflow C lustering (with  A nalysis)... 1 st  Month C oloring... 2 nd  Month S cheduling... 3 rd  Month Gcc Frontend Partitioning Algorithm PandA Scheduling Algorithm Clustered Graph Metric Evaluation Reconfigurable Clustered Graph Area Latency Rec. Time Power Target  Architecture Database
GraphGen GraphGen is the first step of the  HLR  toolchain ... Takes as input a system specification or an algorithm... Produces a graph (CFG/BB/DFG/SDG) Perfoms high level analysis step... Transforms the system description (C/C++/SystemC) to a representation suitable for further elaboration... Based on GCC and compiler theory... Uses  P and A  0.4 funtionalities to produce a statement level graph...
IsomorphClustering IsomorphClusteing follows GraphGen in the  HLR  toolchain ... Takes as input a statement level graph... Produces a clustered graph... Clustering phase... Aggregates nodes into configuration (basic unit of reconfigurable execution)... Based on isomorphism, tries to find different instances of isomorph templates... We can also apply differents algorithms...
SimpleLatency SimpleLatency follows IsomorphClusteing in the  HLR  toolchain ... Takes as input a clustered graph... Adds latency information at each configuration... Produces a reconfigurable clustered graph with latency evaluations... Coloring... “ Colors” each cluster with usefull evalution for reconfigurability... Based on clusters internal critical path... Different metric for different architectures... Connects  HLR  with real architectural parameters...
Salomone Salomone is the last step in the  HLR  toolchain ... Takes as input a reconfigurable clustered graph... Produces a schedule on an abstract reconfigurable architecture... Scheduling... It's considered the core task of  HLR ... Maps each configuration on an area portion... Adapts the system execution to reconfigurable model... Based onto graph coloring algorithm...
Results  1/3 Based onto AES encryption... Templates found with Isomorph CLustering... Execution time... 123.94 s
Results  2/3 Salomone adapting and coloring... Execution time... 113.55 s
Results  3/3 Final Scheduling...
What's next Heuristich implementation for Salomone... To improve result quality in term of number of area portions... A new metric for area/latency... Based on RTL logical synthesis evaluations... Introduce feedback into  HLR  workflow... Based on schedule evaluation... New clustering and scheduling algorithms... Such as Napoleon...
Questions

HPPS - Final - 06/14/2007

  • 1.
    H igh P erformance P rocessors and S ystems PdM – UIC joint master 2007 Instructor: Prof. Donatella Sciuto HPPS @ PdM – June 2007
  • 2.
    General Outline DRESDDReAMS Alessandro Panella Matteo Murgida Operating System Ivan Beretta Design Flow Antonio Piazzi Polaris Massimo Morandi Marco Novati HLR Marco Maggioni
  • 3.
    DRESD ina Nutshell D ynamic R econfigurability in E mbedded S ystem D esign DRESD @ PdM – June 2007
  • 4.
  • 5.
    Motivations Increasing needfor behavioral flexibility in embedded systems design Support of new standards, e.g. in media processing Addition of new features Applications too large to fit on the device all at once Speedup the overall computation of the final system
  • 6.
    Reconfiguration The processof physically altering the location or functionality of network or system elements. Automatic configuration describes the way sophisticated networks can readjust themselves in the event of a link or device failing, enabling the network to continue operation. Gerald Estrin, 1960
  • 7.
    SoC Reconfiguration fi x Partial Total Embedded
  • 8.
    Different Scenarios... SingleDevice Distributed System
  • 9.
    What’s next DRESDDReAMS Alessandro Panella Matteo Murgida Operating System Ivan Beretta Design Flow Antonio Piazzi Polaris Massimo Morandi Marco Novati HLR Marco Maggioni
  • 10.
    D ynamic Re configurability A pplied to M ulti-FPGA S ystems
  • 11.
    DReAMS Dynamic ReconfigurabilityApplied to Multi-FPGA Systems Branch of DRESD project Inherits architectures and tools Automatic workflow from VHDL system description to FPGA implementation VHDL parsing and system simulation System creation over a specific architecture Bitstream creation and download onto FPGAs
  • 12.
    Multi-FPGA Partitioning AlessandroPanella [email_address]
  • 13.
    Outline Problem descriptionProject goals and contributions Project phases What is partitioning? Existing approaches Going deep into the problem SpartA The framework The idea The algorithm Experimental results Future work
  • 14.
    Problem description Multi-FPGA- RATIONALE Large designs do not fit into a single chip High performance parallelized applications Our case: apply dynamic reconfigurability Need to break the initial design into several blocks One block corresponds to a single FPGA chip Which inputs/outputs? Which objectives? Which techinques?
  • 15.
    Project goals andcontributions Analyze existing approaches Obtain a deep knowledge of this -well explored- field Extract basic ideas for a new approach Obtain some terms of comparison Define precisely which problem(s) we cope with Contextualize the problem Focus on our needs Develop a new solution Theoretical background Implementation and evaluation
  • 16.
    Project phases FirstPhase [15th March – 12th April] Documentation: presentation (12/4), report Goals: Analysis of the state of the art Produce some hints on a new approach Second Phase [13th April – 17th May] Documentation: presentation (17/5), report Goals: Precise definition of the problem Propose a new solution Third Phase [18th May – 14th June] Documentation: presentation (14/6), final report Goal Implementation and evaluation of the proposed solution
  • 17.
    What is partitioning?Goal Divide a set of interrelated objects into a set of subsets Optimize a specific objective(s) K-way partitioning Given a graph G=(V,E), partition it into k subsets V 1 ...V k such that their intersection is empty and their union = V. Balance constraint: |V i | ≈ |V|/k Aims at minimizing (or maximizing) an objective function Edge-cut Other objectives In general: NP-complete Several heuristics that provide good results have been developed
  • 18.
    Existing approaches -a glance Traditional methods Kernighan – Lin and Fiduccia – Mattheyses heuristics Iterative-improvement algorithms Begins with an initial partition and iteratively improve it O(n 3 ) complexity Iterative algorithms Genetic Simulated annealing Multilevel algorithms Clustering -> Initial partitioning -> Refining MeTIS/hMETIS suite: best current results for large flattened graphs partitioning
  • 19.
    Going deeper intothe problem Two kinds of multi-FPGA partition Topology-aware Architecture topology is an input No optimization of the no. of FPGAs needed Main task: association between the (larger) system graph and the (smaller) architectural graph Topology-free Architecture topology is not provided Input: dimension and communication features of FPGAs Minimization of the number of FPGAs Place and route after partitioning At the moment, we deal with the Topology-free problem
  • 20.
    SPartA: the frameworkInput: VHDL system description Output: several VHDL files, one for each block (FPGA) Three main phases: Extract design from VHDL description “ Real” partitioning phase (core) Build VHDL files
  • 21.
    SPartA: the ideaStructural approach Fully exploits the design hierarchy Modules can be treated as single blocks Bases for expansions toward dynamic reconfigurability Objectives Minimize cutsize Minimized the number of used FPGAs Preserving module integrity
  • 22.
    SPartA: the algorithm 1/2 Recursive algorithm (deals with trees) Starts from TOP node Precondition No leaves with dimension > FPGA size At every moment, a node can be: COVERED, UNCOVERED or PARTIALLY COVERED Stop condition Node TOP is COVERED
  • 23.
    SPartA: the algorithm 2/2 OPEN ISSUE: Selecting the first node to be inserted into an empty partition Random node Node with overall max communication Node with max communication with its siblings
  • 24.
    Results 2/2Complexity: exponential, due to the recursive nature of the algorithm Execution time however low (tens of seconds for a reasonable large design) EXAMPLE ORIGINAL TREE PARTITIONED TREE
  • 25.
    Results 3/3Evaluation metrics EDGECUT, FILLING and SPLITS Evaluation of the three policies for node selection 18 different trees of varying size
  • 26.
  • 27.
    Future work Algorithmimprovement Balancing of last partition First node selection policies More refined “score” function for selecting node Use closeness metrics Comparisons with existing algorithms Expansion SpartA framework development Topology-aware partitioning
  • 28.
    The end ANYQUESTIONS?
  • 29.
    What’s next DRESDDReAMS Alessandro Panella Matteo Murgida Operating System Ivan Beretta Design Flow Antonio Piazzi Polaris Massimo Morandi Marco Novati HLR Marco Maggioni
  • 30.
    Chimera Multi-FPGAs ArchitectureDefinition Matteo Murgida [email_address]
  • 31.
    Outline Introduction Problemdescription Project Goals State of the Art Project in details Contributions Phases Results What’s next
  • 32.
    Problem Description Architecturaldescription of a distributed FPGAs environment 3 layers architecture
  • 33.
    Project Goals Designthe architecture of the most generic distributed system Node definition Interface definition Communication channel definition Design a communication protocol Essential protocol Interrupt based protocol Timeout improvement
  • 34.
    State of theArt CONFigurable ElecTronic TIssue (CONFETTI) by EPFL Cellular based architecture PROs: high degree of parallelism, high computational power CONs: no flexibility, oversized for small problems, small architectural customizations imply big cost/effort Splash 2 by IDA Supercomputing Center Architecture composed by a Sun Sparcstation host, an interface board and “Splash Array”s boards PROs: again high parallelism and power CONs: a central host coordinates the computational units, no fault tollerance, no flexibility
  • 35.
    Contributions The proposedarchitecture: Allows several Spartan-3 Starter Boards to communicate and exchange data It is portable to different FPGAs with minimum effort It is the basic infrastructure that will allow external partial dynamic reconfiguration
  • 36.
    Project Phases FirstPhase, time window: 15th March – 12th April Documentation: prj presentation (12/4), prj report Goals: Digilent Spartan-3 Starter Board study Boards connection Second Phase, time window: 13th April – 17th May Documentation: prj presentation (17/5), prj report Goals: Communication between two Microblaze soft-processors GPIO integration in the architecture Third Phase, time window: 18th May – 14th June Documentation: prj presentation (14/6), prj report Goals: Interrupt handling, timeout handling Simple application as example
  • 37.
    Board Study Howto use resources like switches, leds and connectors in the board How to map an IP-Core port with a physical pin of the board Choice of the A2 Expansion Connector to connect two boards
  • 38.
    Microblaze Communication Communicationbetween two Microblaze soft-processors Development of a display controller to visualize the data flow
  • 39.
    GPIO Insertion Higherarchitecture portability through the use of the GPIO IP-Core. Higher architecture portability through the use of the GPIO IP-Core
  • 40.
    Interrupt Controller InsertionCommunication protocol improvement by interrupt handling to prevent processor from busy waiting Interrupt Controller is included in the architecture to permit multi-interrupt detection and handling
  • 41.
    Timeout Malfunctioning dueto interference on the communication channel lead to deadlocks Communication protocol is not reliable at all Counter implementation, including the driver used by the processor to lower down raised interrupts Development of a simple application to verify to correctness of the proposed approach
  • 42.
  • 43.
    Future Work Applythe proposed approach to external partial dynamic reconfiguration Develop a co-simulation framework based on the VHDL/SystemC descriptions of distributed systems Receive as input the VHDL description of the system Build the VHDL description for every node Create the SystemC stub to allow inter node communication Describe the communication in SystemC Co-simulate the VHDL / SystemC description
  • 44.
  • 45.
    What’s next DRESDDReAMS Alessandro Panella Matteo Murgida Operating System Ivan Beretta Design Flow Antonio Piazzi Polaris Massimo Morandi Marco Novati HLR Marco Maggioni
  • 46.
    O perating Sy stem support for R econf i gurable S oC
  • 47.
    Development of anOS architecture-independent layer for dynamic reconfiguration Ivan Beretta [email_address]
  • 48.
    Outline Introduction Problemdescription Project Goals State of the Art Project in details Contributions Phases Results What’s next
  • 49.
    Problem description Needfor an operating system support on Reconfigurable SoCs Simplified software development process Improved code portability Lack of support for dynamic reconfigurable architectures Specific solutions for specific architectures Need for an architecture-independent abstraction layer
  • 50.
    Project Goal Primarygoals: Analysis of the State of the Art Definition of the new intermediate layer Physical implementation Specific goals: Study of the solutions developed inside the DRESD group Comparison between existing solutions Recovery of on of the two implementations Hardware architectures generation using up-to-date tools on Xilinx Virtex II – Pro VP7
  • 51.
    State of theArt Caronte implementation (Alberto Donato, 2005) Two kernel modules ICAP deivice driver IP-Core manager (IPCM)
  • 52.
    State of theArt (cont’d) YaRA implementation (Vincenzo Rana, 2006) Multi-layered structure Four modules: Reconfiguration controller driver, MAC, LOL, Reconfiguration Library ROTFL architecture
  • 53.
    Contributions Limits ofexisting implementations Lack of portability E.g. YaRA solution implemented on RAPTOR2000 Reconfiguration process details visible from userspace Definition of an architecture independent middleware Improved portability It works on different hardware architectures It works with different Linux distribution Opportunity to optimize latencies
  • 54.
    Phases First phase:Layer definition Goal: Factorization of common features Boundaries of the new middleware Mapping of existing solutions on the functionalities Motivation: Provide guidelines for actual implementation Second phase: Implementation recovery Goal: Recovery of bootstrap process and kernel images Motivation: Full recovery of Caronte solution Third phase: Architectures generation Goal: Synthesis of hardware architectures using up-to-date Xilinx tools and cores Motivation: Synthesis of hardware architectures using up-to-date Xilinx tools
  • 55.
    First Phase: Layerdefinition Definition of new layer boundaries Factorization of existing features Mapping of the required functionalities on existing implementations Legend: ● = Both hardware and software ● = Hardware independent Feature Caronte Solution YaRA Solution Reconfiguration controller support ICAP device driver Reconfiguration Controller Driver Dynamic address space assignment IPCM Module MAC module Dynamic device registration and driver loading IPCM Module LOL module API Direct interaction with modules Reconfiguration library Module management (caching, placement...) Not implemented ROTFL architecture
  • 56.
    Second Phase: ImplementationRecovery Bootstrap process from flash memory 16 MB Flash 0xe4000000 0xe42FFFFF ... ... 0xe4F00000 0xe4F80000 64 MB DDR SDRAM 0x00000000 ... ... 0xe4FFFFFF 0x03FFFFFF 0x00800000 ... BRAM PowerPC FPGA Bootloader Bootmanager Kernel and RAMDisk Image 1 2 3 4 5 6
  • 57.
    Second Phase: ImplementationRecovery (cont’d) Several issues No bootmanager nor linux kernel on flash memory at the beginning Flash memory seen as read-only memory at runtime Need for an ad-hoc solution Avmon command line interface Executed from DDR SDRAM memory FTP transfert of bootmanager and flash programming Also useful for kernel download Kernel executable image Kernel image built using a cross-compiler ICAP and IPCM modules loaded at runtime
  • 58.
    Third Phase: Architecturegeneration Hardware architecture used in Second Phase no longer useful Synthesized with Xilinx ISE and EDK 6.1 Same hardware structure realized with updated cores and recend tool versions Synthesis with Xilinx ISE and EDK 7.1 Synthesis with Xilinx ISE and EDK 9.1 Lack of device driver support and documentation to configure newest cores
  • 59.
    Results: Implementation RecoveryLinux Bootstrap from flash memory
  • 60.
    Results: Implementation RecoveryDesign summary for hardware architectures on Xilinx Virtex II – Pro VP7 Two main limitations Ethernet controller Necessity of a top-level design Design too large for module-based reconfiguration Xilinx ISE/EDK 7.1 Xilinx ISE/EDK 9.1 Resource Used Available % Used Available % Slices 4926 4928 99% 5318 4928 107% Flip-Flops 5217 9856 52% 5724 9856 58% 4-in LUTs 6974 9856 70% 6993 9856 70%
  • 61.
    What’s next Devicedriver updates to support newest architectures Intermediate layer implementation Opportunity to add some additional features Reconfiguration scheduler Opportunity to define a common device driver interface to simplify the creation of a new driver by the use Integration of the middleware and the operating system support in a complete design flow
  • 62.
  • 63.
    What’s next DRESDDReAMS Alessandro Panella Matteo Murgida Operating System Ivan Beretta Design Flow Antonio Piazzi Polaris Massimo Morandi Marco Novati HLR Marco Maggioni
  • 64.
    Design FLow AntonioPiazzi [email_address]
  • 65.
    Outline Introduction Problemdescription Project Goals State of the Art Project in details Contributions Phases Results What’s next
  • 66.
    Problem description Userhas to spread his attention on many problems, some of this related with the implementation of the design. Often users could don’t know anything about reconfigurable architecture generation and they haven’t.
  • 67.
    Project Goals Newdesign methodology tailored to support partial dynamic reconfigurable architecture Definition and implememtantion of design framework able to Support different design paradigms i.e. Xilinx Module Based, Xilinx EAPR Hide the dirty work (due to the recofiguration) to the application designer Support different architectural solutions i.e. different communication infrastructure IBM CoreConnect or Wishbone
  • 68.
    Contributions With ourframe work all user (novice and not) may be able to develop and debug their functionality through a reconfigurable architecture without analyze all problems related with that develop methodology
  • 69.
    Phases 1 st phase (15 March – 15 April): Budgeting Study of the state of the art 2 nd phase(15 April – 15 May): Realization phase Construction of the entire frame work based on previously separated tools Implementation of a innovative work flow 3 rd phase (15 May – 15 June): Project’s validation Definition of a new communication infrastructure and transfer protocol for the reconfigurable part Verify the integration of the new infrastructure in the project
  • 70.
    First Phase Studyof the state of the art Standard reconfigurable design flow Xilinx Modlue Based and EAPR Caronte Design Flow EDK-based architecture
  • 71.
  • 72.
    Second Phase 1/4 Costruction of the entire frame work based on prevoiusly separated tools User has to focus his attention only on the develop of the IBM core-connect architecture and on writing modules which implement his functionality SYSTEM.VHD contains all information about the IBM core-connect architecture
  • 73.
    Second Phase 2/4 ArchGen take the system.vhd file and process the contained architecture and translate that static architecture in a dynamic one FIX.VHD contains the instantiations of the processors (one or more) and all the components presented in the IBM core-connect architecture TOP.VHD contains the instantiations of the fix component and the information about the communication infrastructure
  • 74.
    Second Phase 3 /4 COMiC generate an NCD file which contains the information about the communication infrastructure and an XDL file which contains the same information in text mode
  • 75.
    Second Phase 4/ 4 At this point we have only to collect all the information we need and so, through a parser we insert those into a new top.vhd which will be our fix part of the architecture, at this point we have only to manage the reconfigurable modules written by the user
  • 76.
    Third Phase 1/3 An OPB bus based on 3-state buffer used to link one or more modules to the fix part (created with ISE) Definition of a new communication infrastructure and transfer protocol for the reconfigurable part
  • 77.
    Third Phase 2/3 Use ncd2xdl converter to obtain an xdl file which contains all parameters of our bus
  • 78.
    Third Phase 3/3 Perfect integration in our process, we can use all bus type to connect fix and reconfigurable part Verify the integration of the new infrastructure in the project
  • 79.
    Results That framework answer to the need of automation presented from the novice user and help, generally, all the users that they head a low time to market.
  • 80.
    What’s next Ouridea for future work is to schedule a one or two work day to patch some bugs presents in the project and to adjust the output of COMiC which has to create an OPB replay bus.
  • 81.
  • 82.
    What’s next DRESDDReAMS Matteo Murgida Alessandro Panella Operating System Ivan Beretta Design Flow Antonio Piazzi Polaris Massimo Morandi Marco Novati HLR Marco Maggioni
  • 83.
  • 84.
    Polaris Create anintegrated HW/SW system to manage 2D reconfiguration SW side: Maintain information on FPGA status Decide of how to efficiently allocate tasks HW side: Provide support for effective task allocation Perform 2D bitstream relocation
  • 85.
    Management of 2DReconfiguration in a Reconfigurable System Massimo Morandi [email_address]
  • 86.
    Outline Introduction Problemdescription Project Goals and Contributions Project in details Phases Results Future Work
  • 87.
    Problem Description NewGeneration of FPGAs Virtex-4 and Virtex-5 Allow bi-dimensional reconfiguration This permits to: Better exploit reconfigurable area Obtain modules performance optimizations More complex management: Handle one more degree of freedom Avoid more fragmentation Perform good placement choices to keep low TRR Keep acceptable intra-module routing paths
  • 88.
    Project Goals andContributions Analyze effects of 2D reconfiguration New advantages New problems Examine possible solutions to new problems Explore literature to find promising ideas Evaluate those solutions in various scenarios Propose a new solution Combining ideas from literature with new ones Obtaining good cost-quality tradeoff
  • 89.
    Project Phases FirstPhase, time window: 15th March – 12th April Documentation: prj presentation (12/4), prj report Goals: General analysis of 2D reconfiguration Detailed description of the new problems Second Phase, time window: 13th April – 17th May Documentation: prj presentation (17/5), prj report Goals: Definition of desired features for a solution Analysis and evaluation of existing solutions Third Phase, time window: 18th May – 14th June Documentation: prj presentation (14/6), prj report Goal: p ropose a new combined solution to effectively handle problems of 2D reconfiguration
  • 90.
    Setting and AdvantagesDefinition Definition of the setting: 2D self partial dynamical run-time reconfiguration Analysis of the advantages of 2D Reconfiguration In area usage and performance
  • 91.
    2D Fragmentation ProblemAnalysis of the 2D-fragmentation problem Area generally more fragmented Can nullify the area optimizations obtained
  • 92.
    Placement Decisions Analysisof 2D placement choices effects: Again, bad choices can lead to performance loss
  • 93.
    Allocation manager Definitionof allocation manager desired features: Low TRR Low management overhead High routing efficiency Low fragmentation Definition of allocation manager structure: Empty space manager Complete space Heuristic selection Fitter General (FF,BL,BF,WF…) Focused (FA,RA… )
  • 94.
    Most relevant worksMaintain complete information on empty space: KAMER: Keep All Maximally Empty Rectangles Apply a general fitting strategy CUR: Maintain the Countour of a Union of Rectangles Apply a focused fitting strategy Heuristically prune part of the information: KNER: Keep Non-overlapping Empty Rectangles Apply a general fitting strategy 2D-HASHING: Keep Non-ov. Empty Rectangles in optimized data structure Apply (exclusively) a general fitting strategy
  • 95.
    Evaluation and ProposedApproach Proposed Approach Heuristic (KNER-like) empty space manager, to keep low complexity for use in a self-reconfigurable system Fitting strategy focused on minimizing routing paths, to maintain high performance of the reconfigurable system (chosen metric to minimize Manhattan distance) High placement quality => high complexity Lowest compl. => no focused fitting (bad especially for routing)
  • 96.
    Structure of theallocation manager Task, defined by: Arrival time, ASAP, (ALAP), H, W, Latency, Communicating Tasks Hosted in a queue which also adds a pointer to the rectangle where it is placed Reconfigurable Device, represented as: Binary Tree structure, each node is a Rectangle, each leaf is an empty Rectangle. Navigation trough pointers to left child, right child, next leaf and a function to find previous leaf (for bookkeeping after split or merge) Rectangle, defined by: X, Y, H, W Initially one, (X,Y)=(0,0), H=FPGA Rows, W=FPGA Cols
  • 97.
  • 98.
    Experimental Results Benchmarkof 100 randomly generated tasks: Size (5% to 25% of FPGA), randomly interconnected Execution time: 3x less than CUR, close to KNER Communication cost: 3x less than KNER, close to CUR Task Rejection Rate: all solutions quite close
  • 99.
    Future Work Applythe proposed solution to self reconfiguration: Adapt the algorithm to run on the internal processor Create a validation reconfigurable architecture Integrate the architecture with relocation Tune the algorithm to improve results: Experiment techniques to reduce TRR Try to optimize the code to have an algorithm with lower running time
  • 100.
  • 101.
    What’s next DRESDDReAMS Alessandro Panella Matteo Murgida Operating System Ivan Beretta Design Flow Antonio Piazzi Polaris Massimo Morandi Marco Novati HLR Marco Maggioni
  • 102.
    Relocation for 2DReconfigurable Systems Marco Novati [email_address]
  • 103.
    Outline Introduction Problemdescription Project Goals Project in details Phases Results What’s next
  • 104.
    Problem Description Self Dynamical Runtime 2D Reconfiguration Xilinx Virtex-4 and Virtex-5 Relocation, different solutions Software (BAnMat, PARBIT) Hardware (REPLICA, BiRF) We chose an hardware solution BiRF Square
  • 105.
    Project Goals Studyof the new FPGA Families Examination of Xilinx documentation on V4 and V5 Analysis of the new bitstream structure Generation of V4 and V5 bitstream Development of the new version of BiRF Implementation Validation
  • 106.
    Phases First Phase: 15th March – 12th April Documentation: prj presentation (12/4), prj report Goals: Xilinx documentation examination V4 & V5 bitstream structure analysis Second Phase: 13th April – 17th May Documentation: prj presentation (17/5), prj report Goals: Implementation of BiRF Square Synthesis Third Phase: 18th May – 14th June Documentation: prj presentation (14/6), prj report Goals: Verification & Validation
  • 107.
    Frame Addressing NewFrame Addressing: Possibility of addressing rows and columns
  • 108.
  • 109.
    CRC Calculation ParticularCRC value, used by Xilinx tools Two version of BiRF Square: By using the “predefined” value With actual CRC calculation An optimized algorithm has been used
  • 110.
    Synthesis results Ona Virtex-4 with speed grade -12 General purpose version: max frequency of 160 MHz Specific version: maxfrequency of 290Mhz
  • 111.
  • 112.
  • 113.
    Results 1/2BiRF Square Permitsto apply relocation in a self partially and dynamically 2D-reconfigurable system The occupation ratio is relatively small Frequency more than acceptable Reduction of internal memory requirements
  • 114.
    Results 2/2Throughput of 7,3 MB/s: A total configuration file size is about 1 MB Considering an architecture: 1/3 of the area as fixed part 2/3 as reconfigurable part with 6 slots With such hypothesis Size of a partial bitstream will be about 110 KB Relocation time of about 15 ms
  • 115.
    What’s Next Futureimprovements: Direct access to the memory (DMA) Direct manipulation of the bitstream Portability Integration with ICAP Elimination of the relocation overhead Relocation time << reconfiguration time The final goal: Creation of a real architecture that exploits self partial and dynamical 2D-reconfiguration,with relocation
  • 116.
  • 117.
    What’s next DRESDDReAMS Alessandro Panella Matteo Murgida Operating System Ivan Beretta Design Flow Antonio Piazzi Polaris Massimo Morandi Marco Novati HLR Marco Maggioni
  • 118.
    H igh L evel R econfiguration Marco Maggioni marco.maggioni @dresd.org
  • 119.
    Outline Introduction Problemdescription Project Goals State of the Art Project in details Contributions HLR workflow GraphGen IsomorphClustering SimpleLatency Salomone Results What’s next
  • 120.
    Problem Description Whatis H igh L evel R econfiguration...? Theoretical approach to dynamic reconfiguration... Vision... Reconfigurability has many advantages... Mission... Exploit these advantages to obtain best performance... How...? Adapting a system to this execution model managing complexity and drawbacks...
  • 121.
    Project Goal Createa complete HLR workflow... From a real system specification to its reconfigurable execution model... Define precise interfaces for each phase... To promote flexibility and future HLR researchs... To develop a complete toolchain... Apply some algorithms regarding reconfigurability... To reuse past works...
  • 122.
    State of ArtPresent of HLR ... Some ideas/concepts regarding clustering and scheduling... ... but no a complete and well-defined workflow. ... but a lot of work to do. System specifications analysis... P and A HW/SW framework to promote new ideas... Dynamic Reconfigurability can be considered as a branch of this research...
  • 123.
    Contribution Dynamic libraryloading system ... Embedded into GNU compilation tool-chain Porting of P and A libraries into Earendil... Suitable for future analysis... HLR tools deployed onto Earendil... Cover each step of workflow...
  • 124.
    HLR workflow Clustering (with A nalysis)... 1 st Month C oloring... 2 nd Month S cheduling... 3 rd Month Gcc Frontend Partitioning Algorithm PandA Scheduling Algorithm Clustered Graph Metric Evaluation Reconfigurable Clustered Graph Area Latency Rec. Time Power Target Architecture Database
  • 125.
    GraphGen GraphGen isthe first step of the HLR toolchain ... Takes as input a system specification or an algorithm... Produces a graph (CFG/BB/DFG/SDG) Perfoms high level analysis step... Transforms the system description (C/C++/SystemC) to a representation suitable for further elaboration... Based on GCC and compiler theory... Uses P and A 0.4 funtionalities to produce a statement level graph...
  • 126.
    IsomorphClustering IsomorphClusteing followsGraphGen in the HLR toolchain ... Takes as input a statement level graph... Produces a clustered graph... Clustering phase... Aggregates nodes into configuration (basic unit of reconfigurable execution)... Based on isomorphism, tries to find different instances of isomorph templates... We can also apply differents algorithms...
  • 127.
    SimpleLatency SimpleLatency followsIsomorphClusteing in the HLR toolchain ... Takes as input a clustered graph... Adds latency information at each configuration... Produces a reconfigurable clustered graph with latency evaluations... Coloring... “ Colors” each cluster with usefull evalution for reconfigurability... Based on clusters internal critical path... Different metric for different architectures... Connects HLR with real architectural parameters...
  • 128.
    Salomone Salomone isthe last step in the HLR toolchain ... Takes as input a reconfigurable clustered graph... Produces a schedule on an abstract reconfigurable architecture... Scheduling... It's considered the core task of HLR ... Maps each configuration on an area portion... Adapts the system execution to reconfigurable model... Based onto graph coloring algorithm...
  • 129.
    Results 1/3Based onto AES encryption... Templates found with Isomorph CLustering... Execution time... 123.94 s
  • 130.
    Results 2/3Salomone adapting and coloring... Execution time... 113.55 s
  • 131.
    Results 3/3Final Scheduling...
  • 132.
    What's next Heuristichimplementation for Salomone... To improve result quality in term of number of area portions... A new metric for area/latency... Based on RTL logical synthesis evaluations... Introduce feedback into HLR workflow... Based on schedule evaluation... New clustering and scheduling algorithms... Such as Napoleon...
  • 133.