Design Methodologies for Dynamic Reconfigurable Multi-FPGA Systems BY Alessandro Panella [email_address] 3-Day DRESD  07/28 – 08/01 2008 Hotel Villa Gina, Goglio, Italy
About this thesis (1/2) PROBLEM STATEMENT: Extend the range of application of dynamic reconfigurability techniques from the single FPGA case to multi-FPGA systems NOVELTY Methodology for the design of multi-FPGA systems Dynamic reconfigurability Seen as a solution for implementing area over-requiring applications Only used “when needed” Regularity-driven partitioning for run-time reuse
About this thesis (2/2) Major contribution: Development of a multi-FPGA system design flow which exploits dynamic reconfigurability for blocks’ reuse. Useful contributions: Creation of an intermediate representation for structural and hierarchical circuits. Creation of a framework for the extraction of the design from VHDL. Design and implementation of static global layout algorithms. Exploit hierarchy information for regular patterns extraction.
Outline Context definition FPGA Multi-FPGA Systems (MFS) Dynamic reconfigurability Related works MFS design flows Dynamic reconfigurable MFS’s Proposed methodology Design extraction Global layout Reuse and Dynamic reconfigurability Experimental results Conclusion and future works
Field Programmable Gate Array Re-programmable semi-custom hardware Low Non Recurrent Engineering (NRE) costs Good performances High flexibility Composed of Configurable Logic Blocks (CLB) Xilinx Virtex CLB: 2 slices, each containing two 4-input Look-Up Tables (LUT)
Multi-FPGA Systems (MFS) Ensembles of more FPGAs (2 - 1000’s) Motivations: Massively parallel computing Need to implement large applications General trend in VLSI towards multi-core computers Applications: Supercomputing Logic emulation Neural networks, … Terminology: Architecture : physical cluster of FPGAs  Application : programmed functionality System : architecture + application
MFS topologies (1/2) Connections: Hardwired vs. Programmable Dedicated vs. Shared (bus, point to point) Complete graph (Clique) Direct connection between any two chips Planarity; Pin requirements Mesh : 4(8)-neighbor pattern Expandability No fixed length path Communication logic    in intermediate chips PRO  CON
MFS topologies (2/2) Crossbar : logic bearing chips and routing chips Total (one routing chip)  Partial (several routing chips) Equal communication delays Low scalability Hybrid : combine benefits of the two approaches Example: Complete Graph Partial Crossbar (HCGP) (from Khalid, M.: Routing Architecture and Layout Synthesis for Multi-FPGA Systems, Ph.D. Thesis, University of Toronto, 1999)
Reconfigurability Reconfiguration:  altering the   location or functionality of a system element  (H. Estrin, 1960) FPGA:  suitable  physical ground Partial vs. Total (Partial) Dynamic vs. Static: Only some parts of the system take part in each reconfiguration The execution of the system does not cease Motivations and applications Provide a larger  virtual area React to sudden and frequent changes in applications needs Fault tolerance
Dynamically Reconfigurable MFS’s Rationale: expand the capabilities of  static  MFS’s Going beyond MFS physical limitations Provide a high level of flexibility  E.g. in logic emulation: dynamic fault fixing Partial vs. Total reconfiguration in MFS Two main scenarios (not exclusive) Reconfiguration of  logic  chips Reconfiguration of  routing  chips The interconnections are dynamically mutable Components can be  reused
Design hierarchy Application composed of: Blocks Can have sub-blocks Nets Block-to-block Block-to-interface Advantages: Handle the complexity of design Reuse of modules IP-Cores libraries Block-to-block net Block-to-interface net
What’s next Context definition FPGA Multi-FPGA Systems (MFS) Dynamic reconfigurability Related works MFS design flows Dynamic reconfigurable MFS’s Proposed methodology Design extraction Global layout Reuse and Dynamic reconfigurability Experimental results Conclusion and future works
Related works - MFS design flow All MFS design flows have a similar structure Different algorithms used in each phase Examples: Hauck (a) and Kahlid (b) Global layout tasks: partitioning, placement and routing Hauck , S.: Multi-FPGA Systems, Ph.D. Thesis, University of Washington, 1995 Kahlid , M.: Routing Architecture and Layout Synthesis for Multi-FPGA Systems, Ph.D. Thesis, University of Toronto, 1990
Complete MFS design flows (a) Integrated solution to partitioning, placement and routing Recursive bi-partitioning Multilevel approach Clustering and refinement phases Partition orderings  for placement Identify the bottlenecks in the architecture Assign the two initial partitions to the least connected parts of the architecture, and so on recursively The connections are routed as the bisections are computed PROS: the architecture is considered CONS: no flexibility on routing given partitioning and placement
Complete MFS design flows (b) Partitioning: recursive bisection using Fiduccia-Mattheyses heuristic Placement: dependent on the topology Mesh: force-directed Crossbar: trivial task, the FPGAs have the same distance Routing: two approaches General (obtain a graph from the architecture) Specific (fitted on the particular MFS topology) PROS: uses existent effective and robust algorithms CONS: stress on routing and topology evaluation
Partial MFS design flows Address only some phases of the design Usually partitioning and placement Iterative approaches Genetic algorithm [Hidalgo et al., DSD ‘02] Simulated annealing [Roy at al., ICCAD ’93; Vicente et al., FPL ‘99] Hierarchical approaches Exploit the design hierarchy in partitioning Behrens et al., ICCAD ‘96 Hierarchy exploration heuristic Fang et al., TODAES ‘00 Hierarchy extraction from Verilog spec. Set-covering procedure
Dynamic Reconfigurable MFS Extraction of a directed task graph from VHDL Task graph divided into  time segments Using a non-linear programming model Each segment is spatially partitioned [ Ouaiss et al. , An Integrated Partitioning and Synthesis System for Dynamically Reconfigurable Multi-FPGA architectures, 1998] Dynamic?
What’s next Context definition FPGA Multi-FPGA Systems (MFS) Dynamic reconfigurability Related works MFS design flows Dynamic reconfigurable MFS’s Proposed methodology Design extraction Global layout Reuse and Dynamic reconfigurability Experimental results Conclusion and future works
Proposed methodology Multi-FPGA design flow Three main phases Design extraction Static Global Physical Layout Partitioning Placement Routing Reuse through Dynamic Reconfigurability Reuse introduces extra delays Reconf. times, sequential execution… Only adopted  when needed In such case,  the introduced delay has to be minimized
Input: VHDL description Output: Intermediate representation Ad hoc  created data structure Two sub-phases: VHDL preprocessing VHDL structural parsing Design Extraction
Intermediate representation C++ data structure Contains both structural and hierarchical information Graphs implemented using the Boost Graph Library Container class provides an API
VHDL Parsing VHDL preprocessing: obtain a pure structural VHDL description Features of each component are retrieved using vendors synthesis tools (i.e. Xilinx XST, Synplify PRO) Create the intermediate representation from the pure VHDL description
Example Hierarchy Flattened view DES encryption core (part of the 3DES core circuit)
Static Global Layout This phase addresses Partitioning and Placement Two implemented approaches: Integrated P&P Sequential P&P
Simulated annealing algorithm Iterative randomized approach Suitable to cope with high dimesionality problems Partitioning + Placement is such a problem Aim:  minimize  a cost function  f The algorithm starts with a “high” temperature  T At each iteration M  random moves are performed The move if accepted ( Metropolis criterium ) Always if the cost decreases or remains equal With probability  if the cost increase T  is decreased by a  cooling factor  α Stop after  S  consecutive non-accepted moves Integrated P&P
Annealing implementation Solution: array  [c i ] , node  i  is placed in FPGA  c i Cost: Weighted Estimated Wire Length (WEWL) Random move: single-node or swap, with equal probability Constraints: Area constraint I/O Pin constraint Handled with  penalties
Sequential P&P Partitioning: bottom-up clustering 1-to-1 Placement: annealing Simplified version of the integrated P&P algorithm CLUSTERING: Initialization: each node is considered as a cluster At each iteration Choose two nodes on the basis of a metric Collapse them Stop when Only one cluster is left No clusters can be formed due to Area constraint I/O Pin constraint
Clustering metrics Connection : Communication Ratio : Internal comm. External comm. Communication density :
Blocks reuse Problem: application does not fit onto the architecture  Reuse similar parts of the circuit in order to  save space Def:  dynamically-interconnected structure Architectural scenarios Bus Crossbar
Isomorphic clusters Which parts of the structure consider for reuse? Def.  Isomorphic Clusters Substructures which contain the same blocks having the same connections Example Two subproblems Finding isomorphic clusters Select the ones to reuse (and how many times)
Isomorphic clusters extraction (1/2) Regularity driven clustering Def.   type  of a node : component which the node is instance of If two nodes selected for collapsing have the same parent Look for nodes with the same type of the parent in the hierarchy Execute the same collapsing operation Assign the same type to the newly created cluster s Clustering itself benefits from this enhancement Problem of standard clustering: lack of global metric Regularity provides global information
Isomorphic clusters extraction (2/2) The key feature is the assignment of a “type” to clusters Example:
Blocks reuse choices Choose which blocks to reuse Difficulty: high complexity due to hierarchical clusters Some clusters contains others Solution ILP model fast even for a high number of nodes Run the ILP model on each “cut” of the dendrogram Each cut is a flatten structural view of the application
ILP model for blocks reuse x i :   number of times cluster type  t i  is  reused  (= no. of needed reconfigurations)
What’s next Context definition FPGA Multi-FPGA Systems (MFS) Dynamic reconfigurability Related works MFS design flows Dynamic reconfigurable MFS’s Proposed methodology Design extraction Global layout Reuse and Dynamic reconfigurability Experimental results Conclusion and future works
Experiments Test circuit description (slide 37) Integrated vs. Sequential partitioning & placement Methodologically, both approaches are valid They are compared from a numerical point of view Partitioning evaluation (slide 38) Placement evaluation (slide 39) Sequential P&P vs. Metis (slide 40) Provide a comparison with an external approach Blocks reuse evaluation (slide 41) Execution time Example of application
Results: test circuits Triple-DES encryption+decryption core  (3DES) Finite Impulse Response filter  (FIR) Noekeon cipher  (NOEK) Composed module FIR+3DES
Integrated vs. Sequential P&P (1/2) Partitioning evaluation NOTE : by setting the distance between any two FPGAs equal to 1, the integrated annealing approach is actually a partitioning algorithm
Placement evaluation (on mesh architectures) Integrated P&P Sequential P&P v Integrated vs. Sequential P&P (2/2)
Clustering Vs. Metis
Results: ILP model solving Timing results ILP result - example :  3DES-FIR circuit Conn  metric 4 FPGAs of 600 slices needed Only 3 are available Adopt reuse Dendrogram cuts 2-7 provides the lowest estimated rec. time
What’s next Context definition FPGA Multi-FPGA Systems (MFS) Dynamic reconfigurability Related works MFS design flows Dynamic reconfigurable MFS’s Proposed methodology Design extraction Global layout Reuse and Dynamic reconfigurability Experimental results Conclusion and future works
Conclusion: contributions Major contribution: Development of a multi-FPGA systems design flow which exploits dynamic reconfigurability for blocks reuse while minimizing the estimated execution time. Useful contributions: Creation of a intermediate representation for structural and hierarchical circuits. Creation of a framework for the extraction of the design from VHDL. Design and implementation of static global layout algorithms. Exploit hierarchy information for regular patterns extraction. The proposed approaches have been validated through experimental evaluations
Conclusion: future works Improvements Go beyond the inherent greediness of clustering More powerful closeness metrics More accurate time estimation function for blocks reuse Additions Development of a robust and effective routing algorithm for both static and dynamic implementations Partitioning and placement for dynamically-interconnected structures Binding and scheduling of application blocks on the instantiated clusters
The end. Questions?
That’s all folks! Thank you. How ‘bout a funny joke?

3rd 3DDRESD: DReAMS

  • 1.
    Design Methodologies forDynamic Reconfigurable Multi-FPGA Systems BY Alessandro Panella [email_address] 3-Day DRESD 07/28 – 08/01 2008 Hotel Villa Gina, Goglio, Italy
  • 2.
    About this thesis(1/2) PROBLEM STATEMENT: Extend the range of application of dynamic reconfigurability techniques from the single FPGA case to multi-FPGA systems NOVELTY Methodology for the design of multi-FPGA systems Dynamic reconfigurability Seen as a solution for implementing area over-requiring applications Only used “when needed” Regularity-driven partitioning for run-time reuse
  • 3.
    About this thesis(2/2) Major contribution: Development of a multi-FPGA system design flow which exploits dynamic reconfigurability for blocks’ reuse. Useful contributions: Creation of an intermediate representation for structural and hierarchical circuits. Creation of a framework for the extraction of the design from VHDL. Design and implementation of static global layout algorithms. Exploit hierarchy information for regular patterns extraction.
  • 4.
    Outline Context definitionFPGA Multi-FPGA Systems (MFS) Dynamic reconfigurability Related works MFS design flows Dynamic reconfigurable MFS’s Proposed methodology Design extraction Global layout Reuse and Dynamic reconfigurability Experimental results Conclusion and future works
  • 5.
    Field Programmable GateArray Re-programmable semi-custom hardware Low Non Recurrent Engineering (NRE) costs Good performances High flexibility Composed of Configurable Logic Blocks (CLB) Xilinx Virtex CLB: 2 slices, each containing two 4-input Look-Up Tables (LUT)
  • 6.
    Multi-FPGA Systems (MFS)Ensembles of more FPGAs (2 - 1000’s) Motivations: Massively parallel computing Need to implement large applications General trend in VLSI towards multi-core computers Applications: Supercomputing Logic emulation Neural networks, … Terminology: Architecture : physical cluster of FPGAs Application : programmed functionality System : architecture + application
  • 7.
    MFS topologies (1/2)Connections: Hardwired vs. Programmable Dedicated vs. Shared (bus, point to point) Complete graph (Clique) Direct connection between any two chips Planarity; Pin requirements Mesh : 4(8)-neighbor pattern Expandability No fixed length path Communication logic in intermediate chips PRO CON
  • 8.
    MFS topologies (2/2)Crossbar : logic bearing chips and routing chips Total (one routing chip) Partial (several routing chips) Equal communication delays Low scalability Hybrid : combine benefits of the two approaches Example: Complete Graph Partial Crossbar (HCGP) (from Khalid, M.: Routing Architecture and Layout Synthesis for Multi-FPGA Systems, Ph.D. Thesis, University of Toronto, 1999)
  • 9.
    Reconfigurability Reconfiguration: altering the location or functionality of a system element (H. Estrin, 1960) FPGA: suitable physical ground Partial vs. Total (Partial) Dynamic vs. Static: Only some parts of the system take part in each reconfiguration The execution of the system does not cease Motivations and applications Provide a larger virtual area React to sudden and frequent changes in applications needs Fault tolerance
  • 10.
    Dynamically Reconfigurable MFS’sRationale: expand the capabilities of static MFS’s Going beyond MFS physical limitations Provide a high level of flexibility E.g. in logic emulation: dynamic fault fixing Partial vs. Total reconfiguration in MFS Two main scenarios (not exclusive) Reconfiguration of logic chips Reconfiguration of routing chips The interconnections are dynamically mutable Components can be reused
  • 11.
    Design hierarchy Applicationcomposed of: Blocks Can have sub-blocks Nets Block-to-block Block-to-interface Advantages: Handle the complexity of design Reuse of modules IP-Cores libraries Block-to-block net Block-to-interface net
  • 12.
    What’s next Contextdefinition FPGA Multi-FPGA Systems (MFS) Dynamic reconfigurability Related works MFS design flows Dynamic reconfigurable MFS’s Proposed methodology Design extraction Global layout Reuse and Dynamic reconfigurability Experimental results Conclusion and future works
  • 13.
    Related works -MFS design flow All MFS design flows have a similar structure Different algorithms used in each phase Examples: Hauck (a) and Kahlid (b) Global layout tasks: partitioning, placement and routing Hauck , S.: Multi-FPGA Systems, Ph.D. Thesis, University of Washington, 1995 Kahlid , M.: Routing Architecture and Layout Synthesis for Multi-FPGA Systems, Ph.D. Thesis, University of Toronto, 1990
  • 14.
    Complete MFS designflows (a) Integrated solution to partitioning, placement and routing Recursive bi-partitioning Multilevel approach Clustering and refinement phases Partition orderings for placement Identify the bottlenecks in the architecture Assign the two initial partitions to the least connected parts of the architecture, and so on recursively The connections are routed as the bisections are computed PROS: the architecture is considered CONS: no flexibility on routing given partitioning and placement
  • 15.
    Complete MFS designflows (b) Partitioning: recursive bisection using Fiduccia-Mattheyses heuristic Placement: dependent on the topology Mesh: force-directed Crossbar: trivial task, the FPGAs have the same distance Routing: two approaches General (obtain a graph from the architecture) Specific (fitted on the particular MFS topology) PROS: uses existent effective and robust algorithms CONS: stress on routing and topology evaluation
  • 16.
    Partial MFS designflows Address only some phases of the design Usually partitioning and placement Iterative approaches Genetic algorithm [Hidalgo et al., DSD ‘02] Simulated annealing [Roy at al., ICCAD ’93; Vicente et al., FPL ‘99] Hierarchical approaches Exploit the design hierarchy in partitioning Behrens et al., ICCAD ‘96 Hierarchy exploration heuristic Fang et al., TODAES ‘00 Hierarchy extraction from Verilog spec. Set-covering procedure
  • 17.
    Dynamic Reconfigurable MFSExtraction of a directed task graph from VHDL Task graph divided into time segments Using a non-linear programming model Each segment is spatially partitioned [ Ouaiss et al. , An Integrated Partitioning and Synthesis System for Dynamically Reconfigurable Multi-FPGA architectures, 1998] Dynamic?
  • 18.
    What’s next Contextdefinition FPGA Multi-FPGA Systems (MFS) Dynamic reconfigurability Related works MFS design flows Dynamic reconfigurable MFS’s Proposed methodology Design extraction Global layout Reuse and Dynamic reconfigurability Experimental results Conclusion and future works
  • 19.
    Proposed methodology Multi-FPGAdesign flow Three main phases Design extraction Static Global Physical Layout Partitioning Placement Routing Reuse through Dynamic Reconfigurability Reuse introduces extra delays Reconf. times, sequential execution… Only adopted when needed In such case, the introduced delay has to be minimized
  • 20.
    Input: VHDL descriptionOutput: Intermediate representation Ad hoc created data structure Two sub-phases: VHDL preprocessing VHDL structural parsing Design Extraction
  • 21.
    Intermediate representation C++data structure Contains both structural and hierarchical information Graphs implemented using the Boost Graph Library Container class provides an API
  • 22.
    VHDL Parsing VHDLpreprocessing: obtain a pure structural VHDL description Features of each component are retrieved using vendors synthesis tools (i.e. Xilinx XST, Synplify PRO) Create the intermediate representation from the pure VHDL description
  • 23.
    Example Hierarchy Flattenedview DES encryption core (part of the 3DES core circuit)
  • 24.
    Static Global LayoutThis phase addresses Partitioning and Placement Two implemented approaches: Integrated P&P Sequential P&P
  • 25.
    Simulated annealing algorithmIterative randomized approach Suitable to cope with high dimesionality problems Partitioning + Placement is such a problem Aim: minimize a cost function f The algorithm starts with a “high” temperature T At each iteration M random moves are performed The move if accepted ( Metropolis criterium ) Always if the cost decreases or remains equal With probability if the cost increase T is decreased by a cooling factor α Stop after S consecutive non-accepted moves Integrated P&P
  • 26.
    Annealing implementation Solution:array [c i ] , node i is placed in FPGA c i Cost: Weighted Estimated Wire Length (WEWL) Random move: single-node or swap, with equal probability Constraints: Area constraint I/O Pin constraint Handled with penalties
  • 27.
    Sequential P&P Partitioning:bottom-up clustering 1-to-1 Placement: annealing Simplified version of the integrated P&P algorithm CLUSTERING: Initialization: each node is considered as a cluster At each iteration Choose two nodes on the basis of a metric Collapse them Stop when Only one cluster is left No clusters can be formed due to Area constraint I/O Pin constraint
  • 28.
    Clustering metrics Connection: Communication Ratio : Internal comm. External comm. Communication density :
  • 29.
    Blocks reuse Problem:application does not fit onto the architecture Reuse similar parts of the circuit in order to save space Def: dynamically-interconnected structure Architectural scenarios Bus Crossbar
  • 30.
    Isomorphic clusters Whichparts of the structure consider for reuse? Def. Isomorphic Clusters Substructures which contain the same blocks having the same connections Example Two subproblems Finding isomorphic clusters Select the ones to reuse (and how many times)
  • 31.
    Isomorphic clusters extraction(1/2) Regularity driven clustering Def. type of a node : component which the node is instance of If two nodes selected for collapsing have the same parent Look for nodes with the same type of the parent in the hierarchy Execute the same collapsing operation Assign the same type to the newly created cluster s Clustering itself benefits from this enhancement Problem of standard clustering: lack of global metric Regularity provides global information
  • 32.
    Isomorphic clusters extraction(2/2) The key feature is the assignment of a “type” to clusters Example:
  • 33.
    Blocks reuse choicesChoose which blocks to reuse Difficulty: high complexity due to hierarchical clusters Some clusters contains others Solution ILP model fast even for a high number of nodes Run the ILP model on each “cut” of the dendrogram Each cut is a flatten structural view of the application
  • 34.
    ILP model forblocks reuse x i : number of times cluster type t i is reused (= no. of needed reconfigurations)
  • 35.
    What’s next Contextdefinition FPGA Multi-FPGA Systems (MFS) Dynamic reconfigurability Related works MFS design flows Dynamic reconfigurable MFS’s Proposed methodology Design extraction Global layout Reuse and Dynamic reconfigurability Experimental results Conclusion and future works
  • 36.
    Experiments Test circuitdescription (slide 37) Integrated vs. Sequential partitioning & placement Methodologically, both approaches are valid They are compared from a numerical point of view Partitioning evaluation (slide 38) Placement evaluation (slide 39) Sequential P&P vs. Metis (slide 40) Provide a comparison with an external approach Blocks reuse evaluation (slide 41) Execution time Example of application
  • 37.
    Results: test circuitsTriple-DES encryption+decryption core (3DES) Finite Impulse Response filter (FIR) Noekeon cipher (NOEK) Composed module FIR+3DES
  • 38.
    Integrated vs. SequentialP&P (1/2) Partitioning evaluation NOTE : by setting the distance between any two FPGAs equal to 1, the integrated annealing approach is actually a partitioning algorithm
  • 39.
    Placement evaluation (onmesh architectures) Integrated P&P Sequential P&P v Integrated vs. Sequential P&P (2/2)
  • 40.
  • 41.
    Results: ILP modelsolving Timing results ILP result - example : 3DES-FIR circuit Conn metric 4 FPGAs of 600 slices needed Only 3 are available Adopt reuse Dendrogram cuts 2-7 provides the lowest estimated rec. time
  • 42.
    What’s next Contextdefinition FPGA Multi-FPGA Systems (MFS) Dynamic reconfigurability Related works MFS design flows Dynamic reconfigurable MFS’s Proposed methodology Design extraction Global layout Reuse and Dynamic reconfigurability Experimental results Conclusion and future works
  • 43.
    Conclusion: contributions Majorcontribution: Development of a multi-FPGA systems design flow which exploits dynamic reconfigurability for blocks reuse while minimizing the estimated execution time. Useful contributions: Creation of a intermediate representation for structural and hierarchical circuits. Creation of a framework for the extraction of the design from VHDL. Design and implementation of static global layout algorithms. Exploit hierarchy information for regular patterns extraction. The proposed approaches have been validated through experimental evaluations
  • 44.
    Conclusion: future worksImprovements Go beyond the inherent greediness of clustering More powerful closeness metrics More accurate time estimation function for blocks reuse Additions Development of a robust and effective routing algorithm for both static and dynamic implementations Partitioning and placement for dynamically-interconnected structures Binding and scheduling of application blocks on the instantiated clusters
  • 45.
  • 46.
    That’s all folks!Thank you. How ‘bout a funny joke?

Editor's Notes

  • #2 Good morning to everybody and thank you for being here, I am… I’m going to present my thesis work, which is entitled…