CSBP: A Fast Circuit Similarity-Based          Placement for FPGA Incremental Design               and Design Space Explor...
Outline      Introduction      Circuit Similarity-Based Placement      Experimental Results      Conclusion and Future Work
Introduction Field Programmable Gate Array (FPGA)    Ease of design, low start-up costs and fast manufacturing     turna...
Reusable Info in CAD Incremental design for FPGAs    Design preservation is the key of incremental design.    Similarit...
Reusable Info in CAD (Cont.) Design space exploration for FPGAs    FPGA design offers a variety of customizations by var...
Data Mining Overview    The key of data mining is to extract patterns and useful information from     data, including te...
Graph Similarity Summary of graph similarity measures       Measure                        Description                   ...
Circuit Similarity Circuit similarity     We define circuit similarity to describe the similar topological structures   ...
Outline      Introduction      Circuit Similarity-Based Placement      Experimental Results      Conclusion and Future Work
Motivating Example              Circuit similarity algorithm       V7     V8     V9     V10    V11    V12    V13    V14  ...
Motivating Example (Cont.) Circuit similarity-based  placement     The initial placement of the new      circuit design ...
Motivating Example (Cont.)(a) Placement of     (b) Init placement      (c) Final placement        (d) init placement      ...
Circuit Similarity CAD FlowCAD flow for incremental design   CAD flow for design space exploration
Circuit Similarity Algorithm Iterative similarity algorithm     We employ the iterative similarity      algorithm for un...
Performance Enhancement Support constraint    A support of a node is the set of     nodes with predefined matchings    ...
Outline      Introduction      Circuit Similarity-Based Placement      Experimental Results      Conclusion and Future Work
Incremental Design                                           f CAD flow    Two-iteration CAD flow.    CSBP flow (a) an...
Results                    Initial placement results                        Bounding box cost (bb cost) and delay cost a...
Results (Cont.)                                                    300000 Post-routing results comparison                ...
Results (Cont.) Runtime comparison             Only placement time is compared.             CS-t achieves 31x speedup o...
Design Space Exploration             CAD flow                    Study logic-level and algorithm-                     le...
Logic-level Sample Synthesis Scripts       Alias          Scripts       resyn          "b; rw; rwz; b; rwz; b"       resyn...
Logic Level Results                                                     2500 Initial results comparison                  ...
Logic Level Results (Cont.)                Final placement results                        Wire length and critical delay...
Logic Level Results (Cont.)                                                                                               ...
Logic Level Results (Cont.)                   Runtime comparison                                    Only placement time ...
Algorithm Level Results     Experimental settings           The algorithm-level design is a            constant multipli...
Algorithm Level Results (Cont.)                                 Wire length-delay space comparison                       ...
Outline      Introduction      Circuit Similarity-Based Placement      Experimental Results      Conclusion and Future Work
Future Work Improvement to CSBP    Integrate predefined matchings, for example, naming matching, into our     CSBP to fu...
Conclusion Proposed an efficient circuit similarity algorithm Developed CSBP, a fast circuit similarity-based placement ...
Xiaoyu Shi, Dahua Zeng, Yu Hu, Guohui Lin, Osmar R. Zaiane CSBP: A Fast Circuit Similarity-Based Placement for FPGA    Inc...
Upcoming SlideShare
Loading in …5
×

CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design and Design Space Exploration

954 views

Published on

Invited talk to Fudan University, from Xiaooyu Shi: http://webdocs.cs.ualberta.ca/~xshi

Published in: Automotive
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
954
On SlideShare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
20
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design and Design Space Exploration

  1. 1. CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design and Design Space Exploration 1Xiaoyu Shi, 1Dahua Zeng, 2Yu Hu, 1Guohui Lin, 1Osmar R. Zaiane 1Dept. of Computing Science, University of Alberta2Dept. of Electrical and Computer Engineering, University of Alberta Presented by Xiaoyu Shi LOGO Please address comments to bryanhu@ece.ualberta.ca
  2. 2. Outline Introduction Circuit Similarity-Based Placement Experimental Results Conclusion and Future Work
  3. 3. Introduction Field Programmable Gate Array (FPGA)  Ease of design, low start-up costs and fast manufacturing turnaround time.  Size of FPGAs has reached million gates level.  Modern FPGA designs suffer from long compilation time. Xilinx SPARTAN-6 board FPGA placement  Determines which logic block within an FPGA should implement each of the logic blocks required by the circuits.  Has a significant impact on the performance and routability in nanometer circuit designs.  The optimization goals are to minimize certain criteria, such as wire length, critical delay and area.  Now becomes the bottleneck of modern FPGA circuit design [Chen’06]. Up-to-date fast placement algorithms  Extensive studies have been performed to improve the placement efficiency as a single synthesis phase for decades.  State-of-the-art work includes using multi-core [Ludwin’08], embedding- based [Gopalakrishnan’06], partitioning-based [Maidee’05], multi-level [Sankar’99], simulated annealing [Betz’97].
  4. 4. Reusable Info in CAD Incremental design for FPGAs  Design preservation is the key of incremental design.  Similarity among circuits exists because functional changes or optimizations are small, and they generally result in a similar topology of the modified circuit compared to the original circuit [Krishnaswamy’09]. Final design Final iteration Optimizations, timing, Iteration 3 … etc … Changes due to Iteration 2 verification, timing, etc Initial design Iteration 1 Incremental design process for FPGAs
  5. 5. Reusable Info in CAD (Cont.) Design space exploration for FPGAs  FPGA design offers a variety of customizations by varying design parameters.  Local similarity and global similarity exist in design space exploration. Final design Optimizations, timing, etc … Changes due to verification, timing, etc Initial design Constant multiplier blocks by CMU SPIRAL [Puschel’04]
  6. 6. Data Mining Overview  The key of data mining is to extract patterns and useful information from data, including text, graphs and circuits, etc.  It has been extensively studied since 1950s, and has been widely applied to many domains, such as businesses, sciences and health cares.  Graph mining, including graph pattern mining, graph classification and graph compression, is a research hot area in data mining [Borgwardt’08]. Graph similarity  It quantitatively defines the topological similarity between two graphs.  It has been used to many applications, such as web searching [Kleinberg’99], social network mapping [Watts’99] and chemical structure matching [Hattori’03].
  7. 7. Graph Similarity Summary of graph similarity measures Measure Description Time Global Complexity Topo Isomorphism Identifying a bijection between the nodes NP-Hard Yes [Pelillo’02] of two graphs which preserves (directed) adjacency Edit distance Given a cost function on edit operations, NP-Hard Yes [Bunke’99] determine the minimum cost transformation from one graph to another Common subgraph Identifying the largest isomorphic NP-Hard Yes [Fernandez’01] subgraphs of two graphs Iterative methods Two graph elements are similar if their Cubic Yes [Blondel’04] neighborhoods are similar Statistical methods Assessing aggregate measures of graph Linear No [Alberta’02] structure, degree distribution, diameter, betweenness measures Iterative methods  It has lower computational complexity and considers global topological information.  It takes advantage of the graph sparsity.
  8. 8. Circuit Similarity Circuit similarity  We define circuit similarity to describe the similar topological structures between two circuits.  We adapt the iterative methods in graph similarity.  It exists in several CAD phases, such as placement, routing and verification.  It can be widely used to accelerate FPGA designs, such as incremental design and exploration of the design space, etc.
  9. 9. Outline Introduction Circuit Similarity-Based Placement Experimental Results Conclusion and Future Work
  10. 10. Motivating Example  Circuit similarity algorithm V7 V8 V9 V10 V11 V12 V13 V14 V15 V16V’7 0.92 0.25 0.48 0.15 0 0 0 0.42 0.06 0V’8 0 0.73 0 0 0.05 0 0.39 0 0.17 0.06V’9 0 0.39 0 0 0.4 0 0.73 0 0.06 0.48V’10 Graph G 0.48 0 0.89 0.25 0.3 0.12 0.14 0.06 0.33 0.09V’11 0 0 0.11 0.48 0 0.86 0 0.36 0.17 0V’12 0 0 0.3 0.34 0.64 0.25 0.39 0.34 0.15 0.42V’13 0.48 0.25 0.07 0.4 0 0.36 0 0.88 0.06 0V’14 0.4 0.39 0.29 0.15 0.15 0.18 0.12 0.46 0.59 0.06V’15 0 0.12 0.09 0 0.63 0 0.36 0 0.27 0.82 Similarity score matrix for G and G’ Graph G’
  11. 11. Motivating Example (Cont.) Circuit similarity-based placement  The initial placement of the new circuit design (G’) is generated by computing the similarity between the original (G) and modified circuits, and finding the correspondent node matching.  A low-temperature simulated annealing is applied to further refine the results.  The proposed circuit similarity algorithm can be used to speedup placement, which allows faster incremental design and design space exploration.
  12. 12. Motivating Example (Cont.)(a) Placement of (b) Init placement (c) Final placement (d) init placement (c) Final placementreference config using CS using CS using VPR using VPR Placement layouts comparison of circuit “des”  A real example Wire Delay Critical Runtime (E-05) Delay (s)  For circuit “des”, the reference (E-08) configuration (synthesized using “resyn3” script in ABC) has 1245 CS-init 306 5.93 - - CLBs and 1501 nets while the new configuration (synthesized VPR-init 1087 14.00 - - using “rwsat2” script in ABC) has 1215 CLBs and 1471 nets. CS-final 237 5.08 8.28 13.38  The results show that CSBP successfully finds the internal VPR-final 221 4.98 10.10 28.42 node correspondence. Status of placement results of circuit “des”
  13. 13. Circuit Similarity CAD FlowCAD flow for incremental design CAD flow for design space exploration
  14. 14. Circuit Similarity Algorithm Iterative similarity algorithm  We employ the iterative similarity algorithm for undirected molecular graphs [Rupp’07].  We adapt the iterative similarity algorithm to consider directed circuit graphs, fix the I/O pins, and compute the similarity of fanin and fanout nodes respectively, based on unique circuit constraints. If (|in(vi)| < |in(v’j)| and |out(vi)| < |out(v’j)|) Summary of variables
  15. 15. Performance Enhancement Support constraint  A support of a node is the set of nodes with predefined matchings  Formally, if v ∈ G and v’ ∈ G’, the in the transitive fanin or fanout cone of this node. support constraint requires: where β ∈ (0,1]. Level constraint  A topological sort and reverse  Formally, if v ∈ G and v’ ∈ G’, the topological sort can label each internal node with two values. level constraint requires: where Bl and Br are two nonnegative integers. Effectiveness of the pruning techniques
  16. 16. Outline Introduction Circuit Similarity-Based Placement Experimental Results Conclusion and Future Work
  17. 17. Incremental Design  f CAD flow  Two-iteration CAD flow.  CSBP flow (a) and from-scratch flow (b) are compared.  Optimization “imfs” reduces the number of CLBs by 2%. Settings  Two versions of CSBP are compared: A high quality version (CS) with β = 0.5, inner_num = 1 and Bl = Br = 1; A turbo version (CS-t) with β = 1, inner_num = 0.1 and Bl = Br = 0.  CSBP is implemented in C and evaluated on the 20 largest MCNC benchmarks.  The results are averaged over 5 funs on a Linux server with dual- core 2.19GHz CPU and 5GB memory.  CS2 package [Goldberg’97] is used for maximum matching problem. CAD flow for incremental design
  18. 18. Results  Initial placement results  Bounding box cost (bb cost) and delay cost are compared.  Clearly, the initial placement results generated using CS is much better than VPR’s initial results, and is very close to VPR’s final results. 100% 100% 90% 90% 80% 80%Percentage Percentage 70% 70% 60% 60% 50% 50% 40% 40% 30% 30% 20% 20% 10% 10% 0% 0% s38417 s38584 s38417 s38584 s298 s298 pdc alu4 ex1010 pdc alu4 apex2 apex4 ex1010 tseng apex2 apex4 tseng ex5p frisc ex5p seq des frisc des seq diffeq misex3 spla bigkey clma diffeq dsip misex3 spla bigkey clma dsip elliptic elliptic CS-init VPR-final VPR-init CS-init VPR-final VPR-init Comparisons of initial bb cost Comparisons of initial delay cost CS reduces bb cost by 72% on avg. compared to VPR CS reduces delay cost by 53% on avg. compared to VPR
  19. 19. Results (Cont.) 300000 Post-routing results comparison 250000  A low-temperature annealing is 200000 applied to the initial results. 150000  Wire length, critical delay and area are compared. 100000  The results demonstrate the 50000 effectiveness of the pruning 0 techniques, which do not affect the apex2 apex4 ex1010 tseng ex5p s38417 s38584 seq bigkey des clma diffeq dsip misex3 s298 spla alu4 pdc frisc elliptic quality significantly. CS-t CS VPR Wire length CS increases the wire length by 3% on avg. 4.00E+08 4.50E-07 3.50E+08 4.00E-07 3.00E+08 3.50E-07 2.50E+08 3.00E-07 2.00E+08 2.50E-07 1.50E+08 2.00E-07 1.00E+08 1.50E-07 1.00E-07 5.00E+07 5.00E-08 0.00E+00 0.00E+00 s38417 s38584 s298 pdc alu4 apex2 apex4 ex1010 tseng des ex5p frisc seq bigkey clma diffeq dsip misex3 spla elliptic s38417 s38584 s298 pdc alu4 apex2 apex4 ex1010 tseng des ex5p frisc seq bigkey clma diffeq dsip misex3 spla elliptic CS-t CS VPR Area CS-t CS VPR Critical delay CS increases the area by 2% on avg. CS increases the crit. delay by 6% on avg.
  20. 20. Results (Cont.) Runtime comparison  Only placement time is compared.  CS-t achieves 31x speedup on average, with up to 91x.  More speedup is expected when circuits become larger. 100 90 80 70Speedups 60 50 40 30 20 10 0 CS-t CS VPR Speedups compared to VPR
  21. 21. Design Space Exploration  CAD flow  Study logic-level and algorithm- level design space, respectively.  CSBP flow (a) and from-scratch flow (b) are compared.  Settings  The logic-level design space consists of 19 configurations generated by 19 ABC1 synthesis scripts in abc.rc.  The algorithm-level design space consists of 18 configurations of constant multiplier generated by CMU SPIRAL [Puschel’04] varying bits from 7 to 252.  Both CS and CS-t are evaluated.  The benchmarking environments are the same as logic-level design space exploration.1 http://www.eecs.berkeley.edu/~alanmi/abc/2 CAD flow for design space exploration Bit = 16 is abandoned due to ABC crash
  22. 22. Logic-level Sample Synthesis Scripts Alias Scripts resyn "b; rw; rwz; b; rwz; b" resyn2 "b; rw; rf; b; rw; rwz; b; rfz; rwz; b" resyn2a "b; rw; b; rw; rwz; b; rwz; b" src_rw "st; rw -l; rwz -l; rwz -l" src_rs "st; rs -K 6 -N 2 -l; rs -K 9 -N 2 -l; rs -K 12 -N 2 -l" choice "fraig_store; resyn; fraig_store; resyn2; fraig_store; fraig_restore" rwsat "st; rw -l; b -l; rw -l; rf -l" compress "b -l; rw -l; rwz -l; b -l; rwz -l; b -l" share "st; multi -m; fx; resyn2"http://www.eecs.berkeley.edu/~alanmi/abc/
  23. 23. Logic Level Results 2500 Initial results comparison 2000  The number of CLBs and levels vary 1500 widely in logic-level design space. 1000  Show circuit “dsip” as an example. 500  Bounding box cost and delay cost are 0 compared for initial placement shake rwsat2 share resyn2rsdc resyn2a choice compress2rsdc resyn2 resyn3 choice2 rwsat src_rs compress2 src_rw src_rws resyn2rs resyn compress compress2rs results. CS CS-t VPR Initial bb cost of “dsip” CS reduces bb cost by 76% on avg. 4.00E-04 Critical delay 3.00E-04 2.00E-04 1.00E-04 0.00E+00 compress2rs… resyn2a resyn2 resyn3 compress2 shake src_rws resyn2rs resyn compress rwsat2 share compress2rs resyn2rsdc choice choice2 rwsat src_rs src_rw CS CS-t VPR Initial delay cost of “dsip” CS reduces delay cost by 48% on avg. Characteristics of logic-level design space
  24. 24. Logic Level Results (Cont.)  Final placement results  Wire length and critical delay of circuit “dsip” are compared.  The final results produced by CS and CS-t are very close or better compared to VPR’s, with 32% overhead for wire length and 20% improvement for critical delay. 100% 100% 80% 80%Percentage Percentage 60% 60% 40% 40% 20% 20% 0% 0% resyn2a resyn2 resyn3 compress2 shake src_rws resyn2rs resyn compress rwsat2 share compress2rs resyn2rsdc choice compress2rsdc choice2 rwsat src_rs src_rw resyn2a resyn2 resyn3 compress2 shake src_rws resyn2rs resyn compress rwsat2 share compress2rs resyn2rsdc choice compress2rsdc choice2 rwsat src_rs src_rw CS-t CS VPR CS-t CS VPR Final wire length comparison of “dsip” Final critical delay comparison of “dsip”
  25. 25. Logic Level Results (Cont.) 800 700 Design space shape characterization 600  We compare the minimal, median and 500 maximal wire length and critical delay 400 produced by CS, CS-t and VPR. 300 200  We also compare the shapes of each configuration over 19 designs. 100 0  The almost identical curves show that compress2… shake rwsat2 share resyn2rsdc resyn2a choice resyn2 resyn3 choice2 rwsat src_rs compress2 src_rw src_rws resyn2rs resyn compress compress2rs CSBP is able to accurately depict the shape of a design space. vpr cs cs-t Shape of final wire length of circuit “dsip”2500 4.5E-07 0.00000042000 3.5E-07 0.00000031500 2.5E-07 0.00000021000 1.5E-07 500 0.0000001 5E-08 0 0 ex1010 apex2 apex4 tseng des ex5p s38417 s38584 bigkey clma diffeq dsip misex3 s298 seq spla pdc alu4 frisc elliptic s38417 s38584 s298 alu4 apex2 apex4 ex1010 pdc tseng bigkey des ex5p frisc seq spla clma diffeq dsip misex3 elliptic vpr-min cs-min cs-t-min vpr-min cs-min cs-t-min Shape of minimal wire length of 20 circuits over 19 designs Shape of minimal crit. delay of 20 circuits over 19 designs
  26. 26. Logic Level Results (Cont.)  Runtime comparison  Only placement time is compared.  CS-t achieves 30x speedup on average, with up to 100x.  In practice, one can take advantage of the significant speedup of CS-t to perform quick design space exploration. 100 90 80 70Speedups 60 50 40 30 20 10 0 s38417 s38584 s298 pdc alu4 apex2 apex4 tseng ex1010 frisc des ex5p seq spla bigkey clma diffeq misex3 dsip elliptic CS CS-t VPR Runtime comparison Speedups compared to VPR (“*” marked time is measured with a timeout )
  27. 27. Algorithm Level Results  Experimental settings  The algorithm-level design is a constant multiplier.  The design parameter explored in our experiments is the fractional bits varying from 7 to 251.  CMU SPIRAL is used to generate RTL design based on Hcub algorithm [Voronenko’07]. Characteristics of algorithm-level design space generated by CMU SPIRAL  Experimental results  The initial and final placement results are similar to logic-level space exploration.  CS and CS-t achieve 7x and 30x speedup compared VPR, respectively. An example of a constant parallel multiplier1 Bit = 16 is abandoned due to ABC crash
  28. 28. Algorithm Level Results (Cont.)  Wire length-delay space comparison  The pareto-points, which are the optimal configurations in a design space, are of most interests to IC designers.  CS and VPR find the same pareto-points.  Bits = 24 is used as the reference circuit. 4.00E-07 4.25E-07Estimated critical delay Estimated critical delay 3.50E-07 B19 B25 3.75E-07 B25 B19 B18 B18 3.00E-07 B23 3.25E-07 B23 B22 B17 B22 B21 B17 2.50E-07 B14 B21 2.75E-07 B14 B12 B15 B15 B12 2.00E-07 2.25E-07 B8 B7 B10 B10 1.50E-07 B9 1.75E-07 B8 B9 B7 0 100 200 300 400 500 0 200 400 600 Wire length Wire length Wire length-delay space of VPR Wire length-delay space of CS
  29. 29. Outline Introduction Circuit Similarity-Based Placement Experimental Results Conclusion and Future Work
  30. 30. Future Work Improvement to CSBP  Integrate predefined matchings, for example, naming matching, into our CSBP to further enhance both the efficiency and the quality of the design. Other applications  Study the effectiveness of applying circuit similarity algorithm to other applications, such as routing and sequential verification for FPGAs
  31. 31. Conclusion Proposed an efficient circuit similarity algorithm Developed CSBP, a fast circuit similarity-based placement for FPGAs  Applied CSPB to incremental design and design space exploration.  Open-source tool available at: http://webdocs.cs.ualberta.ca/~xshi/soft.html Applied CSBP to incremental design for FPGAs  CSBP is able to reduce engineering effort by capturing the similarity from the previous design iterations.  CSBP is 31x faster compared to VPR. Applied CSBP to design space exploration for FPGAs  CSBP can precisely depict the shape of a design space and pinpoint the optimal designs.  CSBP is 30x faster compared to VPR.
  32. 32. Xiaoyu Shi, Dahua Zeng, Yu Hu, Guohui Lin, Osmar R. Zaiane CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design and Design Space Exploration LOGO www.themegallery.com

×