SlideShare a Scribd company logo
1 of 1
Download to read offline
Performance
Discussion and Future Directions
Group Steiner Problem
 Given an undirected weighted graph 𝐺 = (𝑉, 𝐸) and a family 𝑁 = 𝑁1, … , π‘π‘˜ of π‘˜ disjoint groups of nodes 𝑁𝑖 βŠ† 𝑉 , find a
minimum-cost tree which contains at least one node from each group 𝑁𝑖.
 Steiner nodes, i.e., nodes that are not a member of any group 𝑁𝑖, can optionally be used to interconnect the groups. In the
figure, the solid dots represent non-Steiner nodes, hollow dots represent optional Steiner nodes and the boxes represent groups.
 The solution to the Group Steiner Problem (GSP) can be used to find an efficient way of connecting a subset of a group of
vertices in a graph. Applications of this problem and its variants include routing of VLSI circuits and computer-aided design.
A feasible solution for a group
Steiner problem instance
 In VLSI design flow, logical design is followed by planning the physical layout
which involves:
Partitioning Placement Routing
VLSI Routing Estimation
 In practice, however, each terminal consists of a large collection of
electrically equivalent ports, a fact that is not accounted for in
layout steps such as wiring estimation. Each module can also be
rotated or flipped giving eight possible locations for a given port.
Different wiring layouts using multiple
equivalent ports
A module is rotated and flipped to induce
a group of eight terminal positions
Placement of
standard cells
After global
routing
After detailed
routing
 Routing:
Input: Terminal List, location of
terminals and pins/ports
Output: Geometric layout of all nets
Objective: Minimize the total wire
length completing all connections
without increasing chip area
 All experiments were done on Blue Waters supercomputer which uses a Cray XE6/XK7 system.
 We used several VLSI WRP (Wire Routing Problem) instances from datasets provided by [3] with
graph size ranging from 128 to 3,168 vertices.
 We ran strong scalability test by increasing the number of processors while keeping the problem
size constant. (Figure 5)
 We also compared our implementation with the best known sequential performance by [4]. The
results show our implementation achieves up to 302x speedup for a graph size of 3,168 vertices.
(Figure 6)
 Table 1 shows the accuracy of our solutions in comparison with optimum costs from[4].
Number of Vertices
Runtime(s)
Figure 5: A strong scalability test. Also shows comparison with
our serial implementation which took 5852.4s for graph size of
2,518 vertices.
Figure 6: Runtime comparison with [4]
GPU-Accelerated VLSI Routing Using Group Steiner Trees
Basileal Imana, Venkata Suhas Maringanti and Peter Yoon
Department of Computer Science, Trinity College, Hartford, CT
Specifications: Master process runs on a GPU-enabled XK7 compute
node while all other slaves run on traditional XE6 compute nodes.
XK7: one AMD 6200 Interlagos CPU, 23GB Host memory; one NVIDIA
GK110 β€œKepler” accelerator, 6GB memory
XE6: 2 AMD 6276 Interlagos CPUs, 2.3 GHz, 64GB physical memory,
16 cores
Discussion
 Our implementation outperforms existing serial implementations
while producing accurate solutions for WRP problem instances.
 The algorithm is adaptively refined in that it uses several steps to
refine the solution and is highly dynamic making it hard to predict the
size of the solution before actually computing it. This causes some
load imbalance.
 Due to the dynamic nature of the problem, our implementation
exhibits some irregular memory access patterns.
Future Directions
 Design more efficient work distributions schemes to decrease the
load imbalance.
 Overlap communication with computations where possible.
 Improve the efficiency of memory access pattern.
Graph Size Terminals Opt. cost [4] Approx. cost Error %
wrp3-11 128 11 1100361 1100427 0.006
wrp3-39 703 39 3900450 3900600 0.004
wrp3-96 2518 96 96001172 96003009 0.002
wrp3-83 3168 83 8300906 8302279 0.017
Table 1: Accuracy Comparison
References
[1] Helvig C.S., Robins, G. and Zelikovsky, A. New Approximation Algorithms for Routing
with Multi-Port Terminals. IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, 19(10), 1118-1128.
[2] Lund, B. D., and Smith, J. W. A multi-stage CUDA Kernel for Floyd-Warshall. CoRR
abs/1001.4108 (2010).
[3] Koch, T., Martin, A. and Voß, S. SteinLib: an updated library on Steiner tree problems
in graphs. in Cheng, X. and Du, D.Z. eds. Steiner Trees in Industry, Springer US, Berlin,
2001, 285–326.
[4] Polzin, T., Vahdati, S. The Steiner tree challenge: An updated study, 11th DIMACS
Implementation Challenge. Retrieved July 06, 2015, from Princeton University:
http://dimacs11.cs.princeton.edu/papers/PolzinVahdatiDIMACS.pdf
The Group Steiner Heuristic [1]
1. Replace input graph 𝐺 = (𝑉, 𝐸) by its metric closure, i.e., a complete graph
with vertices 𝑉 and edge weights equal to the shortest path lengths.
(Figure 1)
2. For every terminal vertex 𝑣, create a new node 𝑣’ and a new zero-cost edge
(𝑣, 𝑣’). 𝑣’ takes the role of 𝑣; 𝑣’ becomes a port and 𝑣 a non-port. (Figure 2)
3. Construct a rooted 1-star tree, i.e., a tree of depth 1 where all leaves are
terminal nodes one from each group. (Figure 3)
4. Select intermediate nodes and determine a set of groups that should be
connected to each intermediate node to for a partial-star. (Figure 4)
5. Combine the partial-stars to obtain a 2-star tree, i.e., a tree of depth 2.
(Figure 4)
The 2-star has to be rooted at the center of the optimal solution-tree. The
center, however, cannot be determined ahead of time. Therefore, a Steiner 2-
star is constructed with each node 𝑣 in 𝐺 as root π‘Ÿ and the minimum-cost tree
is determined over all possible choices of π‘Ÿ.
Input: A graph 𝐺 = (𝑉, 𝐸), a family 𝑁 of π‘˜ disjoint groups 𝑁1, … , π‘π‘˜ βŠ† 𝑉 and a root π‘Ÿ ∈ 𝑉
π΄π‘π‘π‘Ÿπ‘œπ‘₯2(π‘Ÿ) ← π‘Ÿ /*add root to solution*/
𝑁′ ← 𝑁 /*begin with all groups remaining*/
WHILE 𝑁′
β‰  βˆ… DO
FOR EACH 𝑣 ∈ 𝑉 DO
Sort 𝑀 = 𝑁1, … , π‘π‘˜ such that
π‘π‘œπ‘ π‘‘(𝑣,𝑁𝑖)
π‘π‘œπ‘ π‘‘(π‘Ÿ,𝑁𝑖)
≀
π‘π‘œπ‘ π‘‘(𝑣,𝑁 𝑖+1)
π‘π‘œπ‘ π‘‘(π‘Ÿ,𝑁 𝑖+1)
Find 𝑗 ∈ 1, … π‘˜ that minimizes
π‘›π‘œπ‘Ÿπ‘š 𝑣 =
π‘π‘œπ‘ π‘‘ π‘Ÿ,𝑣 + π‘π‘œπ‘ π‘‘(𝑣,𝑁𝑖
𝑗
𝑖=1 )
π‘π‘œπ‘ π‘‘(π‘Ÿ,𝑁𝑖
𝑗
𝑖=1
)
𝑀 𝑣 ← 𝑁1, … , π‘π‘˜ /*store groups connected to 𝑣 βˆ—/
ENDFOR
Find 𝑣 π‘šπ‘–π‘› with the minimum π‘›π‘œπ‘Ÿπ‘š(𝑣)
𝑃 ← (π‘Ÿ, 𝑣 π‘šπ‘–π‘›, 𝑀 𝑣 π‘šπ‘–π‘› ) /*partial star with root π‘Ÿ, intermediate 𝑣 */
𝑁′ ← 𝑁′ βˆ’ π‘”π‘Ÿπ‘œπ‘’π‘π‘ (𝑃)
π΄π‘π‘π‘Ÿπ‘œπ‘₯2 π‘Ÿ ← π΄π‘π‘π‘Ÿπ‘œπ‘₯2 π‘Ÿ βˆͺ 𝑃 /*add partial star to solution*/
ENDWHILE
Output: A low-cost 2-star π΄π‘π‘π‘Ÿπ‘œπ‘₯2 π‘Ÿ with the root π‘Ÿ intersecting each group 𝑁𝑖
Depth Bounded Tree Approximation
Figure 1: Construction of metric closure
Figure 2: Transformation of input graph
Figure 3: 1-star tree rooted at π‘Ÿ
Figure 4: 2-star tree made
of three partial-stars
The time complexity of the group Steiner Heuristic is 𝛰(𝛼 + 𝑉 2 βˆ™ π‘˜2 βˆ™ log π‘˜),
where π‘˜ is the number of groups and Ξ± is the time it takes to compute all-pairs
shortest paths using Floyd-Warshall Algorithm.
CUDA-Aware MPI-Based Approach
 Our implementation of the Group Steiner Heuristic [1] is
based on CUDA-Aware MPI.
 We parallelize the algorithm by exploiting the need to
construct rooted 2-star trees with each vertex 𝑣 in the graph
as a possible root.
 All such trees can be constructed independently on separate
processes.
 Metric closure is constructed by computing All Pair Shortest
Paths using Blocked Floyd-Warshall Algorithm on a GPU [2].
 There is no communication during 2-star tree construction
(the most time-consuming step), making our implementation
scalable.
 Note the communication pattern when we launch as many processes as the number of the vertices in
the graph. If fewer number of processes are launched, round-robin work distribution scheme is used
to assign roots.
Two-star tree construction
IF master
𝐺 ← read_graph(π‘“π‘–π‘™π‘’π‘›π‘Žπ‘šπ‘’)
(𝑀𝑐, 𝑃) ← FLOYD_APSP_CUDA(𝐺) /*get metric closure
and predecessors matrix*/
ENDIF
BROADCAST(𝑀𝑐) /*from master*/
π‘œπ‘›π‘’π‘ π‘‘π‘Žπ‘Ÿ_π‘Žπ‘™π‘™ ← [ ]
WHILE there is a remaining root π‘Ÿ
π‘œπ‘›π‘’π‘ π‘‘π‘Žπ‘Ÿ ← build_onestar(𝑀𝑐, π‘Ÿ)
GATHER(π‘œπ‘›π‘’π‘ π‘‘π‘Žπ‘Ÿ, π‘œπ‘›π‘’π‘ π‘‘π‘Žπ‘Ÿ_π‘Žπ‘™π‘™) /*to master*/
ENDWHILE
BROADCAST(onestar_all) //from master
π‘”π‘™π‘œπ‘π‘Žπ‘™_π‘šπ‘–π‘› ← { }
π‘™π‘œπ‘π‘Žπ‘™_π‘šπ‘–π‘› ← { }
WHILE there is a remaining root π‘Ÿ
π‘‘π‘€π‘œπ‘ π‘‘π‘Žπ‘Ÿ ← build_twostar(𝑀𝑐, π‘Ÿ ,π‘œπ‘›π‘’π‘ π‘‘π‘Žπ‘Ÿ_π‘Žπ‘™π‘™)
IF cost(π‘‘π‘€π‘œπ‘ π‘‘π‘Žπ‘Ÿ) < cost(π‘™π‘œπ‘π‘Žπ‘™_min)
π‘™π‘œπ‘π‘Žπ‘™_min ← π‘‘π‘€π‘œπ‘ π‘‘π‘Žπ‘Ÿ
ENDIF
ENDWHILE
REDUCE(π‘™π‘œπ‘π‘Žπ‘™_π‘šπ‘–π‘›, π‘”π‘™π‘œπ‘π‘Žπ‘™_π‘šπ‘–π‘›) /*to master*/
IF master
build_sol_graph (π‘”π‘™π‘œπ‘π‘Žπ‘™_π‘šπ‘–π‘›, 𝑃, 𝑀𝑐)
ENDIF
CUDA-Aware MPI-based approach
A partial-star
Intermediate nodes
Runtime(s)
Number of Processes
…
Broadcast metric closure (MPI_Bcast)
Each process constructs 1-star for assigned root
Collect 1-stars (MPI_Gather)
…
Broadcast all 1-stars
Find global minimum-cost tree
(MPI_Reduce)
Construct metric closure on a GPU
Build solution graph
Read input graph from file
Each process constructs 2-star for assigned root
0
n210
n210
0
0
Acknowledgment
This research was supported by:
 CUDA Teaching Center Program, NVIDIA Research
 Faculty Research Committee and Interdisciplinary Science Program, Trinity College
 Blue Waters Student Internship Program
 Tens and thousands of non-overlapping nets may need to be routed
simultaneously in large-scale circuit design.
 Conventional procedures of VLSI global routing estimation assume a one-to-one
correspondence between terminals and ports.

More Related Content

Similar to post119s1-file2

IJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphsIJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphsAkisato Kimura
Β 
Aes encryption engine for many core processor arrays for enhanced security
Aes encryption engine for many core processor arrays for enhanced securityAes encryption engine for many core processor arrays for enhanced security
Aes encryption engine for many core processor arrays for enhanced securityIAEME Publication
Β 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAprithan
Β 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelJenny Liu
Β 
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on MulticoreMap-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicoreillidan2004
Β 
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...IJERA Editor
Β 
Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...TELKOMNIKA JOURNAL
Β 
Feed forward neural network for sine
Feed forward neural network for sineFeed forward neural network for sine
Feed forward neural network for sineijcsa
Β 
A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...
A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...
A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...Subhajit Sahu
Β 
Data structures assignmentweek4b.pdfCI583 Data Structure
Data structures assignmentweek4b.pdfCI583 Data StructureData structures assignmentweek4b.pdfCI583 Data Structure
Data structures assignmentweek4b.pdfCI583 Data StructureOllieShoresna
Β 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
Β 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationGeoffrey Fox
Β 
(Slides) A Method for Distributed Computaion of Semi-Optimal Multicast Tree i...
(Slides) A Method for Distributed Computaion of Semi-Optimal Multicast Tree i...(Slides) A Method for Distributed Computaion of Semi-Optimal Multicast Tree i...
(Slides) A Method for Distributed Computaion of Semi-Optimal Multicast Tree i...Naoki Shibata
Β 
Fusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with ViewsFusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with ViewsPhilip Schwarz
Β 

Similar to post119s1-file2 (20)

post119s1-file3
post119s1-file3post119s1-file3
post119s1-file3
Β 
IJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphsIJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphs
Β 
Aes encryption engine for many core processor arrays for enhanced security
Aes encryption engine for many core processor arrays for enhanced securityAes encryption engine for many core processor arrays for enhanced security
Aes encryption engine for many core processor arrays for enhanced security
Β 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
Β 
FractalTreeIndex
FractalTreeIndexFractalTreeIndex
FractalTreeIndex
Β 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
Β 
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on MulticoreMap-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore
Β 
50120130406039
5012013040603950120130406039
50120130406039
Β 
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
Β 
L14.pdf
L14.pdfL14.pdf
L14.pdf
Β 
Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...
Β 
Feed forward neural network for sine
Feed forward neural network for sineFeed forward neural network for sine
Feed forward neural network for sine
Β 
H010223640
H010223640H010223640
H010223640
Β 
A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...
A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...
A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...
Β 
Data structures assignmentweek4b.pdfCI583 Data Structure
Data structures assignmentweek4b.pdfCI583 Data StructureData structures assignmentweek4b.pdfCI583 Data Structure
Data structures assignmentweek4b.pdfCI583 Data Structure
Β 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
Β 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
Β 
(Slides) A Method for Distributed Computaion of Semi-Optimal Multicast Tree i...
(Slides) A Method for Distributed Computaion of Semi-Optimal Multicast Tree i...(Slides) A Method for Distributed Computaion of Semi-Optimal Multicast Tree i...
(Slides) A Method for Distributed Computaion of Semi-Optimal Multicast Tree i...
Β 
Fusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with ViewsFusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with Views
Β 
Data miningpresentation
Data miningpresentationData miningpresentation
Data miningpresentation
Β 

post119s1-file2

  • 1. Performance Discussion and Future Directions Group Steiner Problem  Given an undirected weighted graph 𝐺 = (𝑉, 𝐸) and a family 𝑁 = 𝑁1, … , π‘π‘˜ of π‘˜ disjoint groups of nodes 𝑁𝑖 βŠ† 𝑉 , find a minimum-cost tree which contains at least one node from each group 𝑁𝑖.  Steiner nodes, i.e., nodes that are not a member of any group 𝑁𝑖, can optionally be used to interconnect the groups. In the figure, the solid dots represent non-Steiner nodes, hollow dots represent optional Steiner nodes and the boxes represent groups.  The solution to the Group Steiner Problem (GSP) can be used to find an efficient way of connecting a subset of a group of vertices in a graph. Applications of this problem and its variants include routing of VLSI circuits and computer-aided design. A feasible solution for a group Steiner problem instance  In VLSI design flow, logical design is followed by planning the physical layout which involves: Partitioning Placement Routing VLSI Routing Estimation  In practice, however, each terminal consists of a large collection of electrically equivalent ports, a fact that is not accounted for in layout steps such as wiring estimation. Each module can also be rotated or flipped giving eight possible locations for a given port. Different wiring layouts using multiple equivalent ports A module is rotated and flipped to induce a group of eight terminal positions Placement of standard cells After global routing After detailed routing  Routing: Input: Terminal List, location of terminals and pins/ports Output: Geometric layout of all nets Objective: Minimize the total wire length completing all connections without increasing chip area  All experiments were done on Blue Waters supercomputer which uses a Cray XE6/XK7 system.  We used several VLSI WRP (Wire Routing Problem) instances from datasets provided by [3] with graph size ranging from 128 to 3,168 vertices.  We ran strong scalability test by increasing the number of processors while keeping the problem size constant. (Figure 5)  We also compared our implementation with the best known sequential performance by [4]. The results show our implementation achieves up to 302x speedup for a graph size of 3,168 vertices. (Figure 6)  Table 1 shows the accuracy of our solutions in comparison with optimum costs from[4]. Number of Vertices Runtime(s) Figure 5: A strong scalability test. Also shows comparison with our serial implementation which took 5852.4s for graph size of 2,518 vertices. Figure 6: Runtime comparison with [4] GPU-Accelerated VLSI Routing Using Group Steiner Trees Basileal Imana, Venkata Suhas Maringanti and Peter Yoon Department of Computer Science, Trinity College, Hartford, CT Specifications: Master process runs on a GPU-enabled XK7 compute node while all other slaves run on traditional XE6 compute nodes. XK7: one AMD 6200 Interlagos CPU, 23GB Host memory; one NVIDIA GK110 β€œKepler” accelerator, 6GB memory XE6: 2 AMD 6276 Interlagos CPUs, 2.3 GHz, 64GB physical memory, 16 cores Discussion  Our implementation outperforms existing serial implementations while producing accurate solutions for WRP problem instances.  The algorithm is adaptively refined in that it uses several steps to refine the solution and is highly dynamic making it hard to predict the size of the solution before actually computing it. This causes some load imbalance.  Due to the dynamic nature of the problem, our implementation exhibits some irregular memory access patterns. Future Directions  Design more efficient work distributions schemes to decrease the load imbalance.  Overlap communication with computations where possible.  Improve the efficiency of memory access pattern. Graph Size Terminals Opt. cost [4] Approx. cost Error % wrp3-11 128 11 1100361 1100427 0.006 wrp3-39 703 39 3900450 3900600 0.004 wrp3-96 2518 96 96001172 96003009 0.002 wrp3-83 3168 83 8300906 8302279 0.017 Table 1: Accuracy Comparison References [1] Helvig C.S., Robins, G. and Zelikovsky, A. New Approximation Algorithms for Routing with Multi-Port Terminals. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 19(10), 1118-1128. [2] Lund, B. D., and Smith, J. W. A multi-stage CUDA Kernel for Floyd-Warshall. CoRR abs/1001.4108 (2010). [3] Koch, T., Martin, A. and Voß, S. SteinLib: an updated library on Steiner tree problems in graphs. in Cheng, X. and Du, D.Z. eds. Steiner Trees in Industry, Springer US, Berlin, 2001, 285–326. [4] Polzin, T., Vahdati, S. The Steiner tree challenge: An updated study, 11th DIMACS Implementation Challenge. Retrieved July 06, 2015, from Princeton University: http://dimacs11.cs.princeton.edu/papers/PolzinVahdatiDIMACS.pdf The Group Steiner Heuristic [1] 1. Replace input graph 𝐺 = (𝑉, 𝐸) by its metric closure, i.e., a complete graph with vertices 𝑉 and edge weights equal to the shortest path lengths. (Figure 1) 2. For every terminal vertex 𝑣, create a new node 𝑣’ and a new zero-cost edge (𝑣, 𝑣’). 𝑣’ takes the role of 𝑣; 𝑣’ becomes a port and 𝑣 a non-port. (Figure 2) 3. Construct a rooted 1-star tree, i.e., a tree of depth 1 where all leaves are terminal nodes one from each group. (Figure 3) 4. Select intermediate nodes and determine a set of groups that should be connected to each intermediate node to for a partial-star. (Figure 4) 5. Combine the partial-stars to obtain a 2-star tree, i.e., a tree of depth 2. (Figure 4) The 2-star has to be rooted at the center of the optimal solution-tree. The center, however, cannot be determined ahead of time. Therefore, a Steiner 2- star is constructed with each node 𝑣 in 𝐺 as root π‘Ÿ and the minimum-cost tree is determined over all possible choices of π‘Ÿ. Input: A graph 𝐺 = (𝑉, 𝐸), a family 𝑁 of π‘˜ disjoint groups 𝑁1, … , π‘π‘˜ βŠ† 𝑉 and a root π‘Ÿ ∈ 𝑉 π΄π‘π‘π‘Ÿπ‘œπ‘₯2(π‘Ÿ) ← π‘Ÿ /*add root to solution*/ 𝑁′ ← 𝑁 /*begin with all groups remaining*/ WHILE 𝑁′ β‰  βˆ… DO FOR EACH 𝑣 ∈ 𝑉 DO Sort 𝑀 = 𝑁1, … , π‘π‘˜ such that π‘π‘œπ‘ π‘‘(𝑣,𝑁𝑖) π‘π‘œπ‘ π‘‘(π‘Ÿ,𝑁𝑖) ≀ π‘π‘œπ‘ π‘‘(𝑣,𝑁 𝑖+1) π‘π‘œπ‘ π‘‘(π‘Ÿ,𝑁 𝑖+1) Find 𝑗 ∈ 1, … π‘˜ that minimizes π‘›π‘œπ‘Ÿπ‘š 𝑣 = π‘π‘œπ‘ π‘‘ π‘Ÿ,𝑣 + π‘π‘œπ‘ π‘‘(𝑣,𝑁𝑖 𝑗 𝑖=1 ) π‘π‘œπ‘ π‘‘(π‘Ÿ,𝑁𝑖 𝑗 𝑖=1 ) 𝑀 𝑣 ← 𝑁1, … , π‘π‘˜ /*store groups connected to 𝑣 βˆ—/ ENDFOR Find 𝑣 π‘šπ‘–π‘› with the minimum π‘›π‘œπ‘Ÿπ‘š(𝑣) 𝑃 ← (π‘Ÿ, 𝑣 π‘šπ‘–π‘›, 𝑀 𝑣 π‘šπ‘–π‘› ) /*partial star with root π‘Ÿ, intermediate 𝑣 */ 𝑁′ ← 𝑁′ βˆ’ π‘”π‘Ÿπ‘œπ‘’π‘π‘ (𝑃) π΄π‘π‘π‘Ÿπ‘œπ‘₯2 π‘Ÿ ← π΄π‘π‘π‘Ÿπ‘œπ‘₯2 π‘Ÿ βˆͺ 𝑃 /*add partial star to solution*/ ENDWHILE Output: A low-cost 2-star π΄π‘π‘π‘Ÿπ‘œπ‘₯2 π‘Ÿ with the root π‘Ÿ intersecting each group 𝑁𝑖 Depth Bounded Tree Approximation Figure 1: Construction of metric closure Figure 2: Transformation of input graph Figure 3: 1-star tree rooted at π‘Ÿ Figure 4: 2-star tree made of three partial-stars The time complexity of the group Steiner Heuristic is 𝛰(𝛼 + 𝑉 2 βˆ™ π‘˜2 βˆ™ log π‘˜), where π‘˜ is the number of groups and Ξ± is the time it takes to compute all-pairs shortest paths using Floyd-Warshall Algorithm. CUDA-Aware MPI-Based Approach  Our implementation of the Group Steiner Heuristic [1] is based on CUDA-Aware MPI.  We parallelize the algorithm by exploiting the need to construct rooted 2-star trees with each vertex 𝑣 in the graph as a possible root.  All such trees can be constructed independently on separate processes.  Metric closure is constructed by computing All Pair Shortest Paths using Blocked Floyd-Warshall Algorithm on a GPU [2].  There is no communication during 2-star tree construction (the most time-consuming step), making our implementation scalable.  Note the communication pattern when we launch as many processes as the number of the vertices in the graph. If fewer number of processes are launched, round-robin work distribution scheme is used to assign roots. Two-star tree construction IF master 𝐺 ← read_graph(π‘“π‘–π‘™π‘’π‘›π‘Žπ‘šπ‘’) (𝑀𝑐, 𝑃) ← FLOYD_APSP_CUDA(𝐺) /*get metric closure and predecessors matrix*/ ENDIF BROADCAST(𝑀𝑐) /*from master*/ π‘œπ‘›π‘’π‘ π‘‘π‘Žπ‘Ÿ_π‘Žπ‘™π‘™ ← [ ] WHILE there is a remaining root π‘Ÿ π‘œπ‘›π‘’π‘ π‘‘π‘Žπ‘Ÿ ← build_onestar(𝑀𝑐, π‘Ÿ) GATHER(π‘œπ‘›π‘’π‘ π‘‘π‘Žπ‘Ÿ, π‘œπ‘›π‘’π‘ π‘‘π‘Žπ‘Ÿ_π‘Žπ‘™π‘™) /*to master*/ ENDWHILE BROADCAST(onestar_all) //from master π‘”π‘™π‘œπ‘π‘Žπ‘™_π‘šπ‘–π‘› ← { } π‘™π‘œπ‘π‘Žπ‘™_π‘šπ‘–π‘› ← { } WHILE there is a remaining root π‘Ÿ π‘‘π‘€π‘œπ‘ π‘‘π‘Žπ‘Ÿ ← build_twostar(𝑀𝑐, π‘Ÿ ,π‘œπ‘›π‘’π‘ π‘‘π‘Žπ‘Ÿ_π‘Žπ‘™π‘™) IF cost(π‘‘π‘€π‘œπ‘ π‘‘π‘Žπ‘Ÿ) < cost(π‘™π‘œπ‘π‘Žπ‘™_min) π‘™π‘œπ‘π‘Žπ‘™_min ← π‘‘π‘€π‘œπ‘ π‘‘π‘Žπ‘Ÿ ENDIF ENDWHILE REDUCE(π‘™π‘œπ‘π‘Žπ‘™_π‘šπ‘–π‘›, π‘”π‘™π‘œπ‘π‘Žπ‘™_π‘šπ‘–π‘›) /*to master*/ IF master build_sol_graph (π‘”π‘™π‘œπ‘π‘Žπ‘™_π‘šπ‘–π‘›, 𝑃, 𝑀𝑐) ENDIF CUDA-Aware MPI-based approach A partial-star Intermediate nodes Runtime(s) Number of Processes … Broadcast metric closure (MPI_Bcast) Each process constructs 1-star for assigned root Collect 1-stars (MPI_Gather) … Broadcast all 1-stars Find global minimum-cost tree (MPI_Reduce) Construct metric closure on a GPU Build solution graph Read input graph from file Each process constructs 2-star for assigned root 0 n210 n210 0 0 Acknowledgment This research was supported by:  CUDA Teaching Center Program, NVIDIA Research  Faculty Research Committee and Interdisciplinary Science Program, Trinity College  Blue Waters Student Internship Program  Tens and thousands of non-overlapping nets may need to be routed simultaneously in large-scale circuit design.  Conventional procedures of VLSI global routing estimation assume a one-to-one correspondence between terminals and ports.