Advantages of Hiring UIUX Design Service Providers for Your Business
Implementing 3D SPHARM Surfaces Registration on Cell B.E. Processor
1. Implementing 3D SPHARM Surfaces
Registration on Cell Processor
Huian Li (huili@indiana.edu) Mi Yan (miyan@us.ibm.com)
Robert Henschel (rhensche@indiana edu)
(rhensche@indiana.edu) Li Shen (shenli@iupui edu)
(shenli@iupui.edu)
July 29, 2009
5. SHREC
(a) template, (b) object, (c) after ICP, (d) after
registration of p
g parameterization
6. Calculation of coefficients
• After rotating the parameter net on the surface in
Euler angles (α, β, γ), new coefficients will be:
l
c ( )
m
l
nl
D l
mn ( ) c l
n
where
min( l n ,l m )
D mn ( ) e ( i m in ) (
l
( 1) t d mnt ( ))
t max( 0 , n m )
l
and
(l n)!(l n)!(l m)!(l m)!
d mnt ( )
l
(cos ) ( 2l nm2t ) (sin ) ( 2t mn )
(l n t )!(l m t )!(t m n)!t! 2 2
7. RMSD
• RMSD (Root Mean Square Distance): distance
between two SPHARM models
L max l
1
RMSD
4
l0 m l
|| c 1ml c 2 , l || 2
,
m
m m
c and c
1 ,l 2 ,l are coefficients of two
SPHARM models
8. Matlab implementation
• A straightforward implementation in Matlab:
for l = 0 Lmax
0,
for m = -l, l
for n = -l, l
l
for t = max(0, n-m), min(l+m, l-n)
... performing calculations ...
• One rotation for Lmax = 50 took 823 seconds on 2GHz quad
quad-
core Intel Xeon E5335
10. Cell implementation
• Domain decomposition:
for l = 0, Lmax
for m = -l l
l,
for n = -l, l
for t = max(0 n-m) min(l+m l-n)
max(0, n m), min(l+m, l n)
... calculations ...
• Decomposition along l leads to work load
imbalance among SPUs
• Decomposition along m creates unnecessary data
p g y
communication
11. Cell implementation
• Loop fusion:
for l = 0, Lmax
for m = -l l
l,
for n = -l, l
for t = max(0 n-m) min(l+m l-n)
max(0, n m), min(l+m, l n)
... calculations ...
• Unique index for combined loop:
f(l, m) = l2 + m + l
• W kl d f each SPE :
Workload for h
(Lmax + 1)2/(total # of SPEs)
12. Cell implementation
• Lookup table T for factorial
• Transform exponentials & multiplications into
multiplications & additions respectively
additions, respectively.
(l n)!(l n)!(l m)!(l m)!
d l
( ) (cos ) ( 2l nm2t ) (sin ) ( 2t mn )
(l n t )!(l m t )!(t m n)!t!
mnt
2 2
exp(
1
(T (l n ) T (l n ) T (l m ) T (l m ))
2
T (l n t ) T (l m t ) T (t m n ) T (t )
( 2l n m 2t ) log(cos ) ( 2t m n ) log(sin ))
2 2
13. Cell implementation
• Others that specific to Cell:
• Vectorization & data alignment
• DMA data transfer between main memory &
local store
• SPU d decrementert
17. Performance analysis
Performance of one rotation on Cell BE
1.8
18
1.6
1.4
s)
Time (seconds
1.2
1
0.8
0.6
0.4
04
T
0.2
0
1 2 4 8 16
Number of SPEs
18. Performance analysis
Performance of finding the shortest
distance at Level 3 on Cell BE
7000
6000
5000
s)
seconds
4000
Time (s
3000 GNU gcc
IBM xlc
2000
1000
0
4 8 12 16
Number of SPEs
19. Conclusion
• Performance increases dramatically on Cell due to
its unique architecture and algorithm optimization.
• Carefulness must be taken for data placement due
to limited local store.
• Carefulness must also be taken for data transfer
between local store and main memory.