Knoxville aug02-full

SPIRAL: Current Status
Markus Püschel
Students
Faculty
• José Moura (CMU)
• Jeremy Johnson (Drexel)
• Robert Johnson (MathStar)
• David Padua (UIUC)
• Viktor Prasanna (USC)
• Markus Püschel (CMU)
• Manuela Veloso (CMU)

• Gavin Haentjens (CMU)
• Pinit Kumhom (Drexel)
• Neungsoo Park (USC)
• David Sepiashvili (CMU)
• Bryan Singer (CMU)
• Yevgen Voronenko (Drexel)
• Edward Wertz (CMU)
• Jianxin Xiong (UIUC)
Collaborators
• Christoph Überhuber (TU Vienna)
• Franz Franchetti (TU Vienna)

http://www.ece.cmu.edu/~spiral

Sponsor
Work supported by DARPA (DSO), Applied & Computational
Mathematics Program, OPAL, through grant managed by
research grant DABT63-98-1-0004 administered by the Army
Directorate of Contracting.

Moore’s Law

and High(est) Performance
Scientific Computing
(single processor, off-the-shelf)
Moore’s Law:

 processor-memory bottleneck
 short life cycles of computers
 very complex architectures
• vendor specific
• special instructions (MMX, SSE, FMA, …)
• undocumented features

Consequences for software/algorithms:
 arithmetic cost model not accurate for predicting runtime
 better performance models hard to get
 best code is machine dependent (registers/caches size, structure)
 hand-tuned code becomes obsolete as fast as it is written
 compiler limitations
 full performance requires (in part) assembly coding

Portable performance requires automation

SPIRAL
Automates
Implementation

 cuts development costs
 code less error-prone

Optimization

 systematic exploration of alternatives both at
algorithmic and code level

Platform-Adaptation

 takes advantage of architecture specific features
 porting without loss of performance

of DSP algorithms

 are performance critical

A library generator for highly optimized
signal processing algorithms

SPIRAL system

goes for a coffee

ian
tic
a

controls
algorithm generation

Formula Generator

fast algorithm
as SPL formula

ert
Exp m
SPL Compiler er
ram
rog
P

controls
implementation options

C/Fortran/SIMD
code

platform-adapted
implementation

Search Engine

hem
at
M

specifies

runtime on given platform

comes back

(or an espresso for small transforms)

SPIRAL

DSP transform

user

DSP Algorithms: Example 4-point DFT
Cooley/Tukey FFT (size 4):

1 1 1 1  1
1 i − 1 − i  0

=
1 − 1 1 − 1 1

 
1 − i − 1 i  0

Fourier transform

0 1 0  1
1 0 1  0

0 − 1 0  0

1 0 − 1 0

0
1
0
0

0
0
1
0

0 1 1
0 1 − 1

0  0 0

i  0 0

0 0  1
0 0  0

1 1  0

1 − 1 0

0
0
1
0

0
1
0
0

Diagonal matrix (twiddles)

DFT4 = ( DFT2 ⊗ I 2 ) ⋅ T24 ⋅ ( I 2 ⊗ DFT2 ) ⋅ L4
2
Kronecker product

Identity

Permutation

 product of structured sparse matrices
 mathematical notation

0
0

0

1

DSP Algorithms: Terminology
Transform

DFTn

Rule

DFTnm → ( DFTn ⊗ I m ) ⋅ D ⋅ ( I n ⊗ DFTm ) ⋅ P

parameterized matrix

• a breakdown strategy
• product of sparse matrices

DFT 8

Ruletree
DFT 2

DFT 4
DFT 2

Formula

DFT 2

• recursive application of rules
• uniquely defines an algorithm
• efficient representation
• easy manipulation

DFT8 = ( F2 ⊗ I 4 ) ⋅ D ⋅ ( I 2 ⊗ ( I 2 ⊗ F2 ) ) ⋅ P
• few constructs and primitives
• uniquely defines an algorithm
• can be translated into code

DSP Transforms
discrete Fourier transform
Walsh-Hadamard transform

DFTn = [ exp( 2kliπ / n ) ]
WHT2k = DFT2 ⊗ ⊗ DFT2
DCT ( II ) n = [ cos( k ( l + 1 / 2 )π / n ) ]

DCT ( IV ) n = [ cos( ( k + 1 / 2) (l + 1 / 2)π / n ) ]

discrete cosine and sine
Transforms (16 types)
modified discrete
cosine transform

DST ( I ) n = [ sin ( klπ / n ) ]
MDCTn×2 n = [ cos( ( k + ( n + 1) / 2)(l +1 / 2)π / n ) ]

two-dimensional transform

T ⊗T

Others: filters, discrete wavelet transforms, Haar, Hartley, …

Rules = Breakdown Strategies

(

)

DCT2( II ) → diag 1,1 / 2 ⋅ F2

(

)

DCTn( II ) → P ⋅ DCTn(/II ) ⊕ DCTn(/IV ) ⋅ ( I n / 2 ⊗ F2 )
2
2

base case
Q

recursive

DCTn( IV ) → S ⋅ DCTn( II ) ⋅ D

translation

DCTn( IV ) → M 1  M r
DFTn → B ⋅ ( DCTn(/I2) ⊕ DSTn(/I2) ) ⋅ C

iterative
recursive

DFTnm → ( DFTn ⊗ I m ) ⋅ D ⋅ ( I n ⊗ DFTm ) ⋅ P

recursive

Fn (h) → ( I n / d ⊗ k I d + k ) ⋅ ( I n / d ⊗ Fd (h))

recursive

Fn (h) → Circ (h ) ⋅ E

recursive

DWTn (W ) → ( DWTn / 2 (W ) ⊕ I n / 2 ) ⋅ P ⋅ ( I n / 2 ⊗ k W ) ⋅ E

recursive

WHT2n → ∏ ( I 2n1+K+ni −1 ⊗ WHT2ni ⊗ I 2ni +1+K+nt )

iterative/
recursive

MDCTn×2 n → S ⋅ DCTn( IV ) ⋅ P

translation

n

i =1

DSP Transform

Algorithm (Formula)

Implementation

Formulas in SPL
••••
( compose
( diagonal ( 2*cos(1/16*pi) 2*cos(3/16*pi) 2*cos(5/16*pi) 2*cos(7/16*pi) ) )
( permutation ( 1 3 4 2 ) )
( tensor
( I 2 )
( F 2 )
)
( permutation ( 1 4 2 3 ) )
( direct_sum
( compose
( F 2 )
( diagonal ( 1 sqrt(1/2) ) )
)
( compose
( matrix
( 1 1 0 )
( 0 (-1) 1 )
)
( diagonal ( cos(13/8*pi)-sin(13/8*pi) sin(13/8*pi) cos(13/8*pi)+sin(13/8*pi) ) )
( matrix
( 1 0 )
( 1 1 )
( 0 1 )
)
( permutation ( 2 1 ) )

••••

SPL Syntax (Subset)
 matrix operations:
(compose formula formula ...)
(tensor formula formula ...)
(direct_sum formula formula ...)
 direct matrix description:
(matrix (a11 a12 ...) (a21 a22 ...) ...)
(diagonal (d1 d2 ...))
(permutation (p1 p2 ...))
 parameterized matrices:
(I n)
(F n)
 scalars:
1.5, 2/7, cos(..), w(3), pi, 1.2e-04
allows extension of SPL
 definition of new symbols:
(define name formula)
(template formula (i-code-list)
 directives for code generation
#codetype real/complex
controls loop unrolling
#unroll on/off

SPL Compiler, 4-point FFT
(compose (tensor (F 2) (I 2)) (T 4 2)
(tensor (I 2) (F 2)) (L 4 2))

fast algorithm
as
formula
as
SPL program

#codetype
complex
f0
f1
f2
f3
f4
y(1)
y(2)
y(3)
y(4)

=
=
=
=
=
=
=
=
=

x(1) + x(3)
x(1) - x(3)
x(2) + x(4)
x(2) - x(4)
(0.00d0,-1.00d0)*f(3)
f0 + f2
f0 - f2
f1 + f4
f1 - f4

real
r0
r1
r2
r3
r4
r5
r6
r7
y(1)
y(2)
y(3)
y(4)
y(5)
y(6)
y(7)
y(8)

=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=

x(1)
x(1)
x(2)
x(2)
x(3)
x(3)
x(4)
x(4)
r0 +
r1 +
r0 r1 r2 +
r3 r2 r3 +

+ x(5)
- x(5)
+ x(6)
- x(6)
+ x(7)
- x(7)
+ x(8)
- x(8)
r4
r5
r4
r5
r7
r6
r7
r6

SPL Compiler: Summary
SPL Program
SPL Formula

Symbol Definition Template Definition

Parsing
Abstract Syntax Tree

Symbol Table

Template Table

Intermediate Code Generation
I-Code

Intermediate Code Restructuring
I-Code

Optimization
I-Code

Target Code Generation
C, FORTRAN function

Built-in optimizations:
 single static assignment code
 no reuse of temporary vars
 only scalar temporary vars
 constants precomputed
 limited CSE
Extensible through templates

DSP Transform

Algorithm (Formula)

Implementation

Search

Why Search?

DCT, type IV, size 16

~31000 formulas

• maaaany different formulas
• large spread in runtimes, even for modest size
• not due to arithmetic cost
• best formula is platform-dependent

Search Methods available in SPIRAL






Exhaustive Search
Dynamic Programming (DP)
Random Search
Hill Climbing
STEER (similar to a genetic algorithm)
Possible
Sizes
Exhaust Very small

Formulas
Timed

Results

All

Best

DP

All

10s-100s

(very) good

Random

All

User decided

fair/good

Hill Climbing

All

100s-1000s

Good

STEER

All

100s-1000s

(very) good

Search over
• algorithm space and
• implementation options (degree of unrolling)

STEER
Population n:

Mutation

……

Cross-Breeding

expand
differently

Population n+1:
swap
expansions

……

Survival of Fittest

Experimental Results (C code)
search methods
(applicable to all transforms)

high performance code
(compared with FFTW)

different transforms

generated high quality code

SPIRAL System
 Available for download (v3.1): www.ece.cmu.edu/~spiral
 Easy installation (Unix: configure/make; Windows: install shield)
 Unix/Linux and Windows 98/ME/NT/2000/XP
 Current transforms: DFT, DHT, WHT, RHT, DCT/DST type I – IV,
MDCT, Filters, Wavelets, Toeplitz, Circulants
 Extensible
 New version (4.0) in preparation

Learning to Generate Fast Algorithms
• Learns from given dataset (formulas+runtimes) how to design a
fast algorithm (breakdown strategy)
• Learns from a transform of one size, generates the best algorithm
for many sizes
• Tested for DFT and WHT

SIMD Short Vector Extensions

vector length = 4
(4-way)
x

+

 Extension to instruction set architecture
 Available on most current architectures
(SSE on Pentium, AltiVec on Motorola G4)
 Originally for multimedia (like MMX for integers)
 Requires fine grain parallelism
 Large potential speed-up

Problems:
 SIMD instructions are architecture specific
 No common API (usually assembly hand coding)
 Performance very sensitive to memory access
 Automatic vectorization very limited

very difficult to use

Vector code generation from SPL formulas
Naturally vectorizable construct

A ⊗ I4

A

vector length

x

y

(Current) generic construct completely vectorizable:
k

∏ P D ( A ⊗ Iυ ) E Q
i

i

i

i =1

i

i

Pi, Qi
Di, Ei
Ai
ν

permutations
diagonals
arbitrary formulas
SIMD vector length

Vectorization in two steps:
1. Formula manipulation using manipulation rules
2. Code generation (vector code + C code)

Generated Vector Code DFTs: Pentium 4
7

6

Spiral SSE

gflops

5

Intel MKL interl.
4

FFTW 2.1.3
Spiral C and F95

3

Spiral F95 vect
SIMD-FFT

2

1

0
4

5

6

7

8

9

10

11

12

13

14

n
DFT 2^n, Pentium 4, 2.53 GHz, using Intel C compiler 6.0

 speedups (to C code) up to factor of 3.3
 beats hand-tuned vendor library

Generated Vector Code, Other Transforms
WHT
normalized runtime

normalized runtime

2-dim DCT

transform size

transform size

speedups up to factor of 2.5

Flexible Loop Interleaving (Runtime Gain WHT)

Athlon XP: up to 55%

Alpha 21264: up to 70%

Pentium 4: up to 45%

UltraSparc III: up to 60%

Parallel Code Generation: Example WHT
10

PowerPC RS64 III

Speedup

8
6
4

1 thread
8 threads
10 threads

2
0
1

6

11

16

WHT size log(N)

21

26
Parallelized constructs:
In ⊗ A, A ⊗ In

Code Scheduling
Runtime histograms

DFT, size 16 ~6500 formulas

DCT4, size 16 ~16000 formulas

unscheduled
scheduled

Filters and Wavelets
New constructs: row/column overlapped tensor product

A

A

A

I n ⊗k A =

A

I n ⊗k A =

A

A

A
Examples for rules:

Fn (h) → ( I n / d ⊗ k I d + k ) ⋅ ( I n / d ⊗ Fd (h))

DWTn (W ) → ( DWTn / 2 (W ) ⊕ I n / 2 ) ⋅ P ⋅ ( I n / 2 ⊗ k W ) ⋅ E

A

Conclusions


Automatic code generation for the entire domain of
(linear) DSP algorithms



Portable high performance across platforms and
across time



Integration of math (high) level and implementation
(low) level



Intelligence through search and learning in the
space of alternatives

Future Plans


Transforms: Radon, Gabor, Hankel, structured
matrices, …



Target platforms: parallel platforms, DSP
processors, SW/HW architectures, FPGAs, ASICs



Instructions: Vector, FMAs, prefetching, OpenMP,
MPI



Beyond transforms: entire DSP applications



Other domains amenable to SPIRAL approach?

Questions?
http://www.ece.cmu.edu/~spiral

Generating Parallel Programs



Interpret constructs such as In ⊗ A as parallel operations and
transform formulas to obtain maximal parallelism.



Explore alternative data access patterns mathematically (e.g.
different permutations in matrix factorizations)



Prototype implementation using WHT

 Build on existing sequential package
 SMP implementation using OpenMP (IPDPS’02)

 90% efficiency obtained on 12 processor PowerPC RS64 III
 Distributed memory implementation using MPI (POHLL’02)

Comparison of Parallel DDL Schemes

10

Speedup

PowerPC RS64 III
10 threads
8
6
4
2
0
1

6

11

16

21

WHT size log(N)

26

Best sequential with
DDL
Parallel without DDL
Coarse-grained DDL
Fine-grained DDL
with ID shift

Overall Parallel Speedup

10

PowerPC RS64 III

6

Speedup

8

1 thread
8 threads
10 threads

4
2
0
1

6

11

16

WHT size log(N)

21

26

Performance of Digit Permutations on CRAY T3E

Distribution of Global Tensor Permutations on 128 Processors (shmem put)
1200

Number of tensor permutations

1000

800

600

400

200

0

0

50

100

150
200
250
Bandwidth in MB/sec/node

300

350

Architecture Framework

P4 ( I 8 ⊗ F2 )T4 P4− 1 P3 ( I 8 ⊗ F2 )T3 P3− 1 P2 ( I 8 ⊗ F2 )T2 P2− 1 P1 ( I 8 ⊗ F2 )T1P1− 1 Pd

Host

I/O Interface

Parameters: Dimension,
Pd
I/O Interface

M = Memory
AG = Address Generator
CU = Computation Unit

Interconnection Network

Parameters: Pi, 1 ≤ i ≤ n,
no. of processor

AG
CU AG

CU AG

M

M

CU AG

... M

CU AG

M

Parameters: Dimension,
and Ti, 1 ≤ i ≤ n, no. of
processor 2m

CU

Pease Algorithm Dataflow

L16 ( I 8 ⊗ F2 )T3 L16 L16 ( I 8 ⊗ F2 )T2 L16 L16 ( I 8 ⊗ F2 )T1 L16 ( I 8 ⊗ F2 ) Pd
2
8
4
4
8
2
Stage 0

Stage 1

Stage 2

L16
2

Stage 3
0

0

1

2

2

4

3

6

4

8

5

10

6

12

7

14

8

1

9

3

10

5

11

7

12

9

13

11

14

13

15

15

Optimal Dataflow

L16 ( L82 ⊗ I 2 )( I 8 ⊗ F2 )T3 ( L84 ⊗ I 2 ) L16 L16 ( L82 ⊗ I 2 )( I 8 ⊗ F2 )T2 ( L84 ⊗ I 2 ) L16 L16 ( I 8 ⊗ F2 )T1 L16 ( L84 ⊗ I 2 )( I 8 ⊗ F2 )( L82 ⊗ I 2 ) Pd
2
8
4
4
8
2

Stage 0

Stage 1

Stage 2

Stage 3

Pease v.s. Optimal

Performance comparison of Pease algorithm and
optimal algorithm

no. of clock cycles

12000
10000
8000
Pease

6000

Optimal

4000
2000
0
5

6

7

8

FFT size

Stage
1
2
3
4
5
6

Local
Access
32
32
64
128
64
32

Pease
Remote
Local
Access Peformance Access
96
3900
128
96
3860
128
64
3900
128
0
640
128
64
3840
64
96
3880
64

Optimal
Remote
Access Performance
0
640
0
640
0
640
0
640
64
2560
64
2560

Performance: Speedup

Speedup

Speedup(4*N*log(N)/No. of
Clocks)

4
3.5
3
2.5
2
1.5
1
0.5
0
5

6

7

8

9

10

11

12

13

14

15

16

Address Size (Bits)

Speedup = S/C
C = no. of. clock cycles using 4-processor FFT engine
S = minimum no. of clock cycles using single processor
(4*2n*n,
2n is the number of points)

SPIRAL Approach
given

DSP Transform

given

Possible
Algorithms
Possible
Implementations
Performance
Evaluation

Intelligent Search

SPIRAL Search Space

(DFT, DCT, Wavelets etc.)

Computing Platform
(Pentium III, Pentium 4, Athlon,
SUN, PowerPC, Alpha, … )

adapted
implementation

Classical Code Generation System
given

DSP Transform
(DFT, DCT, Wavelets etc.)

Math/Algorithm
Expert
Expert
Programmer
Performance
Evaluation

given

Computing Platform
(Pentium III, Pentium 4, Athlon,
SUN, PowerPC, Alpha, … )

adapted
implementation

Algorithms = Ruletrees = Formulas
DCT8( II )

DCTn( II ) → P ⋅ ( DCTn(/II ) ⊕ DCTn(/IV ) ) ⋅ ( F2 ⊗ I n / 2 )
2
2

R1

DCT4( IV )

DCT4( II )
R1

DCT2( II )

DCT4( II )

DCT2( IV )

R3

F2

DCTn( IV ) → P ⋅ DCTn( II ) ⋅ S

R6

R6
( II )
2

R1

DST
R4

DCT2( II )

F2

R3

DCT2( II ) →

1
⋅ F2
2

DCT2( IV )

F2

R6

DST2( II )
R4

F2

Mathematical Framework: Summary
 fast algorithms represented as ruletrees (easy generation/manipulation)
and as formulas (can be translated into code)
 formulas built from few constructs and primitives
 many different algorithms/formulas generated from few rules
(combinatorial explosion)
 these algorithms are (essentially) equal in arithmetic cost,
but differ in data flow

Formula Generation

data base (extensible!)
data type

Formula Generator

rules

recursive
application

transforms

control search engine

ruletrees

formulas

runtime

export

formula
translation
(spl compiler)

translation

 written in GAP/AREP (computer algebra system)
 all computation/manipulation is symbolic
 exact arithmetic
 easy extensible rule and transform data base
 verification of rules and formulas

cut here for other
optimization problems

Number of Formulas/Algorithms
k

# DFT, size 2^k

# DCT-IV, size 2^k

1
2
3
4
5
6
7
8
9

1
6
40
296
27744
162570361280
~1.01 • 10^27
~2.31 • 10^61
~2.86 • 10^133

1
10
126
31242
1924443362
7343815121631354242
~1.07 • 10^38
~2.30 • 10^76
~1.06 • 10^153

 differ in data flow not in arithmetic cost
 exponential search space

Extensibility of SPIRAL
New transforms are readily included on the high level
(easy, due to SPIRAL’s framework)
New constructs and primitives (potentially required by radically
different transforms) are readily included in SPL
(moderate effort, due to template mechanism)
New instructions sets available (e.g., SSE) are included by
extending the SPL compiler
(doable one time effort)

Generated Vector Code DFTs: Pentium III
1.8

1.6

1.4

gflops

1.2

Intel MKL
Spiral SSE
Spiral C/F95 vect
Spiral C/F95
FFTW 2.1.3

1

0.8

0.6

0.4

0.2

0
4

5

6

7

8

9

10

11

12

13

n
DFT 2^n, Pentium III, 1 GHz, using Intel C compiler 6.0

speedups (to C code) up to factor of 2.5

Knoxville aug02-full

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Knoxville aug02-full

Similar to Knoxville aug02-full (20)

Recently uploaded

Recently uploaded (20)

Knoxville aug02-full

Editor's Notes