Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Knoxville aug02-full
1. SPIRAL: Current Status
Markus Püschel
Students
Faculty
• José Moura (CMU)
• Jeremy Johnson (Drexel)
• Robert Johnson (MathStar)
• David Padua (UIUC)
• Viktor Prasanna (USC)
• Markus Püschel (CMU)
• Manuela Veloso (CMU)
• Gavin Haentjens (CMU)
• Pinit Kumhom (Drexel)
• Neungsoo Park (USC)
• David Sepiashvili (CMU)
• Bryan Singer (CMU)
• Yevgen Voronenko (Drexel)
• Edward Wertz (CMU)
• Jianxin Xiong (UIUC)
Collaborators
• Christoph Überhuber (TU Vienna)
• Franz Franchetti (TU Vienna)
http://www.ece.cmu.edu/~spiral
2. Sponsor
Work supported by DARPA (DSO), Applied & Computational
Mathematics Program, OPAL, through grant managed by
research grant DABT63-98-1-0004 administered by the Army
Directorate of Contracting.
3. Moore’s Law
and High(est) Performance
Scientific Computing
(single processor, off-the-shelf)
Moore’s Law:
processor-memory bottleneck
short life cycles of computers
very complex architectures
• vendor specific
• special instructions (MMX, SSE, FMA, …)
• undocumented features
Consequences for software/algorithms:
arithmetic cost model not accurate for predicting runtime
better performance models hard to get
best code is machine dependent (registers/caches size, structure)
hand-tuned code becomes obsolete as fast as it is written
compiler limitations
full performance requires (in part) assembly coding
Portable performance requires automation
4. SPIRAL
Automates
Implementation
cuts development costs
code less error-prone
Optimization
systematic exploration of alternatives both at
algorithmic and code level
Platform-Adaptation
takes advantage of architecture specific features
porting without loss of performance
of DSP algorithms
are performance critical
A library generator for highly optimized
signal processing algorithms
5. SPIRAL system
goes for a coffee
ian
tic
a
controls
algorithm generation
Formula Generator
fast algorithm
as SPL formula
ert
Exp m
SPL Compiler er
ram
rog
P
controls
implementation options
C/Fortran/SIMD
code
platform-adapted
implementation
Search Engine
hem
at
M
specifies
runtime on given platform
comes back
(or an espresso for small transforms)
SPIRAL
DSP transform
user
8. DSP Algorithms: Terminology
Transform
DFTn
Rule
DFTnm → ( DFTn ⊗ I m ) ⋅ D ⋅ ( I n ⊗ DFTm ) ⋅ P
parameterized matrix
• a breakdown strategy
• product of sparse matrices
DFT 8
Ruletree
DFT 2
DFT 4
DFT 2
Formula
DFT 2
• recursive application of rules
• uniquely defines an algorithm
• efficient representation
• easy manipulation
DFT8 = ( F2 ⊗ I 4 ) ⋅ D ⋅ ( I 2 ⊗ ( I 2 ⊗ F2 ) ) ⋅ P
• few constructs and primitives
• uniquely defines an algorithm
• can be translated into code
9. DSP Transforms
discrete Fourier transform
Walsh-Hadamard transform
DFTn = [ exp( 2kliπ / n ) ]
WHT2k = DFT2 ⊗ ⊗ DFT2
DCT ( II ) n = [ cos( k ( l + 1 / 2 )π / n ) ]
DCT ( IV ) n = [ cos( ( k + 1 / 2) (l + 1 / 2)π / n ) ]
discrete cosine and sine
Transforms (16 types)
modified discrete
cosine transform
DST ( I ) n = [ sin ( klπ / n ) ]
MDCTn×2 n = [ cos( ( k + ( n + 1) / 2)(l +1 / 2)π / n ) ]
two-dimensional transform
T ⊗T
Others: filters, discrete wavelet transforms, Haar, Hartley, …
10. Rules = Breakdown Strategies
(
)
DCT2( II ) → diag 1,1 / 2 ⋅ F2
(
)
DCTn( II ) → P ⋅ DCTn(/II ) ⊕ DCTn(/IV ) ⋅ ( I n / 2 ⊗ F2 )
2
2
base case
Q
recursive
DCTn( IV ) → S ⋅ DCTn( II ) ⋅ D
translation
DCTn( IV ) → M 1 M r
DFTn → B ⋅ ( DCTn(/I2) ⊕ DSTn(/I2) ) ⋅ C
iterative
recursive
DFTnm → ( DFTn ⊗ I m ) ⋅ D ⋅ ( I n ⊗ DFTm ) ⋅ P
recursive
Fn (h) → ( I n / d ⊗ k I d + k ) ⋅ ( I n / d ⊗ Fd (h))
recursive
Fn (h) → Circ (h ) ⋅ E
recursive
DWTn (W ) → ( DWTn / 2 (W ) ⊕ I n / 2 ) ⋅ P ⋅ ( I n / 2 ⊗ k W ) ⋅ E
recursive
WHT2n → ∏ ( I 2n1+K+ni −1 ⊗ WHT2ni ⊗ I 2ni +1+K+nt )
iterative/
recursive
MDCTn×2 n → S ⋅ DCTn( IV ) ⋅ P
translation
n
i =1
16. SPL Compiler: Summary
SPL Program
SPL Formula
Symbol Definition Template Definition
Parsing
Abstract Syntax Tree
Symbol Table
Template Table
Intermediate Code Generation
I-Code
Intermediate Code Restructuring
I-Code
Optimization
I-Code
Target Code Generation
C, FORTRAN function
Built-in optimizations:
single static assignment code
no reuse of temporary vars
only scalar temporary vars
constants precomputed
limited CSE
Extensible through templates
18. Why Search?
DCT, type IV, size 16
~31000 formulas
• maaaany different formulas
• large spread in runtimes, even for modest size
• not due to arithmetic cost
• best formula is platform-dependent
19. Search Methods available in SPIRAL
Exhaustive Search
Dynamic Programming (DP)
Random Search
Hill Climbing
STEER (similar to a genetic algorithm)
Possible
Sizes
Exhaust Very small
Formulas
Timed
Results
All
Best
DP
All
10s-100s
(very) good
Random
All
User decided
fair/good
Hill Climbing
All
100s-1000s
Good
STEER
All
100s-1000s
(very) good
Search over
• algorithm space and
• implementation options (degree of unrolling)
21. Experimental Results (C code)
search methods
(applicable to all transforms)
high performance code
(compared with FFTW)
different transforms
generated high quality code
22. SPIRAL System
Available for download (v3.1): www.ece.cmu.edu/~spiral
Easy installation (Unix: configure/make; Windows: install shield)
Unix/Linux and Windows 98/ME/NT/2000/XP
Current transforms: DFT, DHT, WHT, RHT, DCT/DST type I – IV,
MDCT, Filters, Wavelets, Toeplitz, Circulants
Extensible
New version (4.0) in preparation
24. Learning to Generate Fast Algorithms
• Learns from given dataset (formulas+runtimes) how to design a
fast algorithm (breakdown strategy)
• Learns from a transform of one size, generates the best algorithm
for many sizes
• Tested for DFT and WHT
25. SIMD Short Vector Extensions
vector length = 4
(4-way)
x
+
Extension to instruction set architecture
Available on most current architectures
(SSE on Pentium, AltiVec on Motorola G4)
Originally for multimedia (like MMX for integers)
Requires fine grain parallelism
Large potential speed-up
Problems:
SIMD instructions are architecture specific
No common API (usually assembly hand coding)
Performance very sensitive to memory access
Automatic vectorization very limited
very difficult to use
26. Vector code generation from SPL formulas
Naturally vectorizable construct
A ⊗ I4
A
vector length
x
y
(Current) generic construct completely vectorizable:
k
∏ P D ( A ⊗ Iυ ) E Q
i
i
i
i =1
i
i
Pi, Qi
Di, Ei
Ai
ν
permutations
diagonals
arbitrary formulas
SIMD vector length
Vectorization in two steps:
1. Formula manipulation using manipulation rules
2. Code generation (vector code + C code)
27. Generated Vector Code DFTs: Pentium 4
7
6
Spiral SSE
gflops
5
Intel MKL interl.
4
FFTW 2.1.3
Spiral C and F95
3
Spiral F95 vect
SIMD-FFT
2
1
0
4
5
6
7
8
9
10
11
12
13
14
n
DFT 2^n, Pentium 4, 2.53 GHz, using Intel C compiler 6.0
speedups (to C code) up to factor of 3.3
beats hand-tuned vendor library
28. Generated Vector Code, Other Transforms
WHT
normalized runtime
normalized runtime
2-dim DCT
transform size
transform size
speedups up to factor of 2.5
29. Flexible Loop Interleaving (Runtime Gain WHT)
Athlon XP: up to 55%
Alpha 21264: up to 70%
Pentium 4: up to 45%
UltraSparc III: up to 60%
30. Parallel Code Generation: Example WHT
10
PowerPC RS64 III
Speedup
8
6
4
1 thread
8 threads
10 threads
2
0
1
6
11
16
WHT size log(N)
21
26
Parallelized constructs:
In ⊗ A, A ⊗ In
32. Filters and Wavelets
New constructs: row/column overlapped tensor product
A
A
A
I n ⊗k A =
A
I n ⊗k A =
A
A
A
Examples for rules:
Fn (h) → ( I n / d ⊗ k I d + k ) ⋅ ( I n / d ⊗ Fd (h))
DWTn (W ) → ( DWTn / 2 (W ) ⊕ I n / 2 ) ⋅ P ⋅ ( I n / 2 ⊗ k W ) ⋅ E
A
33. Conclusions
Automatic code generation for the entire domain of
(linear) DSP algorithms
Portable high performance across platforms and
across time
Integration of math (high) level and implementation
(low) level
Intelligence through search and learning in the
space of alternatives
37. Generating Parallel Programs
Interpret constructs such as In ⊗ A as parallel operations and
transform formulas to obtain maximal parallelism.
Explore alternative data access patterns mathematically (e.g.
different permutations in matrix factorizations)
Prototype implementation using WHT
Build on existing sequential package
SMP implementation using OpenMP (IPDPS’02)
90% efficiency obtained on 12 processor PowerPC RS64 III
Distributed memory implementation using MPI (POHLL’02)
38. Comparison of Parallel DDL Schemes
10
Speedup
PowerPC RS64 III
10 threads
8
6
4
2
0
1
6
11
16
21
WHT size log(N)
26
Best sequential with
DDL
Parallel without DDL
Coarse-grained DDL
Fine-grained DDL
with ID shift
40. Performance of Digit Permutations on CRAY T3E
Distribution of Global Tensor Permutations on 128 Processors (shmem put)
1200
Number of tensor permutations
1000
800
600
400
200
0
0
50
100
150
200
250
Bandwidth in MB/sec/node
300
350
41. Architecture Framework
P4 ( I 8 ⊗ F2 )T4 P4− 1 P3 ( I 8 ⊗ F2 )T3 P3− 1 P2 ( I 8 ⊗ F2 )T2 P2− 1 P1 ( I 8 ⊗ F2 )T1P1− 1 Pd
Host
I/O Interface
Parameters: Dimension,
Pd
I/O Interface
M = Memory
AG = Address Generator
CU = Computation Unit
Interconnection Network
Parameters: Pi, 1 ≤ i ≤ n,
no. of processor
AG
CU AG
CU AG
M
M
CU AG
... M
CU AG
M
Parameters: Dimension,
and Ti, 1 ≤ i ≤ n, no. of
processor 2m
CU
47. Classical Code Generation System
given
DSP Transform
(DFT, DCT, Wavelets etc.)
Math/Algorithm
Expert
Expert
Programmer
Performance
Evaluation
given
Computing Platform
(Pentium III, Pentium 4, Athlon,
SUN, PowerPC, Alpha, … )
adapted
implementation
48. Algorithms = Ruletrees = Formulas
DCT8( II )
DCTn( II ) → P ⋅ ( DCTn(/II ) ⊕ DCTn(/IV ) ) ⋅ ( F2 ⊗ I n / 2 )
2
2
R1
DCT4( IV )
DCT4( II )
R1
DCT2( II )
DCT4( II )
DCT2( IV )
R3
F2
DCTn( IV ) → P ⋅ DCTn( II ) ⋅ S
R6
R6
( II )
2
R1
DST
R4
DCT2( II )
F2
R3
DCT2( II ) →
1
⋅ F2
2
DCT2( IV )
F2
R6
DST2( II )
R4
F2
49. Mathematical Framework: Summary
fast algorithms represented as ruletrees (easy generation/manipulation)
and as formulas (can be translated into code)
formulas built from few constructs and primitives
many different algorithms/formulas generated from few rules
(combinatorial explosion)
these algorithms are (essentially) equal in arithmetic cost,
but differ in data flow
50. Formula Generation
data base (extensible!)
data type
Formula Generator
rules
recursive
application
transforms
control search engine
ruletrees
formulas
runtime
export
formula
translation
(spl compiler)
translation
written in GAP/AREP (computer algebra system)
all computation/manipulation is symbolic
exact arithmetic
easy extensible rule and transform data base
verification of rules and formulas
cut here for other
optimization problems
51. Number of Formulas/Algorithms
k
# DFT, size 2^k
# DCT-IV, size 2^k
1
2
3
4
5
6
7
8
9
1
6
40
296
27744
162570361280
~1.01 • 10^27
~2.31 • 10^61
~2.86 • 10^133
1
10
126
31242
1924443362
7343815121631354242
~1.07 • 10^38
~2.30 • 10^76
~1.06 • 10^153
differ in data flow not in arithmetic cost
exponential search space
52. Extensibility of SPIRAL
New transforms are readily included on the high level
(easy, due to SPIRAL’s framework)
New constructs and primitives (potentially required by radically
different transforms) are readily included in SPL
(moderate effort, due to template mechanism)
New instructions sets available (e.g., SSE) are included by
extending the SPL compiler
(doable one time effort)
53. Generated Vector Code DFTs: Pentium III
1.8
1.6
1.4
gflops
1.2
Intel MKL
Spiral SSE
Spiral C/F95 vect
Spiral C/F95
FFTW 2.1.3
1
0.8
0.6
0.4
0.2
0
4
5
6
7
8
9
10
11
12
13
n
DFT 2^n, Pentium III, 1 GHz, using Intel C compiler 6.0
speedups (to C code) up to factor of 2.5
Editor's Notes
This graph shows the parallel speedup of the WHT transform with 10 threads at different data sizes. The effect of granularity on parallel pseudo transpose is obvious. The fine-grained is better than the coarse-grained. It is different from the transform part.
The yellow line shows the speedup of work-sharing OpenMP. There is only 3 times speedup even though 10 threads are used. This indicates, although OpenMP provides the convenience to parallelize iteration automatically, the efficiency could be very low depending on the nature of the problem.
The dataflow of the conjugating form of the Pease Algorithm.