Knoxville aug02-full

165 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
165
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • This graph shows the parallel speedup of the WHT transform with 10 threads at different data sizes. The effect of granularity on parallel pseudo transpose is obvious. The fine-grained is better than the coarse-grained. It is different from the transform part.
    The yellow line shows the speedup of work-sharing OpenMP. There is only 3 times speedup even though 10 threads are used. This indicates, although OpenMP provides the convenience to parallelize iteration automatically, the efficiency could be very low depending on the nature of the problem.
  • The dataflow of the conjugating form of the Pease Algorithm.
  • Knoxville aug02-full

    1. 1. SPIRAL: Current Status Markus Püschel Students Faculty • José Moura (CMU) • Jeremy Johnson (Drexel) • Robert Johnson (MathStar) • David Padua (UIUC) • Viktor Prasanna (USC) • Markus Püschel (CMU) • Manuela Veloso (CMU) • Gavin Haentjens (CMU) • Pinit Kumhom (Drexel) • Neungsoo Park (USC) • David Sepiashvili (CMU) • Bryan Singer (CMU) • Yevgen Voronenko (Drexel) • Edward Wertz (CMU) • Jianxin Xiong (UIUC) Collaborators • Christoph Überhuber (TU Vienna) • Franz Franchetti (TU Vienna) http://www.ece.cmu.edu/~spiral
    2. 2. Sponsor Work supported by DARPA (DSO), Applied & Computational Mathematics Program, OPAL, through grant managed by research grant DABT63-98-1-0004 administered by the Army Directorate of Contracting.
    3. 3. Moore’s Law and High(est) Performance Scientific Computing (single processor, off-the-shelf) Moore’s Law:  processor-memory bottleneck  short life cycles of computers  very complex architectures • vendor specific • special instructions (MMX, SSE, FMA, …) • undocumented features Consequences for software/algorithms:  arithmetic cost model not accurate for predicting runtime  better performance models hard to get  best code is machine dependent (registers/caches size, structure)  hand-tuned code becomes obsolete as fast as it is written  compiler limitations  full performance requires (in part) assembly coding Portable performance requires automation
    4. 4. SPIRAL Automates Implementation  cuts development costs  code less error-prone Optimization  systematic exploration of alternatives both at algorithmic and code level Platform-Adaptation  takes advantage of architecture specific features  porting without loss of performance of DSP algorithms  are performance critical A library generator for highly optimized signal processing algorithms
    5. 5. SPIRAL system goes for a coffee ian tic a controls algorithm generation Formula Generator fast algorithm as SPL formula ert Exp m SPL Compiler er ram rog P controls implementation options C/Fortran/SIMD code platform-adapted implementation Search Engine hem at M specifies runtime on given platform comes back (or an espresso for small transforms) SPIRAL DSP transform user
    6. 6. DSP Transform Algorithm
    7. 7. DSP Algorithms: Example 4-point DFT Cooley/Tukey FFT (size 4): 1 1 1 1  1 1 i − 1 − i  0  = 1 − 1 1 − 1 1    1 − i − 1 i  0  Fourier transform 0 1 0  1 1 0 1  0  0 − 1 0  0  1 0 − 1 0 0 1 0 0 0 0 1 0 0 1 1 0 1 − 1  0  0 0  i  0 0 0 0  1 0 0  0  1 1  0  1 − 1 0 0 0 1 0 0 1 0 0 Diagonal matrix (twiddles) DFT4 = ( DFT2 ⊗ I 2 ) ⋅ T24 ⋅ ( I 2 ⊗ DFT2 ) ⋅ L4 2 Kronecker product Identity Permutation  product of structured sparse matrices  mathematical notation 0 0  0  1
    8. 8. DSP Algorithms: Terminology Transform DFTn Rule DFTnm → ( DFTn ⊗ I m ) ⋅ D ⋅ ( I n ⊗ DFTm ) ⋅ P parameterized matrix • a breakdown strategy • product of sparse matrices DFT 8 Ruletree DFT 2 DFT 4 DFT 2 Formula DFT 2 • recursive application of rules • uniquely defines an algorithm • efficient representation • easy manipulation DFT8 = ( F2 ⊗ I 4 ) ⋅ D ⋅ ( I 2 ⊗ ( I 2 ⊗ F2 ) ) ⋅ P • few constructs and primitives • uniquely defines an algorithm • can be translated into code
    9. 9. DSP Transforms discrete Fourier transform Walsh-Hadamard transform DFTn = [ exp( 2kliπ / n ) ] WHT2k = DFT2 ⊗ ⊗ DFT2 DCT ( II ) n = [ cos( k ( l + 1 / 2 )π / n ) ] DCT ( IV ) n = [ cos( ( k + 1 / 2) (l + 1 / 2)π / n ) ] discrete cosine and sine Transforms (16 types) modified discrete cosine transform DST ( I ) n = [ sin ( klπ / n ) ] MDCTn×2 n = [ cos( ( k + ( n + 1) / 2)(l +1 / 2)π / n ) ] two-dimensional transform T ⊗T Others: filters, discrete wavelet transforms, Haar, Hartley, …
    10. 10. Rules = Breakdown Strategies ( ) DCT2( II ) → diag 1,1 / 2 ⋅ F2 ( ) DCTn( II ) → P ⋅ DCTn(/II ) ⊕ DCTn(/IV ) ⋅ ( I n / 2 ⊗ F2 ) 2 2 base case Q recursive DCTn( IV ) → S ⋅ DCTn( II ) ⋅ D translation DCTn( IV ) → M 1  M r DFTn → B ⋅ ( DCTn(/I2) ⊕ DSTn(/I2) ) ⋅ C iterative recursive DFTnm → ( DFTn ⊗ I m ) ⋅ D ⋅ ( I n ⊗ DFTm ) ⋅ P recursive Fn (h) → ( I n / d ⊗ k I d + k ) ⋅ ( I n / d ⊗ Fd (h)) recursive Fn (h) → Circ (h ) ⋅ E recursive DWTn (W ) → ( DWTn / 2 (W ) ⊕ I n / 2 ) ⋅ P ⋅ ( I n / 2 ⊗ k W ) ⋅ E recursive WHT2n → ∏ ( I 2n1+K+ni −1 ⊗ WHT2ni ⊗ I 2ni +1+K+nt ) iterative/ recursive MDCTn×2 n → S ⋅ DCTn( IV ) ⋅ P translation n i =1
    11. 11. Formula for a DCT, size 16
    12. 12. DSP Transform Algorithm (Formula) Implementation
    13. 13. Formulas in SPL •••• ( compose ( diagonal ( 2*cos(1/16*pi) 2*cos(3/16*pi) 2*cos(5/16*pi) 2*cos(7/16*pi) ) ) ( permutation ( 1 3 4 2 ) ) ( tensor ( I 2 ) ( F 2 ) ) ( permutation ( 1 4 2 3 ) ) ( direct_sum ( compose ( F 2 ) ( diagonal ( 1 sqrt(1/2) ) ) ) ( compose ( matrix ( 1 1 0 ) ( 0 (-1) 1 ) ) ( diagonal ( cos(13/8*pi)-sin(13/8*pi) sin(13/8*pi) cos(13/8*pi)+sin(13/8*pi) ) ) ( matrix ( 1 0 ) ( 1 1 ) ( 0 1 ) ) ( permutation ( 2 1 ) ) ••••
    14. 14. SPL Syntax (Subset)  matrix operations: (compose formula formula ...) (tensor formula formula ...) (direct_sum formula formula ...)  direct matrix description: (matrix (a11 a12 ...) (a21 a22 ...) ...) (diagonal (d1 d2 ...)) (permutation (p1 p2 ...))  parameterized matrices: (I n) (F n)  scalars: 1.5, 2/7, cos(..), w(3), pi, 1.2e-04 allows extension of SPL  definition of new symbols: (define name formula) (template formula (i-code-list)  directives for code generation #codetype real/complex controls loop unrolling #unroll on/off
    15. 15. SPL Compiler, 4-point FFT (compose (tensor (F 2) (I 2)) (T 4 2) (tensor (I 2) (F 2)) (L 4 2)) fast algorithm as formula as SPL program #codetype complex f0 f1 f2 f3 f4 y(1) y(2) y(3) y(4) = = = = = = = = = x(1) + x(3) x(1) - x(3) x(2) + x(4) x(2) - x(4) (0.00d0,-1.00d0)*f(3) f0 + f2 f0 - f2 f1 + f4 f1 - f4 real r0 r1 r2 r3 r4 r5 r6 r7 y(1) y(2) y(3) y(4) y(5) y(6) y(7) y(8) = = = = = = = = = = = = = = = = x(1) x(1) x(2) x(2) x(3) x(3) x(4) x(4) r0 + r1 + r0 r1 r2 + r3 r2 r3 + + x(5) - x(5) + x(6) - x(6) + x(7) - x(7) + x(8) - x(8) r4 r5 r4 r5 r7 r6 r7 r6
    16. 16. SPL Compiler: Summary SPL Program SPL Formula Symbol Definition Template Definition Parsing Abstract Syntax Tree Symbol Table Template Table Intermediate Code Generation I-Code Intermediate Code Restructuring I-Code Optimization I-Code Target Code Generation C, FORTRAN function Built-in optimizations:  single static assignment code  no reuse of temporary vars  only scalar temporary vars  constants precomputed  limited CSE Extensible through templates
    17. 17. DSP Transform Algorithm (Formula) Implementation Search
    18. 18. Why Search? DCT, type IV, size 16 ~31000 formulas • maaaany different formulas • large spread in runtimes, even for modest size • not due to arithmetic cost • best formula is platform-dependent
    19. 19. Search Methods available in SPIRAL      Exhaustive Search Dynamic Programming (DP) Random Search Hill Climbing STEER (similar to a genetic algorithm) Possible Sizes Exhaust Very small Formulas Timed Results All Best DP All 10s-100s (very) good Random All User decided fair/good Hill Climbing All 100s-1000s Good STEER All 100s-1000s (very) good Search over • algorithm space and • implementation options (degree of unrolling)
    20. 20. STEER Population n: Mutation …… Cross-Breeding expand differently Population n+1: swap expansions …… Survival of Fittest
    21. 21. Experimental Results (C code) search methods (applicable to all transforms) high performance code (compared with FFTW) different transforms generated high quality code
    22. 22. SPIRAL System  Available for download (v3.1): www.ece.cmu.edu/~spiral  Easy installation (Unix: configure/make; Windows: install shield)  Unix/Linux and Windows 98/ME/NT/2000/XP  Current transforms: DFT, DHT, WHT, RHT, DCT/DST type I – IV, MDCT, Filters, Wavelets, Toeplitz, Circulants  Extensible  New version (4.0) in preparation
    23. 23. Recent Work
    24. 24. Learning to Generate Fast Algorithms • Learns from given dataset (formulas+runtimes) how to design a fast algorithm (breakdown strategy) • Learns from a transform of one size, generates the best algorithm for many sizes • Tested for DFT and WHT
    25. 25. SIMD Short Vector Extensions vector length = 4 (4-way) x +  Extension to instruction set architecture  Available on most current architectures (SSE on Pentium, AltiVec on Motorola G4)  Originally for multimedia (like MMX for integers)  Requires fine grain parallelism  Large potential speed-up Problems:  SIMD instructions are architecture specific  No common API (usually assembly hand coding)  Performance very sensitive to memory access  Automatic vectorization very limited very difficult to use
    26. 26. Vector code generation from SPL formulas Naturally vectorizable construct A ⊗ I4 A vector length x y (Current) generic construct completely vectorizable: k ∏ P D ( A ⊗ Iυ ) E Q i i i i =1 i i Pi, Qi Di, Ei Ai ν permutations diagonals arbitrary formulas SIMD vector length Vectorization in two steps: 1. Formula manipulation using manipulation rules 2. Code generation (vector code + C code)
    27. 27. Generated Vector Code DFTs: Pentium 4 7 6 Spiral SSE gflops 5 Intel MKL interl. 4 FFTW 2.1.3 Spiral C and F95 3 Spiral F95 vect SIMD-FFT 2 1 0 4 5 6 7 8 9 10 11 12 13 14 n DFT 2^n, Pentium 4, 2.53 GHz, using Intel C compiler 6.0  speedups (to C code) up to factor of 3.3  beats hand-tuned vendor library
    28. 28. Generated Vector Code, Other Transforms WHT normalized runtime normalized runtime 2-dim DCT transform size transform size speedups up to factor of 2.5
    29. 29. Flexible Loop Interleaving (Runtime Gain WHT) Athlon XP: up to 55% Alpha 21264: up to 70% Pentium 4: up to 45% UltraSparc III: up to 60%
    30. 30. Parallel Code Generation: Example WHT 10 PowerPC RS64 III Speedup 8 6 4 1 thread 8 threads 10 threads 2 0 1 6 11 16 WHT size log(N) 21 26 Parallelized constructs: In ⊗ A, A ⊗ In
    31. 31. Code Scheduling Runtime histograms DFT, size 16 ~6500 formulas DCT4, size 16 ~16000 formulas unscheduled scheduled
    32. 32. Filters and Wavelets New constructs: row/column overlapped tensor product A A A I n ⊗k A = A I n ⊗k A = A A A Examples for rules: Fn (h) → ( I n / d ⊗ k I d + k ) ⋅ ( I n / d ⊗ Fd (h)) DWTn (W ) → ( DWTn / 2 (W ) ⊕ I n / 2 ) ⋅ P ⋅ ( I n / 2 ⊗ k W ) ⋅ E A
    33. 33. Conclusions  Automatic code generation for the entire domain of (linear) DSP algorithms  Portable high performance across platforms and across time  Integration of math (high) level and implementation (low) level  Intelligence through search and learning in the space of alternatives
    34. 34. Future Plans  Transforms: Radon, Gabor, Hankel, structured matrices, …  Target platforms: parallel platforms, DSP processors, SW/HW architectures, FPGAs, ASICs  Instructions: Vector, FMAs, prefetching, OpenMP, MPI  Beyond transforms: entire DSP applications  Other domains amenable to SPIRAL approach?
    35. 35. Questions? http://www.ece.cmu.edu/~spiral
    36. 36. Extra Slides
    37. 37. Generating Parallel Programs  Interpret constructs such as In ⊗ A as parallel operations and transform formulas to obtain maximal parallelism.  Explore alternative data access patterns mathematically (e.g. different permutations in matrix factorizations)  Prototype implementation using WHT  Build on existing sequential package  SMP implementation using OpenMP (IPDPS’02)  90% efficiency obtained on 12 processor PowerPC RS64 III  Distributed memory implementation using MPI (POHLL’02)
    38. 38. Comparison of Parallel DDL Schemes 10 Speedup PowerPC RS64 III 10 threads 8 6 4 2 0 1 6 11 16 21 WHT size log(N) 26 Best sequential with DDL Parallel without DDL Coarse-grained DDL Fine-grained DDL with ID shift
    39. 39. Overall Parallel Speedup 10 PowerPC RS64 III 6 Speedup 8 1 thread 8 threads 10 threads 4 2 0 1 6 11 16 WHT size log(N) 21 26
    40. 40. Performance of Digit Permutations on CRAY T3E Distribution of Global Tensor Permutations on 128 Processors (shmem put) 1200 Number of tensor permutations 1000 800 600 400 200 0 0 50 100 150 200 250 Bandwidth in MB/sec/node 300 350
    41. 41. Architecture Framework P4 ( I 8 ⊗ F2 )T4 P4− 1 P3 ( I 8 ⊗ F2 )T3 P3− 1 P2 ( I 8 ⊗ F2 )T2 P2− 1 P1 ( I 8 ⊗ F2 )T1P1− 1 Pd Host I/O Interface Parameters: Dimension, Pd I/O Interface M = Memory AG = Address Generator CU = Computation Unit Interconnection Network Parameters: Pi, 1 ≤ i ≤ n, no. of processor AG CU AG CU AG M M CU AG ... M CU AG M Parameters: Dimension, and Ti, 1 ≤ i ≤ n, no. of processor 2m CU
    42. 42. Pease Algorithm Dataflow L16 ( I 8 ⊗ F2 )T3 L16 L16 ( I 8 ⊗ F2 )T2 L16 L16 ( I 8 ⊗ F2 )T1 L16 ( I 8 ⊗ F2 ) Pd 2 8 4 4 8 2 Stage 0 Stage 1 Stage 2 L16 2 Stage 3 0 0 1 2 2 4 3 6 4 8 5 10 6 12 7 14 8 1 9 3 10 5 11 7 12 9 13 11 14 13 15 15
    43. 43. Optimal Dataflow L16 ( L82 ⊗ I 2 )( I 8 ⊗ F2 )T3 ( L84 ⊗ I 2 ) L16 L16 ( L82 ⊗ I 2 )( I 8 ⊗ F2 )T2 ( L84 ⊗ I 2 ) L16 L16 ( I 8 ⊗ F2 )T1 L16 ( L84 ⊗ I 2 )( I 8 ⊗ F2 )( L82 ⊗ I 2 ) Pd 2 8 4 4 8 2 Stage 0 Stage 1 Stage 2 Stage 3
    44. 44. Pease v.s. Optimal Performance comparison of Pease algorithm and optimal algorithm no. of clock cycles 12000 10000 8000 Pease 6000 Optimal 4000 2000 0 5 6 7 8 FFT size Stage 1 2 3 4 5 6 Local Access 32 32 64 128 64 32 Pease Remote Local Access Peformance Access 96 3900 128 96 3860 128 64 3900 128 0 640 128 64 3840 64 96 3880 64 Optimal Remote Access Performance 0 640 0 640 0 640 0 640 64 2560 64 2560
    45. 45. Performance: Speedup Speedup Speedup(4*N*log(N)/No. of Clocks) 4 3.5 3 2.5 2 1.5 1 0.5 0 5 6 7 8 9 10 11 12 13 14 15 16 Address Size (Bits) Speedup = S/C C = no. of. clock cycles using 4-processor FFT engine S = minimum no. of clock cycles using single processor (4*2n*n, 2n is the number of points)
    46. 46. SPIRAL Approach given DSP Transform given Possible Algorithms Possible Implementations Performance Evaluation Intelligent Search SPIRAL Search Space (DFT, DCT, Wavelets etc.) Computing Platform (Pentium III, Pentium 4, Athlon, SUN, PowerPC, Alpha, … ) adapted implementation
    47. 47. Classical Code Generation System given DSP Transform (DFT, DCT, Wavelets etc.) Math/Algorithm Expert Expert Programmer Performance Evaluation given Computing Platform (Pentium III, Pentium 4, Athlon, SUN, PowerPC, Alpha, … ) adapted implementation
    48. 48. Algorithms = Ruletrees = Formulas DCT8( II ) DCTn( II ) → P ⋅ ( DCTn(/II ) ⊕ DCTn(/IV ) ) ⋅ ( F2 ⊗ I n / 2 ) 2 2 R1 DCT4( IV ) DCT4( II ) R1 DCT2( II ) DCT4( II ) DCT2( IV ) R3 F2 DCTn( IV ) → P ⋅ DCTn( II ) ⋅ S R6 R6 ( II ) 2 R1 DST R4 DCT2( II ) F2 R3 DCT2( II ) → 1 ⋅ F2 2 DCT2( IV ) F2 R6 DST2( II ) R4 F2
    49. 49. Mathematical Framework: Summary  fast algorithms represented as ruletrees (easy generation/manipulation) and as formulas (can be translated into code)  formulas built from few constructs and primitives  many different algorithms/formulas generated from few rules (combinatorial explosion)  these algorithms are (essentially) equal in arithmetic cost, but differ in data flow
    50. 50. Formula Generation data base (extensible!) data type Formula Generator rules recursive application transforms control search engine ruletrees formulas runtime export formula translation (spl compiler) translation  written in GAP/AREP (computer algebra system)  all computation/manipulation is symbolic  exact arithmetic  easy extensible rule and transform data base  verification of rules and formulas cut here for other optimization problems
    51. 51. Number of Formulas/Algorithms k # DFT, size 2^k # DCT-IV, size 2^k 1 2 3 4 5 6 7 8 9 1 6 40 296 27744 162570361280 ~1.01 • 10^27 ~2.31 • 10^61 ~2.86 • 10^133 1 10 126 31242 1924443362 7343815121631354242 ~1.07 • 10^38 ~2.30 • 10^76 ~1.06 • 10^153  differ in data flow not in arithmetic cost  exponential search space
    52. 52. Extensibility of SPIRAL New transforms are readily included on the high level (easy, due to SPIRAL’s framework) New constructs and primitives (potentially required by radically different transforms) are readily included in SPL (moderate effort, due to template mechanism) New instructions sets available (e.g., SSE) are included by extending the SPL compiler (doable one time effort)
    53. 53. Generated Vector Code DFTs: Pentium III 1.8 1.6 1.4 gflops 1.2 Intel MKL Spiral SSE Spiral C/F95 vect Spiral C/F95 FFTW 2.1.3 1 0.8 0.6 0.4 0.2 0 4 5 6 7 8 9 10 11 12 13 n DFT 2^n, Pentium III, 1 GHz, using Intel C compiler 6.0 speedups (to C code) up to factor of 2.5

    ×