Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned

  • 328 views
Uploaded on

 

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
328
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
2
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Here we see an example of a single RTM shot as the simulation progresses in time.First, we propagate the source wavelet through the earth model.Next, we inject the receiver data into another wavefield using the same earth model.Finally we see the image forming. We can see the image forming over time here.
  • Expand 3D to 2D, explain how finer granularity in scheduling helps, especially when there are more cores
  • Marcel: is my fractional time correct? 250*150 / 240 = 156.25, so during the last iteration (#157), 0.25 of the threads are busy, others are waiting, and the overall slowdown is only one more loop, so 1/157 of the total time.

Transcript

  • 1. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Optimizing Commercial Software for Intel® Xeon Phi™ Coprocessors: Lessons Learned Supercomputing Conference Denver, Colorado, USA November 17-22, 2013 ©2013 Acceleware Ltd. All rights reserved. Dan Cyca, Chief Technical Officer, Acceleware
  • 2. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned In My Parallel Universe…  Small to medium-sized seismic companies aren’t limited by computational resources when processing seismic data 1
  • 3. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Seismic Computing Requirements 1 EF Full WE Approximation 100 PF Elastic Imaging 10 PF FWI 1 PF RTM 100 TF Paraxial WE approximation 10 TF WEM 1 TF 100 GF Kirchhoff Migration Post SDM, PreSTM 1990 1995 2000 2005 2010 2012 2015 Source: Total 3
  • 4. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned RTM Overview Source Propagate forwards in time Receiver data Propagate backwards in time 4
  • 5. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned RTM Introduction      Finite-difference code Compute intensive:  10s of hours per seismic shot Large memory footprint:  100GB per shot Large local storage requirement:  500GB per shot 10,000s of shots 5
  • 6. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned RTM: Computational Requirements      RTM image is made by migrating and then stacking a large number of shots (typically between 10,000 and 100,000) Migrating each shot requires two or three 3D wave propagations Each shot migration requires large RAM (~100GB) and temporary disk space (~500GB) Runtime per shots varies between a few minutes (low frequency isotropic) to several hours (high frequency anisotropic) Typical compute cluster used for RTM will be 100s of nodes 6
  • 7. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned In My Parallel Universe…  Small to medium-sized seismic companies aren’t limited by computational resources when processing seismic data – We want to make RTM (1 PFlop) available to these companies  We’re delivering parallel software to run RTM on Xeon Phi systems 7
  • 8. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned RTM: Wave Propagation  Finite-difference time domain technique –  3D grid with millions of points – –   3D stencils Update the entire grid every time step 1000s of time steps Memory footprint of 10-100 GB Wavefield data from forward pass stored to disk to facilitate imaging 8
  • 9. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Parallelizing Single Shots    Finite-difference grid contains over 200 million cells per volume (2 GB) Numerous volumes per shot (Earth model, wavefields and image) One shot easily fits in a CPU compute node, but may be too large for a single Xeon Phi 9
  • 10. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Parallelizing Each Shot: Multiple Cards  Phi 0 Phi 1 Phi 2 The volume is partitioned into pieces that fit on a single Xeon Phi 10
  • 11. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Parallelizing Each Shot: Multiple Cards  Phi 0 Transfer  … Boundaries must be transferred between partitions Transfers can become a bottleneck unless they are done asynchronously with stencil calculations … Transfer Phi 1 11
  • 12. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Parallelizing Each Shot: Within Card Core 0, Thread 0 Core 0, Thread 1 Core 0, Thread 2 Core 0, Thread 3 Core 1, Thread 0 x/y Core 1, Thread 1 z   Data in x and y are split over cores Operations in z dimension are vectorized 12
  • 13. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Levels of Parallelism    Each shot is split over multiple Xeon Phi Coprocessors (or Xeon nodes) using MPI The partition on each Phi is split over cores using OpenMP Operations on each thread are vectorized using the compiler’s autovectorizer 13
  • 14. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Kernel: 8th Order Spatial Derivative 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 #pragma omp parallel for for(size_t x = xMin; x < xMax; x++) { for(size_t y = yMin; y < yMax; y++) { size_t const idx = x*strideX + y*strideY; #pragma vector … for(size_t z = zMin; z < zMax; z++) { size_t const i = idx + z; pVy[i] = yCoeffs[0]*(pV[i-4*strideY]-pV[i+4*strideY]) yCoeffs[1]*(pV[i-3*strideY]-pV[i+3*strideY]) yCoeffs[2]*(pV[i-2*strideY]-pV[i+2*strideY]) yCoeffs[3]*(pV[i-1*strideY]-pV[i+1*strideY]) yCoeffs[4]*pV[i]; } } } Triple loop over dimensions + + + + One-dimensional derivative: simple calculation with large memory bandwidth
  • 15. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Tuning OpenMP 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #pragma omp parallel for collapse(2) schedule(static) for(size_t x = xMin; x < xMax; x++) { for(size_t y = yMin; y < yMax; y++) { #pragma vector … size_t const idx = x*strideX + y*strideY; for(size_t z = zMin; z < zMax; z++) { size_t const i = idx + z; // Derivative Calculations } } }  Many options available for OpenMP –  Tuning especially important on Phi (mostly because of high thread count) Here we use static loop scheduling, because it has the lowest overhead – It is also the most prone to load-balance issues
  • 16. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Tuning OpenMP     Collapse(2) combines two adjacent for loops Here, X and Y dimensions are combined. Eg: X = 250, Y = 150 Work is broken more evenly onto cores when there are more iterations – 250 iterations on 240 threads (60*4) means 10 threads do double work while other threads wait (1/2 time wasted) – 250 x 150 divides much better onto 240 threads (1/157 time wasted) Improved Phi performance by 1.5x! Y X X*Y
  • 17. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Tuning Thread Affinity   We programmatically set affinity with run dependent logic Isolating various tasks prevents over-subscription of cores Transfer Threads Core 0 Disk IO Threads Core 1 Propagation Threads Core 2 Core 60 … OS Threads Core 61
  • 18. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Tuning Thread Affinity  Thread affinity settings improved scaling on multiple Phis and multiple CPU sockets Without Affinity Changes With Affinity Changes Dual Xeon Dual Xeon Phi vs. Single sockets vs. Phi single socket 1.3x 1.9x 1.9x 1.7x  Different settings for Xeon Phi and Xeon
  • 19. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Tuning Memory Access 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 #pragma omp parallel for collapse(2) schedule(static) for(size_t x = xMin; x < xMax; x++) { for(size_t y = yMin; y < yMax; y++) { size_t const idx = x*strideX + y*strideY; __assume(strideX%16==0); __assume(strideY%16==0); __assume(idx%16==0); __assume_aligned(pV ,64); __assume_aligned(pVy ,64); #pragma vector always assert vecremainder #pragma ivdep #pragma vector nontemporal (pVy) for(size_t z = zMin; z < zMax; z++) { size_t const i = idx + z; pVy[i] = ( yCoeffs*(pV[i-4*strideY]-pV[i+4*strideY])... } } } Improved performance by 1.1x on both Xeon and Xeon Phi! Give compiler hints about indexing so it knows when to use aligned reads/writes pVy[i] is written once and should not be cached 19
  • 20. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Current Performance Results   For anisotropic wave propagation, Xeon Phi coprocessor is ~2.3x a single Xeon E5-2670 CPU Same code-base and optimizations applied to Xeon and Xeon Phi 20
  • 21. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned About Acceleware  Professional training – – – –  High performance consulting – – –  Xeon Phi Coprocessor Optimization OpenCL OpenMP MPI Feasibility Studies Porting and Optimization Algorithm parallelization Accelerated software – – Oil and Gas Electromagnetics 21
  • 22. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Questions? Come visit us in booth #1825! Head Office Tel: +1 403.249.9099 Email: services@acceleware.com Viktoria Kaczur Senior Account Manager Tel: +1 403.249.9099 ext. 356 Cell: +1 403.671.4455 Email: viktoria.kaczur@acceleware.com 22