Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Connected Component Labeling on Int... by Intel IT Center 2116 views
- M.Sc. Thesis presentation by Roberto Fierimonte 164 views
- Web Dev101 For Journalists by Lisa Williams 1563 views
- Can You Get Performance from Xeon P... by Andrés Gómez 1018 views
- Profiling and Optimizing for Xeon P... by Intel IT Center 353 views
- COSMIC: Middleware for Xeon Phi Ser... by inside-BigData.com 1298 views

773 views

Published on

No Downloads

Total views

773

On SlideShare

0

From Embeds

0

Number of Embeds

2

Shares

0

Downloads

11

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Optimizing Commercial Software for Intel® Xeon Phi™ Coprocessors: Lessons Learned Supercomputing Conference Denver, Colorado, USA November 17-22, 2013 ©2013 Acceleware Ltd. All rights reserved. Dan Cyca, Chief Technical Officer, Acceleware
- 2. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned In My Parallel Universe… Small to medium-sized seismic companies aren’t limited by computational resources when processing seismic data 1
- 3. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Seismic Computing Requirements 1 EF Full WE Approximation 100 PF Elastic Imaging 10 PF FWI 1 PF RTM 100 TF Paraxial WE approximation 10 TF WEM 1 TF 100 GF Kirchhoff Migration Post SDM, PreSTM 1990 1995 2000 2005 2010 2012 2015 Source: Total 3
- 4. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned RTM Overview Source Propagate forwards in time Receiver data Propagate backwards in time 4
- 5. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned RTM Introduction Finite-difference code Compute intensive: 10s of hours per seismic shot Large memory footprint: 100GB per shot Large local storage requirement: 500GB per shot 10,000s of shots 5
- 6. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned RTM: Computational Requirements RTM image is made by migrating and then stacking a large number of shots (typically between 10,000 and 100,000) Migrating each shot requires two or three 3D wave propagations Each shot migration requires large RAM (~100GB) and temporary disk space (~500GB) Runtime per shots varies between a few minutes (low frequency isotropic) to several hours (high frequency anisotropic) Typical compute cluster used for RTM will be 100s of nodes 6
- 7. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned In My Parallel Universe… Small to medium-sized seismic companies aren’t limited by computational resources when processing seismic data – We want to make RTM (1 PFlop) available to these companies We’re delivering parallel software to run RTM on Xeon Phi systems 7
- 8. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned RTM: Wave Propagation Finite-difference time domain technique – 3D grid with millions of points – – 3D stencils Update the entire grid every time step 1000s of time steps Memory footprint of 10-100 GB Wavefield data from forward pass stored to disk to facilitate imaging 8
- 9. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Parallelizing Single Shots Finite-difference grid contains over 200 million cells per volume (2 GB) Numerous volumes per shot (Earth model, wavefields and image) One shot easily fits in a CPU compute node, but may be too large for a single Xeon Phi 9
- 10. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Parallelizing Each Shot: Multiple Cards Phi 0 Phi 1 Phi 2 The volume is partitioned into pieces that fit on a single Xeon Phi 10
- 11. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Parallelizing Each Shot: Multiple Cards Phi 0 Transfer … Boundaries must be transferred between partitions Transfers can become a bottleneck unless they are done asynchronously with stencil calculations … Transfer Phi 1 11
- 12. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Parallelizing Each Shot: Within Card Core 0, Thread 0 Core 0, Thread 1 Core 0, Thread 2 Core 0, Thread 3 Core 1, Thread 0 x/y Core 1, Thread 1 z Data in x and y are split over cores Operations in z dimension are vectorized 12
- 13. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Levels of Parallelism Each shot is split over multiple Xeon Phi Coprocessors (or Xeon nodes) using MPI The partition on each Phi is split over cores using OpenMP Operations on each thread are vectorized using the compiler’s autovectorizer 13
- 14. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Kernel: 8th Order Spatial Derivative 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 #pragma omp parallel for for(size_t x = xMin; x < xMax; x++) { for(size_t y = yMin; y < yMax; y++) { size_t const idx = x*strideX + y*strideY; #pragma vector … for(size_t z = zMin; z < zMax; z++) { size_t const i = idx + z; pVy[i] = yCoeffs[0]*(pV[i-4*strideY]-pV[i+4*strideY]) yCoeffs[1]*(pV[i-3*strideY]-pV[i+3*strideY]) yCoeffs[2]*(pV[i-2*strideY]-pV[i+2*strideY]) yCoeffs[3]*(pV[i-1*strideY]-pV[i+1*strideY]) yCoeffs[4]*pV[i]; } } } Triple loop over dimensions + + + + One-dimensional derivative: simple calculation with large memory bandwidth
- 15. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Tuning OpenMP 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #pragma omp parallel for collapse(2) schedule(static) for(size_t x = xMin; x < xMax; x++) { for(size_t y = yMin; y < yMax; y++) { #pragma vector … size_t const idx = x*strideX + y*strideY; for(size_t z = zMin; z < zMax; z++) { size_t const i = idx + z; // Derivative Calculations } } } Many options available for OpenMP – Tuning especially important on Phi (mostly because of high thread count) Here we use static loop scheduling, because it has the lowest overhead – It is also the most prone to load-balance issues
- 16. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Tuning OpenMP Collapse(2) combines two adjacent for loops Here, X and Y dimensions are combined. Eg: X = 250, Y = 150 Work is broken more evenly onto cores when there are more iterations – 250 iterations on 240 threads (60*4) means 10 threads do double work while other threads wait (1/2 time wasted) – 250 x 150 divides much better onto 240 threads (1/157 time wasted) Improved Phi performance by 1.5x! Y X X*Y
- 17. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Tuning Thread Affinity We programmatically set affinity with run dependent logic Isolating various tasks prevents over-subscription of cores Transfer Threads Core 0 Disk IO Threads Core 1 Propagation Threads Core 2 Core 60 … OS Threads Core 61
- 18. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Tuning Thread Affinity Thread affinity settings improved scaling on multiple Phis and multiple CPU sockets Without Affinity Changes With Affinity Changes Dual Xeon Dual Xeon Phi vs. Single sockets vs. Phi single socket 1.3x 1.9x 1.9x 1.7x Different settings for Xeon Phi and Xeon
- 19. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Tuning Memory Access 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 #pragma omp parallel for collapse(2) schedule(static) for(size_t x = xMin; x < xMax; x++) { for(size_t y = yMin; y < yMax; y++) { size_t const idx = x*strideX + y*strideY; __assume(strideX%16==0); __assume(strideY%16==0); __assume(idx%16==0); __assume_aligned(pV ,64); __assume_aligned(pVy ,64); #pragma vector always assert vecremainder #pragma ivdep #pragma vector nontemporal (pVy) for(size_t z = zMin; z < zMax; z++) { size_t const i = idx + z; pVy[i] = ( yCoeffs*(pV[i-4*strideY]-pV[i+4*strideY])... } } } Improved performance by 1.1x on both Xeon and Xeon Phi! Give compiler hints about indexing so it knows when to use aligned reads/writes pVy[i] is written once and should not be cached 19
- 20. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Current Performance Results For anisotropic wave propagation, Xeon Phi coprocessor is ~2.3x a single Xeon E5-2670 CPU Same code-base and optimizations applied to Xeon and Xeon Phi 20
- 21. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned About Acceleware Professional training – – – – High performance consulting – – – Xeon Phi Coprocessor Optimization OpenCL OpenMP MPI Feasibility Studies Porting and Optimization Algorithm parallelization Accelerated software – – Oil and Gas Electromagnetics 21
- 22. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Questions? Come visit us in booth #1825! Head Office Tel: +1 403.249.9099 Email: services@acceleware.com Viktoria Kaczur Senior Account Manager Tel: +1 403.249.9099 ext. 356 Cell: +1 403.671.4455 Email: viktoria.kaczur@acceleware.com 22

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment