Experts in numerical algorithmsand HPC servicesAccelerators: the good, the bad and the ugly!Dr Ian ReidIan.Reid@nag.co.uk
2 NAG Introduction Accelerators – NAG experience NAG on Intel Xeon Phi SummaryAgenda
3 Founded 1970 Not-for-profit organisation Surpluses fund on-going R&D Mathematical and Statistical Expertise Libraries of components Consulting HPC Services Computational Science and Engineering (CSE) support Procurement advice, market watch, benchmarkingNAG Background
4 Escalator?:Want more performance? Buy the next processor! To get performance/efficiency we have to go(massively) parallel Disruption causing serious look at ‘other’technologies (and algorithms!) Even CPUs with tens of cores Hybrid, shared-memory and distributed-memoryparallelism Painful whichever way we turn!Where has my Escalator gone?
5 Loose definition: hardware on which to run yoursoftware better than on your (general purpose) CPU Generally NOT an easy win Significant learning curve and effort Offload disadvantages… The good: put some effort in; get a great result! The bad: put effort in, get an OK result, but learnlessons which can be re-used (often good!) The ugly: put significant effort in, get a poor resultand don’t learn anything substantiveAccelerators
6 The Intel Xeon Phi is a co-processor attached to ahost system via the PCI express bus Highly parallel architecture Compiler support for OpenMP parallelism It has a distinct memory system from the host Several use cases to consider: Automatic Offloading Explicit Offloading Native ApplicationsIntel Xeon Phi
7 Relatively easy to take existing OpenMP based codeand port to Phi Tuning for Phi takes some learning and expertise … but feedback into Xeon code is often very strong NAG Library for Intel Xeon Phi supports all models Offload (supports automatic and explicit) and Native libs Windows version from Intel Xeon Phi now in betaNAG Experience with Intel Xeon Phi
8 Offload OpenMP regions to Phi when problem sizesare above some threshold Estimating problem size can be complex Required data is transferred to/from the hostprior/post executing OpenMP region Data transfer takes time, eats into the benefit of runningthe OpenMP on the Phi Transparent to the user of the Library Just recompile code containing NAG Library function callsto benefit.Automatic Offload
9 All NAG functions can be explicitly offloaded by user user code modified to include relevant offload statements allows control of which functions offloaded Data transfers to Phi can be dissociated with functionoffloading allowing data to remain on the Phi user responsible for data movement reduces penalty of offloading data by allowing its use bymultiple offloaded function calls before returning to host Effort required by the user to re-code applicationExplicit Offload
10 Users may choose to port their entire application user code modified to include relevant offload statements allows complete control of which functions are offloaded Data transfers to Phi can be dissociated with functionoffloading allowing data to remain on the Phi user responsible for data movement reduces penalty of offloading data by allowing its use bymultiple offloaded function calls before returning to host Effort required by the user to re-code applicationNative Applications
11 Sandybridge CPUs (typically using 32 threads) Knights Corner Phi processor (typically using 240threads)Performance Examples and Lessons
1202004006008001,0001,2001,4001,6000 5000 10000 15000 20000 25000 30000Time(s)Problem Size (n)Hierarchical Cluster Analysis (go3ec)32 threads original Phi offload original Phi offload opt 32 threads opt n=30k; m=3k Xeon 32t: 1,412s Phi 240t*: 1,259s Xeon 32t*: 1,073s For this size problembest to stay on CPUbut take the 25%!
140.000.200.400.600.801.001.201.401.60100 10,000 1,000,000 100,000,000Time(s)Size of problem (n, log scale)Uniform RNG - Mersenne Twister (g05sa)8 threads original Native Phi original Native Phi opt 8 threads opt n=500m Xeon 8t: 0.25s Phi 240t*: 0.08s Xeon 8t*: 0.22s Phi gain ~3x
150501001502002503000 0.5 1 1.5 2 2.5 3 3.5 4 4.5Time(s)Problem Size (weighted)Maximum Likelihood Estimates (g03ca)32 threads original Phi offload original Phi offload opt 32 threads opt n=2500; m=2500;nfac=30; nvar=200 Xeon 32t: 256s Phi 240t*: 53.6s Xeon 32t*: 54.7s Phi gain 4x, but alsoXeon speed-up (greenline under red)
160204060801001201401601802000 1000 2000 3000 4000 5000 6000 7000Time(s)Problem Size (n)Solve real symmetric positive definite simultaneous linearequations using iterative refinement (f04af)32 threads original Phi offload original Phi offload opt 32 threads opt n=6,000; nrhs;1,000 Xeon32t: 171s Phi 240t*: 66s Xeon 32t*: 86s Phi gain ~1.3x (~3xoriginal)
17 Parallelism is a real issue we all face Exciting for some. Challenging for others! Accelerators are interesting and can offer spectacular wins Intel Phi claiming less spectacular performance gains Less effort than on other Accelerators … and often repays on CPU as well! Acid test is always solving your (complete) problem! NAG can help you try out this technology NAG Library for Phi NAG expertiseSummary