Do Multicore ao Manycore: Práticas de Configuração, Compilação e Execução no coprocessador Intel® Xeon Phi™ - Intel Software Conference 2013

  • 655 views
Uploaded on

Palestra ministrada por Luciano Palma no Intel Software Conference nos dias 6 de Agosto (NCC/UNESP/SP) e 12 de Agosto (COPPE/UFRJ/RJ).

Palestra ministrada por Luciano Palma no Intel Software Conference nos dias 6 de Agosto (NCC/UNESP/SP) e 12 de Agosto (COPPE/UFRJ/RJ).

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
655
On Slideshare
0
From Embeds
0
Number of Embeds
21

Actions

Shares
Downloads
9
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. HW and SW Architecture of the Intel® Xeon Phi™ Coprocessor Leo Borges (leonardo.borges@intel.com) Intel - Software and Services Group iStep-Brazil, August 2013 1
  • 2. Click to edit Master title style 2 Introduction High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software Performance and Thread Parallelism Conclusions & References
  • 3. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 3 * Theoretical acceleration using a highly-parallel Intel® Xeon Phi™ coprocessor versus a standard multi-core Intel® Xeon® processor Efficient vectorization, threading, and parallel execution drives higher performance for many applications Fraction Parallel % Vector Performance 7.00 5.00 3.00 1.00 1.00 0.20 0.00 0.40 0.60 0.80 0% 100% 50% 75% 25% Big Gains for Selected Applications Scale to manycore Parallelize Vectorize Medical imaging and biophysics Computer Aided Design & Manufacturing Climate modeling & weather prediction Financial analyses, trading Energy &oil exploration Digital content creation
  • 4. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 4 YES Evaluating Your Applications for Intel® Xeon Phi™ NO YES YES YES Can your workload benefit from more memory bandwidth? Can your workload benefit from large vectors? NO NO Can your workload scale to over 100 threads? Use Intel® Xeon Phi™ coprocessors for applications that scale with: • Threads • Vectors • Memory Bandwidth
  • 5. Click to edit Master title style 5 Introduction High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software Performance and Thread Parallelism Conclusions & References
  • 6. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 6 Intel Many Integrated Core (MIC, pronounced “Mike”) Product Family/Architecture for Highly Parallel Applications • Based on large number of smaller, low power, Intel Arch. Cores • 512-bit wide vector engine • Compliments Intel Xeon processor product line • Provides breakthrough performance for highly parallel apps – Familiar x86 programming model – Same source code supports both Intel Xeon processor & Intel Xeon Phi coprocessor – Initially a coprocessor with PCI Express form factor First products announced at SC12: Code named Knights Corner (KNC) • Up to 61 cores, 4 threads per core • Up to 16GB GDDR5 memory (up to 352 GB/s) • 225-300W (Cooling: Both passive & active SKUs) • x16 PCIe Form-Factor (requires IA host) 6 Intel® Xeon® Phi™ Product Family Based on the Intel MIC Architecture
  • 7. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 7 Each Intel® Xeon Phi™ Coprocessor core is a fully functional multi-thread Execution unit • >50 in-order cores • Ring interconnect • 64-bit addressing • Scalar unit based on Intel® Pentium® processor family • Two pipelines - Dual issue with scalar instructions • One-per-clock scalar pipeline throughput - 4 clock latency from issue to resolution • 4 hardware threads per core • Each thread issues instructions in turn • Round-robin execution hides scalar unit latencyRing Scalar Registers Vector Registers 512K L2 Cache 32K L1 I-cache 32K L1 D-cache Instruction Decode Vector Unit Scalar Unit
  • 8. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 8 Each Intel® Xeon Phi™ Coprocessor core is a fully functional multi-thread Vector unit Ring Scalar Registers Vector Registers 512K L2 Cache 32K L1 I-cache 32K L1 D-cache Instruction Decode Vector Unit Scalar Unit • Optimized • Single and Double precision • All new vector unit • 512-bit SIMD Instructions – not Intel® SSE, MMX™, or Intel® AVX • 32 512-bit wide vector registers - Hold 16 singles or 8 doubles per register • Fully-coherent L1 and L2 caches Takeaway: Vectorization is important
  • 9. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 9 Individual cores are tied together via fully coherent caches into a bidirectional ring • 9 GDDR GDDR GDDR GDDR PCIexp L1 32K I- D-cache per core 3 cycle access Up to 8 concurrent accesses L2 512K cache per core 11 cycle best access Up to 32 concurrent accesses GDDR5 Memory 16 memory channels - Up to 5.5 Gb/sec 16 GB 300ns access Bidirectional ring 115 GB/sec Distributed Tag Directory (DTD) reduces ring snoop traffic PCIe port has its own ring stop Takeaway: Parallelization and data placement are important
  • 10. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 10 Each Xeon Phi can be addressed as an Individual Node in the Cluster • 1 0 6 to 16 GB GDDR5 memory
  • 11. INTEL CONFIDENTIAL • Click to edit Master text styles ‒ Second level  Third level o Fourth level  Fifth level Click to edit Master title style 11 © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 3 Family Outstanding Parallel Computing Solution Performance/$ leadership Intel® Xeon Phi™ Coprocessors 3120P 3120A 5 Family Optimized for High Density Environments Performance/watt leadership 5120D 7 Family Highest Level of Features Performance leadership 7120P 7120X 16GB GDDR5 352 GB/s > 1.2 TFlops DP Turbo T 8GB GDDR5 >300 GB/s >1 TFlops DP 6GB GDDR5 240 GB/s >1 TFlops DP 5120P
  • 12. Click to edit Master title style 12 Introduction High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software Performance Considerations Performance and Thread Parallelism Conclusions & References
  • 13. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 13 Reminder: Vectorization, What is it? for (i=0;i<=MAX;i++) c[i]=a[i]+b[i]; + c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i] b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i] a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i] Vector - One Instruction - Eight Mathematical Operations1 1. Number of operations per instruction varies based on the which SIMD instruction is used and the width of the operands + C B A Scalar - One Instruction - One Mathematical Operation • Vectorizations is Core-Level Parallelism
  • 14. INTEL CONFIDENTIAL • Click to edit Master text styles ‒ Second level  Third level o Fourth level  Fifth level Click to edit Master title style 14 © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 14 Instruction Instruction Width Operand Width Number of Operations per Instruction Family SSE 128-bit 32-bit (SP) 4 Westmere SSE 128-bit 64-bit (DP) 2 Westmere AVX 256-bit 32-bit (SP) 8 SandyBridge AVX 256-bit 64-bit (DP) 4 SandyBridge MIC ISA 512-bit 32-bit (SP) 16 Xeon Phi MIC ISA 512-bit 64-bit (DP) 8 Xeon Phi SIMD Vector Instructions per Family 2X 2X
  • 15. INTEL CONFIDENTIAL • Click to edit Master text styles ‒ Second level  Third level o Fourth level  Fifth level Click to edit Master title style 15 © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Sandy Bridge/Ivy Bridge : Two 256 bits SIMD per cycle 8 MUL (32b) and 8 ADD (32b): 16 Single Precision Flops / cycle 4 MUL (64b) and 4 ADD (64b): 8 Double Precision Flops / cycle Theoretical peak for a 2-sockets E5-2697 v2 (12 cores @ 2.7 GHz) 16[Flops/cycle ]*2[sockets]*12[cores]*2.7[Gcycles/sec] = 1036.8 [Gflops/sec] SP 8[Flops/cycle ]* 2[sockets]*12[cores]*2.7[Gcycles/sec] = 518.4 [Gflops/sec] DP Xeon Phi : One 512 bits SIMD FMA per cycle 16 MUL (32b) and 16 ADD (32b): 32 Single Precision Flops / cycle 8 MUL (64b) and 8 ADD (64b): 16 Double Precision Flops / cycle Theoretical peak for a KNC 7120x (61 cores @ 1.24 GHz) 32[Flops/cycle ]*61[cores]*1.24[Gcycles/sec] = 2420.5 [Gflops/sec] SP 16[Flops/cycle ]*61[cores]*1.24[Gcycles/sec] = 1210.2 [Gflops/sec] DP Theoretical Peak Flops on Xeon and Xeon Phi
  • 16. INTEL CONFIDENTIAL • Click to edit Master text styles ‒ Second level  Third level o Fourth level  Fifth level Click to edit Master title style 16 © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Theoretical Memory Bandwidth on Xeon and Xeon Phi Sandy Bridge/Ivy Bridge: 4 channels , 2 sockets and 1600/1866 MHz memory 8*1.600* 4*2 = 102 GB/s peak (ST : 80 GB/s) on SNB-EP 8*1.866* 4*2 = 120 GB/s peak (ST : 90 GB/s) on IVB-EP Xeon Phi: 16 channels , 5.5 GT/s memory 4[bytes/channel]* 5.5[GT/s]* 16[channels] = 352 GB/s peak (ST : 170 GB/s *) on KNC 7120x *ECC Enabled Basical rules for theoretical memory BW [Bytes / second ] : [Bytes / channel] * Mem freq [Gcycles/sec] * nb of channels * nb of sockets
  • 17. INTEL CONFIDENTIAL17 75 171 0 50 100 150 200 STREAM Triad (GB/s) 330 802 0 200 400 600 800 1000 SMP Linpack (GF/s) 347 887 0 200 400 600 800 1000 DGEMM (GF/s) 728 1,796 0 500 1000 1500 2000 SGEMM (GF/s) Notes 1. Intel® Xeon® Processor E5-2680 used for all SGEMM Matrix = 12800 x 12800 , DGEMM Matrix 10752 x 10752, SMP Linpack Matrix 26000 x 26000 2. Intel® Xeon Phi™ coprocessor SE10P (ECC on) with “Gold” SW stack SGEMM Matrix = 12800 x 12800, DGEMM Matrix 12800 x 12800, SMP Linpack Matrix 26872 x 28672 3. Average single-node results from measurements across a set of nodes from the TACC+ Stampede* Cluster + Texas Advanced Computing Center (TACC) at the University of Texas at Austin. ++ Measured on the TACC+ Stampede Cluster Coprocessor results: Benchmark run 100% on coprocessor, no help from Intel® Xeon® processor host (aka native) Synthetic Benchmarks Intel® Xeon Phi™ Coprocessor and Intel® MKL UP TO 2.4X UP TO 2.5X UP TO 2.2X UP TO 2.4X Higher is Better • 2S Intel® Xeon® • Intel Xeon Phi ECC ON84% Efficient 83% Efficient 75% Efficient
  • 18. Click to edit Master title style 18 Introduction High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software Native, Offload and Variations Performance and Thread Parallelism Conclusions & References
  • 19. INTEL CONFIDENTIAL • Click to edit Master text styles ‒ Second level  Third level o Fourth level  Fifth level Click to edit Master title style © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Wide Spectrum of Execution Models General purpose serial and parallel computing Codes with highly- parallel phases Highly-parallel codes Codes with balanced needs Main( ) Foo( ) MPI_*() Foo( ) Main( ) Foo( ) MPI_*() Main() Foo( ) MPI_*() Main( ) Foo( ) MPI_*() Main( ) Foo( ) MPI_*() Multicore Many-core Multicore Centric Many-core Centric (Intel® Xeon® processors) (Intel® Many Integrated Core co-processors) Multi-core-hosted Offload Symmetric Many-core-hosted Range of Models to Meet Application Needs 19
  • 20. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. The Intel Manycore Platform Software Stack (MPSS) provides Linux on the coprocessor 20 Linux* OS Intel® Xeon Phi™ Coprocessor support libraries, tools, and drivers Linux* OS PCI-E Bus PCI-E Bus Intel® Xeon Phi™ Coprocessor communication and application- launch support Intel® Xeon Phi™ CoprocessorHost Processor System-level code System-level code User-level codeUser-level code
  • 21. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Runs either as an accelerator for offloaded host computation… 21 Linux* OS Intel® Xeon Phi™ Coprocessor support libraries, tools, and drivers Linux* OS PCI-E Bus PCI-E Bus Intel® Xeon Phi™ Coprocessor communication and application- launch support Intel® Xeon Phi™ CoprocessorHost Processor System-level code System-level code User-level codeUser-level code Offload libraries, user-level driver, user-accessible APIs and libraries User code Host-side offload application User code Offload libraries, user-accessible APIs and libraries Target-side offload applicationAdvantages • More memory available • Better file access • Host better on serial code • Better uses resources
  • 22. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. …Or runs as a native or MPI* compute node via IP or OFED 22 Linux* OS Intel® Xeon Phi™ Coprocessor support libraries, tools, and drivers Linux* OS PCI-E Bus PCI-E Bus Intel® Xeon Phi™ Coprocessor communication and application- launch support Intel® Xeon Phi™ CoprocessorHost Processor System-level code System-level code User-level codeUser-level code Advantages • Simpler model • No directives • Easier port • Good kernel test ssh or telnet connection to coprocessor IP address Virtual terminal session Use if • Not serial • Modest memory • Complex code Target-side “native” application User code Standard OS libraries plus any 3rd-party or Intel libraries IB fabric
  • 23. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Intel® Xeon Phi™ Coprocessor Becomes a Network Node * Intel® Xeon® Processor Intel® Xeon Phi™ Coprocessor Virtual Network Connection Intel® Xeon® Processor Intel® Xeon Phi™ Coprocessor Virtual Network Connection … …Intel® Xeon Phi™ Architecture + Linux enables IP addressability 23
  • 24. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Flexible: Enables Multiple Programming Models 24 CPU MIC CPU MIC Data MPI Data Network Homogenous network of many-core CPUs CPU MIC CPU MIC Data MPI Data Network Data Data Heterogeneous network of homogeneous CPUs CPU MIC CPU MIC MPI Offload Offload Network Data Data Homogenous network of heterogeneous nodes Coprocessor only Host+Offload Symmetric
  • 25. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. The Intel® Manycore Platform Software Stack (Intel® MPSS) provides Linux* on the coprocessor 25 Authenticated users can treat it like another node Add –mmic to compiles to create native programs Intel MPSS supplies a virtual FS and native execution ssh mic0 top Mem: 298016K used, 7578640K free, 0K shrd, 0K buff, 100688K cached CPU: 0.0% usr 0.3% sys 0.0% nic 99.6% idle 0.0% io 0.0% irq 0.0% sirq Load average: 1.00 1.04 1.01 1/2234 7265 PID PPID USER STAT VSZ %MEM CPU %CPU COMMAND 7265 7264 fdkew R 7060 0.0 14 0.3 top 43 2 root SW 0 0.0 13 0.0 [ksoftirqd/13] 5748 1 root S 119m 1.5 226 0.0 ./sep_mic_server3.8 5670 1 micuser S 97872 1.2 0 0.0 /bin/coi_daemon --coiuser=micuser sudo scp /opt/intel/composerxe/lib/mic/libiomp5.so root@mic0:/lib64 scp native.exe mic0:/tmp ssh mic0 “/tmp/native.exe <my-args>” icc –O3 –g –mmic –o nativeMIC myNativeProgram.c Xeon Phi can work as a Node
  • 26. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Compiler Assisted Offload: Examples • Offload section of code to the coprocessor. • Offload any function call to the coprocessor. 26 #pragma offload target(mic) in(transa, transb, N, alpha, beta) in(A:length(matrix_elements)) in(B:length(matrix_elements)) in(C:length(matrix_elements)) out(C:length(matrix_elements) alloc_if(0)) { sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N); } float pi = 0.0f; #pragma offload target(mic) #pragma omp parallel for reduction(+:pi) for (i=0; i<count; i++) { float t = (float)((i+0.5f)/count); pi += 4.0f/(1.0f+t*t); } pi /= count; Xeon Phi can work as a Coprocessor
  • 27. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Compiler Assisted Offload: Example • An example in Fortran: 27 !DEC$ ATTRIBUTES OFFLOAD : TARGET( MIC ) :: SGEMM !DEC$ OMP OFFLOAD TARGET( MIC ) & !DEC$ IN( TRANSA, TRANSB, M, N, K, ALPHA, BETA, LDA, LDB, LDC ), & !DEC$ IN( A: LENGTH( NCOLA * LDA )), & !DEC$ IN( B: LENGTH( NCOLB * LDB )), & !DEC$ INOUT( C: LENGTH( N * LDC )) CALL SGEMM( TRANSA, TRANSB, M, N, K, ALPHA, & A, LDA, B, LDB BETA, C, LDC )
  • 28. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Offload directives are independent of function boundaries 28 Host Intel® Xeon® processor Target Intel® Xeon Xeon Phi™ coprocessor Execution • If at first offload the target is available, the target program is loaded • At each offload if the target is available, statement is run on target, else it is run on the host • At program termination the target program is unloaded f() { #pragma offload a = b + g(); h(); } f_part1() { a = b + g(); } __attribute__ ((target(mic))) g() { ... } h() { ... } __attribute__ ((target(mic))) g() { ... }
  • 29. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Example – share work between coprocessor and host using OpenMP* omp_set_nested(1); #pragma omp parallel private(ip) { #pragma omp sections { #pragma omp section /* use pointer to copy back only part of potential array, to avoid overwriting host */ #pragma offload target(mic) in(xp) in(yp) in(zp) out(ppot:length(np1)) #pragma omp parallel for private(ip) for (i=0;i<np1;i++) { ppot[i] = threed_int(x0,xn,y0,yn,z0,zn,nx,ny,nz,xp[i],yp[i],zp[i]); } #pragma omp section #pragma omp parallel for private(ip) for (i=0;i<np2;i++) { pot[i+np1] = threed_int(x0,xn,y0,yn,z0,zn,nx,ny,nz,xp[i+np1],yp[i+np1],zp[i+np1]); } } } 29 Top level, runs on host Runs on coprocessor Runs on host
  • 30. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Pragmas and directives mark data and code to be offloaded and executed 30 C/C++ Syntax Offload pragma #pragma offload <clauses> <statement> Allow next statement to execute on coprocessor or host CPU Variable/function offload properties __attribute__((target(mic))) Compile function for, or allocate variable on, both host CPU and coprocessor Entire blocks of data/code defs #pragma offload_attribute(push, target(mic)) #pragma offload_attribute(pop) Mark entire files or large blocks of code to compile for both host CPU and coprocessorFortran Syntax Offload directive !dir$ omp offload <clauses> <statement> Execute OpenMP* parallel block on coprocessor !dir$ offload <clauses> <statement> Execute next statement or function on coproc. Variable/function offload properties !dir$ attributes offload:<mic> :: <ret-name> OR <var1,var2,…> Compile function or variable for CPU and coprocessor Entire code blocks !dir$ offload begin <clauses> !dir$ end offload
  • 31. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Options on offloads can control data copying and manage coprocessor dynamic allocation 31 Clauses Syntax Semantics Multiple coprocessors target(mic[:unit] ) Select specific coprocessors Conditional offload if (condition) / manadatory Select coprocessor or host compute Inputs in(var-list modifiersopt) Copy from host to coprocessor Outputs out(var-list modifiersopt) Copy from coprocessor to host Inputs & outputs inout(var-list modifiersopt) Copy host to coprocessor and back when offload completes Non-copied data nocopy(var-list modifiersopt) Data is local to target Modifiers Specify copy length length(N) Copy N elements of pointer’s type Coprocessor memory allocation alloc_if ( bool ) Allocate coprocessor space on this offload (default: TRUE) Coprocessor memory release free_if ( bool ) Free coprocessor space at the end of this offload (default: TRUE) Control target data alignment align ( N bytes ) Specify minimum memory alignment on coprocessor Array partial allocation & variable relocation alloc ( array-slice ) into ( var-expr ) Enables partial array allocation and data copy into other vars & ranges
  • 32. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Data Persistence with Compiler Offload 32 __declspec(target(mic)) static float *A, *B, *C, *C1; // Transfer matrices A, B, and C to coprocessor and do not de-allocate matrices A and B #pragma offload target(mic) in(transa, transb, M, N, K, alpha, beta, LDA, LDB, LDC) in(A:length(NCOLA * LDA) free_if(0)) in(B:length(NCOLB * LDB) free_if(0)) inout(C:length(N * LDC)) { sgemm(&transa, &transb, &M, &N, &K, &alpha, A, &LDA, B, &LDB, &beta, C, &LDC); } // Transfer matrix C1 to coprocessor and reuse matrices A and B #pragma offload target(mic) in(transa1, transb1, M, N, K, alpha1, beta1, LDA, LDB, LDC1) nocopy(A:length(NCOLA * LDA) alloc_if(0) free_if(0)) nocopy(B:length(NCOLB * LDB) alloc_if(0) free_if(0)) inout(C1:length(N * LDC1)) { sgemm(&transa1, &transb1, &M, &N, &K, &alpha1, A, &LDA, B, &LDB, &beta1, C1, &LDC1); } // Deallocate A and B on the coprocessor #pragma offload target(mic) nocopy(A:length(NCOLA * LDA) free_if(1)) nocopy(B:length(NCOLB * LDB) free_if(1)) { }
  • 33. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Data Persistence with Compiler Offload 33 #define ALLOC alloc_if(1) free_if(0) #define REUSE alloc_if(0) free_if(0) #define FREE alloc_if(0) free_if(1) __declspec(target(mic)) static float *A, *B, *C, *C1; // Transfer matrices A, B, and C to coprocessor and do not de-allocate matrices A and B #pragma offload target(mic) in(transa, transb, M, N, K, alpha, beta, LDA, LDB, LDC) in(A:length(NCOLA * LDA) ALLOC ) in(B:length(NCOLB * LDB) ALLOC ) inout(C:length(N * LDC)) { sgemm(&transa, &transb, &M, &N, &K, &alpha, A, &LDA, B, &LDB, &beta, C, &LDC); } // Transfer matrix C1 to coprocessor and reuse matrices A and B #pragma offload target(mic) in(transa1, transb1, M, N, K, alpha1, beta1, LDA, LDB, LDC1) nocopy(A:length(NCOLA * LDA) REUSE ) nocopy(B:length(NCOLB * LDB) REUSE ) inout(C1:length(N * LDC1)) { sgemm(&transa1, &transb1, &M, &N, &K, &alpha1, A, &LDA, B, &LDB, &beta1, C1, &LDC1); } // Deallocate A and B on the coprocessor #pragma offload_transfer target(mic) nocopy(A:length(NCOLA * LDA) FREE ) nocopy(B:length(NCOLB * LDB) FREE )
  • 34. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. To handle more complex data structures on the coprocessor, use Virtual Shared Memory An identical range of virtual addresses is reserved on both host an coprocessor: changes are shared at offload points, allowing: • Seamless sharing of complex data structures, including linked lists • Elimination of manual data marshaling and shared array management • Freer use of new C++ features and standard classes 34 Host VM coproc VM Offload code C/C++ executable Host coprocessor Same virtual address range
  • 35. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Example: Virtual Shared Memory • Shared between host and Xeon Phi 35 // Shared variable declaration _Cilk_shared T in1[SIZE]; _Cilk_shared T in2[SIZE]; _Cilk_shared T res[SIZE]; _Cilk_shared void compute_sum() { int i; for (i=0; i<SIZE; i++) { res[i] = in1[i] + in2[i]; } } (...) // Call compute sum on Target _Cilk_offload compute_sum();
  • 36. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Virtual Shared Memory uses special allocation to manage data sharing at offload boundaries Declare virtual shared data using _Cilk_shared allocation specifier Allocate virtual dynamic shared data using these special functions: Shared data copying occurs automatically around offload sections • Memory is only synchronized on entry to or exit from an offload call • Only modified data blocks are transferred between host and coprocessor Allows transfer of C++ objects • Pointers are transportable when they point to “shared” data addresses Well-known methods can be used to synchronize access to shared data and prevent data races within offloaded code • E.g., locks, critical sections, etc. This model is integrated with the Intel® Cilk™ Plus parallel extensions 36 Note: Not supported on Fortran - available for C/C++ only _Offload_shared_malloc(), _Offload_shared_aligned_malloc(), _Offload_shared_free(), _Offload_shared_aligned_free()
  • 37. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Data sharing between host and coprocessor can be enabled using this Intel® Cilk™ Plus syntax 37 What Syntax Function int _Cilk_shared f(int x){ return x+1; } Code emitted for host and target; may be called from either side Global _Cilk_shared int x = 0; Datum is visible on both sides File/Function static static _Cilk_shared int x; Datum visible on both sides, only to code within the file/function Class class _Cilk_shared x {…}; Class methods, members and operators available on both sides Pointer to shared data int _Cilk_shared *p; p is local (not shared), can point to shared data A shared pointer int *_Cilk_shared p; p is shared; should only point at shared data Entire blocks of code #pragma offload_attribute( push, _Cilk_shared) #pragma offload_attribute(pop) Mark entire files or blocks of code _Cilk_shared using this pragma
  • 38. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® Cilk™ Plus syntax can also specify the offloading of computation to the coprocessor 38 Feature Example Offloading a function call x = _Cilk_offload func(y); func executes on coprocessor if possible x = _Cilk_offload_to (card_num) func(y); func must execute on specified coprocessor or an error occurs Offloading asynchronously x = _Cilk_spawn _Cilk_offload func(y); func executes on coprocessor; continuation available for stealing Offloading a parallel for- loop _Cilk_offload _Cilk_for(i=0; i<N; i++){ a[i] = b[i] + c[i]; } Loop executes in parallel on coprocessor. The loop is implicitly “un-inlined” as a function call.
  • 39. Click to edit Master title style 39 Introduction High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software Performance and Thread Parallelism Conclusions & References
  • 40. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Click to edit Master text styles • Second level – Third level – Fourth level – Fifth level Click to edit Master title style © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Advisor XE VTune Amplifier XE Inspector XE Trace Analyzer Code Analysis Comprehensive set of SW tools for Xeon and Xeon Phi Programing Intel Cilk Plus Threading Building Blocks OpenMP OpenCL MPI Offload/Native/MYO Programming Models Math Kernel Library Integrated Performance Primitives Intel Compilers Libraries & Compilers 40
  • 41. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Click to edit Master title style First Level • Second level – Third level – Fourth level – Fifth level INTEL CONFIDENTIAL 41 • Click to edit Master text styles ‒ Second level  Third level o Fourth level  Fifth level Click to edit Master title style © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Options for Thread Parallelism Intel® Math Kernel Library OpenMP* Intel® Threading Building Blocks Intel® Cilk™ Plus OpenCL* Pthreads* and other threading libraries Programmer control Ease of use / code maintainability Choice of unified programming to target Intel® Xeon® and Intel® Xeon Phi™ Architecture! 41
  • 42. Click to edit Master title style 42 Introduction High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software Performance and Thread Parallelism: OpenMP Conclusions & References
  • 43. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenMP* on the Coprocessor • The basics work just like on the host CPU • For both native and offload models • Need to specify -openmp • There are 4 hardware thread contexts per core • Need at least 2 x ncore threads for good performance – For all except the most memory-bound workloads – Often, 3x or 4x (number of available cores) is best – Very different from hyperthreading on the host! – -opt-threads-per-core=n advises compiler how many threads to optimize for • If you don’t saturate all available threads, be sure to set KMP_AFFINITY to control thread distribution 43
  • 44. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Thread Affinity Interface Allows OpenMP threads to be bound to physical or logical cores • export environment variable KMP_AFFINITY= – physical use all physical cores before assigning threads to other logical cores (other hardware thread contexts) – compact assign threads to consecutive h/w contexts on same physical core (eg to benefit from shared cache) – scatter assign consecutive threads to different physical cores (eg to maximize access to memory) – balanced blend of compact & scatter (currently only available for Intel® MIC Architecture) • Helps optimize access to memory or cache • Particularly important if all available h/w threads not used – else some physical cores may be idle while others run multiple threads • See compiler documentation for (much) more detail 44
  • 45. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenMP defaults • OMP_NUM_THREADS defaults to • 1 x ncore for host (or 2x if hyperthreading enabled) • 4 x ncore for native coprocessor applications • 4 x (ncore-1) for offload applications – one core is reserved for offload daemons and OS • Defaults may be changed via environment variables or via API calls on either the host or the coprocessor 45
  • 46. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Target OpenMP environment (offload) Use target-specific APIs to set for coprocessor target only, e.g. omp_set_num_threads_target() (called from host) omp_set_nested_target() etc • Protect with #ifdef __INTEL_OFFLOAD, undefined with –no-offload • Fortran: USE MIC_LIB and OMP_LIB C: #include <offload.h> Or define MIC – specific versions of env vars using MIC_ENV_PREFIX=MIC (no underscore) • Values on MIC no longer default to values on host • Set values specific to MIC using export MIC_OMP_NUM_THREADS=120 (all cards) export MIC_2_OMP_NUM_THREADS=180 for card #2, etc export MIC_3_ENV=“OMP_NUM_THREADS=240|KMP_AFFINITY=balanced” 46
  • 47. Click to edit Master title style 47 Introduction High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software Performance and Thread Parallelism: MKL Conclusions & References
  • 48. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 48
  • 49. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. MKL Usage Models on Intel® Xeon Phi™ Coprocessor 49 • Automatic Offload – No code changes required – Automatically uses both host and target – Transparent data transfer and execution management • Compiler Assisted Offload – Explicit controls of data transfer and remote execution using compiler offload pragmas/directives – Can be used together with Automatic Offload • Native Execution – Uses the coprocessors as independent nodes – Input data is copied to targets in advance
  • 50. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. MKL Execution Models 50 Multicore Hosted General purpose serial and parallel computing Offload Codes with highly- parallel phases Many Core Hosted Highly-parallel codes Symmetric Codes with balanced needs Multicore (Intel® Xeon®) Many-core (Intel® Xeon Phi™) Multicore Centric Many-Core Centric
  • 51. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Work Division Control in MKL Automatic Offload 51 Examples Notes mkl_mic_set_Workdivision( MKL_TARGET_MIC, 0, 0.5) Offload 50% of computation only to the 1st card. Examples Notes MKL_MIC_0_WORKDIVISION=0.5 Offload 50% of computation only to the 1st card.
  • 52. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. How to Use MKL with Compiler Assisted Offload • The same way you would offload any function call to the coprocessor. • An example in C: 52 #pragma offload target(mic) in(transa, transb, N, alpha, beta) in(A:length(matrix_elements)) in(B:length(matrix_elements)) in(C:length(matrix_elements)) out(C:length(matrix_elements) alloc_if(0)) { sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N); }
  • 53. Click to edit Master title style 53 Introduction High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software Performance and Thread Parallelism Conclusions & References
  • 54. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Click to edit Master text styles • Second level – Third level – Fourth level – Fifth level Click to edit Master title style © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Conclusions Intel® Xeon Phi™ coprocessor advantages: • Comparable performance potential to other accelerators • Faster time to solution due to reduced development effort • Better investment protection with a single code base for processors and coprocessors Flexible and Wide range of programming models: from pure Native to Offloaded – and all variants between All with the familiar Intel development environment 54
  • 55. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Click to edit Master text styles • Second level – Third level – Fourth level – Fifth level Click to edit Master title style © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. One Stop Shop for: Tools & Software Downloads Getting Started Development Guides Video Workshops, Tutorials, & Events Code Samples & Case Studies Articles, Forums, & Blogs Associated Product Links http://software.intel.com/mic-developer Intel® Xeon Phi™ Coprocessor Developer Site: http://software.intel.com/mic-developer 55
  • 56. Obrigado.
  • 57. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Click to edit Master text styles • Second level – Third level – Fourth level – Fifth level Click to edit Master title style © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Legal Disclaimer & Optimization Notice Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 57