Intel Technologies for High Performance Computing
Upcoming SlideShare
Loading in...5
×
 

Intel Technologies for High Performance Computing

on

  • 535 views

Leo Borges

Leo Borges
Intel Software Conference 2014 Brazil
May 2014

Statistics

Views

Total Views
535
Views on SlideShare
533
Embed Views
2

Actions

Likes
3
Downloads
32
Comments
0

1 Embed 2

https://twitter.com 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Intel Technologies for High Performance Computing Intel Technologies for High Performance Computing Presentation Transcript

  • Intel Technologies for High Performance Computing Leo Borges Intel Software Conference 2014 Brazil May 2014
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Legal Disclaimers Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the baseline platform into each of the specific benchmark results of each of the other platforms, and assigning them a relative performance number that correlates with the performance improvements reported. Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase. Intel® Hyper-Threading Technology Available on select Intel® Xeon® processors. Requires an Intel® HT Technology-enabled system. Consult your PC manufacturer. Performance will vary depending on the specific hardware and software used. For more information including details on which processors support HT Technology, visit http://www.intel.com/info/hyperthreading. Intel® Turbo Boost Technology requires a Platform with a processor with Intel Turbo Boost Technology capability. Intel Turbo Boost Technology performance varies depending on hardware, software and overall system configuration. Check with your platform manufacturer on whether your system delivers Intel Turbo Boost Technology. For more information, see http://www.intel.com/technology/turboboost Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor series, not across different processor sequences. See http://www.intel.com/products/processor_number for details. Intel products are not intended for use in medical, life saving, life sustaining, critical control or safety systems, or in nuclear facility applications. All dates and products specified are for planning purposes only and are subject to change without notice Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel’s current plan of record product roadmaps. Product plans, dates, and specifications are preliminary and subject to change without notice Copyright © 2014 Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon and Xeon logo , Xeon Phi and Xeon Phi logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. All dates and products specified are for planning purposes only and are subject to change without notice. *Other names and brands may be claimed as the property of others. 2
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Building Blocks Many Product Families – Today’s talk: HPC Focus 3 E5-2600 v3 (E5-2400 v3 for Comms & Storage only) E3-1200 v3 E7-4800 v3 E5-4600 v3 E7-2800 v3 E7-8800 v3 Haswell E7 E5 Efficient Performance E3 E5-1600 v3 Boards/PDKs Software SSDsLAN RAID Note: For discussion purposes pnly (Not intended to be interpreted as portfolio recommendations or guidance) Cloud Storagev3 Segments Channel Enterprise HPC Mission Critical Big Data Public Cloud Co-processors Product families and building blocks targeting an array of Segments Storage Networking
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Recall of a few basics for HPC What to expect from your code What to expect from the hardware Review Vectorization Xeon + Xeon Phi Example Objectives of this session 4
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Review of a few HPC basics for non-ninja programmers 5
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice How it works and where are the bottlenecks CPUCPUCPUCPU L 1L 1L 1L 1 L 2L 2L 2L 2 L 3L 3L 3L 3 memorymemorymemorymemory CPUCPUCPUCPU L 2L 2L 2L 2 L 3L 3L 3L 3 memorymemorymemorymemory I/OI/OI/OI/O Interconnect.Interconnect.Interconnect.Interconnect. L 1L 1L 1L 1 Memory size, BW & latency ?Memory size, BW & latency ?Memory size, BW & latency ?Memory size, BW & latency ? Cache Size, BW &Cache Size, BW &Cache Size, BW &Cache Size, BW & latencylatencylatencylatency CoreCoreCoreCore count, size & perf ?count, size & perf ?count, size & perf ?count, size & perf ? Intra / Inter socketIntra / Inter socketIntra / Inter socketIntra / Inter socket communicationscommunicationscommunicationscommunications InterInterInterInter nodesnodesnodesnodes communication?communication?communication?communication? 6
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Unfortunately, you need to be aware CPU L 1 L 2 L 3 memory Bandwidth Latency Capacity From the core ………………….. ------> ………………………… to the i/o subsystem L1 L2 L3 L4 L5 …. Ln caches eDram MCDram NVM SSD PCIe SSD HDD TapesDDR 7
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice FLOPS and memory Bandwidth impact the efficiency & scalability Performing Flops is not an issue Data movement is the issue (BW, Latency, Power) Efficiency (= Peak flops / Achieved flops) won’t be high enough if store / load are not fast enough (GB/s) First approximation: Only a matter of Frequency and Bandwidth for (i=0;i<=MAX;i++) c[i]= a[i] + b[i]* d[i]; store load load load add mul 8
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Performance expectation: upper bounds CPU bound. “HPL”Real world applications Memory bound. “Stream” Flops/s demanding applications Analyzing this Flop/memory-access ratio will give a first guess for performance prediction BW demanding applications • Our performance metrics are Gflop/s and % of peak (efficiency) • Elapsed time might not tell all the information (how far of the peak performance?) 9
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Performance expectation: upper bounds CPU bound. “HPL”Real world applications Memory bound. “Stream” Analyzing this Flop/memory-access ratio will give a first guess for performance prediction • Our performance metrics are Gflop/s and % of peak (efficiency) • Elapsed time might not tell all the information (how far of the peak performance?) 10 Memory Bound? Compute Bound?
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Glossary, “High performance computing” Peak =nb of floating points operations per cycle * frequency “Flops /sec” “Efficiency = % of the peak performance” Same for Bandwidth (but in Gbytes / sec) sec/sec)/(*)/( FlopscyclecycleFlopsPeak == By the way : What is the peak perf of your laptop ? 11
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Anatomy of a Computer Platform 12
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice CPU: Core/Uncore - Designed For Modularity DRAMDRAMDRAMDRAM QPIQPIQPIQPI Core Uncore IMC QPI Power & Clock #QPI Links # mem channels Size of cache# cores Power Manage- ment Type of Memory Integrated graphics Differentiation in the “Uncore”: … QPI… … … L3 Cache QPI: Intel® QuickPath Interconnect CCCC OOOO RRRR EEEE CCCC OOOO RRRR EEEE CCCC OOOO RRRR EEEE
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Romley EP/EN Platforms Intel® Xeon® Processor E5-2600 v2/2400 v2 Product Families 14 Intel® Xeon® processor E5-2400/2600 prod fam Intel® Xeon® Processor E5-2400/2600 prod fam Intel® C600 series chipset QPI QPI DDR3 DDR3 DDR3 DDR3 3Gb/s SAS, SATA Memory DDR3 & DDR3L RDIMMs & UDIMMs, LR DIMMs Socket R: 4 channels per socket, up to 3 DPC; speeds up to DDR3 1866 Socket B2: 3 channels per socket, up to 2 DPC; speeds up to DDR3 1600 PCI Express* 3.0 Socket R: 40 lanes per socket Socket B2: 24 lanes per socket Extra Gen 2 x4 on 2nd CPU DDR3 DDR3 DDR3 DDR3 PCIe*3.0x8 PCIe*3.0x8 PCIe*3.0x8 PCIe*3.0x8 PCIe*3.0x8 Intel® C600 series chipset (Patsburg PCH) Optimized Server & WS PCH Integrated Storage: Up to 8 ports 3Gb/s SAS RAID 5 optional Ivy Bridge CPUs Socket R: Up to 12 cores / socket Socket B2: Up to 10 cores / socket DMI2 PCIe*3.0x8 PCIe*3.0x8 PCIe*3.0x8 PCIe*3.0x8 PCIe*3.0x8 PCIe*2.0x4 QPI Socket R: 2 QPI links Socket B2: 1 QPI link
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice IvyBridge (IVB) E5-2600 v2 family The total benefit (at node level) is given by a combination of factors DDR3 DDR3 DDR3 DDR3 LLC Cache MC QPII/O C C QPI QPI Gen3 x16 Gen3 x16 Gen3 x8 15 C C C C C C C C C C Feature Xeon E5-2600 v2 Process Technology 22 nm Cores/Threads Up to 12 Cores/24 Threads Last-level Cache Up to 30 MB Max Memory Speed (MHz) Up to 1866 Max DIMM Capacity 12 Slots/Processor PCIe* Lanes / Controllers/Speed 40 / 10 (PCIe* 3.0 at 8 GT/s) TDP (W) 150 (Workstation only), 130, 115, 95, 80, 70, 60 wstream.exe
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Advanced Standard Workstation Only SKU Segment Optimized 8.0 GT/s QPI DDR3-1866 Intel® HT Intel® Turbo Boost Low Power Basic Socket compatible with SNB-EP top to bottom on the SKU stack All SKUs, frequencies and features and can change without notice 6C 80W 2.1GHz 15M E5-2620 v2 4C 80W 2.5GHz 10M E5-2609 v2 10C 115W 2.5GHz 25M E5-2670 v2 8C 95W 2.0GHz 20M E5-2640 v2 4C 80W 1.8GHz 10M E5-2603 v2 6C 80W 2.6GHz 15M E5-2630 v2 10C 130W 3.0GHz 25M E5-2690 v2 10C 115W 2.8GHz 25M E5-2680 v2 8C 95W 2.6GHz 20M E5-2650 v2 10C 95W 2.2GHz 25M E5-2660 v2 12C 130W 2.7GHz 30M E5-2697 v2 12C 115W 2.4GHz 30M E5-2695 v2 8C 130W 3.3GHz 25M 6C 130W 3.5GHz 25M E5-2643 v2 4C 130W 3.5GHz 15M E5-2637 v2 10C 70W 1.7GHz 25M E5-2650L v2 6C 60W 2.4GHz 15M E5-2630L v2 10C 8.0 GT/s QPI 6C 7.2 GT/s QPI DDR3-1600 Intel® HT Intel® Turbo Boost 7.2 GT/s QPI DDR3 1600 Intel® HT Intel® Turbo Boost 8.0 GT/s QPI DDR3-1866 (skt R) DDR3-1600 (skt B2) Intel® HT Intel® Turbo Boost 6.4 GT/s QPI DDR3 1333 No Intel® HT No Intel® Turbo 8C 150W 3.4GHz 20M E5-2687W v2 E5-2667 v2 E5-2600 v2 Product Family 16
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Parallel Programming for Intel® Architecture (or, in general, for normal CPUs) Cores Vectors Memory, caches Data layout and alignment OpenMP TBB Cilk plus Vector loops Vector functions Blocking algorithms Manual layout, ugly code AoS SoA library 4 considerations when writing an efficient, unconstrained parallel program Array notations Threads, locks Intrinsics Directives for alignment Performance Analysis
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice “SIMDization”, so called Vectorization Single Instruction Multiple Data (SIMD): Processing vector with a single operation Provides data level parallelism (DLP) Vector: Consists of more than one element Elements are of same scalar data types (e.g. floats, integers, …) Scalar Processing Vector Processing AA BB CC ++ A B C + CiCi ++ AiAi BiBi CiCi AiAi BiBi CiCi AiAi BiBi CiCi AiAi BiBi VLVL Ci + Ai Bi Ci Ai Bi Ci Ai Bi Ci Ai Bi VL 18
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Vectorization of Code • Transform sequential code to exploit vector processing capabilities (SIMD) – Manually by explicit syntax – Automatically by tools like a compiler for(i = 0; i <= MAX;i++) c[i] = a[i] + b[i]; a b c + a b c ++ a[i] b[i] c[i] + a[i] b[i] c[i] + a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i] b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i] c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i] + a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i] b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i] c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i] 19
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Reminder about the peak flops Scheduler (Port names as used by Intel® Architecture Code Analyzer ***) Load Port 0 Port 1 Port 5 Port 2 Port 3 Load Store Address Store DataALUALU ALU/JMP AVX FP Shuf AVX FP Bool VI* ADDVI* MUL SSE MUL DIV** SSE ADD AVX FP ADD AVX FP MUL 0 63 127 255 SSE Shuf AVX FP Blend Port 4 AVX FP Blend VI* ADD Store Address 6 instructions / cycle: • 3 memory ops • 3 computational operations Nehalem /Westmere: Two 128 bits SIMD per cycle 4 MUL (32b) and 4 ADD (32b): 8 Single Precision Flops / cycle 2 MUL (64b) and 2 ADD (64b): 4 Double Precision Flops / cycle SandyBridge/ Ivy Bridge: Two 256 bits SIMD per cycle 8 MUL (32b) and 8 ADD (32b): 16 Single Precision Flops / cycle 4 MUL (64b) and 4ADD (64b): 8 Double Precision Flops / cycle Intel® SandyBridge/Ivy Bridge micro-architecture 20
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Processor: Intel Core i5-3427U ark.intel.com: 21 In the Laptop We’ll be Using for Demo… Processor Number i5-3427U # of Cores 2 # of Threads 4 Clock Speed 1.8 GHz Max Turbo Frequency 2.8 GHz Instruction Set Extensions AVX SandyBridge/ Ivy Bridge: Two 256 bits SIMD per cycle 8 MUL (32b) and 8 ADD (32b): 16 Single Precision Flops / cycle 4 MUL (64b) and 4ADD (64b): 8 Double Precision Flops / cycle 2 (cores) * 1.8GHz * 16 Flop/cycle = 57.6 Gflop/s (single precision) 2 (cores) * 1.8GHz * 8 Flop/cycle = 28.8 Gflop/s (double precision)
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Haswell-EP vs IvyBridge-EP The total benefit (at node level) is given by a combinaison of factors • Benefit from micro-u optimization (IPC) 25 % IPC improvements • Benefit from the nb of cores up to 1.16x (at cst Frequency) • Benefit from AVX2 up to 2x (FMA) • Benefit from Memory bandwidth up to 1.14x (1866MHz to 2133MHz) DDR4 DDR4 DDR4 DDR4 LLC Cache MC QPII/O C C QPI QPI Gen3 x16 Gen3 x16 Gen3 x8 22 C C C C C C C C C C C C
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Flops/s, AVX, AVX2 and AVX-512 2013 2014 2015 2016 H1 H2 H1 H2 H1 H2 H1 H2 Haswell-EP future futureIvy Bridge-EP 23 ----512512512512 ----512512512512
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice FMA FP Multiply Unified Reservation Station Port1 Port2 Port3 Port4 Port5 Load & Store Address Store Data Integer ALU & Shift Integer ALU & LEA Integer ALU & LEA FMA FP Mult FP Add Divide Port6 Integer ALU & Shift Port7 Store Address Port0 New AGU for Stores • Leaves Port 2 & 3 open for Loads Branch New Branch Unit • Reduces Port0 Conflicts • 2nd EU for high branch code 4th ALU • Great for integer workloads • Frees Port0 & 1 for vector Vector Shuffle Branch Vector Int Multiply Vector Logicals Vector Shifts Vector Int ALU Vector Int ALU Vector Logicals Vector Logicals Intel® Microarchitecture (Haswell) 2xFMA • Doubles peak FLOPs • Two FP multiplies benefits legacy Haswell Execution Unit Overview 24
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Extends 128-bit integer vector instructions to 256-bit Floating Point Fused Multiply Add: A*B + C Increased FLOPS potential Increased accuracy – Only a single round Enhanced vectorization with Gather, Shifts and powerful permutes Intel® AVX2 uses same 256-bit YMM registers as Intel AVX Floating-Point Performance (Peak) per Core 2x 2x AVX2 Haswell FMA (*,+) FMA (*,+) AVX SandyBridge/ Ivy Bridge MUL (*) ADD (+) SSE4 Nehalem/ Westmere MUL (*) ADD (+) 8 DP (16 SP) 4 DP (8 SP) 16 DP (32 SP) 256b AVX1 16 SP / 8 DP Flops/Cycle 256b AVX2 32 SP / 16 DP Flops/Cycle (FMA) 25
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Parallel Programming for Intel® Architecture (or, in general, for normal CPUs) Cores Vectors Memory, caches Data layout and alignment OpenMP TBB Cilk plus Vector loops Vector functions Blocking algorithms Manual layout, ugly code AoS SoA library 4 considerations when writing an efficient, unconstrained parallel program Array notations Threads, locks Intrinsics Directives for alignment Performance Analysis
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Use math libs for best use of AVX1, AVX2 & AVX-512 1.0 2.0 0.0 Assembly Intrinsics Assembly Intrinsics MKL Dgemm benchmark MKL Dgemm benchmark MKL FFT benchmark MKL FFT benchmark 1.5 Use Intel® Math Kernel Library as much as possible Use of intrinsics or assembly for specific kernels Use Compiler and Intel tools to optimize your source code speedup Application Source code Application Source code One core basis comparison 27
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Intel® Math Kernel Library: Optimized Mathematical Building Blocks Linear Algebra • BLAS • LAPACK • Sparse Solvers • Iterative • Pardiso* • ScaLAPACK Fast Fourier Transforms • Multidimensional • FFTW interfaces • Cluster FFT Vector Math • Trigonometric • Hyperbolic • Exponential, Log • Power / Root Vector RNGs • Congruential • Wichmann-Hill • Mersenne Twister • Sobol • Neiderreiter • Non-deterministic Summary Statistics • Kurtosis • Variation coefficient • Order statistics • Min/max • Variance-covariance And More • Splines • Interpolation • Trust Region • Fast Poisson Solver Intel® MKL is an integral part of Intel® Parallel Studio XE 28
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Many Ways to Vectorize Ease of useCompiler: Auto-vectorization (no change of code) Programmer control Compiler: Auto-vectorization hints (#pragma simd, …) SIMD intrinsic class (e.g.: F32vec, F64vec, …) Vector intrinsic (e.g.: _mm_fmadd_pd(…), _mm_add_ps(…), …) Assembler code (e.g.: [v]addps, [v]addss, …) Compiler: Intel® Cilk™ Plus Array Notation Extensions 29
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Control Vectorization ! Provides details on vectorization success & failure: Linux*, Mac OS* X: -vec-report<n>, Windows*: /Qvec-report<n> *: First available with Intel® Parallel Studio XE n Diagnostic Messages 0 Tells the vectorizer to report no diagnostic information. Useful for turning off reporting in case it was enabled on command line earlier. 1 Tells the vectorizer to report on vectorized loops. [default if n missing] 2 Tells the vectorizer to report on vectorized and non-vectorized loops. 3 Tells the vectorizer to report on vectorized and non-vectorized loops and any proven or assumed data dependences. 4 Tells the vectorizer to report on non-vectorized loops. 5 Tells the vectorizer to report on non-vectorized loops and the reason why they were not vectorized. 6* Tells the vectorizer to use greater detail when reporting on vectorized and non- vectorized loops and any proven or assumed data dependences. 30
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Vectorization Report II Note: In case inter-procedural optimization (-ipo or /Qipo) is activated and compilation and linking are separate compiler invocations, the switch to enable reporting needs to be added to the link step! 35: subroutine fd( y ) 36: integer :: i 37: real, dimension(10), intent(inout) :: y 38: do i=2,10 39: y(i) = y(i-1) + 1 40: end do 41: end subroutine fd novec.f90(38): (col. 3) remark: loop was not vectorized: existence of vector dependence. novec.f90(39): (col. 5) remark: vector dependence: proven FLOW dependence between y line 39, and y line 39. novec.f90(38:3-38:3):VEC:MAIN_: loop was not vectorized: existence of vector dependence 31
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Reasons for Vectorization Fails & How to Succeed ● Most frequent reason is Dependence: Minimize dependencies among iterations by design! ● Alignment: Align your arrays/data structures ● Function calls in loop body: Use aggressive in-lining (IPO) ● Complex control flow/conditional branches: Avoid them in loops by creating multiple versions of loops ● Unsupported loop structure: Use loop invariant expressions ● Not inner loop: Manual loop interchange possible? ● Mixed data types: Avoid type conversions ● Non-unit stride between elements: Possible to change algorithm to allow linear/consecutive access? ● Loop body too complex reports: Try splitting up the loops! ● Vectorization seems inefficient reports: Enforce vectorization, benchmark ! 32
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice IVDEP vs. SIMD Pragma/Directives 33 Differences between IVDEP & SIMD pragmas/directives: #pragma ivdep (C/C++) or !DIR$ IVDEP (Fortran) -Ignore vector dependencies (IVDEP): Compiler ignores assumed but not proven dependencies for a loop -Example: #pragma simd (C/C++) or !DIR$ SIMD (Fortran): - Aggressive version of IVDEP: Ignores all dependencies inside a loop - It’s an imperative that forces the compiler try everything to vectorize - Efficiency heuristic is ignored - Attention: This can break semantically correct code! However, it can vectorize code legally in some cases that wouldn’t be possible otherwise! void foo(int *a, int k, int c, int m) { #pragma ivdep for (int i = 0; i < m; i++) a[i] = a[i + k] * c; }
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Memory Subsystem 34
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice CPU: Core/Uncore - Designed For Modularity DRAMDRAMDRAMDRAM QPIQPIQPIQPI Core Uncore IMC QPI Power & Clock #QPI Links # mem channels Size of cache# cores Power Manage- ment Type of Memory Integrated graphics Differentiation in the “Uncore”: … QPI… … … L3 Cache QPI: Intel® QuickPath Interconnect CCCC OOOO RRRR EEEE CCCC OOOO RRRR EEEE CCCC OOOO RRRR EEEE
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Memory Bandwidth update For Sandy Bridge EP platform: 4 channels , 2 sockets and 1600 MHz memory 8*1.600* 4*2 = 102.4 GB/s peak (ST : 80 GB/s) For Ivy Bridge EP platform: 4 channels , 2 sockets and 1866 MHz memory 8*1.866* 4*2 = 119.42 GB/s peak (ST : ~98 GB/s) For Haswell EP platform: 4 channels , 2 sockets and 2133 MHz memory 8*2.133* 4*2 = 136.5 GB/s peak (ST : ~114 GB/s) Basical rules for theoretical memory BW [Bytes / second ] : 8 [Bytes / channel] * Mem freq [Gcycles/sec] * nb of channels * nb of sockets 2 full width QPI 1.12 full width QPI 1.1 DMI2DMI2 40LPCIe3.040LPCIe3.0 HSW Socket-R3 LGA HSW Socket-R3 LGA DDR3/4DDR3/4 DDR3/4DDR3/4 DDR3/4DDR3/4 DDR3/4DDR3/4 36
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Processor: Intel Core i5-3427U ark.intel.com: 37 In the Laptop We’ll be Using for Demo… Memory Types DDR3/L/-RS 1333/1600 # of Memory Channels 2 Max Memory Bandwidth 25.6 GB/s Basical rules for theoretical memory BW [Bytes / second ] : 8 [Bytes / channel] * Mem freq [Gcycles/sec] * nb of channels * nb of sockets Platform: 2 channels , 1 sockets and 1600 MHz memory 8*1.6* 2*1 = 25.6 GB/s peak (ST : 20 GB/s)
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Parallel Programming for Intel® Architecture (or, in general, for normal CPUs) Cores Vectors Memory, caches Data layout and alignment OpenMP TBB Cilk plus Vector loops Vector functions Blocking algorithms Manual layout, ugly code AoS SoA library 4 considerations when writing an efficient, unconstrained parallel program Array notations Threads, locks Intrinsics Directives for alignment Performance Analysis
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Intel® Many Integrated Core Architecture 39
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Up to 61 IA cores/1.2 GHz/ 244 Threads Up to 16 GB memory with up to 352 GB/s bandwidth 512-bit SIMD instructions Open Source Linux operating system IP addressable Standard programming languages, tools, clustering 22 nm process Intel® Xeon Phi™ Product Family Passive Card Active Card http://software.intel.com/en-us/mic-developer 40
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 3 Family Outstanding Parallel Computing Solution Performance/$ leadership 5 Family Optimized for High Density Environments Performance/Watt leadership 8GB GDDR5 >300GB/s >1TF DP 225-245W TDP 6GB GDDR5 240GB/s >1TF DP 300W TDP Intel® Xeon Phi™ Coprocessor Product Lineup Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance 41 Optional 3-year Warranty Extend to 3-year warranty on any Intel® Xeon Phi™ Coprocessor. Product Code: XPX100WRNTY, MM# 933057 7 Family Highest Performance Most Memory Performance leadership 16GB GDDR5 352GB/s >1.2TF DP 300W TDP 3120P MM# 927501 3120A MM# 927500 5110P MM# 924044 5120D (no thermal) MM# 927503 7120P MM# 927499 7120X (No Thermal Solution) MM# 927498 7120A MM# 934878 7120D (Dense Form Factor) MM# 932330 41
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Core Architecture Instruction decoder L1 Cache (I & D) L2 Cache Interprocessor network Vector Unit Scalar Unit Vector Registers Scalar Registers 512 KB Slice per 32 KB per core L2 Hardware Prefetching Fully Coherent In Order 512-wide64-bit 4 Threads per Core VPU: integer, SP, DP; 3-operand, 16-instruction 42
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Spectrum of Execution Models (Offload / Native / Symmetric) Offload: Workload is run on host, and highly parallel phases on Coprocessor !dir$ omp offload target(mic) !$omp parallel do do i=1,10 A(i) = B(i) * C(i) enddo !$omp end parallel MPI Example on Host with offload to coprocessors 43
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Spectrum of Execution Models (Offload / Native / Symmetric) MPI example on Coprocessor only Native (Coprocessor-only model): Workload is run solely on coprocessor icc –mmic … ./bin_mic Then ssh mic0 ./bin_mic Or start it from host micnaticeloadex ./bin_mic 44
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Symmetric Mode Command Line Arslan et al. 2013. Rice HPC Conf. Workload runs on Host AND Coprocessors 45
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice QPI IOH* IOH* rank 0 in “mic0” rank 1 in “mic1” rank 4 in “mic2” rank 2 in “cpu0” rank 3 in “cpu1” MPI Process OpenMP Threads 244 threads 244 threads 12 threads 12 threads 244 threads 244 threads 4x 7120A (61 Cores, 1.238 GHz, 16GB GDDR5) 2x E5-2697v2 (12C, 2.7GHz) and 64GB DDR3-1866 MHz rank 5 in “mic3” Peer-to-peer via DMA *Integrated in the processor Single Node Tests – HW and SW Configuration Isotropic RTM FD Kernel Direct DMA transfers between devices 46
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Scalability study with one to four Intel® Xeon Phi™ coprocessors 1.1 4.0 9.3 14.7 20.1 24.4 0.0 5.0 10.0 15.0 20.0 25.0 30.0 0.0 0.4 0.8 1.2 1.6 TFlops Scaling Based on Number of Coprocessors CUDA K40c CUDA K10 High performance and scalability with Intel® Xeon Phi® coprocessor Single Node Tests – Performance & Scalability Isotopic RTM FD Kernel 47 Scaling analysis with each Intel® Xeon Phi™ coprocessor solving a 14GB subdomain and pair of Intel® Xeon® processors solving a 10GB subdomain 16th order 3D space and 2nd order time; 61 Flops per Cell 24.4 GCell/s total performance with 2 processors + 4 coprocessors semi-OPT measurement is an OpenMP parallel version implemented with cache-blocking and compiler directives to improve vectorization. The remaining measurements are on code with additional optimizations such as loop unrolling, non-temporal stores, tiling on Y-Z, prefetch tuning, and balance between MULs and ADDs via intrinsics CUDA K40c and CUDA K10 are measurements on single devices using code that extended the FDTD3d sample in the CUDA SDK5.5 to 16th order in space and further optimized to increase register reuse 4.2 GCell/s 5.1 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See benchmark tests and configurations in the speaker notes. For more information go to http://www.intel.com/performance 1. Xeon = Intel® Xeon® processor E5-2697v2 Source: Intel Measured Results as of April 2014 2x Xeon1 semi-OPT 2x Xeon1 2x Xeon1 + 1x 7120A 2x Xeon1 + 2x 7120A 2x Xeon1 + 3x 7120A 2x Xeon1 + 4x 7120A Config. Summary IC 14.0 U1 MPI 4.1.1.036 MPSS 6720-15 ECC off, Turbo on (Xeon & 7120A) CUDA 5.5 (875MHz Boost Enabled)
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Parallel Programming for Intel® Architecture (or, in general, for normal CPUs) Cores Vectors Memory, caches Data layout and alignment OpenMP TBB Cilk plus Vector loops Vector functions Blocking algorithms Manual layout, ugly code AoS SoA library 4 considerations when writing an efficient, unconstrained parallel program Array notations Threads, locks Intrinsics Directives for alignment Performance Analysis
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 3DFD comparison : E5-2697v2 (Ivy Bridge) and Xeon Phi 7120A
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Energy efficiency with multiple Intel® Xeon Phi cards Note: 3 and 4 Xeon Phi power values are projections based on the data collected for 1 and 2 Xeon Phi. Single Node Tests – Performance/Watt High energy efficiency with Xeon Phi This data was presented by Petrobras at SC13 and Rice 2014 Oil & Gas HPC Workshop Source: Petrobras presentation at 2014 RICE Oil & Gas HPC: http://rice2014oghpc.blogs.rice.edu/files/2014/03/Intel-Rice2014-RTM-XeonPhi-V3.pdf 50
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Next Intel® Xeon Phi™ Product Family (Codenamed Knights Landing) 51 All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. • “Knights Landing” code name for the 2nd generation Intel® Xeon Phi™ product • Based on Intel’s 14 nanometer manufacturing process • Standalone bootable processor (running the host OS) and a PCIe coprocessor (PCIe end-point device) • Integrated on-package high-bandwidth memory • Flexible memory modes for the on package memory include: cache and flat • Support for Intel® Advanced Vector Extensions 512 (Intel® AVX-512) • 60+ cores, 3+ TeraFLOPS of double-precision peak performance per single socket node • Multiple hardware threads per core with improved single-thread performance over the current generation Intel® Xeon Phi™ coprocessor 51 Note that code name above is not the product name
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Programming Resources 52 Intel® Xeon Phi™ Coprocessor Developer’s Quick Start Guide Overview of Programming for Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors Access to webinar replays and over 50 training videos Beginning labs for the Intel® Xeon Phi™ Coprocessor Programming guides, tools, case studies, labs, code samples, forums & more http://software.intel.com/mic-developer Using a familiar programming model and tools means that developers don’t need to start from scratch. Many programming resources are available to further accelerate time to solution. 52 Click on tabs
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Questions?Questions? Are you ready for Multicore and ManyCore?
  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © 2014, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 54