11. Click to edit Master title style
11
Introduction
High-level overview of the Intel® Xeon Phi™ platform:
Hardware and Software
Performance Considerations
Intel Xeon Phi Case Studies
Intel Xeon Phi Ecosystem
Conclusions & References
12. 12
Based on memory access and flops required
• Temporal/spatial locality of data
• Bandwidth Requirement
6 GB/s
Bandwidth
Limited
Core Limited
Stream-triad
BLAS1 & BLAS
2
All
Linpack
DGEMM
Mfg &
Scientific
Sparse
Matrix-
Vector
Scientific
SPECfp2000
All
Reservoir
Simulation
FTDT
Oil & Gas
Kirchhoff
Migration
Oil & Gas
Fluid Dynamics
Ocean Models
ScientificFFT
Oil & Gas
Mil HPC
(Y: Math Kernel; B: Applications; W: Segment)
Option
pricing
FSI
Molecular
Dynamic
Scientific
Application Characterization
RTM
Oil & Gas
13. INTEL CONFIDENTIAL13
75
171
0
50
100
150
200
STREAM
Triad (GB/s)
330
802
0
200
400
600
800
1000
SMP Linpack
(GF/s)
347
887
0
200
400
600
800
1000
DGEMM
(GF/s)
728
1,796
0
500
1000
1500
2000
SGEMM
(GF/s)
Notes
1. Intel® Xeon® Processor E5-2680 used for all SGEMM Matrix = 12800 x 12800 , DGEMM Matrix 10752 x
10752, SMP Linpack Matrix 26000 x 26000
2. Intel® Xeon Phi™ coprocessor SE10P (ECC on) with “Gold” SW stack SGEMM Matrix = 12800 x 12800,
DGEMM Matrix 12800 x 12800, SMP Linpack Matrix 26872 x 28672
3. Average single-node results from measurements across a set of nodes from the TACC+ Stampede* Cluster
+ Texas Advanced Computing Center (TACC) at the University of Texas at Austin.
++ Measured on the TACC+ Stampede Cluster
Coprocessor results: Benchmark run 100% on coprocessor,
no help from Intel® Xeon® processor host (aka native)
Synthetic Benchmarks
Intel® Xeon Phi™ Coprocessor and Intel® MKL
UP TO
2.4X
UP TO
2.5X
UP TO
2.2X
UP TO
2.4X
Higher is Better
• 2S Intel® Xeon®
• Intel Xeon Phi
ECC ON84% Efficient 83% Efficient 75% Efficient
14. INTEL CONFIDENTIAL
1.00
3.91
4.63
4.81
0.00
1.00
2.00
3.00
4.00
5.00
6.00
2S Intel® Xeon® Processor SMP Linpack DGEMM SGEMM
RelativePerformanceperWatt
(Normalizedto1.0Baselineofa2socketIntel®
Xeon®processorE5-2670)
Performance per Watt
Intel® Xeon Phi™ Coprocessor vs. 2S Intel® Xeon® processor (Intel MKL)
14
1 Intel® Xeon Phi™ Coprocessor
vs.
2 Socket Intel® Xeon® processor
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific
computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you
in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Source: Intel Measured results as of October 26, 2012 Configuration Details: Please reference slide speaker notes.
For more information go to http://www.intel.com/performance
Notes:
1. 2 X Intel® Xeon® Processor E5-2670 (2.6GHz, 8C, 115W)
2. Intel® Xeon Phi™ coprocessor 5110P (ECC on) with Gold RC SW stack (Coprocessor power only)
Higher is Better
Coprocessor results: Benchmark run 100% on coprocessor, no help from Intel® Xeon® processor host (aka native)
5110P
15. Click to edit Master title style
15
Introduction
High-level overview of the Intel® Xeon Phi™ platform:
Hardware and Software
Native, Offload and Variations
Intel Xeon Phi Case Studies
Intel Xeon Phi Ecosystem
Conclusions & References
23. Click to edit Master title style
23
Introduction
High-level overview of the Intel® Xeon Phi™ platform:
Hardware and Software
Intel Xeon Phi Case Studies
Intel Xeon Phi Ecosystem
Conclusions & References
24. INTEL CONFIDENTIAL24
145X
FASTER
0.46
SECONDS
STEP 1.
OPTIMIZE CODE
Parallelize and vectorize
code and continue to run on
multi-core Intel Xeon processors
67.097
SECONDS
Current
Performance
STARTING POINT
Unoptimized serial code
running on multi-core
Intel® Xeon® processors
2.3X
FASTER
0.197
SECONDS
STEP 2.
USE COPROCESSORS
Run all or part of the
optimized code on Intel®
Xeon Phi™ coprocessors
The Following Performance Results are Based
on Already Optimized Code
SOURCE: INTEL MEASURED RESULTS AS OF NOVEMBER, 2012
Example: A Two-Step Process with SAXPY
Parallelizing for High Performance
340X
FASTER
25. INTEL CONFIDENTIAL
• Application: Hybrid Monte-Carlo program that
simulates lattice QCD with dynamical Wilson fermions.
It is one of the main production programs of the
QCDSF collaboration (DEISA) and beyond used for
quark simulation.
• Status: Many optimizations already in released
version; more optimizations and alternative offload
model version in development
• Demonstrated Results:
- No source code changes
- Recompiled, selected run-time parameters to get
maximum performance
25
Performance Proof-Point: Government and Academic Research
BQCD
“The performance improvement for BQCD using the
Intel Xeon Phi coprocessor was reached in record
time, requiring only recompilation. We are confident
that larger speed-ups can be obtained with modest
modifications of the code.”
Prof. Dr. Tilo Wettig
Principal Investigator of the QPACE project
BQCD Scalability Gflops/Sec
(Higher is Better)
0
50
100
150
200
250
300
1 2 4 8
SOURCE: INTEL MEASURED MARCH’13
• 2S Intel® Xeon® Processor E5-2670
• Intel® Xeon Phi™ coprocessor–native
(pre-production HW/SW)
• 2S Intel Xeon E5-2670 +
Intel® Xeon Phi™ coprocessor–symmetric
(pre-production HW/SW)
27. INTEL CONFIDENTIAL
• Application: Monte Carlo algorithms are used to
evaluate complex instruments, portfolios, and
investments. Performance depends on raw
computational power and the performance of exp2()
• Status: Case Study available
• Highlights: Dramatic performance scaling for both
single-precision and double-precision calculations
• Demonstrated Results:
- Intel® Xeon Phi™ coprocessor fast exp2() and FMA
instructions deliver high performance, high accuracy
for single precision computations
- Compiler based loop unrolling delivers high performance
- Cache blocking further optimizes cache utilization,
reduces cache misses, and makes outer loop
vectorization possible
• Read the Case Study: software.intel.com/en-us/articles/case-
study-achieving-high-performance-on-monte-carlo-european-option-
on-intel-xeon-phi
27
Performance Proof-Point: Financial Services
MONTE CARLO EUROPEAN OPTIONS
1 1
10.36
3.34
0
2
4
6
8
10
12
Single
Precision
Double
Precision
Speedup
(Higher is Better)
• 2S Intel® Xeon® processor E5-2670
• 2S Intel Xeon processor E5-2670 +
Intel® Xeon Phi™ Coprocessor
(pre-production HW/SW)
SOURCE: INTEL MEASURED RESULTS AS OF JULY, 2013
28. INTEL CONFIDENTIAL
• Application: Weather Research and Forecasting (WRF)
• Status: WRF V3.5 was released 4/18/13
• Code Optimization:
– Approximately two dozen files with less than 2,000
lines of code were modified (out of approximately
700,000 lines of code in about 800 files, all Fortran
standard compliant)
– Most modifications improved performance for both the
host and the co-processors
• Performance Measurements: Pre release of WRF 3.5
(V3.5Pre) and NCAR supported CONUS2.5KM
benchmark (a high resolution weather forecast)
• Acknowledgments: There were many contributors to
these results, including the National Renewable Energy
Laboratory and The Weather Channel Companies
Performance Proof-Point: Government and Academic Research
WEATHER RESEARCH AND FORECASTING (WRF)
1
1.4
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Speedup
(Higher is Better)
• 2S Intel® Xeon® processor E5-2670 with
eight-node cluster configuration
• 2S Intel® Xeon® processor E5-2670 +
Intel® Xeon Phi™ coprocessor
(pre-production HW/SW)
with eight-node cluster configuration
28 SOURCE: INTEL MEASURED RESULTS AS OF JULY, 2013
29. INTEL CONFIDENTIAL
• Application: Sandia National Laboratories' best
approximation to an unstructured implicit finite
element or finite volume application in fewer than
8000 lines of code
• Status: available at
http://software.sandia.gov/trac/mantevo/browser/trunk/packages
• Demonstrated Results:
- Porting was easy using OpenMP
- Substituting an Intel MKL routine for the sparse matrix-
vector product accelerated performance
and will simplify future optimization
- The Intel MPI Library enables rapid performance
improvement when adding an Intel® Xeon Phi™
coprocessor
• Read the Case Study:
29
Performance Proof-Point: Government and Academic Research
SANDIA MANTEVO miniFE
1
2.2
0
0.5
1
1.5
2
2.5
Speedup
(Higher is Better)
• 2S Intel® Xeon® processor E5-2670
• 2S Intel Xeon processor E5-2670 +
Intel® Xeon Phi™ coprocessor
(pre-production HW/SW)
SOURCE: INTEL MEASURED RESULTS AS OF MARCH, 2012
“The programming models available for the Intel
MIC Architecture are open-standard and portable
between traditional processors and Intel Xeon Phi
coprocessors. This should allow us to leverage
code development across multiple platforms.”
James A. Ang, Ph.D.
Extreme-scale Computing, Sandia National Laboratories
software.intel.com/
en-us/articles/running-minife-on-intel-xeon-phi-coprocessors
30. INTEL CONFIDENTIAL30
DEMONSTRATED PERFORMANCE BENEFITS
Intel® Xeon Phi™ Coprocessor
UP TO
2.23X
Acceleware 8th
Order Isotropic
Variable Velocity2
Seismic
UP TO
2X
Sandia National
Labs MiniFE1
Finite Element Analysis
30
1. 8 node cluster, each node with 2S Xeon* (comparison is cluster performance with and without 1 Xeon Phi* per node) (Hetero)
2. 2S Xeon* vs. 1 Xeon Phi* (preproduction HW/SW & Application running 100% on coprocessor (unless otherwise noted)
3. 2S Xeon* vs. 2S Xeon* + 2 Xeon Phi* (offload)
UP TO
3.54X
China Oil & Gas
Geoeast Pre-stack
Time Migration3
SOURCE: INTEL MEASURED RESULTS AS OF NOVEMBER, 2012
31. INTEL CONFIDENTIAL31
DEMONSTRATED PERFORMANCE BENEFITS
Intel® Xeon Phi™ Coprocessor
UP TO
10.75X
Monte Carlo SP3
Finance
UP TO
2.7X
Jefferson Lab
Lattice QCD
Physics
UP TO 7XBlack-Scholes SP3
31
Notes:
1. 2S Xeon* vs. 1 Xeon Phi* (preproduction HW/SW & Application running 100% on coprocessor unless otherwise noted)
2. Intel Measured Oct. 2012
3. Includes additional FLOPS from transcendental function unit
SPEED-UP
2.11X
Intel Labs
Ray Tracing2
Embree Ray Tracing
SOURCE: INTEL MEASURED RESULTS AS OF NOVEMBER, 2012
32. 32
Introduction
High-level overview of the Intel® Xeon Phi™ platform:
Hardware and Software
Intel Xeon Phi Case Studies
Intel Xeon Phi Ecosystem
Conclusions & References
33. INTEL CONFIDENTIAL
• System: TACC Stampede is a 10 petaflop
supercomputer, one of the largest computing systems
in the world for open science research. It became
operational on January 7, 2013
• Status: In Service
• Workloads: Runs hundreds of applications for
thousands of users around the world
• Performance:
– More than 7 petaflops using Intel® Xeon Phi™
coprocessors1
– More than 2 petaflops using the Intel® Xeon®
processor E5 family1
• More Information:
– SC12 interview: insidehpc.com/2012/12/06/video-
intel-xeon-phi-powers-7-tacc-stampede-super/
– TACC HPC systems overview:
www.tacc.utexas.edu/resources/hpc
Implementation Proof-Point: Government and Academic Research
Texas Advanced Computing Center (TACC)
33
1 http://www.tacc.utexas.edu/resources/hpc/stampede
34. INTEL CONFIDENTIAL
System: Located in Southwest China, it contains
16,000 nodes composing the world's largest (public)
installation of Intel Ivy Bridge and Xeon Phi’s
processors. Each cluster node is formed with
• 2 CPUs hex-core Intel® Xeon® Ivy-Bridge @ 2.2GHz
• 3 Intel® Xeon Phi™ cards, each with 57 cores @ 1.1GHz
Performance: Theoretical peak of 54.9 Pflop/s
• 6.8 Pflop/s from 32,000 Xeon Ivy Bridge sockets
• 48.1 Pflop/s from 48,000 Xeon Phi cards
• for a total of 3,120,000 cores.
30.65 Pflop/s sustained Linpack.
More Information: "Visit to the National University
for Defense Technology Changsha, China." Jack
Dongarra, University of Tennessee, and Oak Ridge
National Laboratory. June 2013.
www.netlib.org/utk/people/JackDongarra/PAPERS/tianhe-2-
dongarra-report.pdf
Tianhe-2 System: #1 June 2013 Top500 List
34
35. INTEL CONFIDENTIALOther brands and names are the property of their respective owners.
A Growing Sotware Ecosystem:
Developing today on Intel® Xeon Phi™ coprocessors
Shown at SC’12, November 2012
35
36. 36
Introduction
High-level overview of the Intel® Xeon Phi™ platform:
Hardware and Software
Intel Xeon Phi Case Studies
Intel Xeon Phi Ecosystem
Conclusions & References