Introdução ao coprocessador Intel® Xeon Phi™ - Intel Software Conference 2013

© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Introduction to the
Intel® Xeon Phi™
Coprocessor
Leo Borges (leonardo.borges@intel.com)
Intel - Software and Services Group
iStep-Brazil, August 2013
1

Click to edit Master title style
2
Introduction
High-level overview of the Intel® Xeon Phi™ platform:
Hardware and Software
Intel Xeon Phi Case Studies
Intel Xeon Phi Ecosystem
Conclusions & References

Large Scale
Clusters
for Test &
Optimization
Tera-
Scale
Research
Leading
Performance,
Energy Efficient
Platform
Building
Blocks
Dedicated,
Renowned
Applications
Expertise
Broad Software
Tools
Portfolio
Defined
HPC
Application
Platform
Many
Integrated
Core
Architecture
Manufacturing
Process
Technologies
Exa-Scale Labs
A long term commitment to the HPC market segment
3
Intel in High-Performance Computing

HPC Processor Solutions
Common Intel Environment
Portable code, common tools
Xeon®
General Purpose Architecture
Leadership Per Core Performance
FP/core via AVX
Multi-Core Performance Intel® Xeon Phi™
Coprocessor
Trades a “big” IA core for
multiple lower performance
IA cores resulting in higher
performance for a subset of
highly parallel applications
EN
General
purpose
perf/watt
EP
Max perf/watt
w/ Higher
Memory BW /
freq and QPI
ideal for HPC
Xeon EX
Additional
sockets & big
memory
EP 4S
Additional
compute
density
Multi-Core Many-Core
4

5
Highly parallel and vectorized applications, or with
need for higher memory bandwidth, will run even
faster on Intel® Xeon Phi™ Coprocessors
Most applications will still run best on multi-
core Intel® Xeon® processors
Optimizing code often delivers significant
performance gains
RUNNING
EXISTING SERIAL
SOFTWARE
RUNNING
OPTIMIZED
SOFTWARE
Big Gains for Selected Applications
Medical imaging and
biophysics
Computer Aided Design
& Manufacturing
Climate modeling &
weather prediction
Financial analyses,
trading
Energy &oil
exploration
Digital content
creation

6
YES
Evaluating Your Applications
for Intel® Xeon Phi™
NO
YES
YES
YES
Can your workload
benefit from more
memory bandwidth?
Can your workload
benefit from
large vectors?
NO
NO
Can your workload
scale to over
100 threads?
Use Intel® Xeon Phi™ coprocessors for applications that scale with:
• Threads • Vectors • Memory Bandwidth

7
Introduction

8
Intel Many Integrated Core (MIC, pronounced “Mike”)
Product Family/Architecture for Highly Parallel Applications
• Based on large number of smaller, low power, Intel Arch. Cores
• 512-bit wide vector engine
• Compliments Intel Xeon processor product line
• Provides breakthrough performance for highly parallel apps
– Familiar x86 programming model
– Same source code supports both Intel Xeon processor & Intel Xeon Phi coprocessor
– Initially a coprocessor with PCI Express form factor
First products announced at SC12: Code named Knights Corner (KNC)
• Up to 61 cores, 4 threads per core
• Up to 16GB GDDR5 memory (up to 352 GB/s)
• 225-300W (Cooling: Both passive & active SKUs)
• x16 PCIe Form-Factor (requires IA host)
8
Intel® Xeon® Phi™ Product Family
Based on the Intel MIC Architecture

9
Each Xeon Phi can be addressed as
an Individual Node in the Cluster
• 9
6 to 16 GB GDDR5 memory

INTEL CONFIDENTIAL
• Click to edit Master text styles
‒ Second level
 Third level
o Fourth level
 Fifth level
10
3 Family
Outstanding Parallel Computing Solution
Performance/$ leadership
Intel® Xeon Phi™ Coprocessors
3120P 3120A
5 Family
Optimized for High Density Environments
Performance/watt leadership
5120D
7 Family
Highest Level of Features
Performance leadership
7120P 7120X
16GB GDDR5
352 GB/s
> 1.2 TFlops DP
Turbo
T
8GB GDDR5
>300 GB/s
>1 TFlops DP
6GB GDDR5
240 GB/s
>1 TFlops DP
5120P

11
Introduction
Performance Considerations

12
Based on memory access and flops required
• Temporal/spatial locality of data
• Bandwidth Requirement
6 GB/s
Bandwidth
Limited
Core Limited
Stream-triad
BLAS1 & BLAS
2
All
Linpack
DGEMM
Mfg &
Scientific
Sparse
Matrix-
Vector
Scientific
SPECfp2000
All
Reservoir
Simulation
FTDT
Oil & Gas
Kirchhoff
Migration
Oil & Gas
Fluid Dynamics
Ocean Models
ScientificFFT
Oil & Gas
Mil HPC
(Y: Math Kernel; B: Applications; W: Segment)
Option
pricing
FSI
Molecular
Dynamic
Scientific
Application Characterization
RTM
Oil & Gas

INTEL CONFIDENTIAL13
75
171
0
50
100
150
200
STREAM
Triad (GB/s)
330
802
0
200
400
600
800
1000
SMP Linpack
(GF/s)
347
887
0
200
400
600
800
1000
DGEMM
(GF/s)
728
1,796
0
500
1000
1500
2000
SGEMM
(GF/s)
Notes
1. Intel® Xeon® Processor E5-2680 used for all SGEMM Matrix = 12800 x 12800 , DGEMM Matrix 10752 x
10752, SMP Linpack Matrix 26000 x 26000
2. Intel® Xeon Phi™ coprocessor SE10P (ECC on) with “Gold” SW stack SGEMM Matrix = 12800 x 12800,
DGEMM Matrix 12800 x 12800, SMP Linpack Matrix 26872 x 28672
3. Average single-node results from measurements across a set of nodes from the TACC+ Stampede* Cluster
+ Texas Advanced Computing Center (TACC) at the University of Texas at Austin.
++ Measured on the TACC+ Stampede Cluster
Coprocessor results: Benchmark run 100% on coprocessor,
no help from Intel® Xeon® processor host (aka native)
Synthetic Benchmarks
Intel® Xeon Phi™ Coprocessor and Intel® MKL
UP TO
2.4X
UP TO
2.5X
UP TO
2.2X
UP TO
2.4X
Higher is Better
• 2S Intel® Xeon®
• Intel Xeon Phi
ECC ON84% Efficient 83% Efficient 75% Efficient

INTEL CONFIDENTIAL
1.00
3.91
4.63
4.81
0.00
1.00
2.00
3.00
4.00
5.00
6.00
2S Intel® Xeon® Processor SMP Linpack DGEMM SGEMM
RelativePerformanceperWatt
(Normalizedto1.0Baselineofa2socketIntel®
Xeon®processorE5-2670)
Performance per Watt
Intel® Xeon Phi™ Coprocessor vs. 2S Intel® Xeon® processor (Intel MKL)
14
1 Intel® Xeon Phi™ Coprocessor
vs.
2 Socket Intel® Xeon® processor
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific
computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you
in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Source: Intel Measured results as of October 26, 2012 Configuration Details: Please reference slide speaker notes.
For more information go to http://www.intel.com/performance
Notes:
1. 2 X Intel® Xeon® Processor E5-2670 (2.6GHz, 8C, 115W)
2. Intel® Xeon Phi™ coprocessor 5110P (ECC on) with Gold RC SW stack (Coprocessor power only)
Higher is Better
Coprocessor results: Benchmark run 100% on coprocessor, no help from Intel® Xeon® processor host (aka native)
5110P

15
Introduction
Native, Offload and Variations

INTEL CONFIDENTIAL
‒ Second level
 Third level
o Fourth level
 Fifth level
Wide Spectrum of Execution Models
General purpose
serial and parallel
computing
Codes with highly-
parallel phases
Highly-parallel
codes
Codes with
balanced needs
Main( )
Foo( )
MPI_*()
Foo( )
Main( )
Foo( )
MPI_*()
Main()
Foo( )
MPI_*()
Main( )
Foo( )
MPI_*()
Main( )
Foo( )
MPI_*()
Multicore
Many-core
Multicore Centric Many-core Centric
(Intel® Xeon® processors) (Intel® Many Integrated Core co-processors)
Multi-core-hosted Offload Symmetric Many-core-hosted
Range of Models to Meet Application Needs
16

The Intel Manycore Platform Software Stack
(MPSS) provides Linux on the coprocessor
17
Linux* OS
Intel® Xeon Phi™ Coprocessor
support libraries, tools, and
drivers
Linux* OS
PCI-E Bus PCI-E Bus
communication and application-
launch support
Intel® Xeon Phi™ CoprocessorHost Processor
System-level code System-level code
User-level codeUser-level code

Runs either as an accelerator for offloaded
host computation…
18
Linux* OS
drivers
Linux* OS
PCI-E Bus PCI-E Bus
launch support
Offload libraries, user-level
driver, user-accessible APIs
and libraries
User code
Host-side offload application
User code
Offload libraries,
user-accessible
APIs and libraries
Target-side offload
applicationAdvantages
• More memory available
• Better file access
• Host better on serial code
• Better uses resources

…Or runs as a native or
MPI* compute node via IP or OFED
19
Linux* OS
drivers
Linux* OS
PCI-E Bus PCI-E Bus
launch support
Advantages
• Simpler model
• No directives
• Easier port
• Good kernel test
ssh or telnet
connection to
coprocessor IP
address
Virtual terminal session
Use if
• Not serial
• Modest memory
• Complex code
Target-side “native”
application
User code
Standard OS libraries
plus any 3rd-party or
Intel libraries
IB fabric

Flexible: Enables Multiple Programming
Models
20
CPU MIC
CPU MIC
Data
MPI
Data
Network
Homogenous network
of many-core CPUs
CPU MIC
CPU MIC
Data
MPI
Data
Network
Data
Data
Heterogeneous network
of homogeneous CPUs
CPU MIC
CPU MIC
MPI
Offload
Offload
Network
Data
Data
Homogenous network of
heterogeneous nodes
Coprocessor only Host+Offload Symmetric

Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Click to edit Master text styles
• Second level
– Third level
– Fourth level
– Fifth level
Advisor XE
VTune Amplifier XE
Inspector XE
Trace Analyzer
Code Analysis
Comprehensive set of SW tools for Xeon
and Xeon Phi Programing
Intel Cilk Plus
Threading Building
Blocks
OpenMP
OpenCL
MPI
Offload/Native/MYO
Programming
Models
Math Kernel Library
Integrated Performance
Primitives
Intel Compilers
Libraries &
Compilers
21

First Level
• Second level
– Third level
– Fourth level
– Fifth level
INTEL CONFIDENTIAL
22
‒ Second level
 Third level
o Fourth level
 Fifth level
Options for Thread Parallelism
Intel® Math Kernel Library
OpenMP*
Intel® Threading Building Blocks
Intel® Cilk™ Plus
OpenCL*
Pthreads* and other threading libraries Programmer control
Ease of use / code
maintainability
Choice of unified programming to target Intel® Xeon® and Intel® Xeon Phi™ Architecture!
22

23
Introduction

145X
FASTER
0.46
SECONDS
STEP 1.
OPTIMIZE CODE
Parallelize and vectorize
code and continue to run on
multi-core Intel Xeon processors
67.097
SECONDS
Current
Performance
STARTING POINT
Unoptimized serial code
running on multi-core
Intel® Xeon® processors
2.3X
FASTER
0.197
SECONDS
STEP 2.
USE COPROCESSORS
Run all or part of the
optimized code on Intel®
Xeon Phi™ coprocessors
The Following Performance Results are Based
on Already Optimized Code
SOURCE: INTEL MEASURED RESULTS AS OF NOVEMBER, 2012
Example: A Two-Step Process with SAXPY
Parallelizing for High Performance
340X
FASTER

INTEL CONFIDENTIAL
• Application: Hybrid Monte-Carlo program that
simulates lattice QCD with dynamical Wilson fermions.
It is one of the main production programs of the
QCDSF collaboration (DEISA) and beyond used for
quark simulation.
• Status: Many optimizations already in released
version; more optimizations and alternative offload
model version in development
• Demonstrated Results:
- No source code changes
- Recompiled, selected run-time parameters to get
maximum performance
25
Performance Proof-Point: Government and Academic Research
BQCD
“The performance improvement for BQCD using the
Intel Xeon Phi coprocessor was reached in record
time, requiring only recompilation. We are confident
that larger speed-ups can be obtained with modest
modifications of the code.”
Prof. Dr. Tilo Wettig
Principal Investigator of the QPACE project
BQCD Scalability Gflops/Sec
(Higher is Better)
0
50
100
150
200
250
300
1 2 4 8
SOURCE: INTEL MEASURED MARCH’13
• 2S Intel® Xeon® Processor E5-2670
• Intel® Xeon Phi™ coprocessor–native
(pre-production HW/SW)
• 2S Intel Xeon E5-2670 +
Intel® Xeon Phi™ coprocessor–symmetric

INTEL CONFIDENTIAL
• Application: Seismic imaging technique used to obtain
a subsurface depth image from input seismic data
• Status: See presentation Rice O&G HPC workshop,
http://rice2013.og-hpc.org/technical-program
• Execution Model: Fully Hybrid MPI+OpenMP using
symmetric mode
– Highly scalable on cluster
• Code Optimization:
– Minimal source code changes for dynamic
load balancing
Performance Proof-Point: Energy Industry
CGG: WAVE EQUATION MIGRATION (WEM)
1
2.57
3.57
6.14
0
1
2
3
4
5
6
7
Speedup
(Higher is Better)
• 2S Intel® Xeon® processor E5-2670
4 MPI / 4 OMP
• Intel® Xeon Phi™ Coprocessor
(pre-production HW/SW) 12 MPI / 20 OMP
• 2S Intel Xeon processor E5-2670 (4/4)
+ Intel® Xeon Phi™ coprocessor (12/20)
• 2S Intel Xeon processor E5-2670 (4/4)
+ 2x Intel® Xeon Phi™ coprocessor
(12/20 + 12/20) (pre-production HW/SW)
26 SOURCE: ARSLAN ET AL., CGG 2013, MARCH’13

INTEL CONFIDENTIAL
• Application: Monte Carlo algorithms are used to
evaluate complex instruments, portfolios, and
investments. Performance depends on raw
computational power and the performance of exp2()
• Status: Case Study available
• Highlights: Dramatic performance scaling for both
single-precision and double-precision calculations
- Intel® Xeon Phi™ coprocessor fast exp2() and FMA
instructions deliver high performance, high accuracy
for single precision computations
- Compiler based loop unrolling delivers high performance
- Cache blocking further optimizes cache utilization,
reduces cache misses, and makes outer loop
vectorization possible
• Read the Case Study: software.intel.com/en-us/articles/case-
study-achieving-high-performance-on-monte-carlo-european-option-
on-intel-xeon-phi
27
Performance Proof-Point: Financial Services
MONTE CARLO EUROPEAN OPTIONS
1 1
10.36
3.34
0
2
4
6
8
10
12
Single
Precision
Double
Precision
Speedup
(Higher is Better)
• 2S Intel Xeon processor E5-2670 +
SOURCE: INTEL MEASURED RESULTS AS OF JULY, 2013

INTEL CONFIDENTIAL
• Application: Weather Research and Forecasting (WRF)
• Status: WRF V3.5 was released 4/18/13
• Code Optimization:
– Approximately two dozen files with less than 2,000
lines of code were modified (out of approximately
700,000 lines of code in about 800 files, all Fortran
standard compliant)
– Most modifications improved performance for both the
host and the co-processors
• Performance Measurements: Pre release of WRF 3.5
(V3.5Pre) and NCAR supported CONUS2.5KM
benchmark (a high resolution weather forecast)
• Acknowledgments: There were many contributors to
these results, including the National Renewable Energy
Laboratory and The Weather Channel Companies
WEATHER RESEARCH AND FORECASTING (WRF)
1
1.4
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Speedup
(Higher is Better)
• 2S Intel® Xeon® processor E5-2670 with
eight-node cluster configuration
• 2S Intel® Xeon® processor E5-2670 +
Intel® Xeon Phi™ coprocessor
with eight-node cluster configuration
28 SOURCE: INTEL MEASURED RESULTS AS OF JULY, 2013

INTEL CONFIDENTIAL
• Application: Sandia National Laboratories' best
approximation to an unstructured implicit finite
element or finite volume application in fewer than
8000 lines of code
• Status: available at
http://software.sandia.gov/trac/mantevo/browser/trunk/packages
- Porting was easy using OpenMP
- Substituting an Intel MKL routine for the sparse matrix-
vector product accelerated performance
and will simplify future optimization
- The Intel MPI Library enables rapid performance
improvement when adding an Intel® Xeon Phi™
coprocessor
• Read the Case Study:
29
SANDIA MANTEVO miniFE
1
2.2
0
0.5
1
1.5
2
2.5
Speedup
(Higher is Better)
• 2S Intel Xeon processor E5-2670 +
Intel® Xeon Phi™ coprocessor
SOURCE: INTEL MEASURED RESULTS AS OF MARCH, 2012
“The programming models available for the Intel
MIC Architecture are open-standard and portable
between traditional processors and Intel Xeon Phi
coprocessors. This should allow us to leverage
code development across multiple platforms.”
James A. Ang, Ph.D.
Extreme-scale Computing, Sandia National Laboratories
software.intel.com/
en-us/articles/running-minife-on-intel-xeon-phi-coprocessors

DEMONSTRATED PERFORMANCE BENEFITS
UP TO
2.23X
Acceleware 8th
Order Isotropic
Variable Velocity2
Seismic
UP TO
2X
Sandia National
Labs MiniFE1
Finite Element Analysis
30
1. 8 node cluster, each node with 2S Xeon* (comparison is cluster performance with and without 1 Xeon Phi* per node) (Hetero)
2. 2S Xeon* vs. 1 Xeon Phi* (preproduction HW/SW & Application running 100% on coprocessor (unless otherwise noted)
3. 2S Xeon* vs. 2S Xeon* + 2 Xeon Phi* (offload)
UP TO
3.54X
China Oil & Gas
Geoeast Pre-stack
Time Migration3

DEMONSTRATED PERFORMANCE BENEFITS
UP TO
10.75X
Monte Carlo SP3
Finance
UP TO
2.7X
Jefferson Lab
Lattice QCD
Physics
UP TO 7XBlack-Scholes SP3
31
Notes:
1. 2S Xeon* vs. 1 Xeon Phi* (preproduction HW/SW & Application running 100% on coprocessor unless otherwise noted)
2. Intel Measured Oct. 2012
3. Includes additional FLOPS from transcendental function unit
SPEED-UP
2.11X
Intel Labs
Ray Tracing2
Embree Ray Tracing

32
Introduction

INTEL CONFIDENTIAL
• System: TACC Stampede is a 10 petaflop
supercomputer, one of the largest computing systems
in the world for open science research. It became
operational on January 7, 2013
• Status: In Service
• Workloads: Runs hundreds of applications for
thousands of users around the world
• Performance:
– More than 7 petaflops using Intel® Xeon Phi™
coprocessors1
– More than 2 petaflops using the Intel® Xeon®
processor E5 family1
• More Information:
– SC12 interview: insidehpc.com/2012/12/06/video-
intel-xeon-phi-powers-7-tacc-stampede-super/
– TACC HPC systems overview:
www.tacc.utexas.edu/resources/hpc
Implementation Proof-Point: Government and Academic Research
Texas Advanced Computing Center (TACC)
33
1 http://www.tacc.utexas.edu/resources/hpc/stampede

INTEL CONFIDENTIAL
System: Located in Southwest China, it contains
16,000 nodes composing the world's largest (public)
installation of Intel Ivy Bridge and Xeon Phi’s
processors. Each cluster node is formed with
• 2 CPUs hex-core Intel® Xeon® Ivy-Bridge @ 2.2GHz
• 3 Intel® Xeon Phi™ cards, each with 57 cores @ 1.1GHz
Performance: Theoretical peak of 54.9 Pflop/s
• 6.8 Pflop/s from 32,000 Xeon Ivy Bridge sockets
• 48.1 Pflop/s from 48,000 Xeon Phi cards
• for a total of 3,120,000 cores.
30.65 Pflop/s sustained Linpack.
More Information: "Visit to the National University
for Defense Technology Changsha, China." Jack
Dongarra, University of Tennessee, and Oak Ridge
National Laboratory. June 2013.
www.netlib.org/utk/people/JackDongarra/PAPERS/tianhe-2-
dongarra-report.pdf
Tianhe-2 System: #1 June 2013 Top500 List
34

INTEL CONFIDENTIALOther brands and names are the property of their respective owners.
A Growing Sotware Ecosystem:
Developing today on Intel® Xeon Phi™ coprocessors
Shown at SC’12, November 2012
35

36
Introduction

• Second level
– Third level
– Fourth level
– Fifth level
Conclusions
Intel® Xeon Phi™ coprocessor advantages:
• Comparable performance potential to other accelerators
• Faster time to solution due to reduced development effort
• Better investment protection with a single code base for
processors and coprocessors
Flexible and Wide range of programming models: from
pure Native to Offloaded – and all variants between
All with the familiar Intel development environment
37

• Second level
– Third level
– Fourth level
– Fifth level
One Stop Shop for:
Tools & Software Downloads
Getting Started Development Guides
Video Workshops, Tutorials, & Events
Code Samples & Case Studies
Articles, Forums, & Blogs
Associated Product Links
http://software.intel.com/mic-developer
Intel® Xeon Phi™ Coprocessor Developer
Site: http://software.intel.com/mic-developer
38

• Second level
– Third level
– Fourth level
– Fifth level
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY
ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS
DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR
IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES
RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY
PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on
Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using
specific computer systems, components, software, operations and functions. Any change to any of
those factors may cause the results to vary. You should consult other information and performance
tests to assist you in fully evaluating your contemplated purchases, including the performance of that
product when combined with other products.
Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core,
VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that
are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and
other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended
for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for
Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information
regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Legal Disclaimer & Optimization Notice
40

Introdução ao coprocessador Intel® Xeon Phi™ - Intel Software Conference 2013

Recommended

Recommended

More Related Content

More from Intel Software Brasil

More from Intel Software Brasil (20)

Recently uploaded

Recently uploaded (20)

Introdução ao coprocessador Intel® Xeon Phi™ - Intel Software Conference 2013