Productive parallel programming for intel xeon phi coprocessors

Productive Parallel Programming
for Intel® Xeon Phi™
Coprocessors
Bill Magro!
Director and Chief Technologist!
Technical Computing Software!
Intel Software & Services Group!

Still an Insatiable Need For Computing
Weather Prediction
1 ZFlops
100 EFlops
10 EFlops Genomics Research
1 EFlops
100 PFlops
10 PFlops Medical Imaging
1 PFlops
100 TFlops
10 TFlops
1 TFlops
100 GFlops
10 GFlops
1 GFlops Forecast
100 MFlops
1993 1999 2005 2011 2017 2023 2029

PetaFlop Systems of Today Are
The Client And Handheld Systems 10 years Later
Source: www.top500.org

Some believe…

•  Virtually none of today’s hardware or software technologies
can be improved or modified to reach exascale

•  A complete revolution is needed

We believe…
•  Evolution of today’s technologies + hardware and software
innovation can get us there

•  A systems approach – with co-design – is critical

Moore’s Law: Alive and Well

2003 2005 2007 2009
2011
90 nm 65 nm 45 nm 32 nm 22 22nm
nm A Revolutionary
Leap in
Process
SiGe SiGe
Technology

Invented
SiGe
2nd Gen.
SiGe
Invented
Gate-Last
2nd Gen.
Gate-Last
First to
Implement
37%
Strained Silicon Strained Silicon High-k High-k Tri-Gate
Performance Gain at
Metal Gate Metal Gate Low Voltage*

STRAINED SILICON

HIGH-k METAL GATE >50%
Active Power
Reduction at Constant
TRI-GATE
Performance*

The foundation for all computing… including Exa-Scale
Source: Intel
*Compared to Intel 32nm Technology

Intel® Xeon Phi™ Coprocessor [Knights Corner]: Power
Efficiency
Performance per Watt of a prototype Knights Corner Cluster
compared to the 2 Top Graphics Accelerated Clusters

1381 1380
1266
1400

1200
MFLOPS/Watt

1000

800
+ + +
600

400

200

0
Intel Corp Nagasaki Univ. Barcelona
Supercomputing Center
Knights Corner ATI Radeon Nvidia Tesla 2090
Higher is Better Source: www.green500.org
Top500 #150 Top500 #456 Top500 #177
June 2012 June 2012 June 2012
72.9 kW 47 kW 81.5 kW

6 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.

Myth: explicitly managed locality is in and caches are out!
Reality: Caches remain path to high performance and
efficiency

Relative BW Relative BW/Watt
50

45

40

35

30

25

20

15

10

5

0
Memory BW L2 Cache BW L1 Cache BW

#1 Green 500 Cluster
WORLD RECORD!
“Beacon” at NICS
Intel® Xeon® Processor +
Intel Xeon Phi™ Coprocessor Cluster
Most Power Efficient on the List
2.449 GigaFLOPS / Watt
70.1% efficiency

Other brands and names are the property of their respective owners.
Source: www.green500.org as of Nov 2012

Reaching Exascale Power Goals Requires
Architectural & Systems Focus
•  Memory (2x-5x)
–  New memory interfaces (optimized memory control and xfer)
–  Extend DRAM with non-volatile memory
•  Processor (10x-20x)
–  Reducing data movement (functional reorganization, > 20x)
–  Domain/Core power gating and aggressive voltage scaling
•  Interconnect (2x-5x)
–  More interconnect on package
–  Replace long haul copper with integrated optics
•  Data Center Energy Efficiencies (10%-20%)
–  Higher operating temperature tolerance
–  480V to the rack and free air/water cooling efficiencies

9

Reliability of these machines requires a systems
approach
Extreme Parallelism
1E+07
Top System Concurrency Trend
1E+06
1E+05
1E+04
•  Transparent process migration
1E+03
1E+02
’93 ‘95 ‘97 ‘99 ’01 ‘03 ‘05 ‘07 ‘09
•  Holistic fault detection and recovery
MTTI Measured in Minutes
•  Reliable end to end communications
nt •  Integrated memory in storage layer for
MTTI (hours)

1000 ip Cou
Ch 1E+08
DRAM
100
fast checkpoint and workflow
Count
1E+07
10
1E+06
1
Socket Cou
nt 1E+05 •  N+1 scale reliable architectures
0
2004 2006 2008 2010 2012 2014 2016
1E+04 consistent with stacked memory
0.1 Failures per socket per year:

Time to save a global
constraints
Global Checkpoint
Crossover point •  System wide power management and
dynamic optimization
Time

•  Must design for system level debug
capability.

Reliability is the primary force driving next generation designs
Source: Exascale Computing Study: Technology Challenges in achieving Exascale Systems (2008),

Foundation of Performance:
Computing

Architecture for Discovery

Intel® Xeon® processor
•  Ground-breaking real-world application performance
•  Industry-leading energy efficiency
•  Meets broad set of HPC challenges

Intel® Xeon Phi™ product family
•  Based on Intel® Many Integrated Core (MIC) architecture
•  Leading performance for highly parallel workloads
•  Common Intel Xeon programming model
•  Productive solution for highly-parallel computing

12

Intel® Xeon® E5-2600 processors

Up to 73% performance boost
vs. prior gen1 on HPC suite of
applications

Over 2X improvement on key
industry benchmarks
Up to 4 channels
DDR3 1600 memory
Significantly reduce compute
time on large, complex data
Up to 8 cores
Up to 20 MB cache sets with Intel® Advanced
Vector Extensions
Integrated
PCI Express*
Integrated I/O cuts latency
while adding capacity &
bandwidth

1  Over previous generation Intel® processors. Intel internal estimate. For more legal information on performance forecasts go to http://www.intel.com/performance

13

Introducing Intel® Xeon Phi™ Coprocessors
Highly-parallel Processing for Unparalleled Discovery

Groundbreaking: differences
Up to 61 IA cores/1.1 GHz/ 244 Threads

Up to 8GB memory with up to 352 GB/s bandwidth

512-bit SIMD instructions

Linux operating system, IP addressable

Standard programming languages and tools

Leading to Groundbreaking results
Up to 1 TeraFlop/s double precision peak performance1

Enjoy up to 2.2x higher memory bandwidth than on an Intel®
Xeon® processor E5 family-based server.2

Up to 4x more performance per watt than with an Intel® Xeon®
processor E5 family-based server. 3
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific
computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you
in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
14 For more information go to http://www.intel.com/performance Notes 1, 2 & 3, see backup for system configuration details.

BIG GAINS FOR SELECT APPLICATIONS

8

7 Scale to many-
core
6

5
Vectorize
4

3 Parallelize

2

Performance 1
0.8
0
0% 10% 0.4
20% 30% 40% 50% 60% 70% 0
Fraction
80% 90%
100% Parallel
% Vector

* Theoretical acceleration using a highly-parallel Intel® Xeon Phi™ coprocessor
versus a standard multi-core Intel® Xeon® processor

Performance Potential of
Intel® Xeon Phi™ Coprocessors

Synthetic Benchmark Summary (Intel® MKL)

SGEMM DGEMM HPLinpack STREAM Triad
(GF/s) (GF/s) (GF/s) (GB/s)
Up to Up to Up to Up to
2000 2.9Xis Better
Higher 1000 2.8X
Higher is Better
1000 2.6X
Higher is Better 2.2X
Higher is Better
1,860 200
883 181
175
803
800 800
1500 150

600 600

82% Efficient

75% Efficient
86% Efficient

1000 100
79

ECC On

ECC Off
400 400
640 309 303

500 50
200 200

0 0
2S Intel® Xeon® 1 Intel® Xeon Phi™
0 0 2S Intel® Xeon® 1 Intel® Xeon Phi™
2S Intel® Xeon® 1 Intel® Xeon 1 Intel® Xeon
2S Intel® Xeon® 1 Intel® Xeon Phi™ processor Phi™ Phi™
Processor coprocessor processor coprocessor processor coprocessor coprocessor coprocessor

Notes
1.  Intel® Xeon® Processor E5-2670 used for all SGEMM Matrix = 13824 x 13824 , DGEMM Matrix 7936 x 7936, SMP Linpack Matrix 30720 x 30720
2.  Intel® Xeon Phi™ coprocessor SE10P (ECC on) with “Gold Release Candidate” SW stack SGEMM Matrix = 15360 x 15360, DGEMM Matrix 7680 x 7680, SMP Linpack Matrix 26872 x 28672

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions.
Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Source: Intel Measured results as of October 26, 2012 Configuration Details: Please reference slide speaker notes.
For more information go to http://www.intel.com/performance

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

PARALLELIZING FOR HIGH PERFORMANCE
Example: SAXPY

STARTING POINT
Typical serial code
running on multi-core
67.097 Current
Intel® Xeon® processors SECONDS
Performance

STEP 1.
OPTIMIZE CODE
Parallelize and vectorize 0.46 145X
SECONDS
code and continue to run on FASTER
multi-core Intel Xeon processors

USE COPROCESSORS
STEP 2.
0.197
SECONDS
2.3X
FASTER
340X
Run all or part of the
optimized code on Intel®
Xeon Phi™ coprocessors FASTER

Performance Proof-Point: Government and Academic Research WEATHER
RESEARCH AND FORECASTING (WRF)

Speedup
(Higher is Better)

1.45
1.6
1.4
•  Application: Weather Research and Forecasting (WRF)
1.2 1
1 •  Status: WRF v3.5 coming soon
0.8
0.6 •  Code Optimization:
0.4 –  Approximately two dozen files with less than 2,000
0.2 lines of code were modified (out of approximately
0 700,000 lines of code in about 800 files, all Fortran
standard compliant)
–  Most modifications improved performance for both the
• 2S Intel® Xeon® processor E5-2670 with host and the co-processors
four-node cluster configuration
•  Performance Measurements: V3.5Pre and NCAR
• 2S Intel® Xeon® processor E5-2670 + supported CONUS2.5KM benchmark (a high resolution
Intel® Xeon Phi™ coprocessor
(pre-production HW/SW)
weather forecast)
with four-node cluster configuration
•  Acknowledgments: There were many contributors to
these results, including the National Renewable Energy
Laboratory and The Weather Channel Companies

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of
those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Source: Intel or third party measured results as of December, 2012. Configuration Details: Please see backup slides.. For more information go to http://www.intel.com/performance Any difference in system hardware or software design or configuration may affect actual performance.


PROVEN PERFORMANCE BENEFITS
Intel® Xeon Phi™ Coprocessor

Finite Element Analysis

Sandia National UP TO
Labs MiniFE2
2X

Seismic

Acceleware 8th UP TO
Order Isotropic
Variable Velocity1 2.23X
China Oil & Gas UP TO
Geoeast Pre-stack
Time Migration3 2.05X

1.  2S Xeon* vs. 1 Xeon Phi* (preproduction HW/SW & Application running 100% on coprocessor (unless otherwise noted)
2.  8 node cluster, each node with 2S Xeon* (comparison is cluster performance with and without 1 Xeon Phi* per node) (Hetero)
3.  2S Xeon* vs. 2S Xeon* + 2 Xeon Phi* (offload)

20 INTEL CONFIDENTIAL

PROVEN PERFORMANCE BENEFITS
Intel® Xeon Phi™ Coprocessor

Embree Ray Tracing

Intel Labs SPEED-UP
Ray Tracing3
1.8X

Physics

Jefferson Lab UP TO
Lattice QCD
2.7X

Finance

Black-Scholes SP2 UP TO 7X
Monte Carlo SP2 UP TO

10.75X
1.  2S Xeon* vs. 1 Xeon Phi* (preproduction HW/SW & Application running 100% on coprocessor (unless otherwise noted)
2.  Includes additional FLOPS from transcendental function unit
3.  Intel Measured Oct. 2012

21 INTEL CONFIDENTIAL

Achieving Productive Parallelism
with Intel® Xeon Phi™
Coprocessors

More Cores. Wider Vectors. Performance Delivered
Intel® Parallel Studio XE 2013 and Intel® Cluster Studio XE 2013

More Cores Scaling
Performance
Efficiently

Multicore Many-core

Serial
50+ cores Performance

Wider Vectors
Task & Data Parallel •  Industry-leading performance from
Performance advanced compilers

128 Bits •  Comprehensive libraries

•  Parallel programming models
256 Bits
•  Insightful analysis tools
Distributed
Performance
512 Bits


Parallel Performance Potential

•  If your performance
needs are met by a an
Intel Xeon® processor,
they will be achieved
with fewer threads than
on a coprocessor

•  On a coprocessor:
–  Need more threads to
achieve same performance
–  Same thread count can
yield less performance

Intel® Xeon Phi™ excels on highly parallel
applications

Maximizing Parallel Performance

•  Two fundamental considerations:
– Scaling: Does the application already scale to the limits of Intel®
Xeon® processors?

– Vectorization and memory usage: Does the application make good use
of vectors or is it memory bandwidth bound?

•  If both are true for an application, then the highly parallel and
power-efficient Intel Xeon Phi coprocessor is most likely to be
worth evaluating.


Intel® Family of Parallel Programming Models
Intel® Cilk™ Plus Intel® Threading Domain-Specific Established Research and
Building Blocks Libraries Standards Development

C/C++ language Widely used C++ Intel® Math Kernel Message Passing Intel® Concurrent
extensions to template library for Library Interface (MPI) Collections
simplify parallelism parallelism

OpenMP* Offload Extensions

Coarray Fortran Intel® SPMD
Parallel Compiler
Open sourced & Open sourced &
Also an Intel product
OpenCL*
Also an Intel product

Choice of high-performance parallel programming models

Applicable to Multi-core and Many-core Programming *


Single-source approach to Multi- and Many-Core

Source

Compilers
Libraries,
Parallel Models

Intel® MIC
Multicore CPU Multicore
CPU architecture
co-processor Multicore Cluster Clusters with Multicore
and Many-core

… …

Multicore Many-core Clusters
“Unparalleled productivity… most of this software does
not run on a GPU” - Robert Harrison, NICS, ORNL
“R. Harrison, “Opportunities and Challenges Posed by Exascale Computing
- ORNL's Plans and Perspectives”, National Institute of Computational Sciences, Nov
2011”


Intel® C/C++ and Fortran
Compilers w/OpenMP

Intel® MKL, Intel® Cilk Plus,
Intel® TBB, and Intel® IPP

Intel® Inspector XE,
Intel® VTune™ Amplifier XE,
Intel® Advisor


Compilers w/OpenMP

Intel® TBB, and Intel® IPP

Intel® VTune™ Amplifier Intel® Parallel
XE, Intel® Advisor Studio XE


Compilers w/OpenMP Intel® MPI Library

Intel® TBB, and Intel® IPP Intel® Trace Analyzer
and Collector
Intel® VTune™ Amplifier Intel® Parallel
XE, Intel® Advisor Studio XE


Intel® Xeon Phi™ Coprocessor- Game Changer for HPC
Build your applications on a known compute platform…
and watch them take off sooner.

With restrictive special
purpose hardware
“We ported millions of
lines of code in only days
Complex
and completed accurate
New learning code porting runs. Unparalleled
productivity…
most of this software does
not run on a GPU and
With Intel® Xeon Phi™ never will”.
Coprocessor

— Robert Harrison,
National Institute for
Computational Sciences,
Familiar tools & runtimes Oak Ridge National
Laboratory

7Harrison, Robert. “Opportunities and Challenges Posed by Exascale Computing—ORNL's Plans and Perspectives.” National Institute of
Computational Sciences (NICS), 2011.
32

Achieving Parallelism in Applications

IA Benefit: Wide Range of Development Options

Parallelization Options Vector Options
Ease
of
use

Intel®
Math
Kernel
Library
Intel®
Math
Kernel
Library

MPI*

Auto
vectorizaBon

Semi-‐auto
vectorizaBon:

OpenMP* #pragma
(vector,
ivdep,

simd)

Array
NotaBon:

Intel®
Cilk™
Plus

Intel®
Threading
Building

Blocks
C/C++
Vector
Classes

(F32vec16,
F64vec8)

Intel®
Cilk™
Plus
OpenCL*

Pthreads*
Fine
control

Intrinsics


Spectrum of Programming Models and Mindsets

Multi-Core Centric Many-Core Centric
CPU Coprocessor

MulB-‐Core
Hosted
Symmetric

Many
Core
Hosted

General
purpose
Codes
with
balanced

serial
and
parallel
Highly-‐parallel
codes

compu0ng
needs

Oﬄoad

Codes
with
highly-‐

parallel
phases

Main(
)
Main(
)
Main(
)

CPU Foo(
)
Foo(
)
Foo(
)

MPI_*(
)
MPI_*(
)
MPI_*(
)

Main(
)
Main(
)

Many-core Foo(
)
Foo(
)
Foo(
)

Co-processor MPI_*(
)
MPI_*(
)

Range of models to meet application needs


Operating Environment View

Intel® Xeon® processor Intel® Xeon Phi™ coprocessor

Host MIC
Linux Standard
Base:
Linux •  IP
PCIe •  SSH
•  NFS
File I/O Sockets / OFED
Runtimes “LSB”
Platform SCIF

ABI ABI

A flexible, familiar, compatible operating
environment


Programming View

Intel ® Xeon® Processor Intel® Xeon Phi™ coprocessor

MKL Host MIC MKL
Intra-node TBB
Cilk TBB
Cilk
parallel Plus
OpenMP Plus
OpenMP
PThreads PThreads
PCIe
Intra- and
MPI MPI
inter-node
parallel AO-MKL AO-MKL

Node
OpenCL OpenCL
performance Language Extensions for
C++/FTN Offload
C++/FTN
and offload
SCIF / OFED / IP

Same Parallel Models for Processor and Co-
processor

Programming Intel® Xeon Phi™ based
Systems (MPI+Offload)
•  MPI ranks on Intel® Xeon®

Offload

processors (only)

•  All messages into/out of Xeon

Data

processors

CPU
MIC

•  Offload models used to MPI

accelerate MPI ranks

Data

•  TBB, OpenMP*, Cilk Plus,

Pthreads within coprocessor CPU
MIC

Network

•  Homogenous network of hybrid

nodes:

Data

CPU
MIC

Data

CPU
MIC

Offload Code Examples
•  C/C++
Offload
Pragma
•  Fortran
Offload
DirecBve

#pragma
offload
target
(mic)
!dir$
omp
offload
target(mic)

#pragma
omp
parallel
for
reducBon(+:pi)
!$omp
parallel
do

for
(i=0;
i<count;
i++)

{

do
i=1,10

float
t
=
(float)((i+0.5)/count);

A(i)
=
B(i)
*
C(i)

pi
+=
4.0/(1.0+t*t);

enddo

}

pi
/=
count;

•  C/C++

Language
Extension

class
_Shared
common

{

•  FuncBon
Offload
Example

int
data1;

#pragma
offload
target(mic)

char
*data2;

in(transa,
transb,
N,
alpha,
beta)

class
common
*next;

in(A:length(matrix_elements))

void
process();

in(B:length(matrix_elements))

};

inout(C:length(matrix_elements))
_Shared
class
common

obj1,
obj2;

sgemm(&transa,
&transb,
&N,
&N,
&N,
_Cilk
_spawn

_Offload
obj1.process();

&alpha,
A,
&N,
B,
&N,
&beta,
C,
&N);
_Cilk_spawn

obj2.process();


Programming Intel® Xeon Phi™ based
Systems (MIC Native)
•  MPI ranks on Intel MIC (only)

•  All messages into/out of

Data

coprocessor

•  TBB, OpenMP*, Cilk Plus, CPU
MIC

MPI
Pthreads used directly within

MPI processes Data

CPU
MIC

•  Programmed as homogenous Network

network of many-core CPUs:

Data

CPU
MIC

Data

CPU
MIC

Programming Intel® MIC-based Systems
(Symmetric)

•  MPI ranks on coprocessor and

Intel® Xeon® processors

Data

Data

•  Messages to/from any core

•  TBB, OpenMP*, Cilk Plus, MPI
CPU
MIC

MPI
Pthreads used directly within

MPI processes Data
Data

CPU
MIC

•  Programmed as heterogeneous Network

network of homogeneous

nodes:

Data

Data

CPU
MIC

Data

Data

CPU
MIC

Approved for Public Presentation

A GROWING ECOSYSTEM:
Developing today on Intel® Xeon Phi™ coprocessors

Other brands and names are the property of their respective owners.


Software, Drivers, Tools & Online Resources

Tools & Software Downloads

Getting Started Development Guides

Video Workshops, Tutorials, & Events

Code Samples & Case Studies

Articles, Forums, & Blogs

Associated Product Links

http://software.intel.com/mic-developer

Keys to Productive Parallel Performance
Determine the best platform target for your application
•  Intel® Xeon® processors or Intel® Xeon Phi™ coprocessors
- or both

Choose the right Xeon-centric or MIC-centric model for
your application

Vectorize your application

Parallelize your application
• With MPI (or other multi-process model)
• With threads (via Pthreads, TBB, Cilk Plus, OpenMP,
etc.)
• Go asynchronous: overlap computation and
communication

Maintain unified source code for CPU and

Legal Disclaimer & Optimization Notice
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR
OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO
LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION
INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR
INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software,
operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and
performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when
combined with other products.

Copyright © 2012 , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Xeon Phi logo, Core, VTune, and Cilk
are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of
others.

Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor
family, not across different processor families: Go to: Learn About Intel® Processor Numbers

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not
unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations.
Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured
by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain
optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable
product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804

Copyright© 2012, Intel Corporation. All rights reserved.
intel.com/software/products
Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and *Other brandsthe property ofthe property of theirowners owners.
names are and names are their respective respective

This slide MUST be used with any slides removed from this presentation

Legal Disclaimers
•  All products, computer systems, dates, and figures specified are preliminary based on current expectations,
and are subject to change without notice.
•  Intel processor numbers are not a measure of performance. Processor numbers differentiate features within
each processor family, not across different processor families. Go to:
http://www.intel.com/products/processor_number
•  Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which
may cause the product to deviate from published specifications. Current characterized errata are available on
request.
•  Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual
machine monitor (VMM). Functionality, performance or other benefits will vary depending on hardware and
software configurations. Software applications may not be compatible with all operating systems. Consult
your PC manufacturer. For more information, visit http://www.intel.com/go/virtualization
•  No computer system can provide absolute security under all conditions. Intel® Trusted Execution Technology
(Intel® TXT) requires a computer system with Intel® Virtualization Technology, an Intel TXT-enabled
processor, chipset, BIOS, Authenticated Code Modules and an Intel TXT-compatible measured launched
environment (MLE). Intel TXT also requires the system to contain a TPM v1.s. For more information, visit
http://www.intel.com/technology/security
•  Requires a system with Intel® Turbo Boost Technology capability. Consult your PC manufacturer.
Performance varies depending on hardware, software and system configuration. For more information, visit
http://www.intel.com/technology/turboboost
•  Intel® AES-NI requires a computer system with an AES-NI enabled processor, as well as non-Intel software to
execute the instructions in the correct sequence. AES-NI is available on select Intel® processors. For
availability, consult your reseller or system manufacturer. For more information, see
http://software.intel.com/en-us/articles/intel-advanced-encryption-standard-instructions-aes-ni/
•  Intel product is manufactured on a lead-free process. Lead is below 1000 PPM per EU RoHS directive
(2002/95/EC, Annex A). No exemptions required
•  Halogen-free: Applies only to halogenated flame retardants and PVC in components. Halogens are below
900ppm bromine and 900ppm chlorine.
•  Intel, Intel Xeon, Intel Core microarchitecture, the Intel Xeon logo and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
•  Copyright © 2011, Intel Corporation. All rights reserved.


This slide MUST be used with any slides with performance data removed from this presentation

Legal Disclaimers: Performance
•  Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate
performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration
may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or
components they are considering purchasing. For more information on performance tests and on the performance of Intel
products, Go to: http://www.intel.com/performance/resources/benchmark_limitations.htm.
•  Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document.
Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmarks are
reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase.
•  Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual
benchmark result for the baseline platform into each of the specific benchmark results of each of the other platforms, and
assigning them a relative performance number that correlates with the performance improvements reported.
•  SPEC, SPECint, SPECfp, SPECrate. SPECpower, SPECjAppServer, SPECjEnterprise, SPECjbb, SPECompM, SPECompL, and
SPEC MPI are trademarks of the Standard Performance Evaluation Corporation. See http://www.spec.org for more information.
•  TPC Benchmark is a trademark of the Transaction Processing Council. See http://www.tpc.org for more information.
•  SAP and SAP NetWeaver are the registered trademarks of SAP AG in Germany and in several other countries. See
http://www.sap.com/benchmark for more information.
•  INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR
OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO
LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS
INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE,
MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
•  Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate
performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration
may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or
components they are considering purchasing. For more information on performance tests and on the performance of Intel
products, reference www.intel.com/software/products.

Copyright © 2013 Intel Corporation. All rights reserved.
51 *Other brands and names are the property of their respective owners

WRF Configuration (Backup)

•  Measured by Intel 3/27/2013 (Endeavor Cluster)
•  Runs in symmetric model.
•  Citation: WRF Graphic and detail: John Michalakes 11/12/12; Hardware: 2
socket server with Intel® Xeon® processor E5-2670 (8C, 2.6GHz. 115W),
each node equipped with one Intel® Xeon Phi™ coprocessor (SE10X B1, 61
core, 1.1GHz, 8GB @ 5.5GT/s) in a 8 node FDR cluster. WRF is available
from the US National Center for Atmospheric Research in Boulder, Colorado
•  It is available from http://www.wrf-model.org/
•  All KNC optimizations are in the V3.5 svn today
•  Results obtained under MPSS 3552, Compiler rev 146, MPI rev 30 on SE10X
B1 KNC (61c, 1.1Ghz, 5.5Gts)
•  WRF CONUS2.5km workload available from
www.mmm.ucar.edu/wrf/WG2/bench/
•  Performance comparison is based upon average timestep, we ignore
initialization and post simulation file operations.


Productive parallel programming for intel xeon phi coprocessors

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (8)

Similar to Productive parallel programming for intel xeon phi coprocessors

Similar to Productive parallel programming for intel xeon phi coprocessors (20)

More from inside-BigData.com

More from inside-BigData.com (20)

Recently uploaded

Recently uploaded (20)

Productive parallel programming for intel xeon phi coprocessors