• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Productive parallel programming for intel xeon phi coprocessors
 

Productive parallel programming for intel xeon phi coprocessors

on

  • 614 views

In this video from Moabcon 2013, Bill Magro from Intel presents: Productive Parallel Programming for Intel Xeon Phi Coprocessors. ...

In this video from Moabcon 2013, Bill Magro from Intel presents: Productive Parallel Programming for Intel Xeon Phi Coprocessors.

Learn more at:
http://www.adaptivecomputing.com/company/news-and-events/events/moabcon-2013/moabcon-2013-full-agenda/
and
http://www.intel.com/content/www/us/en/high-performance-computing/high-performance-xeon-phi-coprocessor-brief.html

You can watch the video of this presentation at:
http://insidehpc.com/?p=36407

Statistics

Views

Total Views
614
Views on SlideShare
614
Embed Views
0

Actions

Likes
1
Downloads
28
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Productive parallel programming for intel xeon phi coprocessors Productive parallel programming for intel xeon phi coprocessors Presentation Transcript

    • Productive Parallel Programmingfor Intel® Xeon Phi™Coprocessors Bill Magro! Director and Chief Technologist! Technical Computing Software! Intel Software & Services Group!
    • Still an Insatiable Need For Computing Weather Prediction 1 ZFlops 100 EFlops 10 EFlops Genomics Research 1 EFlops 100 PFlops 10 PFlops Medical Imaging 1 PFlops 100 TFlops 10 TFlops 1 TFlops 100 GFlops 10 GFlops 1 GFlops Forecast 100 MFlops 1993 1999 2005 2011 2017 2023 2029 PetaFlop Systems of Today Are The Client And Handheld Systems 10 years Later Source: www.top500.org
    • Approaching Exascale
    • Some believe… •  Virtually none of today’s hardware or software technologies can be improved or modified to reach exascale•  A complete revolution is needed We believe… •  Evolution of today’s technologies + hardware and software innovation can get us there •  A systems approach – with co-design – is critical
    • Moore’s Law: Alive and Well 2003 2005 2007 20092011 90 nm 65 nm 45 nm 32 nm 22 22nmnm A Revolutionary Leap in Process SiGe SiGe Technology Invented SiGe 2nd Gen. SiGe Invented Gate-Last 2nd Gen. Gate-Last First to Implement 37% Strained Silicon Strained Silicon High-k High-k Tri-Gate Performance Gain at Metal Gate Metal Gate Low Voltage* STRAINED SILICON HIGH-k METAL GATE >50% Active Power Reduction at Constant TRI-GATE Performance* The foundation for all computing… including Exa-Scale Source: Intel *Compared to Intel 32nm Technology
    • Intel® Xeon Phi™ Coprocessor [Knights Corner]: Power Efficiency Performance per Watt of a prototype Knights Corner Cluster compared to the 2 Top Graphics Accelerated Clusters 1381 1380 1266 1400 1200 MFLOPS/Watt 1000 800 + + + 600 400 200 0 Intel Corp Nagasaki Univ. Barcelona Supercomputing Center Knights Corner ATI Radeon Nvidia Tesla 2090 Higher is Better Source: www.green500.org Top500 #150 Top500 #456 Top500 #177 June 2012 June 2012 June 2012 72.9 kW 47 kW 81.5 kW6 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
    • Myth: explicitly managed locality is in and caches are out! Reality: Caches remain path to high performance and efficiency Relative BW Relative BW/Watt 50 45 40 35 30 25 20 15 10 5 0 Memory BW L2 Cache BW L1 Cache BW
    • #1 Green 500 Cluster WORLD RECORD! “Beacon” at NICS Intel® Xeon® Processor + Intel Xeon Phi™ Coprocessor Cluster Most Power Efficient on the List 2.449 GigaFLOPS / Watt 70.1% efficiencyOther brands and names are the property of their respective owners.Source: www.green500.org as of Nov 2012
    • Reaching Exascale Power Goals Requires Architectural & Systems Focus •  Memory (2x-5x) –  New memory interfaces (optimized memory control and xfer) –  Extend DRAM with non-volatile memory •  Processor (10x-20x) –  Reducing data movement (functional reorganization, > 20x) –  Domain/Core power gating and aggressive voltage scaling •  Interconnect (2x-5x) –  More interconnect on package –  Replace long haul copper with integrated optics •  Data Center Energy Efficiencies (10%-20%) –  Higher operating temperature tolerance –  480V to the rack and free air/water cooling efficiencies 9
    • Reliability of these machines requires a systems approach Extreme Parallelism 1E+07 Top System Concurrency Trend 1E+06 1E+05 1E+04 •  Transparent process migration 1E+03 1E+02 ’93 ‘95 ‘97 ‘99 ’01 ‘03 ‘05 ‘07 ‘09 •  Holistic fault detection and recovery MTTI Measured in Minutes •  Reliable end to end communications nt •  Integrated memory in storage layer for MTTI (hours) 1000 ip Cou Ch 1E+08 DRAM 100 fast checkpoint and workflow Count 1E+07 10 1E+06 1 Socket Cou nt 1E+05 •  N+1 scale reliable architectures 0 2004 2006 2008 2010 2012 2014 2016 1E+04 consistent with stacked memory 0.1 Failures per socket per year: Time to save a global constraints Global Checkpoint Crossover point •  System wide power management and dynamic optimization Time •  Must design for system level debug capability. Reliability is the primary force driving next generation designs Source: Exascale Computing Study: Technology Challenges in achieving Exascale Systems (2008),
    • Foundation of Performance:Computing
    • Architecture for Discovery Intel® Xeon® processor •  Ground-breaking real-world application performance •  Industry-leading energy efficiency •  Meets broad set of HPC challenges Intel® Xeon Phi™ product family •  Based on Intel® Many Integrated Core (MIC) architecture •  Leading performance for highly parallel workloads •  Common Intel Xeon programming model •  Productive solution for highly-parallel computing12
    • Intel® Xeon® E5-2600 processors Up to 73% performance boost vs. prior gen1 on HPC suite of applications Over 2X improvement on key industry benchmarks Up to 4 channels DDR3 1600 memory Significantly reduce compute time on large, complex data Up to 8 cores Up to 20 MB cache sets with Intel® Advanced Vector Extensions Integrated PCI Express* Integrated I/O cuts latency while adding capacity & bandwidth 1  Over previous generation Intel® processors. Intel internal estimate. For more legal information on performance forecasts go to http://www.intel.com/performance13
    • Introducing Intel® Xeon Phi™ Coprocessors Highly-parallel Processing for Unparalleled Discovery Groundbreaking: differences Up to 61 IA cores/1.1 GHz/ 244 Threads Up to 8GB memory with up to 352 GB/s bandwidth 512-bit SIMD instructions Linux operating system, IP addressable Standard programming languages and tools Leading to Groundbreaking results Up to 1 TeraFlop/s double precision peak performance1 Enjoy up to 2.2x higher memory bandwidth than on an Intel® Xeon® processor E5 family-based server.2 Up to 4x more performance per watt than with an Intel® Xeon® processor E5 family-based server. 3 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.14 For more information go to http://www.intel.com/performance Notes 1, 2 & 3, see backup for system configuration details.
    • BIG GAINS FOR SELECT APPLICATIONS 8 7 Scale to many- core 6 5 Vectorize 4 3 Parallelize 2Performance 1 0.8 0 0% 10% 0.4 20% 30% 40% 50% 60% 70% 0 Fraction 80% 90% 100% Parallel % Vector* Theoretical acceleration using a highly-parallel Intel® Xeon Phi™ coprocessor versus a standard multi-core Intel® Xeon® processor
    • Performance Potential ofIntel® Xeon Phi™ Coprocessors
    • Synthetic Benchmark Summary (Intel® MKL) SGEMM DGEMM HPLinpack STREAM Triad (GF/s) (GF/s) (GF/s) (GB/s) Up to Up to Up to Up to 2000 2.9Xis Better Higher 1000 2.8X Higher is Better 1000 2.6X Higher is Better 2.2X Higher is Better 1,860 200 883 181 175 803 800 800 1500 150 600 600 82% Efficient 75% Efficient 86% Efficient 1000 100 79 ECC On ECC Off 400 400 640 309 303 500 50 200 200 0 0 2S Intel® Xeon® 1 Intel® Xeon Phi™ 0 0 2S Intel® Xeon® 1 Intel® Xeon Phi™ 2S Intel® Xeon® 1 Intel® Xeon 1 Intel® Xeon 2S Intel® Xeon® 1 Intel® Xeon Phi™ processor Phi™ Phi™ Processor coprocessor processor coprocessor processor coprocessor coprocessor coprocessor Notes 1.  Intel® Xeon® Processor E5-2670 used for all SGEMM Matrix = 13824 x 13824 , DGEMM Matrix 7936 x 7936, SMP Linpack Matrix 30720 x 30720 2.  Intel® Xeon Phi™ coprocessor SE10P (ECC on) with “Gold Release Candidate” SW stack SGEMM Matrix = 15360 x 15360, DGEMM Matrix 7680 x 7680, SMP Linpack Matrix 26872 x 28672 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Intel Measured results as of October 26, 2012 Configuration Details: Please reference slide speaker notes. For more information go to http://www.intel.com/performanceCopyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
    • PARALLELIZING FOR HIGH PERFORMANCE Example: SAXPY STARTING POINT Typical serial code running on multi-core 67.097 Current Intel® Xeon® processors SECONDS Performance STEP 1. OPTIMIZE CODE Parallelize and vectorize 0.46 145X SECONDS code and continue to run on FASTERmulti-core Intel Xeon processorsUSE COPROCESSORS STEP 2. 0.197 SECONDS 2.3X FASTER 340X Run all or part of the optimized code on Intel®Xeon Phi™ coprocessors FASTER
    • Performance Proof-Point: Government and Academic Research WEATHER RESEARCH AND FORECASTING (WRF) Speedup (Higher is Better) 1.45 1.6 1.4 •  Application: Weather Research and Forecasting (WRF) 1.2 1 1 •  Status: WRF v3.5 coming soon 0.8 0.6 •  Code Optimization: 0.4 –  Approximately two dozen files with less than 2,000 0.2 lines of code were modified (out of approximately 0 700,000 lines of code in about 800 files, all Fortran standard compliant) –  Most modifications improved performance for both the • 2S Intel® Xeon® processor E5-2670 with host and the co-processors four-node cluster configuration •  Performance Measurements: V3.5Pre and NCAR • 2S Intel® Xeon® processor E5-2670 + supported CONUS2.5KM benchmark (a high resolution Intel® Xeon Phi™ coprocessor (pre-production HW/SW) weather forecast) with four-node cluster configuration •  Acknowledgments: There were many contributors to these results, including the National Renewable Energy Laboratory and The Weather Channel CompaniesSoftware and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any ofthose factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.Source: Intel or third party measured results as of December, 2012. Configuration Details: Please see backup slides.. For more information go to http://www.intel.com/performance Any difference in system hardware or software design or configuration may affect actual performance. Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
    • PROVEN PERFORMANCE BENEFITS Intel® Xeon Phi™ Coprocessor Finite Element Analysis Sandia National UP TO Labs MiniFE2 2X Seismic Acceleware 8th UP TO Order Isotropic Variable Velocity1 2.23X China Oil & Gas UP TO Geoeast Pre-stack Time Migration3 2.05X 1.  2S Xeon* vs. 1 Xeon Phi* (preproduction HW/SW & Application running 100% on coprocessor (unless otherwise noted) 2.  8 node cluster, each node with 2S Xeon* (comparison is cluster performance with and without 1 Xeon Phi* per node) (Hetero) 3.  2S Xeon* vs. 2S Xeon* + 2 Xeon Phi* (offload)20 INTEL CONFIDENTIAL
    • PROVEN PERFORMANCE BENEFITS Intel® Xeon Phi™ Coprocessor Embree Ray Tracing Intel Labs SPEED-UP Ray Tracing3 1.8X Physics Jefferson Lab UP TO Lattice QCD 2.7X Finance Black-Scholes SP2 UP TO 7X Monte Carlo SP2 UP TO 10.75X1.  2S Xeon* vs. 1 Xeon Phi* (preproduction HW/SW & Application running 100% on coprocessor (unless otherwise noted)2.  Includes additional FLOPS from transcendental function unit3.  Intel Measured Oct. 201221 INTEL CONFIDENTIAL
    • Achieving Productive Parallelismwith Intel® Xeon Phi™Coprocessors
    • More Cores. Wider Vectors. Performance Delivered Intel® Parallel Studio XE 2013 and Intel® Cluster Studio XE 2013 More Cores Scaling Performance Efficiently Multicore Many-core Serial 50+ cores Performance Wider Vectors Task & Data Parallel •  Industry-leading performance from Performance advanced compilers 128 Bits •  Comprehensive libraries •  Parallel programming models 256 Bits •  Insightful analysis tools Distributed Performance 512 BitsCopyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
    • Parallel Performance Potential •  If your performance needs are met by a an Intel Xeon® processor, they will be achieved with fewer threads than on a coprocessor •  On a coprocessor: –  Need more threads to achieve same performance –  Same thread count can yield less performance Intel® Xeon Phi™ excels on highly parallel applicationsCopyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
    • Maximizing Parallel Performance •  Two fundamental considerations: – Scaling: Does the application already scale to the limits of Intel® Xeon® processors? – Vectorization and memory usage: Does the application make good use of vectors or is it memory bandwidth bound? •  If both are true for an application, then the highly parallel and power-efficient Intel Xeon Phi coprocessor is most likely to be worth evaluating.Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
    • Intel® Family of Parallel Programming Models Intel® Cilk™ Plus Intel® Threading Domain-Specific Established Research and Building Blocks Libraries Standards Development C/C++ language Widely used C++ Intel® Math Kernel Message Passing Intel® Concurrent extensions to template library for Library Interface (MPI) Collections simplify parallelism parallelism OpenMP* Offload Extensions Coarray Fortran Intel® SPMD Parallel Compiler Open sourced & Open sourced & Also an Intel product OpenCL* Also an Intel product Choice of high-performance parallel programming models Applicable to Multi-core and Many-core Programming *Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
    • Single-source approach to Multi- and Many-Core Source Compilers Libraries, Parallel Models Intel® MIC Multicore CPU Multicore CPU architecture co-processor Multicore Cluster Clusters with Multicore and Many-core … … Multicore Many-core Clusters “Unparalleled productivity… most of this software does not run on a GPU” - Robert Harrison, NICS, ORNL “R. Harrison, “Opportunities and Challenges Posed by Exascale Computing - ORNLs Plans and Perspectives”, National Institute of Computational Sciences, Nov 2011”Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
    • Intel® C/C++ and Fortran Compilers w/OpenMP Intel® MKL, Intel® Cilk Plus, Intel® TBB, and Intel® IPP Intel® Inspector XE, Intel® VTune™ Amplifier XE, Intel® AdvisorCopyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
    • Intel® C/C++ and Fortran Compilers w/OpenMP Intel® MKL, Intel® Cilk Plus, Intel® TBB, and Intel® IPP Intel® Inspector XE, Intel® VTune™ Amplifier Intel® Parallel XE, Intel® Advisor Studio XECopyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
    • Intel® C/C++ and Fortran Compilers w/OpenMP Intel® MPI Library Intel® MKL, Intel® Cilk Plus, Intel® TBB, and Intel® IPP Intel® Trace Analyzer and Collector Intel® Inspector XE, Intel® VTune™ Amplifier Intel® Parallel XE, Intel® Advisor Studio XECopyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
    • Intel® C/C++ and Fortran Compilers w/OpenMP Intel® MPI Library Intel® MKL, Intel® Cilk Plus, Intel® TBB, and Intel® IPP Intel® Trace Analyzer and Collector Intel® Inspector XE, Intel® VTune™ Amplifier Intel® Parallel XE, Intel® Advisor Studio XECopyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
    • Intel® Xeon Phi™ Coprocessor- Game Changer for HPC Build your applications on a known compute platform… and watch them take off sooner. With restrictive special purpose hardware “We ported millions of lines of code in only days Complex and completed accurate New learning code porting runs. Unparalleled productivity… most of this software does not run on a GPU and With Intel® Xeon Phi™ never will”. Coprocessor — Robert Harrison, National Institute for Computational Sciences, Familiar tools & runtimes Oak Ridge National Laboratory 7Harrison, Robert. “Opportunities and Challenges Posed by Exascale Computing—ORNLs Plans and Perspectives.” National Institute of Computational Sciences (NICS), 2011.32
    • Achieving Parallelism in ApplicationsIA Benefit: Wide Range of Development Options Parallelization Options Vector Options Ease  of  use   Intel®  Math  Kernel  Library   Intel®  Math  Kernel  Library   MPI*   Auto  vectorizaBon   Semi-­‐auto  vectorizaBon:           OpenMP* #pragma  (vector,  ivdep,    simd)   Array  NotaBon:    Intel®  Cilk™  Plus   Intel®  Threading  Building   Blocks   C/C++  Vector  Classes                   (F32vec16,  F64vec8)     Intel®  Cilk™  Plus OpenCL*   Pthreads*   Fine  control   Intrinsics  Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
    • Spectrum of Programming Models and Mindsets Multi-Core Centric Many-Core Centric CPU Coprocessor MulB-­‐Core  Hosted   Symmetric   Many  Core  Hosted   General  purpose   Codes  with  balanced   serial  and  parallel   Highly-­‐parallel  codes   compu0ng   needs   Offload   Codes  with  highly-­‐   parallel  phases   Main(  )   Main(  )   Main(  )   CPU Foo(  )   Foo(  )   Foo(  )   MPI_*(  )   MPI_*(  )   MPI_*(  )   Main(  )   Main(  )   Many-core Foo(  )   Foo(  )   Foo(  )  Co-processor MPI_*(  )   MPI_*(  )   Range of models to meet application needs Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
    • Operating Environment View Intel® Xeon® processor Intel® Xeon Phi™ coprocessor Host MIC Linux Standard Base: Linux •  IP PCIe •  SSH •  NFS File I/O Sockets / OFED Runtimes “LSB” Platform SCIF ABI ABI A flexible, familiar, compatible operating environmentCopyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
    • Programming View Intel ® Xeon® Processor Intel® Xeon Phi™ coprocessor MKL Host MIC MKL Intra-node TBB Cilk TBB Cilk parallel Plus OpenMP Plus OpenMP PThreads PThreads PCIe Intra- and MPI MPI inter-node parallel AO-MKL AO-MKL Node OpenCL OpenCL performance Language Extensions for C++/FTN Offload C++/FTN and offload SCIF / OFED / IP Same Parallel Models for Processor and Co- processorCopyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
    • Spectrum of Programming Models and Mindsets Multi-Core Centric Many-Core Centric CPU Coprocessor MulB-­‐Core  Hosted   Symmetric   Many  Core  Hosted   General  purpose   Codes  with  balanced   serial  and  parallel   Highly-­‐parallel  codes   compu0ng   needs   Offload   Codes  with  highly-­‐   parallel  phases   Main(  )   Main(  )   Main(  )   CPU Foo(  )   Foo(  )   Foo(  )   MPI_*(  )   MPI_*(  )   MPI_*(  )   Main(  )   Main(  )   Many-core Foo(  )   Foo(  )   Foo(  )  Co-processor MPI_*(  )   MPI_*(  )   Range of models to meet application needs Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
    • Programming Intel® Xeon Phi™ based Systems (MPI+Offload)•  MPI ranks on Intel® Xeon®   Offload   processors (only)    •  All messages into/out of Xeon   Data     processors     CPU   MIC  •  Offload models used to MPI     accelerate MPI ranks     Data      •  TBB, OpenMP*, Cilk Plus,     Pthreads within coprocessor CPU   MIC   Network  •  Homogenous network of hybrid     nodes:       Data         CPU   MIC             Data         CPU   MIC  
    • Offload Code Examples •  C/C++  Offload  Pragma   •  Fortran  Offload  DirecBve   #pragma  offload  target  (mic)   !dir$  omp  offload  target(mic)   #pragma  omp  parallel  for  reducBon(+:pi)   !$omp  parallel  do    for  (i=0;  i<count;  i++)    {              do  i=1,10                  float  t  =  (float)((i+0.5)/count);                A(i)  =  B(i)  *  C(i)                  pi  +=  4.0/(1.0+t*t);              enddo    }      pi  /=  count;   •  C/C++    Language  Extension   class  _Shared  common    {   •  FuncBon  Offload  Example    int  data1;   #pragma  offload  target(mic)    char  *data2;                in(transa,  transb,  N,  alpha,  beta)      class  common  *next;                in(A:length(matrix_elements))      void  process();                in(B:length(matrix_elements))     };                inout(C:length(matrix_elements))   _Shared  class  common    obj1,  obj2;                      sgemm(&transa,  &transb,  &N,  &N,  &N,   _Cilk  _spawn      _Offload  obj1.process();   &alpha,  A,  &N,  B,  &N,  &beta,  C,  &N);   _Cilk_spawn                                          obj2.process();    Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
    • Spectrum of Programming Models and Mindsets Multi-Core Centric Many-Core Centric CPU Coprocessor MulB-­‐Core  Hosted   Symmetric   Many  Core  Hosted   General  purpose   Codes  with  balanced   serial  and  parallel   Highly-­‐parallel  codes   compu0ng   needs   Offload   Codes  with  highly-­‐   parallel  phases   Main(  )   Main(  )   Main(  )   CPU Foo(  )   Foo(  )   Foo(  )   MPI_*(  )   MPI_*(  )   MPI_*(  )   Main(  )   Main(  )   Many-core Foo(  )   Foo(  )   Foo(  )  Co-processor MPI_*(  )   MPI_*(  )   Range of models to meet application needs Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
    • Programming Intel® Xeon Phi™ based Systems (MIC Native)•  MPI ranks on Intel MIC (only)    •  All messages into/out of         Data   coprocessor    •  TBB, OpenMP*, Cilk Plus, CPU   MIC   MPI Pthreads used directly within         MPI processes Data           CPU   MIC  •  Programmed as homogenous Network       network of many-core CPUs:         Data       CPU   MIC               Data       CPU   MIC  
    • Spectrum of Programming Models and Mindsets Multi-Core Centric Many-Core Centric CPU Coprocessor MulB-­‐Core  Hosted   Symmetric   Many  Core  Hosted   General  purpose   Codes  with  balanced   serial  and  parallel   Highly-­‐parallel  codes   compu0ng   needs   Offload   Codes  with  highly-­‐   parallel  phases   Main(  )   Main(  )   Main(  )   CPU Foo(  )   Foo(  )   Foo(  )   MPI_*(  )   MPI_*(  )   MPI_*(  )   Main(  )   Main(  )   Many-core Foo(  )   Foo(  )   Foo(  )  Co-processor MPI_*(  )   MPI_*(  )   Range of models to meet application needs Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
    • Programming Intel® MIC-based Systems (Symmetric)•  MPI ranks on coprocessor and     Intel® Xeon® processors       Data     Data  •  Messages to/from any core    •  TBB, OpenMP*, Cilk Plus, MPI CPU   MIC   MPI Pthreads used directly within         MPI processes Data   Data           CPU   MIC  •  Programmed as heterogeneous Network       network of homogeneous     nodes:   Data     Data       CPU   MIC             Data     Data       CPU   MIC  
    • Spectrum of Programming Models and Mindsets Multi-Core Centric Many-Core Centric CPU Coprocessor MulB-­‐Core  Hosted   Symmetric   Many  Core  Hosted   General  purpose   Codes  with  balanced   serial  and  parallel   Highly-­‐parallel  codes   compu0ng   needs   Offload   Codes  with  highly-­‐   parallel  phases   Main(  )   Main(  )   Main(  )   CPU Foo(  )   Foo(  )   Foo(  )   MPI_*(  )   MPI_*(  )   MPI_*(  )   Main(  )   Main(  )   Many-core Foo(  )   Foo(  )   Foo(  )  Co-processor MPI_*(  )   MPI_*(  )   Range of models to meet application needs Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
    • Approved for Public Presentation A GROWING ECOSYSTEM: Developing today on Intel® Xeon Phi™ coprocessorsOther brands and names are the property of their respective owners.Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
    • Software, Drivers, Tools & Online Resources Tools & Software Downloads Getting Started Development Guides Video Workshops, Tutorials, & Events Code Samples & Case Studies Articles, Forums, & Blogs Associated Product Links http://software.intel.com/mic-developerCopyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
    • Keys to Productive Parallel PerformanceDetermine the best platform target for your application •  Intel® Xeon® processors or Intel® Xeon Phi™ coprocessors - or bothChoose the right Xeon-centric or MIC-centric model for your applicationVectorize your applicationParallelize your application • With MPI (or other multi-process model) • With threads (via Pthreads, TBB, Cilk Plus, OpenMP, etc.) • Go asynchronous: overlap computation and communicationMaintain unified source code for CPU and
    • Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
    • Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © 2012 , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Xeon Phi logo, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families: Go to: Learn About Intel® Processor Numbers Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Copyright© 2012, Intel Corporation. All rights reserved. intel.com/software/productsCopyright © 2013 Intel Corporation. All rights reserved. *Other brands and *Other brandsthe property ofthe property of theirowners owners. names are and names are their respective respective
    • This slide MUST be used with any slides removed from this presentation Legal Disclaimers •  All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. •  Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/processor_number •  Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. •  Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual machine monitor (VMM). Functionality, performance or other benefits will vary depending on hardware and software configurations. Software applications may not be compatible with all operating systems. Consult your PC manufacturer. For more information, visit http://www.intel.com/go/virtualization •  No computer system can provide absolute security under all conditions. Intel® Trusted Execution Technology (Intel® TXT) requires a computer system with Intel® Virtualization Technology, an Intel TXT-enabled processor, chipset, BIOS, Authenticated Code Modules and an Intel TXT-compatible measured launched environment (MLE). Intel TXT also requires the system to contain a TPM v1.s. For more information, visit http://www.intel.com/technology/security •  Requires a system with Intel® Turbo Boost Technology capability. Consult your PC manufacturer. Performance varies depending on hardware, software and system configuration. For more information, visit http://www.intel.com/technology/turboboost •  Intel® AES-NI requires a computer system with an AES-NI enabled processor, as well as non-Intel software to execute the instructions in the correct sequence. AES-NI is available on select Intel® processors. For availability, consult your reseller or system manufacturer. For more information, see http://software.intel.com/en-us/articles/intel-advanced-encryption-standard-instructions-aes-ni/ •  Intel product is manufactured on a lead-free process. Lead is below 1000 PPM per EU RoHS directive (2002/95/EC, Annex A). No exemptions required •  Halogen-free: Applies only to halogenated flame retardants and PVC in components. Halogens are below 900ppm bromine and 900ppm chlorine. •  Intel, Intel Xeon, Intel Core microarchitecture, the Intel Xeon logo and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. •  Copyright © 2011, Intel Corporation. All rights reserved.Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
    • This slide MUST be used with any slides with performance data removed from this presentation Legal Disclaimers: Performance •  Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, Go to: http://www.intel.com/performance/resources/benchmark_limitations.htm. •  Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase. •  Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the baseline platform into each of the specific benchmark results of each of the other platforms, and assigning them a relative performance number that correlates with the performance improvements reported. •  SPEC, SPECint, SPECfp, SPECrate. SPECpower, SPECjAppServer, SPECjEnterprise, SPECjbb, SPECompM, SPECompL, and SPEC MPI are trademarks of the Standard Performance Evaluation Corporation. See http://www.spec.org for more information. •  TPC Benchmark is a trademark of the Transaction Processing Council. See http://www.tpc.org for more information. •  SAP and SAP NetWeaver are the registered trademarks of SAP AG in Germany and in several other countries. See http://www.sap.com/benchmark for more information. •  INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. •  Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products.Copyright © 2013 Intel Corporation. All rights reserved.51 *Other brands and names are the property of their respective owners
    • WRF Configuration (Backup) •  Measured by Intel 3/27/2013 (Endeavor Cluster) •  Runs in symmetric model. •  Citation: WRF Graphic and detail: John Michalakes 11/12/12; Hardware: 2 socket server with Intel® Xeon® processor E5-2670 (8C, 2.6GHz. 115W), each node equipped with one Intel® Xeon Phi™ coprocessor (SE10X B1, 61 core, 1.1GHz, 8GB @ 5.5GT/s) in a 8 node FDR cluster. WRF is available from the US National Center for Atmospheric Research in Boulder, Colorado •  It is available from http://www.wrf-model.org/ •  All KNC optimizations are in the V3.5 svn today •  Results obtained under MPSS 3552, Compiler rev 146, MPI rev 30 on SE10X B1 KNC (61c, 1.1Ghz, 5.5Gts) •  WRF CONUS2.5km workload available from www.mmm.ucar.edu/wrf/WG2/bench/ •  Performance comparison is based upon average timestep, we ignore initialization and post simulation file operations.Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners