SlideShare a Scribd company logo
Unleash performance through
parallelism – Intel® Math Kernel Library
Kent Moffat
Product Manager
Intel Developer Products Division
ISC ‘13
Leipzig, Germany
June 2013
Agenda
• What is Intel® Math Kernel Library (Intel® MKL)?
• Intel® MKL supports Intel® Xeon Phi™ Coprocessor
• Intel® MKL usage models on Intel® Xeon Phi™
 Automatic Offload
 Compiler Assisted Offload
 Native Execution
• Performance roadmap
• Where to find more information?
Linear Algebra
•BLAS
•LAPACK
•Sparse solvers
•ScaLAPACK
Fast Fourier
Transforms
•Multidimensional (up
to 7D)
•FFTW interfaces
•Cluster FFT
Vector Math
•Trigonometric
•Hyperbolic
•Exponential,
Logarithmic
•Power / Root
•Rounding
Vector Random
Number Generators
•Congruential
•Recursive
•Wichmann-Hill
•Mersenne Twister
•Sobol
•Neiderreiter
•Non-deterministic
Summary Statistics
•Kurtosis
•Variation coefficient
•Quantiles, order
statistics
•Min/max
•Variance-covariance
•…
Data Fitting
•Splines
•Interpolation
•Cell search
Intel® MKL is industry’s leading math library *
Many-coreMulticore
Multicore
CPU
Multicore
CPU
Intel® MIC
co-processor
Source
Clusters with Multicore
and Many-core
… …
Multicore
Cluster
Clusters
Intel® MKL
* 2011 and 2012 Evans Data N. American developer survey
Intel® MKL Support for Intel® Xeon Phi™ Coprocessor
• Intel® MKL 11.0 supports the Intel® Xeon Phi™ coprocessor
• Heterogeneous computing
 Takes advantage of both multicore host and many-core coprocessors
• Optimized for wider (512-bit) SIMD instructions
• Flexible usage models:
 Automatic Offload: Offers transparent heterogeneous computing
 Compiler Assisted Offload: Allows fine offloading control
 Native execution: Use the coprocessors as independent nodes
Using Intel® MKL on Intel® Xeon Phi™
• Performance scales from multicore to many-cores
• Familiarity of architecture and programming models
• Code re-use, Faster time-to-market
Usage Models on Intel® Xeon Phi™ Coprocessor
• Automatic Offload
 No code changes required
 Automatically uses both host and target
 Transparent data transfer and execution management
• Compiler Assisted Offload
 Explicit controls of data transfer and remote execution using compiler offload pragmas/directives
 Can be used together with Automatic Offload
• Native Execution
 Uses the coprocessors as independent nodes
 Input data is copied to targets in advance
Execution Models
6
Multicore Hosted
General purpose serial and
parallel computing
Offload
Codes with highly-
parallel phases
Many Core Hosted
Highly-parallel codes
Symmetric
Codes with balanced
needs
Multicore
(Intel® Xeon®)
Many-core
(Intel® Xeon Phi™)
Multicore Centric Many-Core Centric
Automatic Offload (AO)
• Offloading is automatic and transparent
• By default, IntelMKL decides:
 When to offload
 Work division between host and targets
• Users enjoy host and target parallelism automatically
• Users can still control work division to fine tune performance
How to Use Automatic Offload
• Using Automatic Offload is easy
• What if there doesn’t exist a coprocessor in the system?
 Runs on the host as usual without any penalty!
Call a function:
mkl_mic_enable()
or
Set an env variable:
MKL_MIC_ENABLE=1
Automatic Offload Enabled Functions
• A selective set of MKL functions are AO enabled.
 Only functions with sufficient computation to offset data
transfer overhead are subject to AO
• In 11.0, AO enabled functions include:
 Level-3 BLAS: ?GEMM, ?TRSM, ?TRMM
 LAPACK 3 amigos: LU, QR, Cholesky
Compiler Assisted Offload (CAO)
• Offloading is explicitly controlled by compiler pragmas or directives.
• All MKL functions can be offloaded in CAO.
 In comparison, only a subset of MKL is subject to AO.
• Can leverage the full potential of compiler’s offloading facility.
• More flexibility in data transfer and remote execution management.
 A big advantage is data persistence: Reusing transferred data for
multiple operations
How to Use Compiler Assisted Offload
• The same way you would offload any function call to the coprocessor.
• An example in C:
#pragma offload target(mic) 
in(transa, transb, N, alpha, beta) 
in(A:length(matrix_elements)) 
in(B:length(matrix_elements)) 
in(C:length(matrix_elements)) 
out(C:length(matrix_elements) alloc_if(0))
{
sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N,
&beta, C, &N);
}
How to Use Compiler Assisted Offload
• An example in Fortran:
!DEC$ ATTRIBUTES OFFLOAD : TARGET( MIC ) :: SGEMM
!DEC$ OMP OFFLOAD TARGET( MIC ) &
!DEC$ IN( TRANSA, TRANSB, M, N, K, ALPHA, BETA, LDA, LDB, LDC ), &
!DEC$ IN( A: LENGTH( NCOLA * LDA )), &
!DEC$ IN( B: LENGTH( NCOLB * LDB )), &
!DEC$ INOUT( C: LENGTH( N * LDC ))
!$OMP PARALLEL SECTIONS
!$OMP SECTION
CALL SGEMM( TRANSA, TRANSB, M, N, K, ALPHA, &
A, LDA, B, LDB BETA, C, LDC )
!$OMP END PARALLEL SECTIONS
Native Execution
• Programs can be built to run only on the coprocessor by using the –mmic build option.
• Tips of using MKL in native execution:
– Use all threads to get best performance (for 60-core coprocessor)
MIC_OMP_NUM_THREADS=240
– Thread affinity setting
KMP_AFFINITY=explicit,proclist=[1-240:1,0,241,242,243],granularity=fine
– Use huge pages for memory allocation (More details on the “Intel Developer Zone”):
 The mmap( ) system call
 The open-source libhugetlbfs.so library: http://libhugetlbfs.sourceforge.net/
Suggestions on Choosing Usage Models
• Choose native execution if
 Highly parallel code.
 Want to use coprocessors as independent compute nodes.
• Choose AO when
 A sufficient Byte/FLOP ratio makes offload beneficial.
 Level-3 BLAS functions: ?GEMM, ?TRMM, ?TRSM.
 LU and QR factorization (in upcoming release updates).
• Choose CAO when either
 There is enough computation to offset data transfer overhead.
 Transferred data can be reused by multiple operations
• You can always run on the host
Intel® MKL Link Line Advisor
• A web tool to help users to choose correct link line options.
• http://software.intel.com/sites/products/mkl/
• Also available offline in the MKL product package.
Where to Find MKL Documentation?
• How to use MKL on Intel® Xeon Phi™ coprocessors:
– “Using Intel Math Kernel Library on Intel® Xeon Phi™ Coprocessors” section in the User’s
Guide.
– “Support Functions for Intel Many Integrated Core Architecture” section in the Reference
Manual.
• Tips, app notes, etc. can also be found on the “Intel Developer Zone”:
– http://www.intel.com/software/mic
– http://www.intel.com/software/mic-developer
Unleash performance through parallelism - Intel® Math Kernel Library
Unleash performance through parallelism - Intel® Math Kernel Library

More Related Content

What's hot

Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)
Deepak Kumar
 
Presentation systemc
Presentation systemcPresentation systemc
Presentation systemc
SUBRAHMANYA S
 
Synthesizing HDL using LeonardoSpectrum
Synthesizing HDL using LeonardoSpectrumSynthesizing HDL using LeonardoSpectrum
Synthesizing HDL using LeonardoSpectrum
Hossam Hassan
 
Technology
TechnologyTechnology
Technology
Nishant Rayan
 
Glow user review
Glow user reviewGlow user review
Glow user review
冠旭 陳
 
Threading Successes 03 Gamebryo
Threading Successes 03   GamebryoThreading Successes 03   Gamebryo
Threading Successes 03 Gamebryo
guest40fc7cd
 
Coverage Solutions on Emulators
Coverage Solutions on EmulatorsCoverage Solutions on Emulators
Coverage Solutions on Emulators
DVClub
 
Developer's Guide to Knights Landing
Developer's Guide to Knights LandingDeveloper's Guide to Knights Landing
Developer's Guide to Knights Landing
Andrey Vladimirov
 
Chris brown ti
Chris brown tiChris brown ti
Chris brown ti
Obsidian Software
 
Programmable logic device (PLD)
Programmable logic device (PLD)Programmable logic device (PLD)
Programmable logic device (PLD)
Sɐɐp ɐɥɯǝp
 
Compiler optimization
Compiler optimizationCompiler optimization
Compiler optimization
liu_ming50
 
Instruction level parallelism
Instruction level parallelismInstruction level parallelism
Instruction level parallelism
deviyasharwin
 
Efficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachEfficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approach
jemin lee
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
Akhila Prabhakaran
 
Fpga Knowledge
Fpga KnowledgeFpga Knowledge
Fpga Knowledge
ranvirsingh
 
Cobol performance tuning paper lessons learned - s8833 tr
Cobol performance tuning paper   lessons learned - s8833 trCobol performance tuning paper   lessons learned - s8833 tr
Cobol performance tuning paper lessons learned - s8833 tr
Pedro Barros
 
Using a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceUsing a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application Performance
Odinot Stanislas
 
Embedded system -Introduction to hardware designing
Embedded system  -Introduction to hardware designingEmbedded system  -Introduction to hardware designing
Embedded system -Introduction to hardware designing
Vibrant Technologies & Computers
 

What's hot (18)

Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)
 
Presentation systemc
Presentation systemcPresentation systemc
Presentation systemc
 
Synthesizing HDL using LeonardoSpectrum
Synthesizing HDL using LeonardoSpectrumSynthesizing HDL using LeonardoSpectrum
Synthesizing HDL using LeonardoSpectrum
 
Technology
TechnologyTechnology
Technology
 
Glow user review
Glow user reviewGlow user review
Glow user review
 
Threading Successes 03 Gamebryo
Threading Successes 03   GamebryoThreading Successes 03   Gamebryo
Threading Successes 03 Gamebryo
 
Coverage Solutions on Emulators
Coverage Solutions on EmulatorsCoverage Solutions on Emulators
Coverage Solutions on Emulators
 
Developer's Guide to Knights Landing
Developer's Guide to Knights LandingDeveloper's Guide to Knights Landing
Developer's Guide to Knights Landing
 
Chris brown ti
Chris brown tiChris brown ti
Chris brown ti
 
Programmable logic device (PLD)
Programmable logic device (PLD)Programmable logic device (PLD)
Programmable logic device (PLD)
 
Compiler optimization
Compiler optimizationCompiler optimization
Compiler optimization
 
Instruction level parallelism
Instruction level parallelismInstruction level parallelism
Instruction level parallelism
 
Efficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachEfficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approach
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
 
Fpga Knowledge
Fpga KnowledgeFpga Knowledge
Fpga Knowledge
 
Cobol performance tuning paper lessons learned - s8833 tr
Cobol performance tuning paper   lessons learned - s8833 trCobol performance tuning paper   lessons learned - s8833 tr
Cobol performance tuning paper lessons learned - s8833 tr
 
Using a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceUsing a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application Performance
 
Embedded system -Introduction to hardware designing
Embedded system  -Introduction to hardware designingEmbedded system  -Introduction to hardware designing
Embedded system -Introduction to hardware designing
 

Similar to Unleash performance through parallelism - Intel® Math Kernel Library

Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Intel® Software
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
Ganesan Narayanasamy
 
Ml also helps generic compiler ?
Ml also helps generic compiler ?Ml also helps generic compiler ?
Ml also helps generic compiler ?
Ryo Takahashi
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
Anil Bohare
 
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "SHow I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
Brandon Liu
 
OMI - The Missing Piece of a Modular, Flexible and Composable Computing World
OMI - The Missing Piece of a Modular, Flexible and Composable Computing WorldOMI - The Missing Piece of a Modular, Flexible and Composable Computing World
OMI - The Missing Piece of a Modular, Flexible and Composable Computing World
Allan Cantle
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
HPCC Systems
 
Smart logic
Smart logicSmart logic
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
KRamasamy2
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation final
Junli Gu
 
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
VAISHNAVI MADHAN
 
ARM® Cortex™ M Energy Optimization - Using Instruction Cache
ARM® Cortex™ M Energy Optimization - Using Instruction CacheARM® Cortex™ M Energy Optimization - Using Instruction Cache
ARM® Cortex™ M Energy Optimization - Using Instruction Cache
Raahul Raghavan
 
Challenges in Embedded Computing
Challenges in Embedded ComputingChallenges in Embedded Computing
Challenges in Embedded Computing
Pradeep Kumar TS
 
Lect_1.pptx
Lect_1.pptxLect_1.pptx
Lect_1.pptx
varshaks3
 
High Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and SolutionsHigh Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and Solutions
Yinghai Lu
 
My parallel universe
My parallel universeMy parallel universe
My parallel universe
Andreas Olofsson
 
Intel's Presentation in SIGGRAPH OpenCL BOF
Intel's Presentation in SIGGRAPH OpenCL BOFIntel's Presentation in SIGGRAPH OpenCL BOF
Intel's Presentation in SIGGRAPH OpenCL BOF
Ofer Rosenberg
 
Distributed Model Validation with Epsilon
Distributed Model Validation with EpsilonDistributed Model Validation with Epsilon
Distributed Model Validation with Epsilon
Sina Madani
 
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdfParallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Slide_N
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
Shien-Chun Luo
 

Similar to Unleash performance through parallelism - Intel® Math Kernel Library (20)

Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Ml also helps generic compiler ?
Ml also helps generic compiler ?Ml also helps generic compiler ?
Ml also helps generic compiler ?
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
 
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "SHow I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
 
OMI - The Missing Piece of a Modular, Flexible and Composable Computing World
OMI - The Missing Piece of a Modular, Flexible and Composable Computing WorldOMI - The Missing Piece of a Modular, Flexible and Composable Computing World
OMI - The Missing Piece of a Modular, Flexible and Composable Computing World
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
Smart logic
Smart logicSmart logic
Smart logic
 
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation final
 
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
 
ARM® Cortex™ M Energy Optimization - Using Instruction Cache
ARM® Cortex™ M Energy Optimization - Using Instruction CacheARM® Cortex™ M Energy Optimization - Using Instruction Cache
ARM® Cortex™ M Energy Optimization - Using Instruction Cache
 
Challenges in Embedded Computing
Challenges in Embedded ComputingChallenges in Embedded Computing
Challenges in Embedded Computing
 
Lect_1.pptx
Lect_1.pptxLect_1.pptx
Lect_1.pptx
 
High Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and SolutionsHigh Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and Solutions
 
My parallel universe
My parallel universeMy parallel universe
My parallel universe
 
Intel's Presentation in SIGGRAPH OpenCL BOF
Intel's Presentation in SIGGRAPH OpenCL BOFIntel's Presentation in SIGGRAPH OpenCL BOF
Intel's Presentation in SIGGRAPH OpenCL BOF
 
Distributed Model Validation with Epsilon
Distributed Model Validation with EpsilonDistributed Model Validation with Epsilon
Distributed Model Validation with Epsilon
 
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdfParallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
 

More from Intel IT Center

AI Crash Course- Supercomputing
AI Crash Course- SupercomputingAI Crash Course- Supercomputing
AI Crash Course- Supercomputing
Intel IT Center
 
FPGA Inference - DellEMC SURFsara
FPGA Inference - DellEMC SURFsaraFPGA Inference - DellEMC SURFsara
FPGA Inference - DellEMC SURFsara
Intel IT Center
 
High Memory Bandwidth Demo @ One Intel Station
High Memory Bandwidth Demo @ One Intel StationHigh Memory Bandwidth Demo @ One Intel Station
High Memory Bandwidth Demo @ One Intel Station
Intel IT Center
 
INFOGRAPHIC: Advantages of Intel vs. IBM Power on SAP HANA solutions
INFOGRAPHIC: Advantages of Intel vs. IBM Power on SAP HANA solutionsINFOGRAPHIC: Advantages of Intel vs. IBM Power on SAP HANA solutions
INFOGRAPHIC: Advantages of Intel vs. IBM Power on SAP HANA solutions
Intel IT Center
 
Disrupt Hackers With Robust User Authentication
Disrupt Hackers With Robust User AuthenticationDisrupt Hackers With Robust User Authentication
Disrupt Hackers With Robust User Authentication
Intel IT Center
 
Strengthen Your Enterprise Arsenal Against Cyber Attacks With Hardware-Enhanc...
Strengthen Your Enterprise Arsenal Against Cyber Attacks With Hardware-Enhanc...Strengthen Your Enterprise Arsenal Against Cyber Attacks With Hardware-Enhanc...
Strengthen Your Enterprise Arsenal Against Cyber Attacks With Hardware-Enhanc...
Intel IT Center
 
Harness Digital Disruption to Create 2022’s Workplace Today
Harness Digital Disruption to Create 2022’s Workplace TodayHarness Digital Disruption to Create 2022’s Workplace Today
Harness Digital Disruption to Create 2022’s Workplace Today
Intel IT Center
 
Don't Rely on Software Alone. Protect Endpoints with Hardware-Enhanced Security.
Don't Rely on Software Alone.Protect Endpoints with Hardware-Enhanced Security.Don't Rely on Software Alone.Protect Endpoints with Hardware-Enhanced Security.
Don't Rely on Software Alone. Protect Endpoints with Hardware-Enhanced Security.
Intel IT Center
 
Achieve Unconstrained Collaboration in a Digital World
Achieve Unconstrained Collaboration in a Digital WorldAchieve Unconstrained Collaboration in a Digital World
Achieve Unconstrained Collaboration in a Digital World
Intel IT Center
 
Intel® Xeon® Scalable Processors Enabled Applications Marketing Guide
Intel® Xeon® Scalable Processors Enabled Applications Marketing GuideIntel® Xeon® Scalable Processors Enabled Applications Marketing Guide
Intel® Xeon® Scalable Processors Enabled Applications Marketing Guide
Intel IT Center
 
#NABshow: National Association of Broadcasters 2017 Super Session Presentatio...
#NABshow: National Association of Broadcasters 2017 Super Session Presentatio...#NABshow: National Association of Broadcasters 2017 Super Session Presentatio...
#NABshow: National Association of Broadcasters 2017 Super Session Presentatio...
Intel IT Center
 
Identity Protection for the Digital Age
Identity Protection for the Digital AgeIdentity Protection for the Digital Age
Identity Protection for the Digital Age
Intel IT Center
 
Three Steps to Making a Digital Workplace a Reality
Three Steps to Making a Digital Workplace a RealityThree Steps to Making a Digital Workplace a Reality
Three Steps to Making a Digital Workplace a Reality
Intel IT Center
 
Three Steps to Making The Digital Workplace a Reality - by Intel’s Chad Const...
Three Steps to Making The Digital Workplace a Reality - by Intel’s Chad Const...Three Steps to Making The Digital Workplace a Reality - by Intel’s Chad Const...
Three Steps to Making The Digital Workplace a Reality - by Intel’s Chad Const...
Intel IT Center
 
Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0
Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0
Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0
Intel IT Center
 
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Enterprise Database Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications Showcase
Intel IT Center
 
Intel® Xeon® Processor E5-2600 v4 Core Business Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Core Business Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Core Business Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Core Business Applications Showcase
Intel IT Center
 
Intel® Xeon® Processor E5-2600 v4 Financial Security Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Financial Security Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Financial Security Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Financial Security Applications Showcase
Intel IT Center
 
Intel® Xeon® Processor E5-2600 v4 Telco Cloud Digital Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Telco Cloud Digital Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Telco Cloud Digital Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Telco Cloud Digital Applications Showcase
Intel IT Center
 
Intel® Xeon® Processor E5-2600 v4 Tech Computing Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Tech Computing Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Tech Computing Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Tech Computing Applications Showcase
Intel IT Center
 

More from Intel IT Center (20)

AI Crash Course- Supercomputing
AI Crash Course- SupercomputingAI Crash Course- Supercomputing
AI Crash Course- Supercomputing
 
FPGA Inference - DellEMC SURFsara
FPGA Inference - DellEMC SURFsaraFPGA Inference - DellEMC SURFsara
FPGA Inference - DellEMC SURFsara
 
High Memory Bandwidth Demo @ One Intel Station
High Memory Bandwidth Demo @ One Intel StationHigh Memory Bandwidth Demo @ One Intel Station
High Memory Bandwidth Demo @ One Intel Station
 
INFOGRAPHIC: Advantages of Intel vs. IBM Power on SAP HANA solutions
INFOGRAPHIC: Advantages of Intel vs. IBM Power on SAP HANA solutionsINFOGRAPHIC: Advantages of Intel vs. IBM Power on SAP HANA solutions
INFOGRAPHIC: Advantages of Intel vs. IBM Power on SAP HANA solutions
 
Disrupt Hackers With Robust User Authentication
Disrupt Hackers With Robust User AuthenticationDisrupt Hackers With Robust User Authentication
Disrupt Hackers With Robust User Authentication
 
Strengthen Your Enterprise Arsenal Against Cyber Attacks With Hardware-Enhanc...
Strengthen Your Enterprise Arsenal Against Cyber Attacks With Hardware-Enhanc...Strengthen Your Enterprise Arsenal Against Cyber Attacks With Hardware-Enhanc...
Strengthen Your Enterprise Arsenal Against Cyber Attacks With Hardware-Enhanc...
 
Harness Digital Disruption to Create 2022’s Workplace Today
Harness Digital Disruption to Create 2022’s Workplace TodayHarness Digital Disruption to Create 2022’s Workplace Today
Harness Digital Disruption to Create 2022’s Workplace Today
 
Don't Rely on Software Alone. Protect Endpoints with Hardware-Enhanced Security.
Don't Rely on Software Alone.Protect Endpoints with Hardware-Enhanced Security.Don't Rely on Software Alone.Protect Endpoints with Hardware-Enhanced Security.
Don't Rely on Software Alone. Protect Endpoints with Hardware-Enhanced Security.
 
Achieve Unconstrained Collaboration in a Digital World
Achieve Unconstrained Collaboration in a Digital WorldAchieve Unconstrained Collaboration in a Digital World
Achieve Unconstrained Collaboration in a Digital World
 
Intel® Xeon® Scalable Processors Enabled Applications Marketing Guide
Intel® Xeon® Scalable Processors Enabled Applications Marketing GuideIntel® Xeon® Scalable Processors Enabled Applications Marketing Guide
Intel® Xeon® Scalable Processors Enabled Applications Marketing Guide
 
#NABshow: National Association of Broadcasters 2017 Super Session Presentatio...
#NABshow: National Association of Broadcasters 2017 Super Session Presentatio...#NABshow: National Association of Broadcasters 2017 Super Session Presentatio...
#NABshow: National Association of Broadcasters 2017 Super Session Presentatio...
 
Identity Protection for the Digital Age
Identity Protection for the Digital AgeIdentity Protection for the Digital Age
Identity Protection for the Digital Age
 
Three Steps to Making a Digital Workplace a Reality
Three Steps to Making a Digital Workplace a RealityThree Steps to Making a Digital Workplace a Reality
Three Steps to Making a Digital Workplace a Reality
 
Three Steps to Making The Digital Workplace a Reality - by Intel’s Chad Const...
Three Steps to Making The Digital Workplace a Reality - by Intel’s Chad Const...Three Steps to Making The Digital Workplace a Reality - by Intel’s Chad Const...
Three Steps to Making The Digital Workplace a Reality - by Intel’s Chad Const...
 
Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0
Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0
Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0
 
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Enterprise Database Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications Showcase
 
Intel® Xeon® Processor E5-2600 v4 Core Business Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Core Business Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Core Business Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Core Business Applications Showcase
 
Intel® Xeon® Processor E5-2600 v4 Financial Security Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Financial Security Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Financial Security Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Financial Security Applications Showcase
 
Intel® Xeon® Processor E5-2600 v4 Telco Cloud Digital Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Telco Cloud Digital Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Telco Cloud Digital Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Telco Cloud Digital Applications Showcase
 
Intel® Xeon® Processor E5-2600 v4 Tech Computing Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Tech Computing Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Tech Computing Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Tech Computing Applications Showcase
 

Unleash performance through parallelism - Intel® Math Kernel Library

  • 1. Unleash performance through parallelism – Intel® Math Kernel Library Kent Moffat Product Manager Intel Developer Products Division ISC ‘13 Leipzig, Germany June 2013
  • 2. Agenda • What is Intel® Math Kernel Library (Intel® MKL)? • Intel® MKL supports Intel® Xeon Phi™ Coprocessor • Intel® MKL usage models on Intel® Xeon Phi™  Automatic Offload  Compiler Assisted Offload  Native Execution • Performance roadmap • Where to find more information?
  • 3. Linear Algebra •BLAS •LAPACK •Sparse solvers •ScaLAPACK Fast Fourier Transforms •Multidimensional (up to 7D) •FFTW interfaces •Cluster FFT Vector Math •Trigonometric •Hyperbolic •Exponential, Logarithmic •Power / Root •Rounding Vector Random Number Generators •Congruential •Recursive •Wichmann-Hill •Mersenne Twister •Sobol •Neiderreiter •Non-deterministic Summary Statistics •Kurtosis •Variation coefficient •Quantiles, order statistics •Min/max •Variance-covariance •… Data Fitting •Splines •Interpolation •Cell search Intel® MKL is industry’s leading math library * Many-coreMulticore Multicore CPU Multicore CPU Intel® MIC co-processor Source Clusters with Multicore and Many-core … … Multicore Cluster Clusters Intel® MKL * 2011 and 2012 Evans Data N. American developer survey
  • 4. Intel® MKL Support for Intel® Xeon Phi™ Coprocessor • Intel® MKL 11.0 supports the Intel® Xeon Phi™ coprocessor • Heterogeneous computing  Takes advantage of both multicore host and many-core coprocessors • Optimized for wider (512-bit) SIMD instructions • Flexible usage models:  Automatic Offload: Offers transparent heterogeneous computing  Compiler Assisted Offload: Allows fine offloading control  Native execution: Use the coprocessors as independent nodes Using Intel® MKL on Intel® Xeon Phi™ • Performance scales from multicore to many-cores • Familiarity of architecture and programming models • Code re-use, Faster time-to-market
  • 5. Usage Models on Intel® Xeon Phi™ Coprocessor • Automatic Offload  No code changes required  Automatically uses both host and target  Transparent data transfer and execution management • Compiler Assisted Offload  Explicit controls of data transfer and remote execution using compiler offload pragmas/directives  Can be used together with Automatic Offload • Native Execution  Uses the coprocessors as independent nodes  Input data is copied to targets in advance
  • 6. Execution Models 6 Multicore Hosted General purpose serial and parallel computing Offload Codes with highly- parallel phases Many Core Hosted Highly-parallel codes Symmetric Codes with balanced needs Multicore (Intel® Xeon®) Many-core (Intel® Xeon Phi™) Multicore Centric Many-Core Centric
  • 7. Automatic Offload (AO) • Offloading is automatic and transparent • By default, IntelMKL decides:  When to offload  Work division between host and targets • Users enjoy host and target parallelism automatically • Users can still control work division to fine tune performance
  • 8. How to Use Automatic Offload • Using Automatic Offload is easy • What if there doesn’t exist a coprocessor in the system?  Runs on the host as usual without any penalty! Call a function: mkl_mic_enable() or Set an env variable: MKL_MIC_ENABLE=1
  • 9. Automatic Offload Enabled Functions • A selective set of MKL functions are AO enabled.  Only functions with sufficient computation to offset data transfer overhead are subject to AO • In 11.0, AO enabled functions include:  Level-3 BLAS: ?GEMM, ?TRSM, ?TRMM  LAPACK 3 amigos: LU, QR, Cholesky
  • 10. Compiler Assisted Offload (CAO) • Offloading is explicitly controlled by compiler pragmas or directives. • All MKL functions can be offloaded in CAO.  In comparison, only a subset of MKL is subject to AO. • Can leverage the full potential of compiler’s offloading facility. • More flexibility in data transfer and remote execution management.  A big advantage is data persistence: Reusing transferred data for multiple operations
  • 11. How to Use Compiler Assisted Offload • The same way you would offload any function call to the coprocessor. • An example in C: #pragma offload target(mic) in(transa, transb, N, alpha, beta) in(A:length(matrix_elements)) in(B:length(matrix_elements)) in(C:length(matrix_elements)) out(C:length(matrix_elements) alloc_if(0)) { sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N); }
  • 12. How to Use Compiler Assisted Offload • An example in Fortran: !DEC$ ATTRIBUTES OFFLOAD : TARGET( MIC ) :: SGEMM !DEC$ OMP OFFLOAD TARGET( MIC ) & !DEC$ IN( TRANSA, TRANSB, M, N, K, ALPHA, BETA, LDA, LDB, LDC ), & !DEC$ IN( A: LENGTH( NCOLA * LDA )), & !DEC$ IN( B: LENGTH( NCOLB * LDB )), & !DEC$ INOUT( C: LENGTH( N * LDC )) !$OMP PARALLEL SECTIONS !$OMP SECTION CALL SGEMM( TRANSA, TRANSB, M, N, K, ALPHA, & A, LDA, B, LDB BETA, C, LDC ) !$OMP END PARALLEL SECTIONS
  • 13. Native Execution • Programs can be built to run only on the coprocessor by using the –mmic build option. • Tips of using MKL in native execution: – Use all threads to get best performance (for 60-core coprocessor) MIC_OMP_NUM_THREADS=240 – Thread affinity setting KMP_AFFINITY=explicit,proclist=[1-240:1,0,241,242,243],granularity=fine – Use huge pages for memory allocation (More details on the “Intel Developer Zone”):  The mmap( ) system call  The open-source libhugetlbfs.so library: http://libhugetlbfs.sourceforge.net/
  • 14. Suggestions on Choosing Usage Models • Choose native execution if  Highly parallel code.  Want to use coprocessors as independent compute nodes. • Choose AO when  A sufficient Byte/FLOP ratio makes offload beneficial.  Level-3 BLAS functions: ?GEMM, ?TRMM, ?TRSM.  LU and QR factorization (in upcoming release updates). • Choose CAO when either  There is enough computation to offset data transfer overhead.  Transferred data can be reused by multiple operations • You can always run on the host
  • 15. Intel® MKL Link Line Advisor • A web tool to help users to choose correct link line options. • http://software.intel.com/sites/products/mkl/ • Also available offline in the MKL product package.
  • 16. Where to Find MKL Documentation? • How to use MKL on Intel® Xeon Phi™ coprocessors: – “Using Intel Math Kernel Library on Intel® Xeon Phi™ Coprocessors” section in the User’s Guide. – “Support Functions for Intel Many Integrated Core Architecture” section in the Reference Manual. • Tips, app notes, etc. can also be found on the “Intel Developer Zone”: – http://www.intel.com/software/mic – http://www.intel.com/software/mic-developer