• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Unleash performance through parallelism - Intel® Math Kernel Library
 

Unleash performance through parallelism - Intel® Math Kernel Library

on

  • 558 views

Unleashing performance through parallelism with Intel Math Kernel Library

Unleashing performance through parallelism with Intel Math Kernel Library

Statistics

Views

Total Views
558
Views on SlideShare
555
Embed Views
3

Actions

Likes
1
Downloads
0
Comments
0

1 Embed 3

https://twitter.com 3

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Unleash performance through parallelism - Intel® Math Kernel Library Unleash performance through parallelism - Intel® Math Kernel Library Presentation Transcript

    • Unleash performance throughparallelism – Intel® Math Kernel LibraryKent MoffatProduct ManagerIntel Developer Products DivisionISC ‘13Leipzig, GermanyJune 2013
    • Agenda• What is Intel® Math Kernel Library (Intel® MKL)?• Intel® MKL supports Intel® Xeon Phi™ Coprocessor• Intel® MKL usage models on Intel® Xeon Phi™ Automatic Offload Compiler Assisted Offload Native Execution• Performance roadmap• Where to find more information?
    • Linear Algebra•BLAS•LAPACK•Sparse solvers•ScaLAPACKFast FourierTransforms•Multidimensional (upto 7D)•FFTW interfaces•Cluster FFTVector Math•Trigonometric•Hyperbolic•Exponential,Logarithmic•Power / Root•RoundingVector RandomNumber Generators•Congruential•Recursive•Wichmann-Hill•Mersenne Twister•Sobol•Neiderreiter•Non-deterministicSummary Statistics•Kurtosis•Variation coefficient•Quantiles, orderstatistics•Min/max•Variance-covariance•…Data Fitting•Splines•Interpolation•Cell searchIntel® MKL is industry’s leading math library *Many-coreMulticoreMulticoreCPUMulticoreCPUIntel® MICco-processorSourceClusters with Multicoreand Many-core… …MulticoreClusterClustersIntel® MKL* 2011 and 2012 Evans Data N. American developer survey
    • Intel® MKL Support for Intel® Xeon Phi™ Coprocessor• Intel® MKL 11.0 supports the Intel® Xeon Phi™ coprocessor• Heterogeneous computing Takes advantage of both multicore host and many-core coprocessors• Optimized for wider (512-bit) SIMD instructions• Flexible usage models: Automatic Offload: Offers transparent heterogeneous computing Compiler Assisted Offload: Allows fine offloading control Native execution: Use the coprocessors as independent nodesUsing Intel® MKL on Intel® Xeon Phi™• Performance scales from multicore to many-cores• Familiarity of architecture and programming models• Code re-use, Faster time-to-market
    • Usage Models on Intel® Xeon Phi™ Coprocessor• Automatic Offload No code changes required Automatically uses both host and target Transparent data transfer and execution management• Compiler Assisted Offload Explicit controls of data transfer and remote execution using compiler offload pragmas/directives Can be used together with Automatic Offload• Native Execution Uses the coprocessors as independent nodes Input data is copied to targets in advance
    • Execution Models6Multicore HostedGeneral purpose serial andparallel computingOffloadCodes with highly-parallel phasesMany Core HostedHighly-parallel codesSymmetricCodes with balancedneedsMulticore(Intel® Xeon®)Many-core(Intel® Xeon Phi™)Multicore Centric Many-Core Centric
    • Automatic Offload (AO)• Offloading is automatic and transparent• By default, IntelMKL decides: When to offload Work division between host and targets• Users enjoy host and target parallelism automatically• Users can still control work division to fine tune performance
    • How to Use Automatic Offload• Using Automatic Offload is easy• What if there doesn’t exist a coprocessor in the system? Runs on the host as usual without any penalty!Call a function:mkl_mic_enable()orSet an env variable:MKL_MIC_ENABLE=1
    • Automatic Offload Enabled Functions• A selective set of MKL functions are AO enabled. Only functions with sufficient computation to offset datatransfer overhead are subject to AO• In 11.0, AO enabled functions include: Level-3 BLAS: ?GEMM, ?TRSM, ?TRMM LAPACK 3 amigos: LU, QR, Cholesky
    • Compiler Assisted Offload (CAO)• Offloading is explicitly controlled by compiler pragmas or directives.• All MKL functions can be offloaded in CAO. In comparison, only a subset of MKL is subject to AO.• Can leverage the full potential of compiler’s offloading facility.• More flexibility in data transfer and remote execution management. A big advantage is data persistence: Reusing transferred data formultiple operations
    • How to Use Compiler Assisted Offload• The same way you would offload any function call to the coprocessor.• An example in C:#pragma offload target(mic) in(transa, transb, N, alpha, beta) in(A:length(matrix_elements)) in(B:length(matrix_elements)) in(C:length(matrix_elements)) out(C:length(matrix_elements) alloc_if(0)){sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N,&beta, C, &N);}
    • How to Use Compiler Assisted Offload• An example in Fortran:!DEC$ ATTRIBUTES OFFLOAD : TARGET( MIC ) :: SGEMM!DEC$ OMP OFFLOAD TARGET( MIC ) &!DEC$ IN( TRANSA, TRANSB, M, N, K, ALPHA, BETA, LDA, LDB, LDC ), &!DEC$ IN( A: LENGTH( NCOLA * LDA )), &!DEC$ IN( B: LENGTH( NCOLB * LDB )), &!DEC$ INOUT( C: LENGTH( N * LDC ))!$OMP PARALLEL SECTIONS!$OMP SECTIONCALL SGEMM( TRANSA, TRANSB, M, N, K, ALPHA, &A, LDA, B, LDB BETA, C, LDC )!$OMP END PARALLEL SECTIONS
    • Native Execution• Programs can be built to run only on the coprocessor by using the –mmic build option.• Tips of using MKL in native execution:– Use all threads to get best performance (for 60-core coprocessor)MIC_OMP_NUM_THREADS=240– Thread affinity settingKMP_AFFINITY=explicit,proclist=[1-240:1,0,241,242,243],granularity=fine– Use huge pages for memory allocation (More details on the “Intel Developer Zone”): The mmap( ) system call The open-source libhugetlbfs.so library: http://libhugetlbfs.sourceforge.net/
    • Suggestions on Choosing Usage Models• Choose native execution if Highly parallel code. Want to use coprocessors as independent compute nodes.• Choose AO when A sufficient Byte/FLOP ratio makes offload beneficial. Level-3 BLAS functions: ?GEMM, ?TRMM, ?TRSM. LU and QR factorization (in upcoming release updates).• Choose CAO when either There is enough computation to offset data transfer overhead. Transferred data can be reused by multiple operations• You can always run on the host
    • Intel® MKL Link Line Advisor• A web tool to help users to choose correct link line options.• http://software.intel.com/sites/products/mkl/• Also available offline in the MKL product package.
    • Where to Find MKL Documentation?• How to use MKL on Intel® Xeon Phi™ coprocessors:– “Using Intel Math Kernel Library on Intel® Xeon Phi™ Coprocessors” section in the User’sGuide.– “Support Functions for Intel Many Integrated Core Architecture” section in the ReferenceManual.• Tips, app notes, etc. can also be found on the “Intel Developer Zone”:– http://www.intel.com/software/mic– http://www.intel.com/software/mic-developer