1Scale Up Performance with Intel® Development ToolsOverview of Intel® Cluster Studio XE &Intel® Parallel Studio XEJune, 19 2013Mike Lee
2visionspan from few cores tomany cores withconsistent models,languages, tools, andtechniques2
3Multicore CPU Multicore CPUIntel® MICarchitecturecoprocessorSourceCompilersLibraries,Parallel Models3
4Multicore CPU Multicore CPUIntel® MICarchitecturecoprocessorSourceCompilersLibraries,Parallel ModelsGame Changer“Unparalleled productivity… most of this software doesnot run on a GPU” - Robert Harrison, NICS, ORNL“R. Harrison, “Opportunities and Challenges Posed by Exascale Computing- ORNLs Plans and Perspectives”, National Institute of Computational Sciences, Nov 2011”4
8Support for Latest IntelProcessors and CoprocessorsIntel® Ivy BridgemicroarchitectureIntel® HaswellmicroarchitectureIntel® Xeon Phi™coprocessorIntel® C++ and FortranCompiler✔AVX✔AVX2, FMA3✔IMCIIntel® TBB library ✔ ✔ ✔Intel® MKL library✔AVX✔AVX2, FMA3✔Intel® MPI library ✔ ✔ ✔Intel® VTune™ AmplifierXE†✔Hardware Events✔Hardware Events✔Hardware EventsIntel® Inspector XE✔Memory & Thread Checks✔Memory & Thread✔Memory & Thread††† Hardware events for new processors added as new processors ship.†† Analysis runs on multicore processors, provides analysis for multicore and many-core processors.
9A Family of Parallel Programming ModelsDeveloper ChoiceIntel® Cilk™ PlusC/C++ languageextensions to simplifyparallelismOpen sourcedAlso an Intel productIntel® ThreadingBuilding BlocksWidely used C++template library forparallelismOpen sourcedAlso an Intel productDomain-SpecificLibrariesIntel® IntegratedPerformancePrimitivesIntel® Math KernelLibraryEstablished StandardsMessage PassingInterface (MPI)OpenMP*Coarray FortranOpenCL*Research andDevelopmentIntel® ConcurrentCollectionsOffload ExtensionsIntel® SPMD ParallelCompilerChoice of high-performance parallel programming modelsApplicable to Multicore and Many-core ProgrammingDelivered with Intel® Cluster Studio XE
10Phase Product Feature BenefitBuildIntel® MPI LibraryHigh Performance Message Passing (MPI)Library• Enabling High Performance Scalability,Interconnect Independence, Runtime FabricSelection, and Application Tuning CapabilityIntel®Composer XEC/C++ and Fortran compilers andperformance libraries• Intel® Threading Building Blocks• Intel® Cilk™ Plus• Intel® Integrated Performance Primitives• Intel® Math Kernel Library• Enabling solution to achieve the applicationperformance and scalability benefits of multicoreand forward scale to many-coreVerifyIntel®Inspector XEMemory & threading dynamic analysis forcode qualityStatic Security Analysis for code quality• Increased productivity, code quality, and lowerscost, finds memory, threading , and securitydefects before they happen• Now MPI enabled at every cluster nodeVerify &TuneIntel® TraceAnalyzer & CollectorMPI Performance Profiler for understandingapplication correctness & behavior• Analyze performance of MPI programs andvisualize parallel application behavior andcommunications patterns to identify hotspotsTuneIntel® VTune™Amplifier XEPerformance Profiler for optimizingapplication performance and scalability• Remove guesswork, saves time, makes it easier tofind performance and scalability bottlenecks• Now MPI enabled at every cluster nodeIntel® Cluster Studio XETools to Scale Forward, Scale Faster – for HPC ClustersEmbargoed Until
11Intel®Composer XE – HPC Compilers & LibrariesGreat Application PerformanceSerial or Parallel ProgrammingScale Forward & FlexibilityTarget Multicore & Manycore Systems on Linux*, Windows*,and OSX*Standards Driven CompilersAcclaimed Fortran and C++ Compilers. Remarkableperformance improvements with just a simple recompileParallel Programming Models & LibrariesIntel® TBB, Intel® Cilk™ Plus, Intel® OpenMP, Intel® CoarrayFortran, Intel® IPP & Intel® MKL
13 13Intel® Cilk™ Plus• 3 simple keywords &array notations forparallelism• Support for task and dataparallelism• Semantics similar toserial code• Simple way to parallelizeyour code• Sequentially consistent,low overhead, powerfulsolutionIntel® Threading BuildingBlocks• Parallel algorithms anddata structures• Scalable memory allocationand task scheduling• Synchronization primitives• Rich feature set for generalpurpose parallelism• Available as open source orcommercial licenseLanguage extensions tosimplify task/data parallelismWidely used C++ templatelibrary for task parallelismCompilers&LibrariesIntel® Cilk™ Plus & Intel® Threading Building BlocksComposibilityUtilize appropriate parallelism model in the same applicationwith both Intel® Cilk™ Plus & Intel® Threading Building Blocks.Simplify ParallelismImplement parallelism through open sourced models withsimple language extensions/keywords & template librariesScale Forward & FlexibilityTarget Multicore & Manycore Systems on Linux*, Windows*,and OSX*
14 14Compilers&LibrariesIntel® OpenMPOpenMP* 4.0 RC1 & TR1Intel® C++ and Fortran Compiler adds support for SIMDextensions and target extensions.16 Years and Counting…Intel supports and advances standards to advance the HPCindustryAvailable Now in Intel® CompilersIntel® Fortran Composer XE 2013 Update 2 (version 13.1)Intel® C++ Composer XE Update 2 (version 13.1)WelcomeOpenMP 4.0!
15“Fast and accurate state of the art general purposeCFD solvers is the focus at S & I EngineeringSolutions Pvt, Ltd. Scalability and efficiency are keyto us when it comes to our choice and use of MPILibraries. The Intel® MPI Library has enabled us toscale to over 10k cores with high efficiency andperformance.”Nikhil Vijay Shende, Director,S & I Engineering Solutions,Pvt. Ltd.Full Hybrid SupportFinely tuned control over threaded and OpenMP* hybrid regionsfor multicore and manycore systemsSustainable ScalabilityTake advantage of reduced memory overhead and nativefabric support resulting in lower latencies and higherbandwidthOptimized PerformanceAutomatically employ optimized collectives via cluster- andapplication-level tuningIntel® MPI Library – Flexible, Efficient & Scalable
17“Intel MKL is indispensable for any high-performance user”Prof. Jack Dongarra, Innovative Computing Lab, University of TennesseeFlexible, Scalable and CompatibleStandard APIs for C & Fortran, Compatible with Present &Future Processors/Coprocessors, Compilers, OS’s, linking andthreading models.Vectorized and ThreadedReplace code with one of thousands of highly optimizedfunctions for science, engineering and financial appsComprehensive Math FunctionalityA wealth of threaded and vectorized complex math functions toaccelerate a wide variety of software applications.Intel® Math Kernel Library – Performance Ready to Use
18Intel® Math Kernel Library – Performance Ready to Use
19Extensive & Rich LibraryThousands of optimized functions covering frequently usedfundamental algorithms including those for creating digitalmedia, enterprise, data, embedded, communications, andscientific / technical applications.Optimized for PerformanceUsing Intel® Streaming SIMD Extensions (Intel® SSE) andIntel® Advanced Vector Extensions (Intel® AVX) instructionwill perform faster than what an optimized compiler canproduce alone.Engineered to Save TimeA Library of Highly Optimized Algorithmic Building Blocks forMedia and Data ApplicationsIntel® Integrated Performance Primitives – PerformanceReady to Use
20Intel® Integrated Performance Primitives – PerformanceReady to Use
21Intel®Advisor XE – Data Driven Threading DesignSimplifies and Speeds Threading DesignBest Results with Parallelism Design Insight and AnalysisEvaluate Return on InvestmentPerformance benefit vs. the cost of transitioning toparallelismSimplifies adding ParallelismShorter learning curve for parallelism by helping to identifyand experiment with parallel opportunitiesStep-by-step Threading GuidanceFrom surveying code, finding the best implementation, tochecking correctness.
22Intel®Advisor XE – Data Driven Threading DesignAdd Parallelism with Less Effort, Less Risk and More Impact
23Optimize Serial & Parallel PeformancePremier Performance ProfilerEasyPerformance optimization can be difficult, but theperformance profiling tool you use shouldn’t be.Rich Set of Performance ProfilesCollect a rich set of performance data for hotspots,threading, locks & waits, DirectX*, bandwidth and more.Mine Results & UnderstandGood data is not enough. Powerful analysis lets you sort,filter and visualize results on the timeline and on your source.Intel® VTune™ Amplifier XE - Performance Profiler“Last week, Intel® VTune™Amplifier XE helped us findalmost 3X performanceimprovement. This week ithelped us improve theperformance another 3X.”Claire Cates, Principal Developer,SAS Institute Inc
24Intel® VTune™ Amplifier XE - Performance ProfilerWhere is my application…Spending Time? Wasting Time? Waiting Too Long?• Focus tuning onfunctions taking time• See call stacks• See time on source• See cache misses on yoursource• See functions sorted by# of cache misses• See locks by wait time• Red/Green for CPUutilization during waitAdvanced Profiling For Scalable Multicore Performance
25Intel®Inspector XE – Dynamic AnalysisDeliver More Reliable ApplicationsDetect Memory & Threading ErrorsFlexible to Fit WorkflowInspect C, C++, C(#, F#, and Fortran. No special buildsrequired. Inspects all code even without sourceFind Errors Early in Development CycleEasy to use tool for serial and parallel applications enhancesproductivity, cut cost and speed time-to-results.Memory & Threading ErrorsLeaks, corruption, allocation/de-allocation, API mismatches,data races in stack and heap, deadlocks, and thread & syncAPI errors“We struggled for a week with acrash situation, …we ran Intel®Inspector XE and immediately foundthe array out of bounds thatoccurred long before the actualcrash. We could have saved a week!”Mikael Le Guerroué, Senior Codec ArchitectureEngineer, Envivio
26Intel®Trace Analyzer and CollectorProfile MPI CommunicationsUnderstand MPI Application BehaviorFlexible to Fit WorkflowUse at compile, link or run to capture trace data for yourapplication.Powerful AnalysisFind temporal dependencies in your code: bottlenecks,hotspots, and load balancing issues correctness checkingLow Overhead & Effective VisualizationVisualize and understand parallel application behavior atminimal cost to concentrate on relevant information quickly