Methods and practices to analyze the performance of your application with Intel® VTune™ Amplifier XE
Upcoming SlideShare
Loading in...5
×
 

Methods and practices to analyze the performance of your application with Intel® VTune™ Amplifier XE

on

  • 209 views

Leo Borges

Leo Borges
Intel Software Conference 2014 Brazil
May 2014

Statistics

Views

Total Views
209
Views on SlideShare
209
Embed Views
0

Actions

Likes
0
Downloads
11
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Methods and practices to analyze the performance of your application with Intel® VTune™ Amplifier XE Methods and practices to analyze the performance of your application with Intel® VTune™ Amplifier XE Presentation Transcript

  • Methods and practices to analyze the performance of your application with Intel® VTune™ Amplifier XE Leo Borges Intel Software Conference 2014 Brazil May 2014
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Legal Disclaimer & Optimization NoticeLegal Disclaimer & Optimization NoticeLegal Disclaimer & Optimization NoticeLegal Disclaimer & Optimization Notice Copyright©Copyright©Copyright©Copyright© 2012,2012,2012,2012, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. 2
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Agenda • Intel® VTune Amplifier XE Intro • Microarchitecture Review • The Top-Down Characterization details • Intel® VTune™ Amplifier XE Implementation • Demo **Sources for current presentation: http://software.intel.com/en-us/articles/advanced-profiling-with-intel- vtune-amplifier-xe-part-1-find-the-bottleneck 3
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Two Ways to Collect Data - Intel® VTune™ Amplifier XE 4 Software CollectorSoftware CollectorSoftware CollectorSoftware Collector Hotspots, Concurrency, Locks & Waits Hardware CollectorHardware CollectorHardware CollectorHardware Collector Lightweight Hotspots, Advanced Analysis Uses OS interrupts Uses the on chip Performance Monitoring Unit (PMU) Collects from a single process tree Collect system wide or from a single process tree. ~10ms default resolution ~1ms default resolution (finer granularity - finds small functions) Collect on both Intel® and compatible processors Requires a genuine Intel® processor for collection Call stacks show calling sequence New! Optionally collect call stacks Works in virtual environments Works in virtual environments only when supported by the VM (e.g., vSphere* 5.1) No driver required Requires a driver No special recompilesNo special recompilesNo special recompilesNo special recompiles ---- C, C++, C#, Fortran, Java, AssemblyC, C++, C#, Fortran, Java, AssemblyC, C++, C#, Fortran, Java, AssemblyC, C++, C#, Fortran, Java, Assembly
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Two Ways to Collect Data - Intel® VTune™ Amplifier XE 5 Software CollectorSoftware CollectorSoftware CollectorSoftware Collector Hotspots, Concurrency, Locks & Waits Hardware CollectorHardware CollectorHardware CollectorHardware Collector Lightweight Hotspots, Advanced Analysis Uses OS interrupts Uses the on chip Performance Monitoring Unit (PMU) Collects from a single process tree Collect system wide or from a single process tree. ~10ms default resolution ~1ms default resolution (finer granularity - finds small functions) Collect on both Intel® and compatible processors Requires a genuine Intel® processor for collection Call stacks show calling sequence New! Optionally collect call stacks Works in virtual environments Works in virtual environments only when supported by the VM (e.g., vSphere* 5.1) No driver required Requires a driver No special recompilesNo special recompilesNo special recompilesNo special recompiles ---- C, C++, C#, Fortran, Java, AssemblyC, C++, C#, Fortran, Java, AssemblyC, C++, C#, Fortran, Java, AssemblyC, C++, C#, Fortran, Java, Assembly
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Microarchitecture basics 6 FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute RetireRetireRetireRetire • Classic 4-stage pipeline depicted here. • Memory not shown. • Pipeline on current processors capable of speculative and out of order execution.
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Intuitive approach to EBS • Use a small list of metrics to monitor level of optimization • Example 1: Cycles per instruction (CPI) • Example 2: Instruction retirement ratio m instructions issued n retired Retirement ratio = n/m % executed but not retired = (1 – n/m)*100 7 Intel Confidential 5/30/20 14
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Microarchitecture Review 8 FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit The traditional 5-stage pipeline. Pipeline on current processors capable of out of order execution.
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Microarchitecture Review 9 FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit The traditional 5-stage pipeline. Pipeline on current processors capable of out of order execution.
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Intel® Software Conference 2014Microarchitecture Review 10 FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit FrontFrontFrontFront----EndEndEndEnd The front-end fetches instructions IN ORDER, decodes them into u-ops(micro-operations), and sends the u-ops to the back-end.
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Microarchitecture Review 11 FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit FrontFrontFrontFront----EndEndEndEnd BackBackBackBack----EndEndEndEnd The back-end receives u-ops, executes them OUT OF ORDER, accesses memory as needed, and commits results to memory IN ORDER.
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Microarchitecture Review 12 FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit FrontFrontFrontFront----EndEndEndEnd BackBackBackBack----EndEndEndEnd AllocationAllocationAllocationAllocation Allocation is the point where u-ops transfer from the front-end to the back-end. The front-end can allocate 4 u-ops per cycle.
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Microarchitecture Review 13 FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit FrontFrontFrontFront----EndEndEndEnd BackBackBackBack----EndEndEndEnd AllocationAllocationAllocationAllocation RetirementRetirementRetirementRetirement Retirement is the point where u-ops leave the back-end. The back-end can retire 4 u-ops per cycle.
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. And a New Term: the Pipeline Slot 14 FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit FrontFrontFrontFront----EndEndEndEnd BackBackBackBack----EndEndEndEnd 4 Potential4 Potential4 Potential4 Potential AllocationsAllocationsAllocationsAllocations per Cycleper Cycleper Cycleper Cycle 4 Potential4 Potential4 Potential4 Potential RetirementsRetirementsRetirementsRetirements per Cycleper Cycleper Cycleper Cycle In reality, there are many queues, buffers, and pieces of logic throughout the pipeline to allow up to 4 allocations and 4 retirements per cycle.
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. And a New Term: the Pipeline Slot 15 FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit FrontFrontFrontFront----EndEndEndEnd BackBackBackBack----EndEndEndEnd 4 Potential4 Potential4 Potential4 Potential AllocationsAllocationsAllocationsAllocations per Cycleper Cycleper Cycleper Cycle 4 Potential4 Potential4 Potential4 Potential RetirementsRetirementsRetirementsRetirements per Cycleper Cycleper Cycleper Cycle The “Pipeline Slot” is an abstraction representing all the resources needed to move one u-op through the pipeline.
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. ExecuteExecuteExecuteExecute And a New Term: the Pipeline Slot 16 FetchFetchFetchFetch DecodeDecodeDecodeDecode MemoryMemoryMemoryMemory CommitCommitCommitCommit FrontFrontFrontFront----EndEndEndEnd BackBackBackBack----EndEndEndEnd There are 4 Pipeline Slots available every cycle. S1 S2 S3 S4
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. And a New Term: the Pipeline Slot 17 FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit FrontFrontFrontFront----EndEndEndEnd BackBackBackBack----EndEndEndEnd Pipeline slots are filled with u-ops that travel from allocation to retirement over multiple cycles. S1 S2 S3 S4 S1 S2 S3 S4 S1 S2 S3 S4 S1 S2 S3 S4 S1 S2 S3 S4
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Cycles Per Instruction (CPI), a standard measure, has some special kinks For multi-core processors, CPI can get as low as 0.25 cycles per instructions with current Intel processors. Normally, something below CPI < ~1.0 is targeted for better performances. Some would suggest CPI must be targeted around ~0.75 to 0.50. But is this correct to any architecture? 18
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Cycles Per Instruction (CPI), a standard measure, has some special kinks • Threads on each Intel® Xeon™ Phi core share a clock If all 4 HW threads are active, each gets ¼ total cycles • Multi-stage instruction decode requires two threads to utilize the whole core – one thread only gets half • With two ops/per cycle (U-V-pipe dual issue): • To get thread CPI, multiply by the active threads 19 Threads perThreads perThreads perThreads per CoreCoreCoreCore BestBestBestBest CPICPICPICPI perperperper CoreCoreCoreCore 1111 1.0 2222 0.5 3333 0.5 4444 0.5 Threads perThreads perThreads perThreads per CoreCoreCoreCore BestBestBestBest CPICPICPICPI perperperper CoreCoreCoreCore Best CPIBest CPIBest CPIBest CPI per Threadper Threadper Threadper Thread 1 x1 x1 x1 x 1.0 = 1.0 2 x2 x2 x2 x 0.5 = 1.0 3 x3 x3 x3 x 0.5 = 1.5 4 x4 x4 x4 x 0.5 = 2.0
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. The Top-Down Characterization What is it? The Top-Down Characterization is: • A new way to organize and use processor events to identify the real hardware bottlenecks in systems/applications • Based on PMU events specifically designed for this task • Integrated into Intel® VTune Amplifier XE for Core • Available on Intel® Microarchitecture code named Sandy Bridge and newer 20
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. The Top-Down Characterization Each pipeline slot on each cycle is classified into 1 of 4 categories. For each slot on each cycle: 21
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. The Top-Down Characterization 22 • Sum to 1.0 • Unit is “Percentage of total Pipeline Slots” • This is the core of the new Top-Down characterization • Each category is further broken down depending on available events
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. 23 Back-EndFront-End Latency Bandwith Memory Bound Memory Bound Core Bound Core Bound L1 DRAM Remote DRAM Local ou Remote L2 L3 DIV Active DIV Active Port Utilization Port Utilization 0 .. 3 ports Store Bound Store Bound ITLBITLB Overhead ICacheICache Misses DSB Switches Branch Resteers Retiring Bad Speculation Branch Mispredict Branch Mispredict Machine Clears Machine Clears General Microcode Sequencer Microcode Sequencer DSBMITE Issues breakdown
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Examples of Metrics (Xeon™ Phi) 24
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Problem Area: L1 Cache Usage • Significantly affects data access latency and therefore application performance • Tuning Suggestions: Software prefetching Tile/block data access for cache size Use streaming stores If using 4K access stride, may be experiencing conflict misses Examine Compiler prefetching (Compiler-generated L1 prefetches should not miss) 25 MetricMetricMetricMetric FormulaFormulaFormulaFormula InvestigateInvestigateInvestigateInvestigate ifififif L1 Misses DATA_READ_MISS_OR_WRITE_MISS + L1_DATA_HIT_INFLIGHT_PF1 L1 Hit Rate (DATA_READ_OR_WRITE – L1 Misses) / DATA_READ_OR_WRITE < 95%
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Problem Area: Data Access Latency • Significantly affects application performance • Tuning Suggestions: Software prefetching Tile/block data access for cache size Use streaming stores Check cache locality – turn off prefetching and use CACHE_FILL events - reduce sharing if needed/possible If using 64K access stride, may be experiencing conflict misses 26 MetricMetricMetricMetric FormulaFormulaFormulaFormula InvestigateInvestigateInvestigateInvestigate ifififif Estimated Latency Impact (CPU_CLK_UNHALTED – EXEC_STAGE_CYCLES – DATA_READ_OR_WRITE) / DATA_READ_OR_WRITE_MISS >145
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Problem Area: TLB Usage • Also affects data access latency and therefore application performance • Tuning Suggestions: Improve cache usage & data access latency If L1 TLB miss/L2 TLB miss is high, try using large pages For loops with multiple streams, try splitting into multiple loops If data access stride is a large power of 2, consider padding between arrays by one 4 KB page 27 MetricMetricMetricMetric FormulaFormulaFormulaFormula InvestInvestInvestInvest---- igateigateigateigate ifififif L1 TLB miss ratio DATA_PAGE_WALK/DATA_READ_OR_WRITE > 1% L2 TLB miss ratio LONG_DATA_PAGE_WALK / DATA_READ_OR_WRITE > .1% L1 TLB misses per L2 TLB miss DATA_PAGE_WALK / LONG_DATA_PAGE_WALK > 100x
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Problem Area: VPU Usage • Indicates whether an application is vectorized successfully and efficiently • Tuning Suggestions: Use the Compiler vectorization report! For data dependencies preventing vectorization, try using Intel® Cilk™ Plus #pragma SIMD (if safe!) Align data and tell the Compiler! Re-structure code if possible: Array notations, AOS->SOA 28 MetricMetricMetricMetric FormulaFormulaFormulaFormula InvestigateInvestigateInvestigateInvestigate ifififif Vectorization Intensity VPU_ELEMENTS_ACTIVE / VPU_INSTRUCTIONS_EXECUTED <8 (DP), <16(SP)
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Problem Area: Memory Bandwidth • Can increase data latency in the system or become a performance bottleneck • Tuning Suggestions: Improve locality in caches Use streaming stores Improve software prefetching 29 MetricMetricMetricMetric FormulaFormulaFormulaFormula InvestigateInvestigateInvestigateInvestigate ifififif Memory Bandwidth (UNC_F_CH0_NORMAL_READ + UNC_F_CH0_NORMAL_WRITE+ UNC_F_CH1_NORMAL_READ + UNC_F_CH1_NORMAL_WRITE) * 64/time < 80GB/sec (practical peak 140GB/sec) (with 8 memory controllers)
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE 30
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. DEMO 31
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Running the General Exploration Collector 32 2. Select “General Exploration” for your CPU architecture 3. Click “Start” to begin profiling 1. Click “New Analysis” button
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. General Exploration Summary 33
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance 34
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance 35 Instructions Navigator New Open PropertiesInstructions Navigator New Open PropertiesInstructions Navigator New Open PropertiesInstructions Navigator New Open Properties New Open CompareNew Open CompareNew Open CompareNew Open Compare ProjectProjectProjectProject ResultResultResultResult ToolbarToolbarToolbarToolbar
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance 36 ProjectProjectProjectProject NavigatorNavigatorNavigatorNavigator
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance 37 Result DisplayResult DisplayResult DisplayResult Display TabsTabsTabsTabs
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance 38 Result AnalysisResult AnalysisResult AnalysisResult Analysis TypeTypeTypeType
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance 39 Result ViewpointResult ViewpointResult ViewpointResult Viewpoint
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance 40 ViewpointViewpointViewpointViewpoint AlternatesAlternatesAlternatesAlternates
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance 41 ResultResultResultResult ComponentsComponentsComponentsComponents
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance 42 GridGridGridGrid PanePanePanePane
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance 43 GridGridGridGrid PanePanePanePane Grouping pullGrouping pullGrouping pullGrouping pull----downdowndowndown
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance 44 StackStackStackStack PanePanePanePane
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance 45 TimelineTimelineTimelineTimeline
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance 46 Filter/OptionsFilter/OptionsFilter/OptionsFilter/Options BarBarBarBar
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance Intel Confidential47 5/30/20 14 Source View /Source View /Source View /Source View / Per line localizationPer line localizationPer line localizationPer line localization
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance Intel Confidential48 5/30/20 14 Source View /Source View /Source View /Source View / View / Hot spotView / Hot spotView / Hot spotView / Hot spot Navigation controlsNavigation controlsNavigation controlsNavigation controls
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance Intel Confidential49 5/30/20 14 Assembly View /Assembly View /Assembly View /Assembly View / View / Hot spotView / Hot spotView / Hot spotView / Hot spot Navigation controlsNavigation controlsNavigation controlsNavigation controls
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance Intel Confidential50 5/30/20 14 Assembly View /Assembly View /Assembly View /Assembly View / AssemblyAssemblyAssemblyAssembly groupingsgroupingsgroupingsgroupings
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Intel® Software Conference 2014 For event collection the coprocessor is treated as a special HW architecture 51
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Intel® Software Conference 2014 Project properties provides the means to invoke data collection by target type 52
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Intel® Software Conference 2014 Launch Application serves many uses, from host/offload to native execution 53
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Intel® Software Conference 2014Search directories have been reorganized to speed symbol resolution during finalization 54 Notable coprocessor library paths:Notable coprocessor library paths:Notable coprocessor library paths:Notable coprocessor library paths: /opt/mpss/3.2/sysroots/k1om-mpss-Linux/boot /opt/mpss/3.2/sysroots/k1om-mpss-Linux/lib64 /opt/intel/composerxe/lib/mic /opt/intel/composerxe/tbb/lib/mic /opt/intel/composerxe/mkl/lib/mic /opt/intel/mpi-rt/4.1.3/mic
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Intel® Software Conference 2014 General Exploration runs a set of events to drive top-down analysis 55
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. For more information on Intel® Xeon Phi™ and VTune™ Amplifier XE 56 Optimization on the coprocessor: http://software.intel.com/en- us/articles/optimization-and-performance-tuning-for-intel-xeon-phi- coprocessors-part-1-optimization http://software.intel.com/en-us/articles/optimization-and- performance-tuning-for-intel-xeon-phi-coprocessors-part-2- understanding Coprocessor Performance Monitoring Unit: http://software.intel.com/sites/default/files/forum/278102/intelr- xeon-phitm-pmu-rev1.01.pdf For general information: http://software.intel.com/mic-developer
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Grid is Based on Top-Down 57
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Use the Hover Text to Understand Metrics* *Suggestions welcome: Submit issues if the text isn’t helpful 58
  • Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Event collections on the coprocessor can generate volumes of data dgemm: on 60+ cores Tip: Use cpu-mask to reduce data set, while maintaining the same accuracy. 59
  • Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Resources Top-Down Characterization White Paper http://software.intel.com/en-us/articles/how-to-tune-applications- using-a-top-down-characterization-of-microarchitectural-issues Tuning Guides http://software.intel.com/en-us/articles/processor-specific- performance-analysis-papers 60