Methods and practices to analyze the performance of your application with Intel® VTune™ Amplifier XE

Methods and practices to
analyze the performance of your
application with Intel® VTune™
Amplifier XE
Leo Borges
Intel Software Conference 2014 Brazil
May 2014

Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE,
TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER
AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR
WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY
PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,
components, software, operations and functions. Any change to any of those factors may cause the results to vary. You
should consult other information and performance tests to assist you in fully evaluating your contemplated purchases,
including the performance of that product when combined with other products.
Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are
trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that
are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and
other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended
for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for
Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information
regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Legal Disclaimer & Optimization NoticeLegal Disclaimer & Optimization NoticeLegal Disclaimer & Optimization NoticeLegal Disclaimer & Optimization Notice
2

Agenda
• Intel® VTune Amplifier XE Intro
• Microarchitecture Review
• The Top-Down Characterization details
• Intel® VTune™ Amplifier XE Implementation
• Demo
**Sources for current presentation:
http://software.intel.com/en-us/articles/advanced-profiling-with-intel-
vtune-amplifier-xe-part-1-find-the-bottleneck
3

Two Ways to Collect Data - Intel® VTune™ Amplifier XE
4
Software CollectorSoftware CollectorSoftware CollectorSoftware Collector
Hotspots, Concurrency, Locks & Waits
Hardware CollectorHardware CollectorHardware CollectorHardware Collector
Lightweight Hotspots, Advanced Analysis
Uses OS interrupts Uses the on chip Performance Monitoring
Unit (PMU)
Collects from a single process tree Collect system wide or from a
single process tree.
~10ms default resolution ~1ms default resolution
(finer granularity - finds small functions)
Collect on both Intel® and compatible
processors
Requires a genuine Intel® processor for
collection
Call stacks show calling sequence New! Optionally collect call stacks
Works in virtual environments Works in virtual environments only when
supported by the VM
(e.g., vSphere* 5.1)
No driver required Requires a driver
No special recompilesNo special recompilesNo special recompilesNo special recompiles ---- C, C++, C#, Fortran, Java, AssemblyC, C++, C#, Fortran, Java, AssemblyC, C++, C#, Fortran, Java, AssemblyC, C++, C#, Fortran, Java, Assembly

Two Ways to Collect Data - Intel® VTune™ Amplifier XE
5
Software CollectorSoftware CollectorSoftware CollectorSoftware Collector
Hotspots, Concurrency, Locks & Waits
Hardware CollectorHardware CollectorHardware CollectorHardware Collector
Lightweight Hotspots, Advanced Analysis
Uses OS interrupts Uses the on chip Performance Monitoring
Unit (PMU)
Collects from a single process tree Collect system wide or from a
single process tree.
~10ms default resolution ~1ms default resolution
(finer granularity - finds small functions)
Collect on both Intel® and compatible
processors
Requires a genuine Intel® processor for
collection
Call stacks show calling sequence New! Optionally collect call stacks
Works in virtual environments Works in virtual environments only when
supported by the VM
(e.g., vSphere* 5.1)
No driver required Requires a driver
No special recompilesNo special recompilesNo special recompilesNo special recompiles ---- C, C++, C#, Fortran, Java, AssemblyC, C++, C#, Fortran, Java, AssemblyC, C++, C#, Fortran, Java, AssemblyC, C++, C#, Fortran, Java, Assembly

Microarchitecture basics
6
FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute RetireRetireRetireRetire
• Classic 4-stage pipeline depicted here.
• Memory not shown.
• Pipeline on current processors capable of speculative
and out of order execution.

Intuitive approach to EBS
• Use a small list of metrics to monitor level of
optimization
• Example 1: Cycles per instruction (CPI)
• Example 2: Instruction retirement ratio
m instructions issued n retired
Retirement ratio = n/m
% executed but not retired = (1 – n/m)*100
7
Intel Confidential
5/30/20
14

Microarchitecture Review
8
FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit
The traditional 5-stage pipeline. Pipeline on current
processors capable of out of order execution.

9
The traditional 5-stage pipeline. Pipeline on current
processors capable of out of order execution.

Intel® Software Conference 2014Microarchitecture Review
10
FrontFrontFrontFront----EndEndEndEnd
The front-end fetches instructions IN ORDER, decodes them into
u-ops(micro-operations), and sends the u-ops to the back-end.

11
FrontFrontFrontFront----EndEndEndEnd BackBackBackBack----EndEndEndEnd
The back-end receives u-ops, executes them OUT OF ORDER,
accesses memory as needed, and commits results to memory
IN ORDER.

12
AllocationAllocationAllocationAllocation
Allocation is the point where u-ops transfer from the
front-end to the back-end. The front-end can allocate 4
u-ops per cycle.

13
AllocationAllocationAllocationAllocation RetirementRetirementRetirementRetirement
Retirement is the point where u-ops leave the back-end. The
back-end can retire 4 u-ops per cycle.

And a New Term: the Pipeline Slot
14
4 Potential4 Potential4 Potential4 Potential
AllocationsAllocationsAllocationsAllocations
per Cycleper Cycleper Cycleper Cycle
RetirementsRetirementsRetirementsRetirements
In reality, there are many queues, buffers, and pieces of logic
throughout the pipeline to allow up to 4 allocations and 4
retirements per cycle.

15
AllocationsAllocationsAllocationsAllocations
RetirementsRetirementsRetirementsRetirements
The “Pipeline Slot” is an abstraction representing all the
resources needed to move one u-op through the pipeline.

ExecuteExecuteExecuteExecute
16
FetchFetchFetchFetch DecodeDecodeDecodeDecode MemoryMemoryMemoryMemory CommitCommitCommitCommit
There are 4 Pipeline Slots available every cycle.
S1
S2
S3
S4

17
Pipeline slots are filled with u-ops that travel from allocation
to retirement over multiple cycles.
S1
S2
S3
S4
S1
S2
S3
S4
S1
S2
S3
S4
S1
S2
S3
S4
S1
S2
S3
S4

Cycles Per Instruction (CPI), a standard
measure, has some special kinks
For multi-core processors, CPI can get as low as 0.25 cycles
per instructions with current Intel processors.
Normally, something below CPI < ~1.0 is targeted for
better performances.
Some would suggest CPI must be targeted around ~0.75 to
0.50.
But is this correct to any architecture?
18

Cycles Per Instruction (CPI), a standard
measure, has some special kinks
• Threads on each Intel® Xeon™ Phi core share a clock
If all 4 HW threads are active, each gets ¼ total cycles
• Multi-stage instruction decode requires two threads to utilize the
whole core – one thread only gets half
• With two ops/per cycle (U-V-pipe dual issue):
• To get thread CPI, multiply by the active threads
19
Threads perThreads perThreads perThreads per
CoreCoreCoreCore
BestBestBestBest CPICPICPICPI
perperperper CoreCoreCoreCore
1111 1.0
2222 0.5
3333 0.5
4444 0.5
Threads perThreads perThreads perThreads per
CoreCoreCoreCore
BestBestBestBest CPICPICPICPI
perperperper CoreCoreCoreCore
Best CPIBest CPIBest CPIBest CPI
per Threadper Threadper Threadper Thread
1 x1 x1 x1 x 1.0 = 1.0
2 x2 x2 x2 x 0.5 = 1.0
3 x3 x3 x3 x 0.5 = 1.5
4 x4 x4 x4 x 0.5 = 2.0

The Top-Down Characterization
What is it?
The Top-Down Characterization is:
• A new way to organize and use processor events to
identify the real hardware bottlenecks in
systems/applications
• Based on PMU events specifically designed for this task
• Integrated into Intel® VTune Amplifier XE for Core
• Available on Intel® Microarchitecture code named Sandy
Bridge and newer
20

Each pipeline slot on each cycle is classified into 1 of 4 categories.
For each slot on each cycle:
21

22
• Sum to 1.0
• Unit is “Percentage of total Pipeline Slots”
• This is the core of the new Top-Down
characterization
• Each category is further broken down depending on
available events

23
Back-EndFront-End
Latency Bandwith
Memory
Bound
Memory
Bound
Core
Bound
Core
Bound
L1
DRAM
Remote
DRAM
Local ou
Remote
L2
L3
DIV
Active
DIV
Active
Port
Utilization
Port
Utilization
0 .. 3 ports
Store
Bound
Store
Bound
ITLBITLB
Overhead
ICacheICache
Misses
DSB
Switches
Branch
Resteers
Retiring Bad
Speculation
Branch
Mispredict
Branch
Mispredict
Machine
Clears
Machine
Clears
General Microcode
Sequencer
Microcode
Sequencer
DSBMITE
Issues breakdown

Examples of Metrics (Xeon™ Phi)
24

Problem Area: L1 Cache Usage
• Significantly affects data access latency and therefore application performance
• Tuning Suggestions:
Software prefetching
Tile/block data access for cache size
Use streaming stores
If using 4K access stride, may be experiencing conflict misses
Examine Compiler prefetching (Compiler-generated L1 prefetches should not
miss)
25
MetricMetricMetricMetric FormulaFormulaFormulaFormula InvestigateInvestigateInvestigateInvestigate ifififif
L1
Misses
DATA_READ_MISS_OR_WRITE_MISS +
L1_DATA_HIT_INFLIGHT_PF1
L1 Hit
Rate
(DATA_READ_OR_WRITE – L1 Misses) /
DATA_READ_OR_WRITE
< 95%

Problem Area: Data Access Latency
• Significantly affects application performance
Software prefetching
Tile/block data access for cache size
Check cache locality – turn off prefetching and use CACHE_FILL events - reduce
sharing if needed/possible
If using 64K access stride, may be experiencing conflict misses
26
Estimated
Latency
Impact
(CPU_CLK_UNHALTED
– EXEC_STAGE_CYCLES
– DATA_READ_OR_WRITE)
/ DATA_READ_OR_WRITE_MISS
>145

Problem Area: TLB Usage
• Also affects data access latency and therefore application performance
Improve cache usage & data access latency
If L1 TLB miss/L2 TLB miss is high, try using large pages
For loops with multiple streams, try splitting into multiple loops
If data access stride is a large power of 2, consider padding between arrays by
one 4 KB page
27
MetricMetricMetricMetric FormulaFormulaFormulaFormula InvestInvestInvestInvest----
igateigateigateigate ifififif
L1 TLB miss ratio DATA_PAGE_WALK/DATA_READ_OR_WRITE > 1%
L2 TLB miss ratio LONG_DATA_PAGE_WALK
/ DATA_READ_OR_WRITE
> .1%
L1 TLB misses per L2
TLB miss
DATA_PAGE_WALK / LONG_DATA_PAGE_WALK > 100x

Problem Area: VPU Usage
• Indicates whether an application is vectorized successfully and efficiently
Use the Compiler vectorization report!
For data dependencies preventing vectorization, try using Intel® Cilk™ Plus
#pragma SIMD (if safe!)
Align data and tell the Compiler!
Re-structure code if possible: Array notations, AOS->SOA
28
Vectorization
Intensity
VPU_ELEMENTS_ACTIVE /
VPU_INSTRUCTIONS_EXECUTED
<8 (DP), <16(SP)

Problem Area: Memory Bandwidth
• Can increase data latency in the system or become a performance bottleneck
Improve locality in caches
Improve software prefetching
29
Memory
Bandwidth
(UNC_F_CH0_NORMAL_READ +
UNC_F_CH0_NORMAL_WRITE+
UNC_F_CH1_NORMAL_READ +
UNC_F_CH1_NORMAL_WRITE) * 64/time
< 80GB/sec
(practical peak
140GB/sec)
(with 8 memory
controllers)

VTune™ Amplifier XE
30

DEMO
31

Running the General Exploration Collector
32
2. Select
“General
Exploration” for
your CPU
architecture
3. Click
“Start” to
begin
profiling
1. Click “New
Analysis” button

General Exploration Summary
33

VTune™ Amplifier XE visualizes performance
34

35
Instructions Navigator New Open PropertiesInstructions Navigator New Open PropertiesInstructions Navigator New Open PropertiesInstructions Navigator New Open Properties New Open CompareNew Open CompareNew Open CompareNew Open Compare
ProjectProjectProjectProject ResultResultResultResult
ToolbarToolbarToolbarToolbar

36
ProjectProjectProjectProject
NavigatorNavigatorNavigatorNavigator

37
Result DisplayResult DisplayResult DisplayResult Display
TabsTabsTabsTabs

38
Result AnalysisResult AnalysisResult AnalysisResult Analysis
TypeTypeTypeType

39
Result ViewpointResult ViewpointResult ViewpointResult Viewpoint

40
ViewpointViewpointViewpointViewpoint
AlternatesAlternatesAlternatesAlternates

41
ResultResultResultResult ComponentsComponentsComponentsComponents

42
GridGridGridGrid PanePanePanePane

43
GridGridGridGrid PanePanePanePane
Grouping pullGrouping pullGrouping pullGrouping pull----downdowndowndown

44
StackStackStackStack
PanePanePanePane

45
TimelineTimelineTimelineTimeline

46
Filter/OptionsFilter/OptionsFilter/OptionsFilter/Options
BarBarBarBar

Intel Confidential47
5/30/20
14
Source View /Source View /Source View /Source View /
Per line localizationPer line localizationPer line localizationPer line localization

5/30/20
14
Source View /Source View /Source View /Source View /
View / Hot spotView / Hot spotView / Hot spotView / Hot spot
Navigation controlsNavigation controlsNavigation controlsNavigation controls

5/30/20
14
Assembly View /Assembly View /Assembly View /Assembly View /
View / Hot spotView / Hot spotView / Hot spotView / Hot spot
Navigation controlsNavigation controlsNavigation controlsNavigation controls

5/30/20
14
Assembly View /Assembly View /Assembly View /Assembly View /
AssemblyAssemblyAssemblyAssembly
groupingsgroupingsgroupingsgroupings

Intel® Software Conference 2014
For event collection the coprocessor
is treated as a special HW
architecture
51

Project properties provides the
means to invoke data collection by
target type
52

Launch Application serves many
uses, from host/offload to native
execution
53

Intel® Software Conference 2014Search directories have been reorganized to
speed symbol resolution during finalization
54
Notable coprocessor library paths:Notable coprocessor library paths:Notable coprocessor library paths:Notable coprocessor library paths:
/opt/mpss/3.2/sysroots/k1om-mpss-Linux/boot
/opt/mpss/3.2/sysroots/k1om-mpss-Linux/lib64
/opt/intel/composerxe/lib/mic
/opt/intel/composerxe/tbb/lib/mic
/opt/intel/composerxe/mkl/lib/mic
/opt/intel/mpi-rt/4.1.3/mic

General Exploration runs a set of events to
drive top-down analysis
55

For more information on Intel® Xeon
Phi™ and VTune™ Amplifier XE
56
Optimization on the coprocessor: http://software.intel.com/en-
us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-
coprocessors-part-1-optimization
http://software.intel.com/en-us/articles/optimization-and-
performance-tuning-for-intel-xeon-phi-coprocessors-part-2-
understanding
Coprocessor Performance Monitoring Unit:
http://software.intel.com/sites/default/files/forum/278102/intelr-
xeon-phitm-pmu-rev1.01.pdf
For general information: http://software.intel.com/mic-developer

Grid is Based on Top-Down
57

Use the Hover Text to Understand Metrics*
*Suggestions welcome: Submit issues if the text isn’t helpful
58

Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Event collections on the coprocessor can
generate volumes of data
dgemm: on 60+ cores
Tip: Use cpu-mask to reduce data set, while maintaining
the same accuracy.
59

Resources
Top-Down Characterization White Paper
http://software.intel.com/en-us/articles/how-to-tune-applications-
using-a-top-down-characterization-of-microarchitectural-issues
Tuning Guides
http://software.intel.com/en-us/articles/processor-specific-
performance-analysis-papers
60

Methods and practices to analyze the performance of your application with Intel® VTune™ Amplifier XE

Methods and practices to analyze the performance of your application with Intel® VTune™ Amplifier XE

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Methods and practices to analyze the performance of your application with Intel® VTune™ Amplifier XE

Similar to Methods and practices to analyze the performance of your application with Intel® VTune™ Amplifier XE (20)

More from Intel Software Brasil

More from Intel Software Brasil (16)

Methods and practices to analyze the performance of your application with Intel® VTune™ Amplifier XE