Methods and practices to
analyze the performance of your
application with Intel® VTune™
Amplifier XE
Leo Borges
Intel Software Conference 2014 Brazil
May 2014
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE,
TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER
AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR
WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY
PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,
components, software, operations and functions. Any change to any of those factors may cause the results to vary. You
should consult other information and performance tests to assist you in fully evaluating your contemplated purchases,
including the performance of that product when combined with other products.
Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are
trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that
are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and
other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended
for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for
Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information
regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Legal Disclaimer & Optimization NoticeLegal Disclaimer & Optimization NoticeLegal Disclaimer & Optimization NoticeLegal Disclaimer & Optimization Notice
Copyright©Copyright©Copyright©Copyright© 2012,2012,2012,2012, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
2
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Agenda
• Intel® VTune Amplifier XE Intro
• Microarchitecture Review
• The Top-Down Characterization details
• Intel® VTune™ Amplifier XE Implementation
• Demo
**Sources for current presentation:
http://software.intel.com/en-us/articles/advanced-profiling-with-intel-
vtune-amplifier-xe-part-1-find-the-bottleneck
3
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Two Ways to Collect Data - Intel® VTune™ Amplifier XE
4
Software CollectorSoftware CollectorSoftware CollectorSoftware Collector
Hotspots, Concurrency, Locks & Waits
Hardware CollectorHardware CollectorHardware CollectorHardware Collector
Lightweight Hotspots, Advanced Analysis
Uses OS interrupts Uses the on chip Performance Monitoring
Unit (PMU)
Collects from a single process tree Collect system wide or from a
single process tree.
~10ms default resolution ~1ms default resolution
(finer granularity - finds small functions)
Collect on both Intel® and compatible
processors
Requires a genuine Intel® processor for
collection
Call stacks show calling sequence New! Optionally collect call stacks
Works in virtual environments Works in virtual environments only when
supported by the VM
(e.g., vSphere* 5.1)
No driver required Requires a driver
No special recompilesNo special recompilesNo special recompilesNo special recompiles ---- C, C++, C#, Fortran, Java, AssemblyC, C++, C#, Fortran, Java, AssemblyC, C++, C#, Fortran, Java, AssemblyC, C++, C#, Fortran, Java, Assembly
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Two Ways to Collect Data - Intel® VTune™ Amplifier XE
5
Software CollectorSoftware CollectorSoftware CollectorSoftware Collector
Hotspots, Concurrency, Locks & Waits
Hardware CollectorHardware CollectorHardware CollectorHardware Collector
Lightweight Hotspots, Advanced Analysis
Uses OS interrupts Uses the on chip Performance Monitoring
Unit (PMU)
Collects from a single process tree Collect system wide or from a
single process tree.
~10ms default resolution ~1ms default resolution
(finer granularity - finds small functions)
Collect on both Intel® and compatible
processors
Requires a genuine Intel® processor for
collection
Call stacks show calling sequence New! Optionally collect call stacks
Works in virtual environments Works in virtual environments only when
supported by the VM
(e.g., vSphere* 5.1)
No driver required Requires a driver
No special recompilesNo special recompilesNo special recompilesNo special recompiles ---- C, C++, C#, Fortran, Java, AssemblyC, C++, C#, Fortran, Java, AssemblyC, C++, C#, Fortran, Java, AssemblyC, C++, C#, Fortran, Java, Assembly
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Microarchitecture basics
6
FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute RetireRetireRetireRetire
• Classic 4-stage pipeline depicted here.
• Memory not shown.
• Pipeline on current processors capable of speculative
and out of order execution.
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Intuitive approach to EBS
• Use a small list of metrics to monitor level of
optimization
• Example 1: Cycles per instruction (CPI)
• Example 2: Instruction retirement ratio
m instructions issued n retired
Retirement ratio = n/m
% executed but not retired = (1 – n/m)*100
7
Intel Confidential
5/30/20
14
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Microarchitecture Review
8
FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit
The traditional 5-stage pipeline. Pipeline on current
processors capable of out of order execution.
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Microarchitecture Review
9
FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit
The traditional 5-stage pipeline. Pipeline on current
processors capable of out of order execution.
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Intel® Software Conference 2014Microarchitecture Review
10
FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit
FrontFrontFrontFront----EndEndEndEnd
The front-end fetches instructions IN ORDER, decodes them into
u-ops(micro-operations), and sends the u-ops to the back-end.
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Microarchitecture Review
11
FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit
FrontFrontFrontFront----EndEndEndEnd BackBackBackBack----EndEndEndEnd
The back-end receives u-ops, executes them OUT OF ORDER,
accesses memory as needed, and commits results to memory
IN ORDER.
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Microarchitecture Review
12
FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit
FrontFrontFrontFront----EndEndEndEnd BackBackBackBack----EndEndEndEnd
AllocationAllocationAllocationAllocation
Allocation is the point where u-ops transfer from the
front-end to the back-end. The front-end can allocate 4
u-ops per cycle.
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Microarchitecture Review
13
FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit
FrontFrontFrontFront----EndEndEndEnd BackBackBackBack----EndEndEndEnd
AllocationAllocationAllocationAllocation RetirementRetirementRetirementRetirement
Retirement is the point where u-ops leave the back-end. The
back-end can retire 4 u-ops per cycle.
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
And a New Term: the Pipeline Slot
14
FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit
FrontFrontFrontFront----EndEndEndEnd BackBackBackBack----EndEndEndEnd
4 Potential4 Potential4 Potential4 Potential
AllocationsAllocationsAllocationsAllocations
per Cycleper Cycleper Cycleper Cycle
4 Potential4 Potential4 Potential4 Potential
RetirementsRetirementsRetirementsRetirements
per Cycleper Cycleper Cycleper Cycle
In reality, there are many queues, buffers, and pieces of logic
throughout the pipeline to allow up to 4 allocations and 4
retirements per cycle.
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
And a New Term: the Pipeline Slot
15
FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit
FrontFrontFrontFront----EndEndEndEnd BackBackBackBack----EndEndEndEnd
4 Potential4 Potential4 Potential4 Potential
AllocationsAllocationsAllocationsAllocations
per Cycleper Cycleper Cycleper Cycle
4 Potential4 Potential4 Potential4 Potential
RetirementsRetirementsRetirementsRetirements
per Cycleper Cycleper Cycleper Cycle
The “Pipeline Slot” is an abstraction representing all the
resources needed to move one u-op through the pipeline.
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
ExecuteExecuteExecuteExecute
And a New Term: the Pipeline Slot
16
FetchFetchFetchFetch DecodeDecodeDecodeDecode MemoryMemoryMemoryMemory CommitCommitCommitCommit
FrontFrontFrontFront----EndEndEndEnd BackBackBackBack----EndEndEndEnd
There are 4 Pipeline Slots available every cycle.
S1
S2
S3
S4
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
And a New Term: the Pipeline Slot
17
FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit
FrontFrontFrontFront----EndEndEndEnd BackBackBackBack----EndEndEndEnd
Pipeline slots are filled with u-ops that travel from allocation
to retirement over multiple cycles.
S1
S2
S3
S4
S1
S2
S3
S4
S1
S2
S3
S4
S1
S2
S3
S4
S1
S2
S3
S4
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Cycles Per Instruction (CPI), a standard
measure, has some special kinks
For multi-core processors, CPI can get as low as 0.25 cycles
per instructions with current Intel processors.
Normally, something below CPI < ~1.0 is targeted for
better performances.
Some would suggest CPI must be targeted around ~0.75 to
0.50.
But is this correct to any architecture?
18
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Cycles Per Instruction (CPI), a standard
measure, has some special kinks
• Threads on each Intel® Xeon™ Phi core share a clock
If all 4 HW threads are active, each gets ¼ total cycles
• Multi-stage instruction decode requires two threads to utilize the
whole core – one thread only gets half
• With two ops/per cycle (U-V-pipe dual issue):
• To get thread CPI, multiply by the active threads
19
Threads perThreads perThreads perThreads per
CoreCoreCoreCore
BestBestBestBest CPICPICPICPI
perperperper CoreCoreCoreCore
1111 1.0
2222 0.5
3333 0.5
4444 0.5
Threads perThreads perThreads perThreads per
CoreCoreCoreCore
BestBestBestBest CPICPICPICPI
perperperper CoreCoreCoreCore
Best CPIBest CPIBest CPIBest CPI
per Threadper Threadper Threadper Thread
1 x1 x1 x1 x 1.0 = 1.0
2 x2 x2 x2 x 0.5 = 1.0
3 x3 x3 x3 x 0.5 = 1.5
4 x4 x4 x4 x 0.5 = 2.0
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
The Top-Down Characterization
What is it?
The Top-Down Characterization is:
• A new way to organize and use processor events to
identify the real hardware bottlenecks in
systems/applications
• Based on PMU events specifically designed for this task
• Integrated into Intel® VTune Amplifier XE for Core
• Available on Intel® Microarchitecture code named Sandy
Bridge and newer
20
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
The Top-Down Characterization
Each pipeline slot on each cycle is classified into 1 of 4 categories.
For each slot on each cycle:
21
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
The Top-Down Characterization
22
• Sum to 1.0
• Unit is “Percentage of total Pipeline Slots”
• This is the core of the new Top-Down
characterization
• Each category is further broken down depending on
available events
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
23
Back-EndFront-End
Latency Bandwith
Memory
Bound
Memory
Bound
Core
Bound
Core
Bound
L1
DRAM
Remote
DRAM
Local ou
Remote
L2
L3
DIV
Active
DIV
Active
Port
Utilization
Port
Utilization
0 .. 3 ports
Store
Bound
Store
Bound
ITLBITLB
Overhead
ICacheICache
Misses
DSB
Switches
Branch
Resteers
Retiring Bad
Speculation
Branch
Mispredict
Branch
Mispredict
Machine
Clears
Machine
Clears
General Microcode
Sequencer
Microcode
Sequencer
DSBMITE
Issues breakdown
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Examples of Metrics (Xeon™ Phi)
24
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Problem Area: L1 Cache Usage
• Significantly affects data access latency and therefore application performance
• Tuning Suggestions:
Software prefetching
Tile/block data access for cache size
Use streaming stores
If using 4K access stride, may be experiencing conflict misses
Examine Compiler prefetching (Compiler-generated L1 prefetches should not
miss)
25
MetricMetricMetricMetric FormulaFormulaFormulaFormula InvestigateInvestigateInvestigateInvestigate ifififif
L1
Misses
DATA_READ_MISS_OR_WRITE_MISS +
L1_DATA_HIT_INFLIGHT_PF1
L1 Hit
Rate
(DATA_READ_OR_WRITE – L1 Misses) /
DATA_READ_OR_WRITE
< 95%
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Problem Area: Data Access Latency
• Significantly affects application performance
• Tuning Suggestions:
Software prefetching
Tile/block data access for cache size
Use streaming stores
Check cache locality – turn off prefetching and use CACHE_FILL events - reduce
sharing if needed/possible
If using 64K access stride, may be experiencing conflict misses
26
MetricMetricMetricMetric FormulaFormulaFormulaFormula InvestigateInvestigateInvestigateInvestigate ifififif
Estimated
Latency
Impact
(CPU_CLK_UNHALTED
– EXEC_STAGE_CYCLES
– DATA_READ_OR_WRITE)
/ DATA_READ_OR_WRITE_MISS
>145
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Problem Area: TLB Usage
• Also affects data access latency and therefore application performance
• Tuning Suggestions:
Improve cache usage & data access latency
If L1 TLB miss/L2 TLB miss is high, try using large pages
For loops with multiple streams, try splitting into multiple loops
If data access stride is a large power of 2, consider padding between arrays by
one 4 KB page
27
MetricMetricMetricMetric FormulaFormulaFormulaFormula InvestInvestInvestInvest----
igateigateigateigate ifififif
L1 TLB miss ratio DATA_PAGE_WALK/DATA_READ_OR_WRITE > 1%
L2 TLB miss ratio LONG_DATA_PAGE_WALK
/ DATA_READ_OR_WRITE
> .1%
L1 TLB misses per L2
TLB miss
DATA_PAGE_WALK / LONG_DATA_PAGE_WALK > 100x
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Problem Area: VPU Usage
• Indicates whether an application is vectorized successfully and efficiently
• Tuning Suggestions:
Use the Compiler vectorization report!
For data dependencies preventing vectorization, try using Intel® Cilk™ Plus
#pragma SIMD (if safe!)
Align data and tell the Compiler!
Re-structure code if possible: Array notations, AOS->SOA
28
MetricMetricMetricMetric FormulaFormulaFormulaFormula InvestigateInvestigateInvestigateInvestigate ifififif
Vectorization
Intensity
VPU_ELEMENTS_ACTIVE /
VPU_INSTRUCTIONS_EXECUTED
<8 (DP), <16(SP)
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Problem Area: Memory Bandwidth
• Can increase data latency in the system or become a performance bottleneck
• Tuning Suggestions:
Improve locality in caches
Use streaming stores
Improve software prefetching
29
MetricMetricMetricMetric FormulaFormulaFormulaFormula InvestigateInvestigateInvestigateInvestigate ifififif
Memory
Bandwidth
(UNC_F_CH0_NORMAL_READ +
UNC_F_CH0_NORMAL_WRITE+
UNC_F_CH1_NORMAL_READ +
UNC_F_CH1_NORMAL_WRITE) * 64/time
< 80GB/sec
(practical peak
140GB/sec)
(with 8 memory
controllers)
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE
30
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
DEMO
31
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Running the General Exploration Collector
32
2. Select
“General
Exploration” for
your CPU
architecture
3. Click
“Start” to
begin
profiling
1. Click “New
Analysis” button
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
General Exploration Summary
33
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
34
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
35
Instructions Navigator New Open PropertiesInstructions Navigator New Open PropertiesInstructions Navigator New Open PropertiesInstructions Navigator New Open Properties New Open CompareNew Open CompareNew Open CompareNew Open Compare
ProjectProjectProjectProject ResultResultResultResult
ToolbarToolbarToolbarToolbar
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
36
ProjectProjectProjectProject
NavigatorNavigatorNavigatorNavigator
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
37
Result DisplayResult DisplayResult DisplayResult Display
TabsTabsTabsTabs
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
38
Result AnalysisResult AnalysisResult AnalysisResult Analysis
TypeTypeTypeType
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
39
Result ViewpointResult ViewpointResult ViewpointResult Viewpoint
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
40
ViewpointViewpointViewpointViewpoint
AlternatesAlternatesAlternatesAlternates
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
41
ResultResultResultResult ComponentsComponentsComponentsComponents
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
42
GridGridGridGrid PanePanePanePane
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
43
GridGridGridGrid PanePanePanePane
Grouping pullGrouping pullGrouping pullGrouping pull----downdowndowndown
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
44
StackStackStackStack
PanePanePanePane
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
45
TimelineTimelineTimelineTimeline
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
46
Filter/OptionsFilter/OptionsFilter/OptionsFilter/Options
BarBarBarBar
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
Intel Confidential47
5/30/20
14
Source View /Source View /Source View /Source View /
Per line localizationPer line localizationPer line localizationPer line localization
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
Intel Confidential48
5/30/20
14
Source View /Source View /Source View /Source View /
View / Hot spotView / Hot spotView / Hot spotView / Hot spot
Navigation controlsNavigation controlsNavigation controlsNavigation controls
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
Intel Confidential49
5/30/20
14
Assembly View /Assembly View /Assembly View /Assembly View /
View / Hot spotView / Hot spotView / Hot spotView / Hot spot
Navigation controlsNavigation controlsNavigation controlsNavigation controls
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
Intel Confidential50
5/30/20
14
Assembly View /Assembly View /Assembly View /Assembly View /
AssemblyAssemblyAssemblyAssembly
groupingsgroupingsgroupingsgroupings
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Intel® Software Conference 2014
For event collection the coprocessor
is treated as a special HW
architecture
51
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Intel® Software Conference 2014
Project properties provides the
means to invoke data collection by
target type
52
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Intel® Software Conference 2014
Launch Application serves many
uses, from host/offload to native
execution
53
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Intel® Software Conference 2014Search directories have been reorganized to
speed symbol resolution during finalization
54
Notable coprocessor library paths:Notable coprocessor library paths:Notable coprocessor library paths:Notable coprocessor library paths:
/opt/mpss/3.2/sysroots/k1om-mpss-Linux/boot
/opt/mpss/3.2/sysroots/k1om-mpss-Linux/lib64
/opt/intel/composerxe/lib/mic
/opt/intel/composerxe/tbb/lib/mic
/opt/intel/composerxe/mkl/lib/mic
/opt/intel/mpi-rt/4.1.3/mic
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Intel® Software Conference 2014
General Exploration runs a set of events to
drive top-down analysis
55
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
For more information on Intel® Xeon
Phi™ and VTune™ Amplifier XE
56
Optimization on the coprocessor: http://software.intel.com/en-
us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-
coprocessors-part-1-optimization
http://software.intel.com/en-us/articles/optimization-and-
performance-tuning-for-intel-xeon-phi-coprocessors-part-2-
understanding
Coprocessor Performance Monitoring Unit:
http://software.intel.com/sites/default/files/forum/278102/intelr-
xeon-phitm-pmu-rev1.01.pdf
For general information: http://software.intel.com/mic-developer
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Grid is Based on Top-Down
57
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Use the Hover Text to Understand Metrics*
*Suggestions welcome: Submit issues if the text isn’t helpful
58
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Event collections on the coprocessor can
generate volumes of data
dgemm: on 60+ cores
Tip: Use cpu-mask to reduce data set, while maintaining
the same accuracy.
59
Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Resources
Top-Down Characterization White Paper
http://software.intel.com/en-us/articles/how-to-tune-applications-
using-a-top-down-characterization-of-microarchitectural-issues
Tuning Guides
http://software.intel.com/en-us/articles/processor-specific-
performance-analysis-papers
60
Methods and practices to analyze the performance of your application with Intel® VTune™ Amplifier XE

Methods and practices to analyze the performance of your application with Intel® VTune™ Amplifier XE

  • 1.
    Methods and practicesto analyze the performance of your application with Intel® VTune™ Amplifier XE Leo Borges Intel Software Conference 2014 Brazil May 2014
  • 2.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Legal Disclaimer & Optimization NoticeLegal Disclaimer & Optimization NoticeLegal Disclaimer & Optimization NoticeLegal Disclaimer & Optimization Notice Copyright©Copyright©Copyright©Copyright© 2012,2012,2012,2012, Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. 2
  • 3.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Agenda • Intel® VTune Amplifier XE Intro • Microarchitecture Review • The Top-Down Characterization details • Intel® VTune™ Amplifier XE Implementation • Demo **Sources for current presentation: http://software.intel.com/en-us/articles/advanced-profiling-with-intel- vtune-amplifier-xe-part-1-find-the-bottleneck 3
  • 4.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Two Ways to Collect Data - Intel® VTune™ Amplifier XE 4 Software CollectorSoftware CollectorSoftware CollectorSoftware Collector Hotspots, Concurrency, Locks & Waits Hardware CollectorHardware CollectorHardware CollectorHardware Collector Lightweight Hotspots, Advanced Analysis Uses OS interrupts Uses the on chip Performance Monitoring Unit (PMU) Collects from a single process tree Collect system wide or from a single process tree. ~10ms default resolution ~1ms default resolution (finer granularity - finds small functions) Collect on both Intel® and compatible processors Requires a genuine Intel® processor for collection Call stacks show calling sequence New! Optionally collect call stacks Works in virtual environments Works in virtual environments only when supported by the VM (e.g., vSphere* 5.1) No driver required Requires a driver No special recompilesNo special recompilesNo special recompilesNo special recompiles ---- C, C++, C#, Fortran, Java, AssemblyC, C++, C#, Fortran, Java, AssemblyC, C++, C#, Fortran, Java, AssemblyC, C++, C#, Fortran, Java, Assembly
  • 5.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Two Ways to Collect Data - Intel® VTune™ Amplifier XE 5 Software CollectorSoftware CollectorSoftware CollectorSoftware Collector Hotspots, Concurrency, Locks & Waits Hardware CollectorHardware CollectorHardware CollectorHardware Collector Lightweight Hotspots, Advanced Analysis Uses OS interrupts Uses the on chip Performance Monitoring Unit (PMU) Collects from a single process tree Collect system wide or from a single process tree. ~10ms default resolution ~1ms default resolution (finer granularity - finds small functions) Collect on both Intel® and compatible processors Requires a genuine Intel® processor for collection Call stacks show calling sequence New! Optionally collect call stacks Works in virtual environments Works in virtual environments only when supported by the VM (e.g., vSphere* 5.1) No driver required Requires a driver No special recompilesNo special recompilesNo special recompilesNo special recompiles ---- C, C++, C#, Fortran, Java, AssemblyC, C++, C#, Fortran, Java, AssemblyC, C++, C#, Fortran, Java, AssemblyC, C++, C#, Fortran, Java, Assembly
  • 6.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Microarchitecture basics 6 FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute RetireRetireRetireRetire • Classic 4-stage pipeline depicted here. • Memory not shown. • Pipeline on current processors capable of speculative and out of order execution.
  • 7.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Intuitive approach to EBS • Use a small list of metrics to monitor level of optimization • Example 1: Cycles per instruction (CPI) • Example 2: Instruction retirement ratio m instructions issued n retired Retirement ratio = n/m % executed but not retired = (1 – n/m)*100 7 Intel Confidential 5/30/20 14
  • 8.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Microarchitecture Review 8 FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit The traditional 5-stage pipeline. Pipeline on current processors capable of out of order execution.
  • 9.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Microarchitecture Review 9 FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit The traditional 5-stage pipeline. Pipeline on current processors capable of out of order execution.
  • 10.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Intel® Software Conference 2014Microarchitecture Review 10 FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit FrontFrontFrontFront----EndEndEndEnd The front-end fetches instructions IN ORDER, decodes them into u-ops(micro-operations), and sends the u-ops to the back-end.
  • 11.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Microarchitecture Review 11 FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit FrontFrontFrontFront----EndEndEndEnd BackBackBackBack----EndEndEndEnd The back-end receives u-ops, executes them OUT OF ORDER, accesses memory as needed, and commits results to memory IN ORDER.
  • 12.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Microarchitecture Review 12 FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit FrontFrontFrontFront----EndEndEndEnd BackBackBackBack----EndEndEndEnd AllocationAllocationAllocationAllocation Allocation is the point where u-ops transfer from the front-end to the back-end. The front-end can allocate 4 u-ops per cycle.
  • 13.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Microarchitecture Review 13 FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit FrontFrontFrontFront----EndEndEndEnd BackBackBackBack----EndEndEndEnd AllocationAllocationAllocationAllocation RetirementRetirementRetirementRetirement Retirement is the point where u-ops leave the back-end. The back-end can retire 4 u-ops per cycle.
  • 14.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. And a New Term: the Pipeline Slot 14 FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit FrontFrontFrontFront----EndEndEndEnd BackBackBackBack----EndEndEndEnd 4 Potential4 Potential4 Potential4 Potential AllocationsAllocationsAllocationsAllocations per Cycleper Cycleper Cycleper Cycle 4 Potential4 Potential4 Potential4 Potential RetirementsRetirementsRetirementsRetirements per Cycleper Cycleper Cycleper Cycle In reality, there are many queues, buffers, and pieces of logic throughout the pipeline to allow up to 4 allocations and 4 retirements per cycle.
  • 15.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. And a New Term: the Pipeline Slot 15 FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit FrontFrontFrontFront----EndEndEndEnd BackBackBackBack----EndEndEndEnd 4 Potential4 Potential4 Potential4 Potential AllocationsAllocationsAllocationsAllocations per Cycleper Cycleper Cycleper Cycle 4 Potential4 Potential4 Potential4 Potential RetirementsRetirementsRetirementsRetirements per Cycleper Cycleper Cycleper Cycle The “Pipeline Slot” is an abstraction representing all the resources needed to move one u-op through the pipeline.
  • 16.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. ExecuteExecuteExecuteExecute And a New Term: the Pipeline Slot 16 FetchFetchFetchFetch DecodeDecodeDecodeDecode MemoryMemoryMemoryMemory CommitCommitCommitCommit FrontFrontFrontFront----EndEndEndEnd BackBackBackBack----EndEndEndEnd There are 4 Pipeline Slots available every cycle. S1 S2 S3 S4
  • 17.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. And a New Term: the Pipeline Slot 17 FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit FrontFrontFrontFront----EndEndEndEnd BackBackBackBack----EndEndEndEnd Pipeline slots are filled with u-ops that travel from allocation to retirement over multiple cycles. S1 S2 S3 S4 S1 S2 S3 S4 S1 S2 S3 S4 S1 S2 S3 S4 S1 S2 S3 S4
  • 18.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Cycles Per Instruction (CPI), a standard measure, has some special kinks For multi-core processors, CPI can get as low as 0.25 cycles per instructions with current Intel processors. Normally, something below CPI < ~1.0 is targeted for better performances. Some would suggest CPI must be targeted around ~0.75 to 0.50. But is this correct to any architecture? 18
  • 19.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Cycles Per Instruction (CPI), a standard measure, has some special kinks • Threads on each Intel® Xeon™ Phi core share a clock If all 4 HW threads are active, each gets ¼ total cycles • Multi-stage instruction decode requires two threads to utilize the whole core – one thread only gets half • With two ops/per cycle (U-V-pipe dual issue): • To get thread CPI, multiply by the active threads 19 Threads perThreads perThreads perThreads per CoreCoreCoreCore BestBestBestBest CPICPICPICPI perperperper CoreCoreCoreCore 1111 1.0 2222 0.5 3333 0.5 4444 0.5 Threads perThreads perThreads perThreads per CoreCoreCoreCore BestBestBestBest CPICPICPICPI perperperper CoreCoreCoreCore Best CPIBest CPIBest CPIBest CPI per Threadper Threadper Threadper Thread 1 x1 x1 x1 x 1.0 = 1.0 2 x2 x2 x2 x 0.5 = 1.0 3 x3 x3 x3 x 0.5 = 1.5 4 x4 x4 x4 x 0.5 = 2.0
  • 20.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. The Top-Down Characterization What is it? The Top-Down Characterization is: • A new way to organize and use processor events to identify the real hardware bottlenecks in systems/applications • Based on PMU events specifically designed for this task • Integrated into Intel® VTune Amplifier XE for Core • Available on Intel® Microarchitecture code named Sandy Bridge and newer 20
  • 21.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. The Top-Down Characterization Each pipeline slot on each cycle is classified into 1 of 4 categories. For each slot on each cycle: 21
  • 22.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. The Top-Down Characterization 22 • Sum to 1.0 • Unit is “Percentage of total Pipeline Slots” • This is the core of the new Top-Down characterization • Each category is further broken down depending on available events
  • 23.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. 23 Back-EndFront-End Latency Bandwith Memory Bound Memory Bound Core Bound Core Bound L1 DRAM Remote DRAM Local ou Remote L2 L3 DIV Active DIV Active Port Utilization Port Utilization 0 .. 3 ports Store Bound Store Bound ITLBITLB Overhead ICacheICache Misses DSB Switches Branch Resteers Retiring Bad Speculation Branch Mispredict Branch Mispredict Machine Clears Machine Clears General Microcode Sequencer Microcode Sequencer DSBMITE Issues breakdown
  • 24.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Examples of Metrics (Xeon™ Phi) 24
  • 25.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Problem Area: L1 Cache Usage • Significantly affects data access latency and therefore application performance • Tuning Suggestions: Software prefetching Tile/block data access for cache size Use streaming stores If using 4K access stride, may be experiencing conflict misses Examine Compiler prefetching (Compiler-generated L1 prefetches should not miss) 25 MetricMetricMetricMetric FormulaFormulaFormulaFormula InvestigateInvestigateInvestigateInvestigate ifififif L1 Misses DATA_READ_MISS_OR_WRITE_MISS + L1_DATA_HIT_INFLIGHT_PF1 L1 Hit Rate (DATA_READ_OR_WRITE – L1 Misses) / DATA_READ_OR_WRITE < 95%
  • 26.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Problem Area: Data Access Latency • Significantly affects application performance • Tuning Suggestions: Software prefetching Tile/block data access for cache size Use streaming stores Check cache locality – turn off prefetching and use CACHE_FILL events - reduce sharing if needed/possible If using 64K access stride, may be experiencing conflict misses 26 MetricMetricMetricMetric FormulaFormulaFormulaFormula InvestigateInvestigateInvestigateInvestigate ifififif Estimated Latency Impact (CPU_CLK_UNHALTED – EXEC_STAGE_CYCLES – DATA_READ_OR_WRITE) / DATA_READ_OR_WRITE_MISS >145
  • 27.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Problem Area: TLB Usage • Also affects data access latency and therefore application performance • Tuning Suggestions: Improve cache usage & data access latency If L1 TLB miss/L2 TLB miss is high, try using large pages For loops with multiple streams, try splitting into multiple loops If data access stride is a large power of 2, consider padding between arrays by one 4 KB page 27 MetricMetricMetricMetric FormulaFormulaFormulaFormula InvestInvestInvestInvest---- igateigateigateigate ifififif L1 TLB miss ratio DATA_PAGE_WALK/DATA_READ_OR_WRITE > 1% L2 TLB miss ratio LONG_DATA_PAGE_WALK / DATA_READ_OR_WRITE > .1% L1 TLB misses per L2 TLB miss DATA_PAGE_WALK / LONG_DATA_PAGE_WALK > 100x
  • 28.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Problem Area: VPU Usage • Indicates whether an application is vectorized successfully and efficiently • Tuning Suggestions: Use the Compiler vectorization report! For data dependencies preventing vectorization, try using Intel® Cilk™ Plus #pragma SIMD (if safe!) Align data and tell the Compiler! Re-structure code if possible: Array notations, AOS->SOA 28 MetricMetricMetricMetric FormulaFormulaFormulaFormula InvestigateInvestigateInvestigateInvestigate ifififif Vectorization Intensity VPU_ELEMENTS_ACTIVE / VPU_INSTRUCTIONS_EXECUTED <8 (DP), <16(SP)
  • 29.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Problem Area: Memory Bandwidth • Can increase data latency in the system or become a performance bottleneck • Tuning Suggestions: Improve locality in caches Use streaming stores Improve software prefetching 29 MetricMetricMetricMetric FormulaFormulaFormulaFormula InvestigateInvestigateInvestigateInvestigate ifififif Memory Bandwidth (UNC_F_CH0_NORMAL_READ + UNC_F_CH0_NORMAL_WRITE+ UNC_F_CH1_NORMAL_READ + UNC_F_CH1_NORMAL_WRITE) * 64/time < 80GB/sec (practical peak 140GB/sec) (with 8 memory controllers)
  • 30.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE 30
  • 31.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. DEMO 31
  • 32.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Running the General Exploration Collector 32 2. Select “General Exploration” for your CPU architecture 3. Click “Start” to begin profiling 1. Click “New Analysis” button
  • 33.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. General Exploration Summary 33
  • 34.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance 34
  • 35.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance 35 Instructions Navigator New Open PropertiesInstructions Navigator New Open PropertiesInstructions Navigator New Open PropertiesInstructions Navigator New Open Properties New Open CompareNew Open CompareNew Open CompareNew Open Compare ProjectProjectProjectProject ResultResultResultResult ToolbarToolbarToolbarToolbar
  • 36.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance 36 ProjectProjectProjectProject NavigatorNavigatorNavigatorNavigator
  • 37.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance 37 Result DisplayResult DisplayResult DisplayResult Display TabsTabsTabsTabs
  • 38.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance 38 Result AnalysisResult AnalysisResult AnalysisResult Analysis TypeTypeTypeType
  • 39.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance 39 Result ViewpointResult ViewpointResult ViewpointResult Viewpoint
  • 40.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance 40 ViewpointViewpointViewpointViewpoint AlternatesAlternatesAlternatesAlternates
  • 41.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance 41 ResultResultResultResult ComponentsComponentsComponentsComponents
  • 42.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance 42 GridGridGridGrid PanePanePanePane
  • 43.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance 43 GridGridGridGrid PanePanePanePane Grouping pullGrouping pullGrouping pullGrouping pull----downdowndowndown
  • 44.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance 44 StackStackStackStack PanePanePanePane
  • 45.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance 45 TimelineTimelineTimelineTimeline
  • 46.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance 46 Filter/OptionsFilter/OptionsFilter/OptionsFilter/Options BarBarBarBar
  • 47.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance Intel Confidential47 5/30/20 14 Source View /Source View /Source View /Source View / Per line localizationPer line localizationPer line localizationPer line localization
  • 48.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance Intel Confidential48 5/30/20 14 Source View /Source View /Source View /Source View / View / Hot spotView / Hot spotView / Hot spotView / Hot spot Navigation controlsNavigation controlsNavigation controlsNavigation controls
  • 49.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance Intel Confidential49 5/30/20 14 Assembly View /Assembly View /Assembly View /Assembly View / View / Hot spotView / Hot spotView / Hot spotView / Hot spot Navigation controlsNavigation controlsNavigation controlsNavigation controls
  • 50.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. VTune™ Amplifier XE visualizes performance Intel Confidential50 5/30/20 14 Assembly View /Assembly View /Assembly View /Assembly View / AssemblyAssemblyAssemblyAssembly groupingsgroupingsgroupingsgroupings
  • 51.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Intel® Software Conference 2014 For event collection the coprocessor is treated as a special HW architecture 51
  • 52.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Intel® Software Conference 2014 Project properties provides the means to invoke data collection by target type 52
  • 53.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Intel® Software Conference 2014 Launch Application serves many uses, from host/offload to native execution 53
  • 54.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Intel® Software Conference 2014Search directories have been reorganized to speed symbol resolution during finalization 54 Notable coprocessor library paths:Notable coprocessor library paths:Notable coprocessor library paths:Notable coprocessor library paths: /opt/mpss/3.2/sysroots/k1om-mpss-Linux/boot /opt/mpss/3.2/sysroots/k1om-mpss-Linux/lib64 /opt/intel/composerxe/lib/mic /opt/intel/composerxe/tbb/lib/mic /opt/intel/composerxe/mkl/lib/mic /opt/intel/mpi-rt/4.1.3/mic
  • 55.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Intel® Software Conference 2014 General Exploration runs a set of events to drive top-down analysis 55
  • 56.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. For more information on Intel® Xeon Phi™ and VTune™ Amplifier XE 56 Optimization on the coprocessor: http://software.intel.com/en- us/articles/optimization-and-performance-tuning-for-intel-xeon-phi- coprocessors-part-1-optimization http://software.intel.com/en-us/articles/optimization-and- performance-tuning-for-intel-xeon-phi-coprocessors-part-2- understanding Coprocessor Performance Monitoring Unit: http://software.intel.com/sites/default/files/forum/278102/intelr- xeon-phitm-pmu-rev1.01.pdf For general information: http://software.intel.com/mic-developer
  • 57.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Grid is Based on Top-Down 57
  • 58.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Use the Hover Text to Understand Metrics* *Suggestions welcome: Submit issues if the text isn’t helpful 58
  • 59.
    Copyright© 2013, IntelCorporation. All rights reserved. *Other brands and names are the property of their respective owners. Event collections on the coprocessor can generate volumes of data dgemm: on 60+ cores Tip: Use cpu-mask to reduce data set, while maintaining the same accuracy. 59
  • 60.
    Copyright©Copyright©Copyright©Copyright© 2013,2013,2013,2013, IntelCorporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved.Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners. Resources Top-Down Characterization White Paper http://software.intel.com/en-us/articles/how-to-tune-applications- using-a-top-down-characterization-of-microarchitectural-issues Tuning Guides http://software.intel.com/en-us/articles/processor-specific- performance-analysis-papers 60