Estimating instruction-level throughput (for example, predicting the cycle counts) is critical for many applications that rely on tightly calculated and accurate timing bounds. In this talk, we will present a new throughput analysis tool, MCA Daemon (MCAD). It is built on top of LLVM MCA and combines the advantages of both static and dynamic throughput analyses, providing a powerful, fast, and easy-to-use tool that scales up with large-scale programs in the real world.
9. Genesis: Assured Micro Patching (AMP)
• A research project initiated by United States DARPA to assure the correctness
of binary patching with little or no source code
4
10. Genesis: Assured Micro Patching (AMP)
• A research project initiated by United States DARPA to assure the correctness
of binary patching with little or no source code
• Including functional and timing aspects
4
11. Genesis: Assured Micro Patching (AMP)
• A research project initiated by United States DARPA to assure the correctness
of binary patching with little or no source code
• Including functional and timing aspects
• Focuses on small (micro) binary patches
4
12. Genesis: Assured Micro Patching (AMP)
• A research project initiated by United States DARPA to assure the correctness
of binary patching with little or no source code
• Including functional and timing aspects
• Focuses on small (micro) binary patches
• Focuses on embedded systems
4
13. Genesis: Assured Micro Patching (AMP)
• A research project initiated by United States DARPA to assure the correctness
of binary patching with little or no source code
• Including functional and timing aspects
• Focuses on small (micro) binary patches
• Focuses on embedded systems
• UCI was studying the timing impacts of binary patches
4
14. Genesis: Assured Micro Patching (AMP)
• A research project initiated by United States DARPA to assure the correctness
of binary patching with little or no source code
• Including functional and timing aspects
• Focuses on small (micro) binary patches
• Focuses on embedded systems
• UCI was studying the timing impacts of binary patches
• Example: After the
fi
rmware on a truck is binary-patched to prevent brakes
from locking up, we need to make sure latencies do not degrade terribly
4
15. Timing impacts of binary patches
Problem de
fi
nition
5
Original
Program
16. Timing impacts of binary patches
Problem de
fi
nition
5
Original
Program
Small
Binary
Patch
Patched
Program
17. Timing impacts of binary patches
Problem de
fi
nition
5
Original
Program
Small
Binary
Patch
Patched
Program
Original
Program
Same set of inputs
18. Timing impacts of binary patches
Problem de
fi
nition
5
Original
Program
Small
Binary
Patch
Patched
Program
Original
Program
ΔT?
Same set of inputs
21. Execution time assessment
Interesting use cases
• Predicting program run time in remote environments or time-sensitive
applications
• Examples:
fi
rmware in cars or satellite (e.g. Kepler space telescope by NASA)
7
22. Execution time assessment
Interesting use cases
• Predicting program run time in remote environments or time-sensitive
applications
• Examples:
fi
rmware in cars or satellite (e.g. Kepler space telescope by NASA)
• Performance analysis
• Insights into performance bottlenecks
7
23. Execution time assessment
Interesting use cases
• Predicting program run time in remote environments or time-sensitive
applications
• Examples:
fi
rmware in cars or satellite (e.g. Kepler space telescope by NASA)
• Performance analysis
• Insights into performance bottlenecks
• Examples: Potential CPU pipeline stalling, GPU memory bank con
fl
icts
7
33. Execution time assessment
Challenges
9
Static Dynamic
High
Low
Precision
• Poor handling on branches &
function calls
• Small scope (only few blocks)
• Lack of run-time information
• Complete execution traces
• Higher
fi
delity on hardware
details
34. Execution time assessment
Challenges
9
Static Dynamic
High
Low
Precision
Fast Slow
Turnaround
• Poor handling on branches &
function calls
• Small scope (only few blocks)
• Lack of run-time information
• Complete execution traces
• Higher
fi
delity on hardware
details
35. Execution time assessment
Challenges
9
Static Dynamic
High
Low
Precision
Fast Slow
Turnaround
• Poor handling on branches &
function calls
• Small scope (only few blocks)
• Lack of run-time information
• Complete execution traces
• Higher
fi
delity on hardware
details
• Faster analysis speed (due to
coarser granularity)
• Easier integration with other tools
36. Execution time assessment
Challenges
9
Static Dynamic
High
Low
Precision
Fast Slow
Turnaround
• Poor handling on branches &
function calls
• Small scope (only few blocks)
• Lack of run-time information
• Complete execution traces
• Higher
fi
delity on hardware
details
• Faster analysis speed (due to
coarser granularity)
• Easier integration with other tools
• Usually require non-trivial
setup
• Slow simulation speed
37. Execution time assessment
Challenges
9
Static Dynamic
High
Low
Precision
Fast Slow
Turnaround
?
• Poor handling on branches &
function calls
• Small scope (only few blocks)
• Lack of run-time information
• Complete execution traces
• Higher
fi
delity on hardware
details
• Faster analysis speed (due to
coarser granularity)
• Easier integration with other tools
• Usually require non-trivial
setup
• Slow simulation speed
46. Introduction to LLVM MCA
• A tool (llvm-mca) and library (libLLVMMCA) for predicting cycle counts and
potential performance hazards in a sequence of assembly code
13
47. Introduction to LLVM MCA
• A tool (llvm-mca) and library (libLLVMMCA) for predicting cycle counts and
potential performance hazards in a sequence of assembly code
• Using instruction scheduling data (e.g. instruction latency) provided by each
LLVM target
• New ISA (with proper scheduling info) can be supported out of the box
13
48. Introduction to LLVM MCA
• A tool (llvm-mca) and library (libLLVMMCA) for predicting cycle counts and
potential performance hazards in a sequence of assembly code
• Using instruction scheduling data (e.g. instruction latency) provided by each
LLVM target
• New ISA (with proper scheduling info) can be supported out of the box
• Accounting for modern processor features: super scalar, out-of-order etc.
13
49. Introduction to LLVM MCA
• A tool (llvm-mca) and library (libLLVMMCA) for predicting cycle counts and
potential performance hazards in a sequence of assembly code
• Using instruction scheduling data (e.g. instruction latency) provided by each
LLVM target
• New ISA (with proper scheduling info) can be supported out of the box
• Accounting for modern processor features: super scalar, out-of-order etc.
• Implemented via lightweight simulation
• Abstract real CPU pipeline stages into a small handful of stages
13
50. Introduction to MCA
An example
14
vmulps %xmm0, %xmm1, %xmm2
vhaddps %xmm2, %xmm2, %xmm3
vhaddps %xmm3, %xmm3, %xmm4
test/tools/llvm-mca/X86/BtVer2/dot-product.s
51. Introduction to MCA
An example
14
vmulps %xmm0, %xmm1, %xmm2
vhaddps %xmm2, %xmm2, %xmm3
vhaddps %xmm3, %xmm3, %xmm4
test/tools/llvm-mca/X86/BtVer2/dot-product.s
llvm-mca -mtriple=x86_64 -mcpu=btver2
-iterations=300 dot-products.s
52. Introduction to MCA
An example
14
vmulps %xmm0, %xmm1, %xmm2
vhaddps %xmm2, %xmm2, %xmm3
vhaddps %xmm3, %xmm3, %xmm4
test/tools/llvm-mca/X86/BtVer2/dot-product.s Summary
Iterations: 300
Instructions: 900
Total Cycles: 610
Total uOps: 900
Dispatch Width: 2
uOps Per Cycle: 1.48
IPC: 1.48
Block RThroughput: 2.0
llvm-mca -mtriple=x86_64 -mcpu=btver2
-iterations=300 dot-products.s
58. MCA Daemon (MCAD)
Highlights
• Combine the advantages of dynamic & static throughput analysis
• Augment the analysis region beyond basic blocks
• MCAD is able to analyze the entire program execution trace
16
59. MCA Daemon (MCAD)
Highlights
• Combine the advantages of dynamic & static throughput analysis
• Augment the analysis region beyond basic blocks
• MCAD is able to analyze the entire program execution trace
• Throughput analysis is happening in parallel / on-the-fly with the
target program execution
16
61. Analyze execution traces using MCA
Using unmodi
fi
ed MCA libraries
18
QEMU
Target Program
Executed
instructions
LLVM MCA
Libraries
Disassembler
62. Analyze execution traces using MCA
Challenge: Sequential work
fl
ow
19
QEMU
Target Program
Blocked until QEMU is
fi
nished
Executed
instructions
LLVM MCA
Libraries
Disassembler
80. Incremental SourceMgr
Implement with threads: Pros & Cons
25
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
Process 1 Simulation Thread
Receiver Thread
Pros: No modi
fi
cation on the simulation pipeline
81. Incremental SourceMgr
Implement with threads: Pros & Cons
25
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
Process 1 Simulation Thread
Receiver Thread
Pros: No modi
fi
cation on the simulation pipeline
Cons: To use IncrementalSourceMgr, you have to use threads
90. Resumable simulation pipeline
• Save (and restore) the analysis state from previous subset of instructions
• Threads are not required when using IncrementalSourceMgr + resumable
pipeline
29
91. Resumable simulation pipeline
• Save (and restore) the analysis state from previous subset of instructions
• Threads are not required when using IncrementalSourceMgr + resumable
pipeline
• Much easier to integrate into other uses
29
92. Resumable simulation pipeline
• Save (and restore) the analysis state from previous subset of instructions
• Threads are not required when using IncrementalSourceMgr + resumable
pipeline
• Much easier to integrate into other uses
• You can still wrap resumable pipeline with a thread
29
93. Resumable simulation pipeline
• Save (and restore) the analysis state from previous subset of instructions
• Threads are not required when using IncrementalSourceMgr + resumable
pipeline
• Much easier to integrate into other uses
• You can still wrap resumable pipeline with a thread
• Minor downside: Modi
fi
cations on the simulation pipeline
29
94. Incremental SourceMgr + Resumable pipeline
Put into real actions
30
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
Resumable Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
95. Incremental SourceMgr + Resumable pipeline
Put into real actions
30
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
Resumable Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
• A program that continuously reads from an I/O device
96. Incremental SourceMgr + Resumable pipeline
Put into real actions
30
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
Resumable Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
• A program that continuously reads from an I/O device
• Only record the user space traces
97. Incremental SourceMgr + Resumable pipeline
Put into real actions
30
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
Resumable Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
• A program that continuously reads from an I/O device
• Only record the user space traces
• Collected ~1 million instructions
101. Challenge
• Created a signi
fi
cant amount of memory footprint
• Bottleneck: ~37GB of accumulated (virtual)
memory was allocated by
mca::InstrBuilder::createInstruction
31
(Unit: MB)
102. Challenge
• Created a signi
fi
cant amount of memory footprint
• Bottleneck: ~37GB of accumulated (virtual)
memory was allocated by
mca::InstrBuilder::createInstruction
31
(Unit: MB)
103. Challenge
• Created a signi
fi
cant amount of memory footprint
• Bottleneck: ~37GB of accumulated (virtual)
memory was allocated by
mca::InstrBuilder::createInstruction
31
(Unit: MB)
104. Challenge
• Created a signi
fi
cant amount of memory footprint
• Bottleneck: ~37GB of accumulated (virtual)
memory was allocated by
mca::InstrBuilder::createInstruction
31
MCInst mca::Instruction
mca::InstrBuilder
(Unit: MB)
106. Large memory footprint
Root cause
• Most of the translated mca::Instruction objects are never deallocated
until the simulation is
fi
nished
32
107. Large memory footprint
Root cause
• Most of the translated mca::Instruction objects are never deallocated
until the simulation is
fi
nished
• mca::Instruction is also used for tracking simulation state, so it’s hard to
make it immutable
32
108. Large memory footprint
Root cause
• Most of the translated mca::Instruction objects are never deallocated
until the simulation is
fi
nished
• mca::Instruction is also used for tracking simulation state, so it’s hard to
make it immutable
• Doesn’t scale really well with large input (recall: ~1 million instructions)
32
122. Custom QEMU TCG plugin
• The plugin interface allows us to tap into various TCG events to collect
executed instructions
• Example: When a TCG block is translated / executed
38
123. Custom QEMU TCG plugin
• The plugin interface allows us to tap into various TCG events to collect
executed instructions
• Example: When a TCG block is translated / executed
• We also added a few plugin interfaces (not upstreamed yet)
• Example: Retrieving CPU register values
38
124. Custom QEMU TCG plugin
• The plugin interface allows us to tap into various TCG events to collect
executed instructions
• Example: When a TCG block is translated / executed
• We also added a few plugin interfaces (not upstreamed yet)
• Example: Retrieving CPU register values
• Sending raw binary instructions* through TCP sockets
38
132. Evaluation
Scalability compared against llvm-mca
• Compare MCAD against llvm-mca (i.e. baseline) on analysis speed and memory
consumption
• Using execution trace collected from running x86_64 FFmpeg 4.2 to decode a
14KB MPEG-4 video
fi
le
• Command: ffmpeg -i input.mp4 -f null -
• Size of the trace: ~27 million x86_64 instructions
42
133. Evaluation
Scalability compared against llvm-mca
• Compare MCAD against llvm-mca (i.e. baseline) on analysis speed and memory
consumption
• Using execution trace collected from running x86_64 FFmpeg 4.2 to decode a
14KB MPEG-4 video
fi
le
• Command: ffmpeg -i input.mp4 -f null -
• Size of the trace: ~27 million x86_64 instructions
• For baseline, we dump the execution trace (assembly instructions) to a
fi
le before
feeding into llvm-mca
• Time measurement on baseline only accounts for llvm-mca’s run time. Excluding
the trace collection time.
42
142. // TODO
• More e
ffi
cient ways to collect traces without QEMU
46
143. // TODO
• More e
ffi
cient ways to collect traces without QEMU
• Analyzing traces from multi-thread programs
46
144. // TODO
• More e
ffi
cient ways to collect traces without QEMU
• Analyzing traces from multi-thread programs
• Improve MCA’s precision via dynamic information (e.g. memory accesses)
46
145. // TODO
• More e
ffi
cient ways to collect traces without QEMU
• Analyzing traces from multi-thread programs
• Improve MCA’s precision via dynamic information (e.g. memory accesses)
• Visualizing analysis results. Or: Improve MCA’s result display
46
146. // TODO
• More e
ffi
cient ways to collect traces without QEMU
• Analyzing traces from multi-thread programs
• Improve MCA’s precision via dynamic information (e.g. memory accesses)
• Visualizing analysis results. Or: Improve MCA’s result display
• Example: Loadable plugins for custom display of the result
46
147. // TODO
• More e
ffi
cient ways to collect traces without QEMU
• Analyzing traces from multi-thread programs
• Improve MCA’s precision via dynamic information (e.g. memory accesses)
• Visualizing analysis results. Or: Improve MCA’s result display
• Example: Loadable plugins for custom display of the result
• Going upstream: QEMU & LLVM
46
149. Going upstream: LLVM
The plan
• We would like to upstream components that are bene
fi
cial to the core MCA
libraries
fi
rst
47
150. Going upstream: LLVM
The plan: components to upstream
48
QEMU
TCG Plugin
MCAD Core
IncrementalSourceMgr
Resumable Simulation Pipeline
(QEMU) Broker
Disassembler
Receiver
Target Program
TCP Sockets
Recycling InstrBuilder
Display*
Not intended to upstream for now
Intended to upstream
Unrelated / no change
151. Going upstream: LLVM
The plan
• We would like to upstream components that are bene
fi
cial to the core MCA
libraries
fi
rst
49
152. Going upstream: LLVM
The plan
• We would like to upstream components that are bene
fi
cial to the core MCA
libraries
fi
rst
• QEMU broker plugin & our TCG plugin will be maintained out-of-tree
49
153. Going upstream: LLVM
The plan
• We would like to upstream components that are bene
fi
cial to the core MCA
libraries
fi
rst
• QEMU broker plugin & our TCG plugin will be maintained out-of-tree
• We’re not sure about upstreaming rest of the tool right now
49
154. Going upstream: LLVM
The plan
• We would like to upstream components that are bene
fi
cial to the core MCA
libraries
fi
rst
• QEMU broker plugin & our TCG plugin will be maintained out-of-tree
• We’re not sure about upstreaming rest of the tool right now
• With the assembly broker, MCAD can be a drop-in replacement for llvm-
mca…with even more features (e.g. the broker plugin infrastructure)
49
155. Going upstream: LLVM
The plan
• We would like to upstream components that are bene
fi
cial to the core MCA
libraries
fi
rst
• QEMU broker plugin & our TCG plugin will be maintained out-of-tree
• We’re not sure about upstreaming rest of the tool right now
• With the assembly broker, MCAD can be a drop-in replacement for llvm-
mca…with even more features (e.g. the broker plugin infrastructure)
• Some of the (advanced) interfaces in broker plugin are only used by QEMU
broker. So, without the latter, it’s not well tested.
49
158. Summary
• MCA Daemon (MCAD) is a high-performance hybrid throughput analysis tool
built on top of LLVM MCA libraries
51
159. Summary
• MCA Daemon (MCAD) is a high-performance hybrid throughput analysis tool
built on top of LLVM MCA libraries
• Online, whole-program analysis on real-world applications
51
160. Summary
• MCA Daemon (MCAD) is a high-performance hybrid throughput analysis tool
built on top of LLVM MCA libraries
• Online, whole-program analysis on real-world applications
• Scale up with large-scale programs with tens of millions of instructions
51
161. Summary
• MCA Daemon (MCAD) is a high-performance hybrid throughput analysis tool
built on top of LLVM MCA libraries
• Online, whole-program analysis on real-world applications
• Scale up with large-scale programs with tens of millions of instructions
• We improved the performance &
fl
exibility of core MCA libraries
51
162. Summary
• MCA Daemon (MCAD) is a high-performance hybrid throughput analysis tool
built on top of LLVM MCA libraries
• Online, whole-program analysis on real-world applications
• Scale up with large-scale programs with tens of millions of instructions
• We improved the performance &
fl
exibility of core MCA libraries
• We would like to merge these changes upstream to bene
fi
t the wider
community
51
164. Acknowledgements
53
This material is based upon work partially supported by the
Defense Advanced Research Projects Agency (DARPA) under
contract N66001-20-C-4027
Special Thanks:
Galois Inc. (https://galois.com) Immunant Inc. (https://immunant.com)