SlideShare a Scribd company logo
1 of 168
Download to read offline
Min-Yih “Min” Hsu, David Gens, Michael Franz. University of California, Irvine
MCA Daemon
Hybrid Throughput Analysis Beyond Basic Blocks
Keynote, EuroLLVM 2022
Outline
2
Outline
Motivation
2
Outline
Motivation
MCA Daemon (MCAD)
2
Outline
Motivation
MCA Daemon (MCAD)
Future Plans
2
Outline
Motivation
MCA Daemon (MCAD)
Future Plans
Epilogue
2
Outline
Motivation

MCA Daemon (MCAD)

Future Plans

Epilogue
3
Genesis: Assured Micro Patching (AMP)
4
Genesis: Assured Micro Patching (AMP)
• A research project initiated by United States DARPA to assure the correctness
of binary patching with little or no source code
4
Genesis: Assured Micro Patching (AMP)
• A research project initiated by United States DARPA to assure the correctness
of binary patching with little or no source code
• Including functional and timing aspects
4
Genesis: Assured Micro Patching (AMP)
• A research project initiated by United States DARPA to assure the correctness
of binary patching with little or no source code
• Including functional and timing aspects
• Focuses on small (micro) binary patches
4
Genesis: Assured Micro Patching (AMP)
• A research project initiated by United States DARPA to assure the correctness
of binary patching with little or no source code
• Including functional and timing aspects
• Focuses on small (micro) binary patches
• Focuses on embedded systems
4
Genesis: Assured Micro Patching (AMP)
• A research project initiated by United States DARPA to assure the correctness
of binary patching with little or no source code
• Including functional and timing aspects
• Focuses on small (micro) binary patches
• Focuses on embedded systems
• UCI was studying the timing impacts of binary patches
4
Genesis: Assured Micro Patching (AMP)
• A research project initiated by United States DARPA to assure the correctness
of binary patching with little or no source code
• Including functional and timing aspects
• Focuses on small (micro) binary patches
• Focuses on embedded systems
• UCI was studying the timing impacts of binary patches
• Example: After the
fi
rmware on a truck is binary-patched to prevent brakes
from locking up, we need to make sure latencies do not degrade terribly
4
Timing impacts of binary patches
Problem de
fi
nition
5
Original


Program
Timing impacts of binary patches
Problem de
fi
nition
5
Original


Program
Small


Binary
Patch
Patched


Program
Timing impacts of binary patches
Problem de
fi
nition
5
Original


Program
Small


Binary
Patch
Patched


Program
Original


Program
Same set of inputs
Timing impacts of binary patches
Problem de
fi
nition
5
Original


Program
Small


Binary
Patch
Patched


Program
Original


Program
ΔT?
Same set of inputs
Execution time assessment
6
Original


Program
Small


Binary
Patch
Patched


Program
Original


Program
ΔT?
Same set of inputs
Execution time assessment
Interesting use cases
7
Execution time assessment
Interesting use cases
• Predicting program run time in remote environments or time-sensitive
applications
• Examples:
fi
rmware in cars or satellite (e.g. Kepler space telescope by NASA)
7
Execution time assessment
Interesting use cases
• Predicting program run time in remote environments or time-sensitive
applications
• Examples:
fi
rmware in cars or satellite (e.g. Kepler space telescope by NASA)
• Performance analysis
• Insights into performance bottlenecks
7
Execution time assessment
Interesting use cases
• Predicting program run time in remote environments or time-sensitive
applications
• Examples:
fi
rmware in cars or satellite (e.g. Kepler space telescope by NASA)
• Performance analysis
• Insights into performance bottlenecks
• Examples: Potential CPU pipeline stalling, GPU memory bank con
fl
icts
7
Execution time assessment
Previous e
ff
orts
8
Execution time assessment
Previous e
ff
orts
• Static approaches
8
Execution time assessment
Previous e
ff
orts
• Static approaches
• Throughput analysis: predicting the cycle counts for linear code (e.g. basic
block, loop) statically
8
Execution time assessment
Previous e
ff
orts
• Static approaches
• Throughput analysis: predicting the cycle counts for linear code (e.g. basic
block, loop) statically
• Examples: IACA, OSACA, uiCA, LLVM MCA, Ithemal
8
Execution time assessment
Previous e
ff
orts
• Static approaches
• Throughput analysis: predicting the cycle counts for linear code (e.g. basic
block, loop) statically
• Examples: IACA, OSACA, uiCA, LLVM MCA, Ithemal
• Dynamic approaches
• Cycle-accurate simulators / emulators
8
Execution time assessment
Previous e
ff
orts
• Static approaches
• Throughput analysis: predicting the cycle counts for linear code (e.g. basic
block, loop) statically
• Examples: IACA, OSACA, uiCA, LLVM MCA, Ithemal
• Dynamic approaches
• Cycle-accurate simulators / emulators
• Examples: gem5, gpgpu-sim
8
Execution time assessment
Challenges
9
Static Dynamic
Execution time assessment
Challenges
9
Static Dynamic
High
Low
Precision
Execution time assessment
Challenges
9
Static Dynamic
High
Low
Precision
• Complete execution traces

• Higher
fi
delity on hardware
details
Execution time assessment
Challenges
9
Static Dynamic
High
Low
Precision
• Poor handling on branches &
function calls

• Small scope (only few blocks)

• Lack of run-time information
• Complete execution traces

• Higher
fi
delity on hardware
details
Execution time assessment
Challenges
9
Static Dynamic
High
Low
Precision
Fast Slow
Turnaround
• Poor handling on branches &
function calls

• Small scope (only few blocks)

• Lack of run-time information
• Complete execution traces

• Higher
fi
delity on hardware
details
Execution time assessment
Challenges
9
Static Dynamic
High
Low
Precision
Fast Slow
Turnaround
• Poor handling on branches &
function calls

• Small scope (only few blocks)

• Lack of run-time information
• Complete execution traces

• Higher
fi
delity on hardware
details
• Faster analysis speed (due to
coarser granularity)

• Easier integration with other tools
Execution time assessment
Challenges
9
Static Dynamic
High
Low
Precision
Fast Slow
Turnaround
• Poor handling on branches &
function calls

• Small scope (only few blocks)

• Lack of run-time information
• Complete execution traces

• Higher
fi
delity on hardware
details
• Faster analysis speed (due to
coarser granularity)

• Easier integration with other tools
• Usually require non-trivial
setup

• Slow simulation speed
Execution time assessment
Challenges
9
Static Dynamic
High
Low
Precision
Fast Slow
Turnaround
?
• Poor handling on branches &
function calls

• Small scope (only few blocks)

• Lack of run-time information
• Complete execution traces

• Higher
fi
delity on hardware
details
• Faster analysis speed (due to
coarser granularity)

• Easier integration with other tools
• Usually require non-trivial
setup

• Slow simulation speed
Outline
Motivation

MCA Daemon (MCAD)

Future Plans

Epilogue
10
MCA Daemon (MCAD)
High-level concept
11
Dynamic Runtime
Static Throughput
Analysis Tool
Target Program
MCA Daemon (MCAD)
High-level concept
11
Dynamic Runtime
Static Throughput
Analysis Tool
Target Program
Execution Trace
MCA Daemon (MCAD)
High-level concept
11
Dynamic Runtime
Static Throughput
Analysis Tool
Target Program
Execution Trace
• The instructions that just got executed

• Run-time values (e.g. register values)
MCA Daemon (MCAD)
High-level concept
11
Dynamic Runtime
Static Throughput
Analysis Tool
Target Program
Execution Trace
Online Environment
Process 1 Process 2
• The instructions that just got executed

• Run-time values (e.g. register values)
MCA Daemon (MCAD)
High-level concept
11
Dynamic Runtime
Static Throughput
Analysis Tool
Target Program
Execution Trace
Online Environment
Process 1 Process 2
Streaming
• The instructions that just got executed

• Run-time values (e.g. register values)
MCA Daemon (MCAD)
High-level concept
12
QEMU
LLVM MCA
Libraries
Target Program
Online Environment
Process 1 Process 2
Execution Trace
Streaming
Introduction to LLVM MCA
13
Introduction to LLVM MCA
• A tool (llvm-mca) and library (libLLVMMCA) for predicting cycle counts and
potential performance hazards in a sequence of assembly code
13
Introduction to LLVM MCA
• A tool (llvm-mca) and library (libLLVMMCA) for predicting cycle counts and
potential performance hazards in a sequence of assembly code
• Using instruction scheduling data (e.g. instruction latency) provided by each
LLVM target

• New ISA (with proper scheduling info) can be supported out of the box
13
Introduction to LLVM MCA
• A tool (llvm-mca) and library (libLLVMMCA) for predicting cycle counts and
potential performance hazards in a sequence of assembly code
• Using instruction scheduling data (e.g. instruction latency) provided by each
LLVM target

• New ISA (with proper scheduling info) can be supported out of the box
• Accounting for modern processor features: super scalar, out-of-order etc.
13
Introduction to LLVM MCA
• A tool (llvm-mca) and library (libLLVMMCA) for predicting cycle counts and
potential performance hazards in a sequence of assembly code
• Using instruction scheduling data (e.g. instruction latency) provided by each
LLVM target

• New ISA (with proper scheduling info) can be supported out of the box
• Accounting for modern processor features: super scalar, out-of-order etc.
• Implemented via lightweight simulation

• Abstract real CPU pipeline stages into a small handful of stages
13
Introduction to MCA
An example
14
vmulps %xmm0, %xmm1, %xmm2


vhaddps %xmm2, %xmm2, %xmm3


vhaddps %xmm3, %xmm3, %xmm4
test/tools/llvm-mca/X86/BtVer2/dot-product.s
Introduction to MCA
An example
14
vmulps %xmm0, %xmm1, %xmm2


vhaddps %xmm2, %xmm2, %xmm3


vhaddps %xmm3, %xmm3, %xmm4
test/tools/llvm-mca/X86/BtVer2/dot-product.s
llvm-mca -mtriple=x86_64 -mcpu=btver2 


-iterations=300 dot-products.s
Introduction to MCA
An example
14
vmulps %xmm0, %xmm1, %xmm2


vhaddps %xmm2, %xmm2, %xmm3


vhaddps %xmm3, %xmm3, %xmm4
test/tools/llvm-mca/X86/BtVer2/dot-product.s Summary
Iterations: 300


Instructions: 900


Total Cycles: 610


Total uOps: 900


Dispatch Width: 2


uOps Per Cycle: 1.48


IPC: 1.48


Block RThroughput: 2.0
llvm-mca -mtriple=x86_64 -mcpu=btver2 


-iterations=300 dot-products.s
15
llvm-mca
MCAD
15
llvm-mca
Assembly
fi
le
LLVM MCA
Libraries
llvm-mca
MCAD
15
QEMU
LLVM MCA
Libraries
Target Program
Process 1 Process 2
Execution Trace
Streaming
llvm-mca
Assembly
fi
le
LLVM MCA
Libraries
llvm-mca
MCAD
MCA Daemon (MCAD)
Highlights
16
MCA Daemon (MCAD)
Highlights
• Combine the advantages of dynamic & static throughput analysis
16
MCA Daemon (MCAD)
Highlights
• Combine the advantages of dynamic & static throughput analysis
• Augment the analysis region beyond basic blocks

• MCAD is able to analyze the entire program execution trace
16
MCA Daemon (MCAD)
Highlights
• Combine the advantages of dynamic & static throughput analysis
• Augment the analysis region beyond basic blocks

• MCAD is able to analyze the entire program execution trace
• Throughput analysis is happening in parallel / on-the-fly with the
target program execution
16
Implementation
17
Analyze execution traces using MCA
Using unmodi
fi
ed MCA libraries
18
QEMU
Target Program
Executed
instructions
LLVM MCA
Libraries
Disassembler
Analyze execution traces using MCA
Challenge: Sequential work
fl
ow
19
QEMU
Target Program
Blocked until QEMU is
fi
nished
Executed
instructions
LLVM MCA
Libraries
Disassembler
MCA internal
20
Assembly
fi
le
MCA internal
20
MCInst
Assembly
fi
le
MCA internal
20
MCInst mca::Instruction
Assembly
fi
le
MCA internal
20
MCInst mca::Instruction
mca::SourceMgr
Assembly
fi
le
MCA internal
20
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
mca::SourceMgr
Assembly
fi
le
MCA internal
20
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
Display
mca::SourceMgr
Assembly
fi
le
MCA with execution trace stream as input
21
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
Display
mca::SourceMgr
Execution
Trace
Stream
MCA with execution trace stream as input
21
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
Display
mca::SourceMgr
Execution
Trace
Stream
Blocking
Incremental SourceMgr
22
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
Display
mca::IncrementalSourceMgr
Execution
Trace
Stream
Incremental SourceMgr
22
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
Display
Making mca::Instruction available to
simulation pipeline right away
mca::IncrementalSourceMgr
Execution
Trace
Stream
Incremental SourceMgr
23
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
Incremental SourceMgr
23
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
Process 2
Process 1
Incremental SourceMgr
Implement with threads
24
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
Process 1
Incremental SourceMgr
Implement with threads
24
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
Process 1 Simulation Thread
Incremental SourceMgr
Implement with threads
24
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
Process 1 Simulation Thread
mca::Instruction fetching loop
Incremental SourceMgr
Implement with threads
24
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
Process 1 Simulation Thread
Receiver Thread
mca::Instruction fetching loop
Incremental SourceMgr
Implement with threads: Pros & Cons
25
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
Process 1 Simulation Thread
Receiver Thread
Incremental SourceMgr
Implement with threads: Pros & Cons
25
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
Process 1 Simulation Thread
Receiver Thread
Pros: No modi
fi
cation on the simulation pipeline
Incremental SourceMgr
Implement with threads: Pros & Cons
25
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
Process 1 Simulation Thread
Receiver Thread
Pros: No modi
fi
cation on the simulation pipeline
Cons: To use IncrementalSourceMgr, you have to use threads
Incremental SourceMgr
Better solution: Resumable simulation pipeline
26
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
Resumable Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
Incremental SourceMgr
Better solution: Resumable simulation pipeline
26
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
Resumable Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
A subset of trace
Incremental SourceMgr
Better solution: Resumable simulation pipeline
26
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
Resumable Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
A subset of trace
A subset of
mca::Instruction
Incremental SourceMgr
Better solution: Resumable simulation pipeline
27
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
Resumable Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
A subset of
mca::Instruction
Incremental SourceMgr
Better solution: Resumable simulation pipeline
28
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
Resumable Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
Pause
Incremental SourceMgr
Better solution: Resumable simulation pipeline
28
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
Resumable Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
Pause
Resumable simulation pipeline
29
Resumable simulation pipeline
• Save (and restore) the analysis state from previous subset of instructions
29
Resumable simulation pipeline
• Save (and restore) the analysis state from previous subset of instructions
• Threads are not required when using IncrementalSourceMgr + resumable
pipeline
29
Resumable simulation pipeline
• Save (and restore) the analysis state from previous subset of instructions
• Threads are not required when using IncrementalSourceMgr + resumable
pipeline
• Much easier to integrate into other uses
29
Resumable simulation pipeline
• Save (and restore) the analysis state from previous subset of instructions
• Threads are not required when using IncrementalSourceMgr + resumable
pipeline
• Much easier to integrate into other uses
• You can still wrap resumable pipeline with a thread
29
Resumable simulation pipeline
• Save (and restore) the analysis state from previous subset of instructions
• Threads are not required when using IncrementalSourceMgr + resumable
pipeline
• Much easier to integrate into other uses
• You can still wrap resumable pipeline with a thread
• Minor downside: Modi
fi
cations on the simulation pipeline
29
Incremental SourceMgr + Resumable pipeline
Put into real actions
30
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
Resumable Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
Incremental SourceMgr + Resumable pipeline
Put into real actions
30
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
Resumable Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
• A program that continuously reads from an I/O device
Incremental SourceMgr + Resumable pipeline
Put into real actions
30
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
Resumable Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
• A program that continuously reads from an I/O device
• Only record the user space traces
Incremental SourceMgr + Resumable pipeline
Put into real actions
30
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
Resumable Simulation Pipeline
mca::IncrementalSourceMgr
Execution
Trace
Stream
QEMU
• A program that continuously reads from an I/O device
• Only record the user space traces
• Collected ~1 million instructions
Challenge
31
Challenge
• Created a signi
fi
cant amount of memory footprint
31
Challenge
• Created a signi
fi
cant amount of memory footprint
31
(Unit: MB)
Challenge
• Created a signi
fi
cant amount of memory footprint
• Bottleneck: ~37GB of accumulated (virtual)
memory was allocated by
mca::InstrBuilder::createInstruction
31
(Unit: MB)
Challenge
• Created a signi
fi
cant amount of memory footprint
• Bottleneck: ~37GB of accumulated (virtual)
memory was allocated by
mca::InstrBuilder::createInstruction
31
(Unit: MB)
Challenge
• Created a signi
fi
cant amount of memory footprint
• Bottleneck: ~37GB of accumulated (virtual)
memory was allocated by
mca::InstrBuilder::createInstruction
31
(Unit: MB)
Challenge
• Created a signi
fi
cant amount of memory footprint
• Bottleneck: ~37GB of accumulated (virtual)
memory was allocated by
mca::InstrBuilder::createInstruction
31
MCInst mca::Instruction
mca::InstrBuilder
(Unit: MB)
Large memory footprint
Root cause
32
Large memory footprint
Root cause
• Most of the translated mca::Instruction objects are never deallocated
until the simulation is
fi
nished
32
Large memory footprint
Root cause
• Most of the translated mca::Instruction objects are never deallocated
until the simulation is
fi
nished
• mca::Instruction is also used for tracking simulation state, so it’s hard to
make it immutable
32
Large memory footprint
Root cause
• Most of the translated mca::Instruction objects are never deallocated
until the simulation is
fi
nished
• mca::Instruction is also used for tracking simulation state, so it’s hard to
make it immutable
• Doesn’t scale really well with large input (recall: ~1 million instructions)
32
Large memory footprint
Observation
33
mca::IncrementalSourceMgr
mca::Instruction
Resumable Simulation
Pipeline
Large memory footprint
Observation
33
mca::IncrementalSourceMgr
mca::Instruction
Resumable Simulation
Pipeline
Stream direction
Large memory footprint
Observation
33
mca::IncrementalSourceMgr
mca::Instruction
Resumable Simulation
Pipeline
Copy
Stream direction
Large memory footprint
Solution: Recycling mca::Instruction
34
mca::IncrementalSourceMgr
mca::Instruction
Resumable Simulation
Pipeline
Copy
Stream direction
Large memory footprint
Solution: Recycling mca::Instruction
34
mca::IncrementalSourceMgr
mca::Instruction
Resumable Simulation
Pipeline
Copy
Stream direction
Large memory footprint
Solution: Recycling mca::Instruction
34
mca::IncrementalSourceMgr
mca::Instruction
Resumable Simulation
Pipeline
Copy
Stream direction
Recycle
mca::InstrBuilder
67%
improvement on accumulated memory consumption
35
~70%
of the mca::Instruction objects are recycled
36
Collecting execution traces via QEMU
37
user mode qemu
Target Program
QEMU MCAD
Collecting execution traces via QEMU
37
user mode qemu
Target Program
Custom TCG Plugin
Instrument
QEMU MCAD
Broker
Collecting execution traces via QEMU
37
user mode qemu
Target Program
Custom TCG Plugin
Instrument
Receiver
TCP Socket
QEMU MCAD
Broker
Collecting execution traces via QEMU
37
user mode qemu
Target Program
Custom TCG Plugin
Instrument Disassembler
Receiver
TCP Socket
MCInst
QEMU MCAD
Custom QEMU TCG plugin
38
Custom QEMU TCG plugin
• The plugin interface allows us to tap into various TCG events to collect
executed instructions

• Example: When a TCG block is translated / executed
38
Custom QEMU TCG plugin
• The plugin interface allows us to tap into various TCG events to collect
executed instructions

• Example: When a TCG block is translated / executed
• We also added a few plugin interfaces (not upstreamed yet)

• Example: Retrieving CPU register values
38
Custom QEMU TCG plugin
• The plugin interface allows us to tap into various TCG events to collect
executed instructions

• Example: When a TCG block is translated / executed
• We also added a few plugin interfaces (not upstreamed yet)

• Example: Retrieving CPU register values
• Sending raw binary instructions* through TCP sockets
38
QEMU
Complete structure of MCAD
39
TCG Plugin
MCAD Core
IncrementalSourceMgr
Resumable Simulation Pipeline
Display
(QEMU) Broker
Disassembler
Receiver
Target Program
TCP Sockets
Recycling InstrBuilder
QEMU
Complete structure of MCAD
39
TCG Plugin
MCAD Core
IncrementalSourceMgr
Resumable Simulation Pipeline
Display
(QEMU) Broker
Disassembler
Receiver
Target Program
TCP Sockets
☹
Recycling InstrBuilder
QEMU
Complete structure of MCAD
A modular design
40
TCG Plugin
Display
(QEMU) Broker
Disassembler
Receiver
Target Program
TCP Sockets
MCAD Core
IncrementalSourceMgr
Resumable Simulation Pipeline
Recycling InstrBuilder
QEMU
Complete structure of MCAD
A modular design
40
TCG Plugin
Display
(QEMU) Broker
Disassembler
Receiver
Target Program
TCP Sockets
Loadable Plugin
MCAD Core
IncrementalSourceMgr
Resumable Simulation Pipeline
Recycling InstrBuilder
Loadable Plugin
Complete structure of MCAD
Example: Assembly broker plugin
41
Display
Assembly Broker
AsmParser
Assembly File
MCAD Core
IncrementalSourceMgr
Resumable Simulation Pipeline
Recycling InstrBuilder
Evaluation
Scalability compared against llvm-mca
42
Evaluation
Scalability compared against llvm-mca
• Compare MCAD against llvm-mca (i.e. baseline) on analysis speed and memory
consumption
42
Evaluation
Scalability compared against llvm-mca
• Compare MCAD against llvm-mca (i.e. baseline) on analysis speed and memory
consumption
• Using execution trace collected from running x86_64 FFmpeg 4.2 to decode a
14KB MPEG-4 video
fi
le

• Command: ffmpeg -i input.mp4 -f null - 

• Size of the trace: ~27 million x86_64 instructions
42
Evaluation
Scalability compared against llvm-mca
• Compare MCAD against llvm-mca (i.e. baseline) on analysis speed and memory
consumption
• Using execution trace collected from running x86_64 FFmpeg 4.2 to decode a
14KB MPEG-4 video
fi
le

• Command: ffmpeg -i input.mp4 -f null - 

• Size of the trace: ~27 million x86_64 instructions
• For baseline, we dump the execution trace (assembly instructions) to a
fi
le before
feeding into llvm-mca

• Time measurement on baseline only accounts for llvm-mca’s run time. Excluding
the trace collection time.
42
Evaluation
Scalability compared against llvm-mca
43
Evaluation
Scalability compared against llvm-mca
43
Analysis time
seconds
0 44 88 132 176 220
llvm-mca
MCAD
Evaluation
Scalability compared against llvm-mca
43
Analysis time
seconds
0 44 88 132 176 220
llvm-mca
MCAD
4x Faster
Evaluation
Scalability compared against llvm-mca
43
Analysis time
seconds
0 44 88 132 176 220
llvm-mca
MCAD
Max resident memory
Gigabytes
0 5 10 15 20 25 30
4x Faster
Evaluation
Scalability compared against llvm-mca
43
Analysis time
seconds
0 44 88 132 176 220
llvm-mca
MCAD
Max resident memory
Gigabytes
0 5 10 15 20 25 30
4x Faster
13x Less
Evaluation
Scalability compared against other static throughput analysis tools
44
Analysis Time Max Resident Memory
uiCA Timeout after 48h 113 GB
OSACA Exit w/ error after 24h N/A
Ithemal Exit w/ error after 2m N/A
MCAD 52.69s 2.16 GB
Outline
Motivation

MCA Daemon (MCAD)

Future Plans

Epilogue
45
// TODO
46
// TODO
• More e
ffi
cient ways to collect traces without QEMU
46
// TODO
• More e
ffi
cient ways to collect traces without QEMU
• Analyzing traces from multi-thread programs
46
// TODO
• More e
ffi
cient ways to collect traces without QEMU
• Analyzing traces from multi-thread programs
• Improve MCA’s precision via dynamic information (e.g. memory accesses)
46
// TODO
• More e
ffi
cient ways to collect traces without QEMU
• Analyzing traces from multi-thread programs
• Improve MCA’s precision via dynamic information (e.g. memory accesses)
• Visualizing analysis results. Or: Improve MCA’s result display
46
// TODO
• More e
ffi
cient ways to collect traces without QEMU
• Analyzing traces from multi-thread programs
• Improve MCA’s precision via dynamic information (e.g. memory accesses)
• Visualizing analysis results. Or: Improve MCA’s result display
• Example: Loadable plugins for custom display of the result
46
// TODO
• More e
ffi
cient ways to collect traces without QEMU
• Analyzing traces from multi-thread programs
• Improve MCA’s precision via dynamic information (e.g. memory accesses)
• Visualizing analysis results. Or: Improve MCA’s result display
• Example: Loadable plugins for custom display of the result
• Going upstream: QEMU & LLVM
46
Going upstream: LLVM
The plan
47
Going upstream: LLVM
The plan
• We would like to upstream components that are bene
fi
cial to the core MCA
libraries
fi
rst
47
Going upstream: LLVM
The plan: components to upstream
48
QEMU
TCG Plugin
MCAD Core
IncrementalSourceMgr
Resumable Simulation Pipeline
(QEMU) Broker
Disassembler
Receiver
Target Program
TCP Sockets
Recycling InstrBuilder
Display*
Not intended to upstream for now
Intended to upstream
Unrelated / no change
Going upstream: LLVM
The plan
• We would like to upstream components that are bene
fi
cial to the core MCA
libraries
fi
rst
49
Going upstream: LLVM
The plan
• We would like to upstream components that are bene
fi
cial to the core MCA
libraries
fi
rst
• QEMU broker plugin & our TCG plugin will be maintained out-of-tree
49
Going upstream: LLVM
The plan
• We would like to upstream components that are bene
fi
cial to the core MCA
libraries
fi
rst
• QEMU broker plugin & our TCG plugin will be maintained out-of-tree
• We’re not sure about upstreaming rest of the tool right now
49
Going upstream: LLVM
The plan
• We would like to upstream components that are bene
fi
cial to the core MCA
libraries
fi
rst
• QEMU broker plugin & our TCG plugin will be maintained out-of-tree
• We’re not sure about upstreaming rest of the tool right now
• With the assembly broker, MCAD can be a drop-in replacement for llvm-
mca…with even more features (e.g. the broker plugin infrastructure)
49
Going upstream: LLVM
The plan
• We would like to upstream components that are bene
fi
cial to the core MCA
libraries
fi
rst
• QEMU broker plugin & our TCG plugin will be maintained out-of-tree
• We’re not sure about upstreaming rest of the tool right now
• With the assembly broker, MCAD can be a drop-in replacement for llvm-
mca…with even more features (e.g. the broker plugin infrastructure)
• Some of the (advanced) interfaces in broker plugin are only used by QEMU
broker. So, without the latter, it’s not well tested.
49
Outline
Motivation

MCA Daemon (MCAD)

Future Plans

Epilogue
50
Summary
51
Summary
• MCA Daemon (MCAD) is a high-performance hybrid throughput analysis tool
built on top of LLVM MCA libraries
51
Summary
• MCA Daemon (MCAD) is a high-performance hybrid throughput analysis tool
built on top of LLVM MCA libraries
• Online, whole-program analysis on real-world applications
51
Summary
• MCA Daemon (MCAD) is a high-performance hybrid throughput analysis tool
built on top of LLVM MCA libraries
• Online, whole-program analysis on real-world applications
• Scale up with large-scale programs with tens of millions of instructions
51
Summary
• MCA Daemon (MCAD) is a high-performance hybrid throughput analysis tool
built on top of LLVM MCA libraries
• Online, whole-program analysis on real-world applications
• Scale up with large-scale programs with tens of millions of instructions
• We improved the performance &
fl
exibility of core MCA libraries
51
Summary
• MCA Daemon (MCAD) is a high-performance hybrid throughput analysis tool
built on top of LLVM MCA libraries
• Online, whole-program analysis on real-world applications
• Scale up with large-scale programs with tens of millions of instructions
• We improved the performance &
fl
exibility of core MCA libraries
• We would like to merge these changes upstream to bene
fi
t the wider
community
51
Source code
52
https://github.com/securesystemslab/LLVM-MCA-Daemon
Acknowledgements
53
This material is based upon work partially supported by the
Defense Advanced Research Projects Agency (DARPA) under
contract N66001-20-C-4027
Special Thanks:
Galois Inc. (https://galois.com) Immunant Inc. (https://immunant.com)
Thank You!
54
Q&A
Aldrich Park @ UC Irvine, California
Me
Appendix
55
Introduction to MCA
An example
56
vmulps %xmm0, %xmm1, %xmm2


vhaddps %xmm2, %xmm2, %xmm3


vhaddps %xmm3, %xmm3, %xmm4
test/tools/llvm-mca/X86/BtVer2/dot-product.s Timeline
012345


Index 0123456789


[0,0] DeeER. . . vmulps


[0,1] D==eeeER . . vhaddps


[0,2] .D====eeeER . vhaddps


[1,0] .DeeE-----R . vmulps


[1,1] . D=eeeE---R . vhaddps


[1,2] . D====eeeER . vhaddps


[2,0] . DeeE-----R . vmulps


[2,1] . D====eeeER . vhaddps


[2,2] . D======eeeER vhaddps
llvm-mca -mtriple=x86_64 -mcpu=btver2 


-iterations=300 dot-products.s
Introduction to MCA
An example
56
vmulps %xmm0, %xmm1, %xmm2


vhaddps %xmm2, %xmm2, %xmm3


vhaddps %xmm3, %xmm3, %xmm4
test/tools/llvm-mca/X86/BtVer2/dot-product.s Timeline
012345


Index 0123456789


[0,0] DeeER. . . vmulps


[0,1] D==eeeER . . vhaddps


[0,2] .D====eeeER . vhaddps


[1,0] .DeeE-----R . vmulps


[1,1] . D=eeeE---R . vhaddps


[1,2] . D====eeeER . vhaddps


[2,0] . DeeE-----R . vmulps


[2,1] . D====eeeER . vhaddps


[2,2] . D======eeeER vhaddps
llvm-mca -mtriple=x86_64 -mcpu=btver2 


-iterations=300 dot-products.s

More Related Content

What's hot

Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersBrendan Gregg
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019Brendan Gregg
 
RISC-V : Berkeley Boot Loader & Proxy Kernelのソースコード解析
RISC-V : Berkeley Boot Loader & Proxy Kernelのソースコード解析RISC-V : Berkeley Boot Loader & Proxy Kernelのソースコード解析
RISC-V : Berkeley Boot Loader & Proxy Kernelのソースコード解析Mr. Vengineer
 
ARM LinuxのMMUはわかりにくい
ARM LinuxのMMUはわかりにくいARM LinuxのMMUはわかりにくい
ARM LinuxのMMUはわかりにくいwata2ki
 
Linux Networking Explained
Linux Networking ExplainedLinux Networking Explained
Linux Networking ExplainedThomas Graf
 
Meet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracingMeet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracingViller Hsiao
 
U-Boot Porting on New Hardware
U-Boot Porting on New HardwareU-Boot Porting on New Hardware
U-Boot Porting on New HardwareRuggedBoardGroup
 
Launch the First Process in Linux System
Launch the First Process in Linux SystemLaunch the First Process in Linux System
Launch the First Process in Linux SystemJian-Hong Pan
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and moreBrendan Gregg
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance AnalysisBrendan Gregg
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF SuperpowersBrendan Gregg
 
P4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadP4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadOpen-NFP
 
Building Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCBuilding Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCKernel TLV
 
Performance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux KernelPerformance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux Kernellcplcp1
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debuggingHao-Ran Liu
 
QEMU - Binary Translation
QEMU - Binary Translation QEMU - Binary Translation
QEMU - Binary Translation Jiann-Fuh Liaw
 
Jagan Teki - U-boot from scratch
Jagan Teki - U-boot from scratchJagan Teki - U-boot from scratch
Jagan Teki - U-boot from scratchlinuxlab_conf
 
Browsing Linux Kernel Source
Browsing Linux Kernel SourceBrowsing Linux Kernel Source
Browsing Linux Kernel SourceMotaz Saad
 
Linux Performance Tools
Linux Performance ToolsLinux Performance Tools
Linux Performance ToolsBrendan Gregg
 

What's hot (20)

Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF Superpowers
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019
 
RISC-V : Berkeley Boot Loader & Proxy Kernelのソースコード解析
RISC-V : Berkeley Boot Loader & Proxy Kernelのソースコード解析RISC-V : Berkeley Boot Loader & Proxy Kernelのソースコード解析
RISC-V : Berkeley Boot Loader & Proxy Kernelのソースコード解析
 
Hands-on ethernet driver
Hands-on ethernet driverHands-on ethernet driver
Hands-on ethernet driver
 
ARM LinuxのMMUはわかりにくい
ARM LinuxのMMUはわかりにくいARM LinuxのMMUはわかりにくい
ARM LinuxのMMUはわかりにくい
 
Linux Networking Explained
Linux Networking ExplainedLinux Networking Explained
Linux Networking Explained
 
Meet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracingMeet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracing
 
U-Boot Porting on New Hardware
U-Boot Porting on New HardwareU-Boot Porting on New Hardware
U-Boot Porting on New Hardware
 
Launch the First Process in Linux System
Launch the First Process in Linux SystemLaunch the First Process in Linux System
Launch the First Process in Linux System
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and more
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF Superpowers
 
P4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadP4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC Offload
 
Building Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCBuilding Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCC
 
Performance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux KernelPerformance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux Kernel
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugging
 
QEMU - Binary Translation
QEMU - Binary Translation QEMU - Binary Translation
QEMU - Binary Translation
 
Jagan Teki - U-boot from scratch
Jagan Teki - U-boot from scratchJagan Teki - U-boot from scratch
Jagan Teki - U-boot from scratch
 
Browsing Linux Kernel Source
Browsing Linux Kernel SourceBrowsing Linux Kernel Source
Browsing Linux Kernel Source
 
Linux Performance Tools
Linux Performance ToolsLinux Performance Tools
Linux Performance Tools
 

Similar to MCA Daemon: Hybrid Throughput Analysis Beyond Basic Blocks

Intel Atom Processor Pre-Silicon Verification Experience
Intel Atom Processor Pre-Silicon Verification ExperienceIntel Atom Processor Pre-Silicon Verification Experience
Intel Atom Processor Pre-Silicon Verification ExperienceDVClub
 
Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014Lari Hotari
 
SFScon 21 - Matteo Camilli - Performance assessment of microservices with str...
SFScon 21 - Matteo Camilli - Performance assessment of microservices with str...SFScon 21 - Matteo Camilli - Performance assessment of microservices with str...
SFScon 21 - Matteo Camilli - Performance assessment of microservices with str...South Tyrol Free Software Conference
 
It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
It Does What You Say, Not What You Mean: Lessons From A Decade of Program RepairIt Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
It Does What You Say, Not What You Mean: Lessons From A Decade of Program RepairClaire Le Goues
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingJen Aman
 
Serverless on OpenStack with Docker Swarm, Mistral, and StackStorm
Serverless on OpenStack with Docker Swarm, Mistral, and StackStormServerless on OpenStack with Docker Swarm, Mistral, and StackStorm
Serverless on OpenStack with Docker Swarm, Mistral, and StackStormDmitri Zimine
 
Performance tuning Grails Applications GR8Conf US 2014
Performance tuning Grails Applications GR8Conf US 2014Performance tuning Grails Applications GR8Conf US 2014
Performance tuning Grails Applications GR8Conf US 2014Lari Hotari
 
5 Steps on the Way to Continuous Delivery
5 Steps on the Way to Continuous Delivery5 Steps on the Way to Continuous Delivery
5 Steps on the Way to Continuous DeliveryXebiaLabs
 
Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Ku...
Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Ku...Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Ku...
Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Ku...HostedbyConfluent
 
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...HostedbyConfluent
 
Web Application Release
Web Application ReleaseWeb Application Release
Web Application ReleasePiyush Mattoo
 
Planning and Control Algorithms Model-Based Approach (State-Space)
Planning and Control Algorithms Model-Based Approach (State-Space)Planning and Control Algorithms Model-Based Approach (State-Space)
Planning and Control Algorithms Model-Based Approach (State-Space)M Reza Rahmati
 
Streaming meetup
Streaming meetupStreaming meetup
Streaming meetupkarthik_krk
 
Performance tuning Grails applications
Performance tuning Grails applicationsPerformance tuning Grails applications
Performance tuning Grails applicationsLari Hotari
 
Case Study: Automating Code Reviews for Custom SAP ABAP Applications with Vir...
Case Study: Automating Code Reviews for Custom SAP ABAP Applications with Vir...Case Study: Automating Code Reviews for Custom SAP ABAP Applications with Vir...
Case Study: Automating Code Reviews for Custom SAP ABAP Applications with Vir...Virtual Forge
 
The Diabolical Developers Guide to Performance Tuning
The Diabolical Developers Guide to Performance TuningThe Diabolical Developers Guide to Performance Tuning
The Diabolical Developers Guide to Performance TuningjClarity
 

Similar to MCA Daemon: Hybrid Throughput Analysis Beyond Basic Blocks (20)

Intel Atom Processor Pre-Silicon Verification Experience
Intel Atom Processor Pre-Silicon Verification ExperienceIntel Atom Processor Pre-Silicon Verification Experience
Intel Atom Processor Pre-Silicon Verification Experience
 
Java Performance Tuning
Java Performance TuningJava Performance Tuning
Java Performance Tuning
 
Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014
 
SFScon 21 - Matteo Camilli - Performance assessment of microservices with str...
SFScon 21 - Matteo Camilli - Performance assessment of microservices with str...SFScon 21 - Matteo Camilli - Performance assessment of microservices with str...
SFScon 21 - Matteo Camilli - Performance assessment of microservices with str...
 
It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
It Does What You Say, Not What You Mean: Lessons From A Decade of Program RepairIt Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark Streaming
 
Serverless on OpenStack with Docker Swarm, Mistral, and StackStorm
Serverless on OpenStack with Docker Swarm, Mistral, and StackStormServerless on OpenStack with Docker Swarm, Mistral, and StackStorm
Serverless on OpenStack with Docker Swarm, Mistral, and StackStorm
 
Spyglass dft
Spyglass dftSpyglass dft
Spyglass dft
 
Performance tuning Grails Applications GR8Conf US 2014
Performance tuning Grails Applications GR8Conf US 2014Performance tuning Grails Applications GR8Conf US 2014
Performance tuning Grails Applications GR8Conf US 2014
 
5 Steps on the Way to Continuous Delivery
5 Steps on the Way to Continuous Delivery5 Steps on the Way to Continuous Delivery
5 Steps on the Way to Continuous Delivery
 
Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Ku...
Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Ku...Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Ku...
Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Ku...
 
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
 
Web Application Release
Web Application ReleaseWeb Application Release
Web Application Release
 
Værktøjer udviklet på AAU til analyse af SCJ programmer
Værktøjer udviklet på AAU til analyse af SCJ programmerVærktøjer udviklet på AAU til analyse af SCJ programmer
Værktøjer udviklet på AAU til analyse af SCJ programmer
 
Planning and Control Algorithms Model-Based Approach (State-Space)
Planning and Control Algorithms Model-Based Approach (State-Space)Planning and Control Algorithms Model-Based Approach (State-Space)
Planning and Control Algorithms Model-Based Approach (State-Space)
 
Streaming meetup
Streaming meetupStreaming meetup
Streaming meetup
 
When Should I Use Simulation?
When Should I Use Simulation?When Should I Use Simulation?
When Should I Use Simulation?
 
Performance tuning Grails applications
Performance tuning Grails applicationsPerformance tuning Grails applications
Performance tuning Grails applications
 
Case Study: Automating Code Reviews for Custom SAP ABAP Applications with Vir...
Case Study: Automating Code Reviews for Custom SAP ABAP Applications with Vir...Case Study: Automating Code Reviews for Custom SAP ABAP Applications with Vir...
Case Study: Automating Code Reviews for Custom SAP ABAP Applications with Vir...
 
The Diabolical Developers Guide to Performance Tuning
The Diabolical Developers Guide to Performance TuningThe Diabolical Developers Guide to Performance Tuning
The Diabolical Developers Guide to Performance Tuning
 

More from Min-Yih Hsu

Debug Information And Where They Come From
Debug Information And Where They Come FromDebug Information And Where They Come From
Debug Information And Where They Come FromMin-Yih Hsu
 
Handling inline assembly in Clang and LLVM
Handling inline assembly in Clang and LLVMHandling inline assembly in Clang and LLVM
Handling inline assembly in Clang and LLVMMin-Yih Hsu
 
How to write a TableGen backend
How to write a TableGen backendHow to write a TableGen backend
How to write a TableGen backendMin-Yih Hsu
 
[COSCUP 2021] LLVM Project: The Good, The Bad, and The Ugly
[COSCUP 2021] LLVM Project: The Good, The Bad, and The Ugly[COSCUP 2021] LLVM Project: The Good, The Bad, and The Ugly
[COSCUP 2021] LLVM Project: The Good, The Bad, and The UglyMin-Yih Hsu
 
[TGSA Academic Friday] How To Train Your Dragon - Intro to Modern Compiler Te...
[TGSA Academic Friday] How To Train Your Dragon - Intro to Modern Compiler Te...[TGSA Academic Friday] How To Train Your Dragon - Intro to Modern Compiler Te...
[TGSA Academic Friday] How To Train Your Dragon - Intro to Modern Compiler Te...Min-Yih Hsu
 
Paper Study - Demand-Driven Computation of Interprocedural Data Flow
Paper Study - Demand-Driven Computation of Interprocedural Data FlowPaper Study - Demand-Driven Computation of Interprocedural Data Flow
Paper Study - Demand-Driven Computation of Interprocedural Data FlowMin-Yih Hsu
 
Paper Study - Incremental Data-Flow Analysis Algorithms by Ryder et al
Paper Study - Incremental Data-Flow Analysis Algorithms by Ryder et alPaper Study - Incremental Data-Flow Analysis Algorithms by Ryder et al
Paper Study - Incremental Data-Flow Analysis Algorithms by Ryder et alMin-Yih Hsu
 
Souper-Charging Peepholes with Target Machine Info
Souper-Charging Peepholes with Target Machine InfoSouper-Charging Peepholes with Target Machine Info
Souper-Charging Peepholes with Target Machine InfoMin-Yih Hsu
 
From V8 to Modern Compilers
From V8 to Modern CompilersFrom V8 to Modern Compilers
From V8 to Modern CompilersMin-Yih Hsu
 
Introduction to Khronos SYCL
Introduction to Khronos SYCLIntroduction to Khronos SYCL
Introduction to Khronos SYCLMin-Yih Hsu
 
Trace Scheduling
Trace SchedulingTrace Scheduling
Trace SchedulingMin-Yih Hsu
 
Polymer Start-Up (SITCON 2016)
Polymer Start-Up (SITCON 2016)Polymer Start-Up (SITCON 2016)
Polymer Start-Up (SITCON 2016)Min-Yih Hsu
 
War of Native Speed on Web (SITCON2016)
War of Native Speed on Web (SITCON2016)War of Native Speed on Web (SITCON2016)
War of Native Speed on Web (SITCON2016)Min-Yih Hsu
 
From Android NDK To AOSP
From Android NDK To AOSPFrom Android NDK To AOSP
From Android NDK To AOSPMin-Yih Hsu
 

More from Min-Yih Hsu (14)

Debug Information And Where They Come From
Debug Information And Where They Come FromDebug Information And Where They Come From
Debug Information And Where They Come From
 
Handling inline assembly in Clang and LLVM
Handling inline assembly in Clang and LLVMHandling inline assembly in Clang and LLVM
Handling inline assembly in Clang and LLVM
 
How to write a TableGen backend
How to write a TableGen backendHow to write a TableGen backend
How to write a TableGen backend
 
[COSCUP 2021] LLVM Project: The Good, The Bad, and The Ugly
[COSCUP 2021] LLVM Project: The Good, The Bad, and The Ugly[COSCUP 2021] LLVM Project: The Good, The Bad, and The Ugly
[COSCUP 2021] LLVM Project: The Good, The Bad, and The Ugly
 
[TGSA Academic Friday] How To Train Your Dragon - Intro to Modern Compiler Te...
[TGSA Academic Friday] How To Train Your Dragon - Intro to Modern Compiler Te...[TGSA Academic Friday] How To Train Your Dragon - Intro to Modern Compiler Te...
[TGSA Academic Friday] How To Train Your Dragon - Intro to Modern Compiler Te...
 
Paper Study - Demand-Driven Computation of Interprocedural Data Flow
Paper Study - Demand-Driven Computation of Interprocedural Data FlowPaper Study - Demand-Driven Computation of Interprocedural Data Flow
Paper Study - Demand-Driven Computation of Interprocedural Data Flow
 
Paper Study - Incremental Data-Flow Analysis Algorithms by Ryder et al
Paper Study - Incremental Data-Flow Analysis Algorithms by Ryder et alPaper Study - Incremental Data-Flow Analysis Algorithms by Ryder et al
Paper Study - Incremental Data-Flow Analysis Algorithms by Ryder et al
 
Souper-Charging Peepholes with Target Machine Info
Souper-Charging Peepholes with Target Machine InfoSouper-Charging Peepholes with Target Machine Info
Souper-Charging Peepholes with Target Machine Info
 
From V8 to Modern Compilers
From V8 to Modern CompilersFrom V8 to Modern Compilers
From V8 to Modern Compilers
 
Introduction to Khronos SYCL
Introduction to Khronos SYCLIntroduction to Khronos SYCL
Introduction to Khronos SYCL
 
Trace Scheduling
Trace SchedulingTrace Scheduling
Trace Scheduling
 
Polymer Start-Up (SITCON 2016)
Polymer Start-Up (SITCON 2016)Polymer Start-Up (SITCON 2016)
Polymer Start-Up (SITCON 2016)
 
War of Native Speed on Web (SITCON2016)
War of Native Speed on Web (SITCON2016)War of Native Speed on Web (SITCON2016)
War of Native Speed on Web (SITCON2016)
 
From Android NDK To AOSP
From Android NDK To AOSPFrom Android NDK To AOSP
From Android NDK To AOSP
 

Recently uploaded

Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxnada99848
 

Recently uploaded (20)

Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptx
 

MCA Daemon: Hybrid Throughput Analysis Beyond Basic Blocks

  • 1. Min-Yih “Min” Hsu, David Gens, Michael Franz. University of California, Irvine MCA Daemon Hybrid Throughput Analysis Beyond Basic Blocks Keynote, EuroLLVM 2022
  • 8. Genesis: Assured Micro Patching (AMP) 4
  • 9. Genesis: Assured Micro Patching (AMP) • A research project initiated by United States DARPA to assure the correctness of binary patching with little or no source code 4
  • 10. Genesis: Assured Micro Patching (AMP) • A research project initiated by United States DARPA to assure the correctness of binary patching with little or no source code • Including functional and timing aspects 4
  • 11. Genesis: Assured Micro Patching (AMP) • A research project initiated by United States DARPA to assure the correctness of binary patching with little or no source code • Including functional and timing aspects • Focuses on small (micro) binary patches 4
  • 12. Genesis: Assured Micro Patching (AMP) • A research project initiated by United States DARPA to assure the correctness of binary patching with little or no source code • Including functional and timing aspects • Focuses on small (micro) binary patches • Focuses on embedded systems 4
  • 13. Genesis: Assured Micro Patching (AMP) • A research project initiated by United States DARPA to assure the correctness of binary patching with little or no source code • Including functional and timing aspects • Focuses on small (micro) binary patches • Focuses on embedded systems • UCI was studying the timing impacts of binary patches 4
  • 14. Genesis: Assured Micro Patching (AMP) • A research project initiated by United States DARPA to assure the correctness of binary patching with little or no source code • Including functional and timing aspects • Focuses on small (micro) binary patches • Focuses on embedded systems • UCI was studying the timing impacts of binary patches • Example: After the fi rmware on a truck is binary-patched to prevent brakes from locking up, we need to make sure latencies do not degrade terribly 4
  • 15. Timing impacts of binary patches Problem de fi nition 5 Original Program
  • 16. Timing impacts of binary patches Problem de fi nition 5 Original Program Small Binary Patch Patched Program
  • 17. Timing impacts of binary patches Problem de fi nition 5 Original Program Small Binary Patch Patched Program Original Program Same set of inputs
  • 18. Timing impacts of binary patches Problem de fi nition 5 Original Program Small Binary Patch Patched Program Original Program ΔT? Same set of inputs
  • 21. Execution time assessment Interesting use cases • Predicting program run time in remote environments or time-sensitive applications • Examples: fi rmware in cars or satellite (e.g. Kepler space telescope by NASA) 7
  • 22. Execution time assessment Interesting use cases • Predicting program run time in remote environments or time-sensitive applications • Examples: fi rmware in cars or satellite (e.g. Kepler space telescope by NASA) • Performance analysis • Insights into performance bottlenecks 7
  • 23. Execution time assessment Interesting use cases • Predicting program run time in remote environments or time-sensitive applications • Examples: fi rmware in cars or satellite (e.g. Kepler space telescope by NASA) • Performance analysis • Insights into performance bottlenecks • Examples: Potential CPU pipeline stalling, GPU memory bank con fl icts 7
  • 25. Execution time assessment Previous e ff orts • Static approaches 8
  • 26. Execution time assessment Previous e ff orts • Static approaches • Throughput analysis: predicting the cycle counts for linear code (e.g. basic block, loop) statically 8
  • 27. Execution time assessment Previous e ff orts • Static approaches • Throughput analysis: predicting the cycle counts for linear code (e.g. basic block, loop) statically • Examples: IACA, OSACA, uiCA, LLVM MCA, Ithemal 8
  • 28. Execution time assessment Previous e ff orts • Static approaches • Throughput analysis: predicting the cycle counts for linear code (e.g. basic block, loop) statically • Examples: IACA, OSACA, uiCA, LLVM MCA, Ithemal • Dynamic approaches • Cycle-accurate simulators / emulators 8
  • 29. Execution time assessment Previous e ff orts • Static approaches • Throughput analysis: predicting the cycle counts for linear code (e.g. basic block, loop) statically • Examples: IACA, OSACA, uiCA, LLVM MCA, Ithemal • Dynamic approaches • Cycle-accurate simulators / emulators • Examples: gem5, gpgpu-sim 8
  • 31. Execution time assessment Challenges 9 Static Dynamic High Low Precision
  • 32. Execution time assessment Challenges 9 Static Dynamic High Low Precision • Complete execution traces • Higher fi delity on hardware details
  • 33. Execution time assessment Challenges 9 Static Dynamic High Low Precision • Poor handling on branches & function calls • Small scope (only few blocks) • Lack of run-time information • Complete execution traces • Higher fi delity on hardware details
  • 34. Execution time assessment Challenges 9 Static Dynamic High Low Precision Fast Slow Turnaround • Poor handling on branches & function calls • Small scope (only few blocks) • Lack of run-time information • Complete execution traces • Higher fi delity on hardware details
  • 35. Execution time assessment Challenges 9 Static Dynamic High Low Precision Fast Slow Turnaround • Poor handling on branches & function calls • Small scope (only few blocks) • Lack of run-time information • Complete execution traces • Higher fi delity on hardware details • Faster analysis speed (due to coarser granularity) • Easier integration with other tools
  • 36. Execution time assessment Challenges 9 Static Dynamic High Low Precision Fast Slow Turnaround • Poor handling on branches & function calls • Small scope (only few blocks) • Lack of run-time information • Complete execution traces • Higher fi delity on hardware details • Faster analysis speed (due to coarser granularity) • Easier integration with other tools • Usually require non-trivial setup • Slow simulation speed
  • 37. Execution time assessment Challenges 9 Static Dynamic High Low Precision Fast Slow Turnaround ? • Poor handling on branches & function calls • Small scope (only few blocks) • Lack of run-time information • Complete execution traces • Higher fi delity on hardware details • Faster analysis speed (due to coarser granularity) • Easier integration with other tools • Usually require non-trivial setup • Slow simulation speed
  • 39. MCA Daemon (MCAD) High-level concept 11 Dynamic Runtime Static Throughput Analysis Tool Target Program
  • 40. MCA Daemon (MCAD) High-level concept 11 Dynamic Runtime Static Throughput Analysis Tool Target Program Execution Trace
  • 41. MCA Daemon (MCAD) High-level concept 11 Dynamic Runtime Static Throughput Analysis Tool Target Program Execution Trace • The instructions that just got executed • Run-time values (e.g. register values)
  • 42. MCA Daemon (MCAD) High-level concept 11 Dynamic Runtime Static Throughput Analysis Tool Target Program Execution Trace Online Environment Process 1 Process 2 • The instructions that just got executed • Run-time values (e.g. register values)
  • 43. MCA Daemon (MCAD) High-level concept 11 Dynamic Runtime Static Throughput Analysis Tool Target Program Execution Trace Online Environment Process 1 Process 2 Streaming • The instructions that just got executed • Run-time values (e.g. register values)
  • 44. MCA Daemon (MCAD) High-level concept 12 QEMU LLVM MCA Libraries Target Program Online Environment Process 1 Process 2 Execution Trace Streaming
  • 46. Introduction to LLVM MCA • A tool (llvm-mca) and library (libLLVMMCA) for predicting cycle counts and potential performance hazards in a sequence of assembly code 13
  • 47. Introduction to LLVM MCA • A tool (llvm-mca) and library (libLLVMMCA) for predicting cycle counts and potential performance hazards in a sequence of assembly code • Using instruction scheduling data (e.g. instruction latency) provided by each LLVM target • New ISA (with proper scheduling info) can be supported out of the box 13
  • 48. Introduction to LLVM MCA • A tool (llvm-mca) and library (libLLVMMCA) for predicting cycle counts and potential performance hazards in a sequence of assembly code • Using instruction scheduling data (e.g. instruction latency) provided by each LLVM target • New ISA (with proper scheduling info) can be supported out of the box • Accounting for modern processor features: super scalar, out-of-order etc. 13
  • 49. Introduction to LLVM MCA • A tool (llvm-mca) and library (libLLVMMCA) for predicting cycle counts and potential performance hazards in a sequence of assembly code • Using instruction scheduling data (e.g. instruction latency) provided by each LLVM target • New ISA (with proper scheduling info) can be supported out of the box • Accounting for modern processor features: super scalar, out-of-order etc. • Implemented via lightweight simulation • Abstract real CPU pipeline stages into a small handful of stages 13
  • 50. Introduction to MCA An example 14 vmulps %xmm0, %xmm1, %xmm2 vhaddps %xmm2, %xmm2, %xmm3 vhaddps %xmm3, %xmm3, %xmm4 test/tools/llvm-mca/X86/BtVer2/dot-product.s
  • 51. Introduction to MCA An example 14 vmulps %xmm0, %xmm1, %xmm2 vhaddps %xmm2, %xmm2, %xmm3 vhaddps %xmm3, %xmm3, %xmm4 test/tools/llvm-mca/X86/BtVer2/dot-product.s llvm-mca -mtriple=x86_64 -mcpu=btver2 -iterations=300 dot-products.s
  • 52. Introduction to MCA An example 14 vmulps %xmm0, %xmm1, %xmm2 vhaddps %xmm2, %xmm2, %xmm3 vhaddps %xmm3, %xmm3, %xmm4 test/tools/llvm-mca/X86/BtVer2/dot-product.s Summary Iterations: 300 Instructions: 900 Total Cycles: 610 Total uOps: 900 Dispatch Width: 2 uOps Per Cycle: 1.48 IPC: 1.48 Block RThroughput: 2.0 llvm-mca -mtriple=x86_64 -mcpu=btver2 -iterations=300 dot-products.s
  • 55. 15 QEMU LLVM MCA Libraries Target Program Process 1 Process 2 Execution Trace Streaming llvm-mca Assembly fi le LLVM MCA Libraries llvm-mca MCAD
  • 57. MCA Daemon (MCAD) Highlights • Combine the advantages of dynamic & static throughput analysis 16
  • 58. MCA Daemon (MCAD) Highlights • Combine the advantages of dynamic & static throughput analysis • Augment the analysis region beyond basic blocks • MCAD is able to analyze the entire program execution trace 16
  • 59. MCA Daemon (MCAD) Highlights • Combine the advantages of dynamic & static throughput analysis • Augment the analysis region beyond basic blocks • MCAD is able to analyze the entire program execution trace • Throughput analysis is happening in parallel / on-the-fly with the target program execution 16
  • 61. Analyze execution traces using MCA Using unmodi fi ed MCA libraries 18 QEMU Target Program Executed instructions LLVM MCA Libraries Disassembler
  • 62. Analyze execution traces using MCA Challenge: Sequential work fl ow 19 QEMU Target Program Blocked until QEMU is fi nished Executed instructions LLVM MCA Libraries Disassembler
  • 67. MCA internal 20 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. MCA Simulation Pipeline mca::SourceMgr Assembly fi le
  • 68. MCA internal 20 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. MCA Simulation Pipeline Display mca::SourceMgr Assembly fi le
  • 69. MCA with execution trace stream as input 21 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. MCA Simulation Pipeline Display mca::SourceMgr Execution Trace Stream
  • 70. MCA with execution trace stream as input 21 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. MCA Simulation Pipeline Display mca::SourceMgr Execution Trace Stream Blocking
  • 71. Incremental SourceMgr 22 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. MCA Simulation Pipeline Display mca::IncrementalSourceMgr Execution Trace Stream
  • 72. Incremental SourceMgr 22 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. MCA Simulation Pipeline Display Making mca::Instruction available to simulation pipeline right away mca::IncrementalSourceMgr Execution Trace Stream
  • 73. Incremental SourceMgr 23 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. MCA Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU
  • 74. Incremental SourceMgr 23 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. MCA Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU Process 2 Process 1
  • 75. Incremental SourceMgr Implement with threads 24 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. MCA Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU Process 1
  • 76. Incremental SourceMgr Implement with threads 24 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. MCA Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU Process 1 Simulation Thread
  • 77. Incremental SourceMgr Implement with threads 24 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. MCA Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU Process 1 Simulation Thread mca::Instruction fetching loop
  • 78. Incremental SourceMgr Implement with threads 24 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. MCA Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU Process 1 Simulation Thread Receiver Thread mca::Instruction fetching loop
  • 79. Incremental SourceMgr Implement with threads: Pros & Cons 25 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. MCA Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU Process 1 Simulation Thread Receiver Thread
  • 80. Incremental SourceMgr Implement with threads: Pros & Cons 25 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. MCA Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU Process 1 Simulation Thread Receiver Thread Pros: No modi fi cation on the simulation pipeline
  • 81. Incremental SourceMgr Implement with threads: Pros & Cons 25 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. MCA Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU Process 1 Simulation Thread Receiver Thread Pros: No modi fi cation on the simulation pipeline Cons: To use IncrementalSourceMgr, you have to use threads
  • 82. Incremental SourceMgr Better solution: Resumable simulation pipeline 26 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. Resumable Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU
  • 83. Incremental SourceMgr Better solution: Resumable simulation pipeline 26 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. Resumable Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU A subset of trace
  • 84. Incremental SourceMgr Better solution: Resumable simulation pipeline 26 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. Resumable Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU A subset of trace A subset of mca::Instruction
  • 85. Incremental SourceMgr Better solution: Resumable simulation pipeline 27 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. Resumable Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU A subset of mca::Instruction
  • 86. Incremental SourceMgr Better solution: Resumable simulation pipeline 28 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. Resumable Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU Pause
  • 87. Incremental SourceMgr Better solution: Resumable simulation pipeline 28 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. Resumable Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU Pause
  • 89. Resumable simulation pipeline • Save (and restore) the analysis state from previous subset of instructions 29
  • 90. Resumable simulation pipeline • Save (and restore) the analysis state from previous subset of instructions • Threads are not required when using IncrementalSourceMgr + resumable pipeline 29
  • 91. Resumable simulation pipeline • Save (and restore) the analysis state from previous subset of instructions • Threads are not required when using IncrementalSourceMgr + resumable pipeline • Much easier to integrate into other uses 29
  • 92. Resumable simulation pipeline • Save (and restore) the analysis state from previous subset of instructions • Threads are not required when using IncrementalSourceMgr + resumable pipeline • Much easier to integrate into other uses • You can still wrap resumable pipeline with a thread 29
  • 93. Resumable simulation pipeline • Save (and restore) the analysis state from previous subset of instructions • Threads are not required when using IncrementalSourceMgr + resumable pipeline • Much easier to integrate into other uses • You can still wrap resumable pipeline with a thread • Minor downside: Modi fi cations on the simulation pipeline 29
  • 94. Incremental SourceMgr + Resumable pipeline Put into real actions 30 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. Resumable Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU
  • 95. Incremental SourceMgr + Resumable pipeline Put into real actions 30 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. Resumable Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU • A program that continuously reads from an I/O device
  • 96. Incremental SourceMgr + Resumable pipeline Put into real actions 30 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. Resumable Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU • A program that continuously reads from an I/O device • Only record the user space traces
  • 97. Incremental SourceMgr + Resumable pipeline Put into real actions 30 MCInst mca::Instruction Stage 0 Stage 1 Stage N …. Resumable Simulation Pipeline mca::IncrementalSourceMgr Execution Trace Stream QEMU • A program that continuously reads from an I/O device • Only record the user space traces • Collected ~1 million instructions
  • 99. Challenge • Created a signi fi cant amount of memory footprint 31
  • 100. Challenge • Created a signi fi cant amount of memory footprint 31 (Unit: MB)
  • 101. Challenge • Created a signi fi cant amount of memory footprint • Bottleneck: ~37GB of accumulated (virtual) memory was allocated by mca::InstrBuilder::createInstruction 31 (Unit: MB)
  • 102. Challenge • Created a signi fi cant amount of memory footprint • Bottleneck: ~37GB of accumulated (virtual) memory was allocated by mca::InstrBuilder::createInstruction 31 (Unit: MB)
  • 103. Challenge • Created a signi fi cant amount of memory footprint • Bottleneck: ~37GB of accumulated (virtual) memory was allocated by mca::InstrBuilder::createInstruction 31 (Unit: MB)
  • 104. Challenge • Created a signi fi cant amount of memory footprint • Bottleneck: ~37GB of accumulated (virtual) memory was allocated by mca::InstrBuilder::createInstruction 31 MCInst mca::Instruction mca::InstrBuilder (Unit: MB)
  • 106. Large memory footprint Root cause • Most of the translated mca::Instruction objects are never deallocated until the simulation is fi nished 32
  • 107. Large memory footprint Root cause • Most of the translated mca::Instruction objects are never deallocated until the simulation is fi nished • mca::Instruction is also used for tracking simulation state, so it’s hard to make it immutable 32
  • 108. Large memory footprint Root cause • Most of the translated mca::Instruction objects are never deallocated until the simulation is fi nished • mca::Instruction is also used for tracking simulation state, so it’s hard to make it immutable • Doesn’t scale really well with large input (recall: ~1 million instructions) 32
  • 112. Large memory footprint Solution: Recycling mca::Instruction 34 mca::IncrementalSourceMgr mca::Instruction Resumable Simulation Pipeline Copy Stream direction
  • 113. Large memory footprint Solution: Recycling mca::Instruction 34 mca::IncrementalSourceMgr mca::Instruction Resumable Simulation Pipeline Copy Stream direction
  • 114. Large memory footprint Solution: Recycling mca::Instruction 34 mca::IncrementalSourceMgr mca::Instruction Resumable Simulation Pipeline Copy Stream direction Recycle mca::InstrBuilder
  • 115. 67% improvement on accumulated memory consumption 35
  • 116. ~70% of the mca::Instruction objects are recycled 36
  • 117. Collecting execution traces via QEMU 37 user mode qemu Target Program QEMU MCAD
  • 118. Collecting execution traces via QEMU 37 user mode qemu Target Program Custom TCG Plugin Instrument QEMU MCAD
  • 119. Broker Collecting execution traces via QEMU 37 user mode qemu Target Program Custom TCG Plugin Instrument Receiver TCP Socket QEMU MCAD
  • 120. Broker Collecting execution traces via QEMU 37 user mode qemu Target Program Custom TCG Plugin Instrument Disassembler Receiver TCP Socket MCInst QEMU MCAD
  • 121. Custom QEMU TCG plugin 38
  • 122. Custom QEMU TCG plugin • The plugin interface allows us to tap into various TCG events to collect executed instructions • Example: When a TCG block is translated / executed 38
  • 123. Custom QEMU TCG plugin • The plugin interface allows us to tap into various TCG events to collect executed instructions • Example: When a TCG block is translated / executed • We also added a few plugin interfaces (not upstreamed yet) • Example: Retrieving CPU register values 38
  • 124. Custom QEMU TCG plugin • The plugin interface allows us to tap into various TCG events to collect executed instructions • Example: When a TCG block is translated / executed • We also added a few plugin interfaces (not upstreamed yet) • Example: Retrieving CPU register values • Sending raw binary instructions* through TCP sockets 38
  • 125. QEMU Complete structure of MCAD 39 TCG Plugin MCAD Core IncrementalSourceMgr Resumable Simulation Pipeline Display (QEMU) Broker Disassembler Receiver Target Program TCP Sockets Recycling InstrBuilder
  • 126. QEMU Complete structure of MCAD 39 TCG Plugin MCAD Core IncrementalSourceMgr Resumable Simulation Pipeline Display (QEMU) Broker Disassembler Receiver Target Program TCP Sockets ☹ Recycling InstrBuilder
  • 127. QEMU Complete structure of MCAD A modular design 40 TCG Plugin Display (QEMU) Broker Disassembler Receiver Target Program TCP Sockets MCAD Core IncrementalSourceMgr Resumable Simulation Pipeline Recycling InstrBuilder
  • 128. QEMU Complete structure of MCAD A modular design 40 TCG Plugin Display (QEMU) Broker Disassembler Receiver Target Program TCP Sockets Loadable Plugin MCAD Core IncrementalSourceMgr Resumable Simulation Pipeline Recycling InstrBuilder
  • 129. Loadable Plugin Complete structure of MCAD Example: Assembly broker plugin 41 Display Assembly Broker AsmParser Assembly File MCAD Core IncrementalSourceMgr Resumable Simulation Pipeline Recycling InstrBuilder
  • 131. Evaluation Scalability compared against llvm-mca • Compare MCAD against llvm-mca (i.e. baseline) on analysis speed and memory consumption 42
  • 132. Evaluation Scalability compared against llvm-mca • Compare MCAD against llvm-mca (i.e. baseline) on analysis speed and memory consumption • Using execution trace collected from running x86_64 FFmpeg 4.2 to decode a 14KB MPEG-4 video fi le • Command: ffmpeg -i input.mp4 -f null - • Size of the trace: ~27 million x86_64 instructions 42
  • 133. Evaluation Scalability compared against llvm-mca • Compare MCAD against llvm-mca (i.e. baseline) on analysis speed and memory consumption • Using execution trace collected from running x86_64 FFmpeg 4.2 to decode a 14KB MPEG-4 video fi le • Command: ffmpeg -i input.mp4 -f null - • Size of the trace: ~27 million x86_64 instructions • For baseline, we dump the execution trace (assembly instructions) to a fi le before feeding into llvm-mca • Time measurement on baseline only accounts for llvm-mca’s run time. Excluding the trace collection time. 42
  • 135. Evaluation Scalability compared against llvm-mca 43 Analysis time seconds 0 44 88 132 176 220 llvm-mca MCAD
  • 136. Evaluation Scalability compared against llvm-mca 43 Analysis time seconds 0 44 88 132 176 220 llvm-mca MCAD 4x Faster
  • 137. Evaluation Scalability compared against llvm-mca 43 Analysis time seconds 0 44 88 132 176 220 llvm-mca MCAD Max resident memory Gigabytes 0 5 10 15 20 25 30 4x Faster
  • 138. Evaluation Scalability compared against llvm-mca 43 Analysis time seconds 0 44 88 132 176 220 llvm-mca MCAD Max resident memory Gigabytes 0 5 10 15 20 25 30 4x Faster 13x Less
  • 139. Evaluation Scalability compared against other static throughput analysis tools 44 Analysis Time Max Resident Memory uiCA Timeout after 48h 113 GB OSACA Exit w/ error after 24h N/A Ithemal Exit w/ error after 2m N/A MCAD 52.69s 2.16 GB
  • 142. // TODO • More e ffi cient ways to collect traces without QEMU 46
  • 143. // TODO • More e ffi cient ways to collect traces without QEMU • Analyzing traces from multi-thread programs 46
  • 144. // TODO • More e ffi cient ways to collect traces without QEMU • Analyzing traces from multi-thread programs • Improve MCA’s precision via dynamic information (e.g. memory accesses) 46
  • 145. // TODO • More e ffi cient ways to collect traces without QEMU • Analyzing traces from multi-thread programs • Improve MCA’s precision via dynamic information (e.g. memory accesses) • Visualizing analysis results. Or: Improve MCA’s result display 46
  • 146. // TODO • More e ffi cient ways to collect traces without QEMU • Analyzing traces from multi-thread programs • Improve MCA’s precision via dynamic information (e.g. memory accesses) • Visualizing analysis results. Or: Improve MCA’s result display • Example: Loadable plugins for custom display of the result 46
  • 147. // TODO • More e ffi cient ways to collect traces without QEMU • Analyzing traces from multi-thread programs • Improve MCA’s precision via dynamic information (e.g. memory accesses) • Visualizing analysis results. Or: Improve MCA’s result display • Example: Loadable plugins for custom display of the result • Going upstream: QEMU & LLVM 46
  • 149. Going upstream: LLVM The plan • We would like to upstream components that are bene fi cial to the core MCA libraries fi rst 47
  • 150. Going upstream: LLVM The plan: components to upstream 48 QEMU TCG Plugin MCAD Core IncrementalSourceMgr Resumable Simulation Pipeline (QEMU) Broker Disassembler Receiver Target Program TCP Sockets Recycling InstrBuilder Display* Not intended to upstream for now Intended to upstream Unrelated / no change
  • 151. Going upstream: LLVM The plan • We would like to upstream components that are bene fi cial to the core MCA libraries fi rst 49
  • 152. Going upstream: LLVM The plan • We would like to upstream components that are bene fi cial to the core MCA libraries fi rst • QEMU broker plugin & our TCG plugin will be maintained out-of-tree 49
  • 153. Going upstream: LLVM The plan • We would like to upstream components that are bene fi cial to the core MCA libraries fi rst • QEMU broker plugin & our TCG plugin will be maintained out-of-tree • We’re not sure about upstreaming rest of the tool right now 49
  • 154. Going upstream: LLVM The plan • We would like to upstream components that are bene fi cial to the core MCA libraries fi rst • QEMU broker plugin & our TCG plugin will be maintained out-of-tree • We’re not sure about upstreaming rest of the tool right now • With the assembly broker, MCAD can be a drop-in replacement for llvm- mca…with even more features (e.g. the broker plugin infrastructure) 49
  • 155. Going upstream: LLVM The plan • We would like to upstream components that are bene fi cial to the core MCA libraries fi rst • QEMU broker plugin & our TCG plugin will be maintained out-of-tree • We’re not sure about upstreaming rest of the tool right now • With the assembly broker, MCAD can be a drop-in replacement for llvm- mca…with even more features (e.g. the broker plugin infrastructure) • Some of the (advanced) interfaces in broker plugin are only used by QEMU broker. So, without the latter, it’s not well tested. 49
  • 158. Summary • MCA Daemon (MCAD) is a high-performance hybrid throughput analysis tool built on top of LLVM MCA libraries 51
  • 159. Summary • MCA Daemon (MCAD) is a high-performance hybrid throughput analysis tool built on top of LLVM MCA libraries • Online, whole-program analysis on real-world applications 51
  • 160. Summary • MCA Daemon (MCAD) is a high-performance hybrid throughput analysis tool built on top of LLVM MCA libraries • Online, whole-program analysis on real-world applications • Scale up with large-scale programs with tens of millions of instructions 51
  • 161. Summary • MCA Daemon (MCAD) is a high-performance hybrid throughput analysis tool built on top of LLVM MCA libraries • Online, whole-program analysis on real-world applications • Scale up with large-scale programs with tens of millions of instructions • We improved the performance & fl exibility of core MCA libraries 51
  • 162. Summary • MCA Daemon (MCAD) is a high-performance hybrid throughput analysis tool built on top of LLVM MCA libraries • Online, whole-program analysis on real-world applications • Scale up with large-scale programs with tens of millions of instructions • We improved the performance & fl exibility of core MCA libraries • We would like to merge these changes upstream to bene fi t the wider community 51
  • 164. Acknowledgements 53 This material is based upon work partially supported by the Defense Advanced Research Projects Agency (DARPA) under contract N66001-20-C-4027 Special Thanks: Galois Inc. (https://galois.com) Immunant Inc. (https://immunant.com)
  • 165. Thank You! 54 Q&A Aldrich Park @ UC Irvine, California Me
  • 167. Introduction to MCA An example 56 vmulps %xmm0, %xmm1, %xmm2 vhaddps %xmm2, %xmm2, %xmm3 vhaddps %xmm3, %xmm3, %xmm4 test/tools/llvm-mca/X86/BtVer2/dot-product.s Timeline 012345 Index 0123456789 [0,0] DeeER. . . vmulps [0,1] D==eeeER . . vhaddps [0,2] .D====eeeER . vhaddps [1,0] .DeeE-----R . vmulps [1,1] . D=eeeE---R . vhaddps [1,2] . D====eeeER . vhaddps [2,0] . DeeE-----R . vmulps [2,1] . D====eeeER . vhaddps [2,2] . D======eeeER vhaddps llvm-mca -mtriple=x86_64 -mcpu=btver2 -iterations=300 dot-products.s
  • 168. Introduction to MCA An example 56 vmulps %xmm0, %xmm1, %xmm2 vhaddps %xmm2, %xmm2, %xmm3 vhaddps %xmm3, %xmm3, %xmm4 test/tools/llvm-mca/X86/BtVer2/dot-product.s Timeline 012345 Index 0123456789 [0,0] DeeER. . . vmulps [0,1] D==eeeER . . vhaddps [0,2] .D====eeeER . vhaddps [1,0] .DeeE-----R . vmulps [1,1] . D=eeeE---R . vhaddps [1,2] . D====eeeER . vhaddps [2,0] . DeeE-----R . vmulps [2,1] . D====eeeER . vhaddps [2,2] . D======eeeER vhaddps llvm-mca -mtriple=x86_64 -mcpu=btver2 -iterations=300 dot-products.s