MCA Daemon: Hybrid Throughput Analysis Beyond Basic Blocks

Min-Yih “Min” Hsu, David Gens, Michael Franz. University of California, Irvine
MCA Daemon
Hybrid Throughput Analysis Beyond Basic Blocks
Keynote, EuroLLVM 2022

Outline
Motivation
MCA Daemon (MCAD)
2

Outline
Motivation
MCA Daemon (MCAD)
Future Plans
2

Outline
Motivation
MCA Daemon (MCAD)
Future Plans
Epilogue
2

Outline
Motivation

MCA Daemon (MCAD)

Future Plans

Epilogue
3

Genesis: Assured Micro Patching (AMP)
4

• A research project initiated by United States DARPA to assure the correctness
of binary patching with little or no source code
4

• Including functional and timing aspects
4

• Focuses on small (micro) binary patches
4

• Focuses on embedded systems
4

• UCI was studying the timing impacts of binary patches
4

• UCI was studying the timing impacts of binary patches
• Example: After the
fi
rmware on a truck is binary-patched to prevent brakes
from locking up, we need to make sure latencies do not degrade terribly
4

Timing impacts of binary patches
Problem de
fi
nition
5
Original

Program

Problem de
fi
nition
5
Original

Program
Small

Binary
Patch
Patched

Program

Problem de
fi
nition
5
Original

Program
Small

Binary
Patch
Patched

Program
Original

Program
Same set of inputs

Problem de
fi
nition
5
Original

Program
Small

Binary
Patch
Patched

Program
Original

Program
ΔT?
Same set of inputs

Execution time assessment
6
Original

Program
Small

Binary
Patch
Patched

Program
Original

Program
ΔT?
Same set of inputs

Interesting use cases
7

• Predicting program run time in remote environments or time-sensitive
applications
• Examples:
fi
rmware in cars or satellite (e.g. Kepler space telescope by NASA)
7

applications
• Examples:
fi
• Performance analysis
• Insights into performance bottlenecks
7

applications
• Examples:
fi
• Performance analysis
• Insights into performance bottlenecks
• Examples: Potential CPU pipeline stalling, GPU memory bank con
fl
icts
7

Previous e
ff
orts
8

Previous e
ff
orts
• Static approaches
8

Previous e
ff
orts
• Throughput analysis: predicting the cycle counts for linear code (e.g. basic
block, loop) statically
8

Previous e
ff
orts
• Examples: IACA, OSACA, uiCA, LLVM MCA, Ithemal
8

Previous e
ff
orts
• Dynamic approaches
• Cycle-accurate simulators / emulators
8

Previous e
ff
orts
• Dynamic approaches
• Cycle-accurate simulators / emulators
• Examples: gem5, gpgpu-sim
8

Challenges
9
Static Dynamic

Challenges
9
Static Dynamic
High
Low
Precision

Challenges
9
Static Dynamic
High
Low
Precision
• Complete execution traces

• Higher
fi
delity on hardware
details

Challenges
9
Static Dynamic
High
Low
Precision
• Poor handling on branches &
function calls

• Small scope (only few blocks)

• Lack of run-time information

• Higher
fi
delity on hardware
details

Challenges
9
Static Dynamic
High
Low
Precision
Fast Slow
Turnaround
function calls



• Higher
fi
delity on hardware
details

Challenges
9
Static Dynamic
High
Low
Precision
Fast Slow
Turnaround
function calls



• Higher
fi
delity on hardware
details
• Faster analysis speed (due to
coarser granularity)

• Easier integration with other tools

Challenges
9
Static Dynamic
High
Low
Precision
Fast Slow
Turnaround
function calls



• Higher
fi
delity on hardware
details

• Usually require non-trivial
setup

• Slow simulation speed

Challenges
9
Static Dynamic
High
Low
Precision
Fast Slow
Turnaround
?
function calls



• Higher
fi
delity on hardware
details

• Usually require non-trivial
setup

• Slow simulation speed

Outline
Motivation

MCA Daemon (MCAD)

Future Plans

Epilogue
10

MCA Daemon (MCAD)
High-level concept
11
Dynamic Runtime
Static Throughput
Analysis Tool
Target Program

MCA Daemon (MCAD)
High-level concept
11
Dynamic Runtime
Static Throughput
Analysis Tool
Target Program
Execution Trace

MCA Daemon (MCAD)
High-level concept
11
Dynamic Runtime
Static Throughput
Analysis Tool
Target Program
Execution Trace
• The instructions that just got executed

• Run-time values (e.g. register values)

MCA Daemon (MCAD)
High-level concept
11
Dynamic Runtime
Static Throughput
Analysis Tool
Target Program
Execution Trace
Online Environment
Process 1 Process 2


MCA Daemon (MCAD)
High-level concept
11
Dynamic Runtime
Static Throughput
Analysis Tool
Target Program
Execution Trace
Online Environment
Process 1 Process 2
Streaming


MCA Daemon (MCAD)
High-level concept
12
QEMU
LLVM MCA
Libraries
Target Program
Online Environment
Process 1 Process 2
Execution Trace
Streaming

Introduction to LLVM MCA
• A tool (llvm-mca) and library (libLLVMMCA) for predicting cycle counts and
potential performance hazards in a sequence of assembly code
13

• Using instruction scheduling data (e.g. instruction latency) provided by each
LLVM target

• New ISA (with proper scheduling info) can be supported out of the box
13

LLVM target

• Accounting for modern processor features: super scalar, out-of-order etc.
13

LLVM target

• Accounting for modern processor features: super scalar, out-of-order etc.
• Implemented via lightweight simulation

• Abstract real CPU pipeline stages into a small handful of stages
13

Introduction to MCA
An example
14
vmulps %xmm0, %xmm1, %xmm2

vhaddps %xmm2, %xmm2, %xmm3

test/tools/llvm-mca/X86/BtVer2/dot-product.s

Introduction to MCA
An example
14


test/tools/llvm-mca/X86/BtVer2/dot-product.s
llvm-mca -mtriple=x86_64 -mcpu=btver2

-iterations=300 dot-products.s

Introduction to MCA
An example
14


test/tools/llvm-mca/X86/BtVer2/dot-product.s Summary
Iterations: 300

Instructions: 900

Total Cycles: 610

Total uOps: 900

Dispatch Width: 2

uOps Per Cycle: 1.48

IPC: 1.48

Block RThroughput: 2.0


15
llvm-mca
Assembly
fi
le
LLVM MCA
Libraries
llvm-mca
MCAD

15
QEMU
LLVM MCA
Libraries
Target Program
Process 1 Process 2
Execution Trace
Streaming
llvm-mca
Assembly
fi
le
LLVM MCA
Libraries
llvm-mca
MCAD

MCA Daemon (MCAD)
Highlights
16

MCA Daemon (MCAD)
Highlights
• Combine the advantages of dynamic & static throughput analysis
16

MCA Daemon (MCAD)
Highlights
• Augment the analysis region beyond basic blocks

• MCAD is able to analyze the entire program execution trace
16

MCA Daemon (MCAD)
Highlights
• Augment the analysis region beyond basic blocks

• MCAD is able to analyze the entire program execution trace
• Throughput analysis is happening in parallel / on-the-fly with the
target program execution
16

Analyze execution traces using MCA
Using unmodi
fi
ed MCA libraries
18
QEMU
Target Program
Executed
instructions
LLVM MCA
Libraries
Disassembler

Analyze execution traces using MCA
Challenge: Sequential work
fl
ow
19
QEMU
Target Program
Blocked until QEMU is
fi
nished
Executed
instructions
LLVM MCA
Libraries
Disassembler

MCA internal
20
Assembly
fi
le

MCA internal
20
MCInst
Assembly
fi
le

MCA internal
20
MCInst mca::Instruction
Assembly
fi
le

MCA internal
20
mca::SourceMgr
Assembly
fi
le

MCA internal
20
MCInst mca::Instruction Stage 0 Stage 1 Stage N
….
MCA Simulation Pipeline
mca::SourceMgr
Assembly
fi
le

MCA internal
20
….
Display
mca::SourceMgr
Assembly
fi
le

MCA with execution trace stream as input
21
….
Display
mca::SourceMgr
Execution
Trace
Stream

MCA with execution trace stream as input
21
….
Display
mca::SourceMgr
Execution
Trace
Stream
Blocking

Incremental SourceMgr
22
….
Display
mca::IncrementalSourceMgr
Execution
Trace
Stream

22
….
Display
Making mca::Instruction available to
simulation pipeline right away
Execution
Trace
Stream

23
….
Execution
Trace
Stream
QEMU

23
….
Execution
Trace
Stream
QEMU
Process 2
Process 1

Implement with threads
24
….
Execution
Trace
Stream
QEMU
Process 1

24
….
Execution
Trace
Stream
QEMU
Process 1 Simulation Thread

24
….
Execution
Trace
Stream
QEMU
mca::Instruction fetching loop

24
….
Execution
Trace
Stream
QEMU
Receiver Thread
mca::Instruction fetching loop

Implement with threads: Pros & Cons
25
….
Execution
Trace
Stream
QEMU
Receiver Thread

25
….
Execution
Trace
Stream
QEMU
Receiver Thread
Pros: No modi
fi
cation on the simulation pipeline

25
….
Execution
Trace
Stream
QEMU
Receiver Thread
Pros: No modi
fi
cation on the simulation pipeline
Cons: To use IncrementalSourceMgr, you have to use threads

Better solution: Resumable simulation pipeline
26
….
Resumable Simulation Pipeline
Execution
Trace
Stream
QEMU

26
….
Execution
Trace
Stream
QEMU
A subset of trace

26
….
Execution
Trace
Stream
QEMU
A subset of trace
A subset of
mca::Instruction

27
….
Execution
Trace
Stream
QEMU
A subset of
mca::Instruction

28
….
Execution
Trace
Stream
QEMU
Pause

Resumable simulation pipeline
29

• Save (and restore) the analysis state from previous subset of instructions
29

• Threads are not required when using IncrementalSourceMgr + resumable
pipeline
29

pipeline
• Much easier to integrate into other uses
29

pipeline
• You can still wrap resumable pipeline with a thread
29

pipeline
• You can still wrap resumable pipeline with a thread
• Minor downside: Modi
fi
cations on the simulation pipeline
29

Incremental SourceMgr + Resumable pipeline
Put into real actions
30
….
Execution
Trace
Stream
QEMU

30
….
Execution
Trace
Stream
QEMU
• A program that continuously reads from an I/O device

30
….
Execution
Trace
Stream
QEMU
• Only record the user space traces

30
….
Execution
Trace
Stream
QEMU
• Only record the user space traces
• Collected ~1 million instructions

Challenge
• Created a signi
fi
cant amount of memory footprint
31

Challenge
• Created a signi
fi
31
(Unit: MB)

Challenge
• Created a signi
fi
• Bottleneck: ~37GB of accumulated (virtual)
memory was allocated by
mca::InstrBuilder::createInstruction
31
(Unit: MB)

Challenge
• Created a signi
fi
• Bottleneck: ~37GB of accumulated (virtual)
memory was allocated by
mca::InstrBuilder::createInstruction
31
mca::InstrBuilder
(Unit: MB)

Large memory footprint
Root cause
32

Root cause
• Most of the translated mca::Instruction objects are never deallocated
until the simulation is
fi
nished
32

Root cause
fi
nished
• mca::Instruction is also used for tracking simulation state, so it’s hard to
make it immutable
32

Root cause
fi
nished
• mca::Instruction is also used for tracking simulation state, so it’s hard to
make it immutable
• Doesn’t scale really well with large input (recall: ~1 million instructions)
32

Observation
33
mca::Instruction
Resumable Simulation
Pipeline

Observation
33
mca::Instruction
Pipeline
Stream direction

Observation
33
mca::Instruction
Pipeline
Copy
Stream direction

Solution: Recycling mca::Instruction
34
mca::Instruction
Pipeline
Copy
Stream direction

Solution: Recycling mca::Instruction
34
mca::Instruction
Pipeline
Copy
Stream direction
Recycle
mca::InstrBuilder

67%
improvement on accumulated memory consumption
35

~70%
of the mca::Instruction objects are recycled
36

Collecting execution traces via QEMU
37
user mode qemu
Target Program
QEMU MCAD

37
user mode qemu
Target Program
Custom TCG Plugin
Instrument
QEMU MCAD

Broker
37
user mode qemu
Target Program
Custom TCG Plugin
Instrument
Receiver
TCP Socket
QEMU MCAD

Broker
37
user mode qemu
Target Program
Custom TCG Plugin
Instrument Disassembler
Receiver
TCP Socket
MCInst
QEMU MCAD

Custom QEMU TCG plugin
• The plugin interface allows us to tap into various TCG events to collect
executed instructions

• Example: When a TCG block is translated / executed
38


• We also added a few plugin interfaces (not upstreamed yet)

• Example: Retrieving CPU register values
38


• We also added a few plugin interfaces (not upstreamed yet)

• Example: Retrieving CPU register values
• Sending raw binary instructions* through TCP sockets
38

QEMU
Complete structure of MCAD
39
TCG Plugin
MCAD Core
IncrementalSourceMgr
Display
(QEMU) Broker
Disassembler
Receiver
Target Program
TCP Sockets
Recycling InstrBuilder

QEMU
39
TCG Plugin
MCAD Core
Display
(QEMU) Broker
Disassembler
Receiver
Target Program
TCP Sockets
☹

QEMU
A modular design
40
TCG Plugin
Display
(QEMU) Broker
Disassembler
Receiver
Target Program
TCP Sockets
MCAD Core

QEMU
A modular design
40
TCG Plugin
Display
(QEMU) Broker
Disassembler
Receiver
Target Program
TCP Sockets
Loadable Plugin
MCAD Core

Loadable Plugin
Example: Assembly broker plugin
41
Display
Assembly Broker
AsmParser
Assembly File
MCAD Core

Evaluation
Scalability compared against llvm-mca
42

Evaluation
• Compare MCAD against llvm-mca (i.e. baseline) on analysis speed and memory
consumption
42

Evaluation
consumption
• Using execution trace collected from running x86_64 FFmpeg 4.2 to decode a
14KB MPEG-4 video
fi
le

• Command: ffmpeg -i input.mp4 -f null -

• Size of the trace: ~27 million x86_64 instructions
42

Evaluation
consumption
• Using execution trace collected from running x86_64 FFmpeg 4.2 to decode a
14KB MPEG-4 video
fi
le

• Command: ffmpeg -i input.mp4 -f null -

• Size of the trace: ~27 million x86_64 instructions
• For baseline, we dump the execution trace (assembly instructions) to a
fi
le before
feeding into llvm-mca

• Time measurement on baseline only accounts for llvm-mca’s run time. Excluding
the trace collection time.
42

Evaluation
43

Evaluation
43
Analysis time
seconds
0 44 88 132 176 220
llvm-mca
MCAD

Evaluation
43
Analysis time
seconds
0 44 88 132 176 220
llvm-mca
MCAD
4x Faster

Evaluation
43
Analysis time
seconds
0 44 88 132 176 220
llvm-mca
MCAD
Max resident memory
Gigabytes
0 5 10 15 20 25 30
4x Faster

Evaluation
43
Analysis time
seconds
0 44 88 132 176 220
llvm-mca
MCAD
Max resident memory
Gigabytes
0 5 10 15 20 25 30
4x Faster
13x Less

Evaluation
Scalability compared against other static throughput analysis tools
44
Analysis Time Max Resident Memory
uiCA Timeout after 48h 113 GB
OSACA Exit w/ error after 24h N/A
Ithemal Exit w/ error after 2m N/A
MCAD 52.69s 2.16 GB

Outline
Motivation

MCA Daemon (MCAD)

Future Plans

Epilogue
45

// TODO
• More e
ffi
cient ways to collect traces without QEMU
46

// TODO
• More e
ffi
• Analyzing traces from multi-thread programs
46

// TODO
• More e
ffi
• Improve MCA’s precision via dynamic information (e.g. memory accesses)
46

// TODO
• More e
ffi
• Visualizing analysis results. Or: Improve MCA’s result display
46

// TODO
• More e
ffi
• Example: Loadable plugins for custom display of the result
46

// TODO
• More e
ffi
• Example: Loadable plugins for custom display of the result
• Going upstream: QEMU & LLVM
46

Going upstream: LLVM
The plan
47

The plan
• We would like to upstream components that are bene
fi
cial to the core MCA
libraries
fi
rst
47

The plan: components to upstream
48
QEMU
TCG Plugin
MCAD Core
(QEMU) Broker
Disassembler
Receiver
Target Program
TCP Sockets
Display*
Not intended to upstream for now
Intended to upstream
Unrelated / no change

The plan
fi
libraries
fi
rst
49

The plan
fi
libraries
fi
rst
• QEMU broker plugin & our TCG plugin will be maintained out-of-tree
49

The plan
fi
libraries
fi
rst
• We’re not sure about upstreaming rest of the tool right now
49

The plan
fi
libraries
fi
rst
• With the assembly broker, MCAD can be a drop-in replacement for llvm-
mca…with even more features (e.g. the broker plugin infrastructure)
49

The plan
fi
libraries
fi
rst
• With the assembly broker, MCAD can be a drop-in replacement for llvm-
mca…with even more features (e.g. the broker plugin infrastructure)
• Some of the (advanced) interfaces in broker plugin are only used by QEMU
broker. So, without the latter, it’s not well tested.
49

Outline
Motivation

MCA Daemon (MCAD)

Future Plans

Epilogue
50

Summary
• MCA Daemon (MCAD) is a high-performance hybrid throughput analysis tool
built on top of LLVM MCA libraries
51

Summary
• Online, whole-program analysis on real-world applications
51

Summary
• Scale up with large-scale programs with tens of millions of instructions
51

Summary
• We improved the performance &
fl
exibility of core MCA libraries
51

Summary
• We improved the performance &
fl
exibility of core MCA libraries
• We would like to merge these changes upstream to bene
fi
t the wider
community
51

Source code
52
https://github.com/securesystemslab/LLVM-MCA-Daemon

Acknowledgements
53
This material is based upon work partially supported by the
Defense Advanced Research Projects Agency (DARPA) under
contract N66001-20-C-4027
Special Thanks:
Galois Inc. (https://galois.com) Immunant Inc. (https://immunant.com)

Thank You!
54
Q&A
Aldrich Park @ UC Irvine, California
Me

Introduction to MCA
An example
56


test/tools/llvm-mca/X86/BtVer2/dot-product.s Timeline
012345

Index 0123456789

[0,0] DeeER. . . vmulps

[0,1] D==eeeER . . vhaddps

[0,2] .D====eeeER . vhaddps

[1,0] .DeeE-----R . vmulps

[1,1] . D=eeeE---R . vhaddps

[1,2] . D====eeeER . vhaddps

[2,0] . DeeE-----R . vmulps

[2,1] . D====eeeER . vhaddps

[2,2] . D======eeeER vhaddps


MCA Daemon: Hybrid Throughput Analysis Beyond Basic Blocks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MCA Daemon: Hybrid Throughput Analysis Beyond Basic Blocks

Similar to MCA Daemon: Hybrid Throughput Analysis Beyond Basic Blocks (20)

More from Min-Yih Hsu

More from Min-Yih Hsu (14)

Recently uploaded

Recently uploaded (20)

MCA Daemon: Hybrid Throughput Analysis Beyond Basic Blocks