Responsible : Prof. Frédéric Pétrot
Supervisor : Luc Michel
TIMA Laboratory - SLS Group
Grenoble, France
Translation cache policies for
dynamic binary translation
Ecole
Nationale
des Sciences
de l'Informatique
Saber Ferjani
2
 DBT: Is a CPU simulation technique, it reads a
short sequence of code (Target), translates it,
and executes it in a different CPU (Host).
Host Machine
CPUSimulated Target
translation
asm code
TB TB TB TB TB TB
3
 Translation cache: It is a buffer in host
machine that stores the Translated Blocks (TB)
Outline
1. Virtualization and simulation techniques
2. Qemu Internals
3. Typical cache algorithms
4. Cache algorithm proposal
5. Simulation results
6. Conclusion & Perspectives
4
1. Virtualization and simulation techniques
5
1.1. Just In Time Compiler
1. Virtualization and simulation techniques
6
1.2. Hosted & Native Hypervisors
1. Virtualization and simulation techniques
7
1.3. Virtualization tools
Virtual Box
Virtual PC
VMware
Xen
Bochs
Valgrind
Qemu
KVM
1. Virtualization and simulation techniques
8
1.4. Simulation techniques
 Interpretive technique ► Extremely slow!
 Native Simulation ► Need source code!
 Binary Translation:
 Static ► Cannot handle indirect branches
 Dynamic ► Quite fast & flexible
2. Qemu internals
9
2.1. Overview
 Generic & Open source machine emulator
 Created by Fabrice Bellard in 2003
 Supported targets: IA32, ARM, SPARC, MIPS, PPC…
2. Qemu internals
10
2.2. Execution flow example
2. Qemu internals
11
2.3. Main execution loop
2. Qemu internals
12
2.4. Translation cache size
2. Qemu internals
13
2.4. TB allocation
3. Typical cache algorithms
14
Optimal cache algorithm (offline)
Basic cache algorithms:
Flush, Random, FIFO, LRU, LFU
Advanced cache algorithms:
LRFU, 2Q, LIRS, ARC
Qemu constraints:
TB are not movable
TB size is variable,
TB size is unpredictable
4. Cache algorithm proposal
15
4.1. Algorithm design
4. Cache algorithm proposal
16
4.2. Data structure
Constant insertion overhead
Frequently referenced TBs are elected for
re-translation into separated cache area
4. Cache algorithm proposal
17
4.3. HST update
Before CSA flush, add address of all TBs
that were executed more than 𝐹𝑡ℎ
HST is used as circular buffer,
HST size is fixed to half of HSA size
@HS1
@HS2
@HS3@HS4
@HS5
Qemu monitor: Back-end configuration
console interface
Log options:
out asm: show generated host code
In asm: show target assembly code
Exec: show trace before each executed TB
…etc
Generated log of (log exec):
Trace (Host Address) [(Target Address)]
5. Simulation results
18
5.1. Qemu log
5. Simulation results
19
5.2. TB-trace: Translation cache simulator
5. Simulation results
20
5.3. Simulated cache algorithms
LRU
LFU
CSA HSA
• A-LRU:
• A-LFU:
• A-2Q:
@
@
@@
@
HST
5. Simulation results
5.3. Qemu used guest machines
LZMA benchmark
Linux Kernel
Windows XP start-up
5. Simulation results
22
5.5. Guest 1: LZMA benchmark over Debian
0,25 0,375 0,5
62
89
72
50 55 5256
68
88
CSA
flushs
Quota=
LRU LFU 2Q
0,25 0,375 0,5
18,5%
39,6%
26,1%
86,9% 91,3% 90,1%
81,8% 81,9% 81,8%
Hotspot
hit
5. Simulation results
23
5.6. Guest 2: Linux kernel 2.6.20
0,25 0,375 0,5
15
18
22
15
17
21
16
19
23
CSA
flushs
Quota=
LRU LFU 2Q
+1
HSA
flush
+1
HSA
flush
0,25 0,375 0,5
24,1%
32,1%
43,6%
24,4%
61,9% 57,4%
30,0%
64,1% 65,2%
Hotspot
hit
5. Simulation results
24
5.7. Guest 3: Windows XP start-up
0,25 0,375 0,5
15
18
21
15
17
21
16
19
24
CSA
flushs
Quota=
LRU LFU 2Q
+1
HSA
flush
+1
HSA
flush
+1
HSA
flush
0,25 0,375 0,5
16,0%
45,2%
52,1%
23,4%
56,5% 51,4%
29,0%
45,3%
64,7%
Hotspot
hit
Qemu translation cache is inefficient
Cache algorithms based on page
replacement cannot be used
Our algorithm proposal advantages:
Reduce unneeded re-translations
TB insertion overhead is constant
Drawbacks:
Invalidated TB remain allocated
Address find operation depend on HST size
6. Conclusion & Perspectives
25
6.1. Conclusion
Use a hash function for HST to accelerate
TB lookup before each new translation,
Use an op-code buffer to accelerate TB
re-translation of hot spots,
Estimate size of next translation, and try
to overwrite invalidated TB
6. Conclusion & Perspectives
26
6.2. Perspectives
27
Questions?

Translation Cache Policies for Dynamic Binary Translation

  • 1.
    Responsible : Prof.Frédéric Pétrot Supervisor : Luc Michel TIMA Laboratory - SLS Group Grenoble, France Translation cache policies for dynamic binary translation Ecole Nationale des Sciences de l'Informatique Saber Ferjani
  • 2.
    2  DBT: Isa CPU simulation technique, it reads a short sequence of code (Target), translates it, and executes it in a different CPU (Host). Host Machine CPUSimulated Target translation asm code
  • 3.
    TB TB TBTB TB TB 3  Translation cache: It is a buffer in host machine that stores the Translated Blocks (TB)
  • 4.
    Outline 1. Virtualization andsimulation techniques 2. Qemu Internals 3. Typical cache algorithms 4. Cache algorithm proposal 5. Simulation results 6. Conclusion & Perspectives 4
  • 5.
    1. Virtualization andsimulation techniques 5 1.1. Just In Time Compiler
  • 6.
    1. Virtualization andsimulation techniques 6 1.2. Hosted & Native Hypervisors
  • 7.
    1. Virtualization andsimulation techniques 7 1.3. Virtualization tools Virtual Box Virtual PC VMware Xen Bochs Valgrind Qemu KVM
  • 8.
    1. Virtualization andsimulation techniques 8 1.4. Simulation techniques  Interpretive technique ► Extremely slow!  Native Simulation ► Need source code!  Binary Translation:  Static ► Cannot handle indirect branches  Dynamic ► Quite fast & flexible
  • 9.
    2. Qemu internals 9 2.1.Overview  Generic & Open source machine emulator  Created by Fabrice Bellard in 2003  Supported targets: IA32, ARM, SPARC, MIPS, PPC…
  • 10.
    2. Qemu internals 10 2.2.Execution flow example
  • 11.
    2. Qemu internals 11 2.3.Main execution loop
  • 12.
    2. Qemu internals 12 2.4.Translation cache size
  • 13.
  • 14.
    3. Typical cachealgorithms 14 Optimal cache algorithm (offline) Basic cache algorithms: Flush, Random, FIFO, LRU, LFU Advanced cache algorithms: LRFU, 2Q, LIRS, ARC Qemu constraints: TB are not movable TB size is variable, TB size is unpredictable
  • 15.
    4. Cache algorithmproposal 15 4.1. Algorithm design
  • 16.
    4. Cache algorithmproposal 16 4.2. Data structure Constant insertion overhead Frequently referenced TBs are elected for re-translation into separated cache area
  • 17.
    4. Cache algorithmproposal 17 4.3. HST update Before CSA flush, add address of all TBs that were executed more than 𝐹𝑡ℎ HST is used as circular buffer, HST size is fixed to half of HSA size @HS1 @HS2 @HS3@HS4 @HS5
  • 18.
    Qemu monitor: Back-endconfiguration console interface Log options: out asm: show generated host code In asm: show target assembly code Exec: show trace before each executed TB …etc Generated log of (log exec): Trace (Host Address) [(Target Address)] 5. Simulation results 18 5.1. Qemu log
  • 19.
    5. Simulation results 19 5.2.TB-trace: Translation cache simulator
  • 20.
    5. Simulation results 20 5.3.Simulated cache algorithms LRU LFU CSA HSA • A-LRU: • A-LFU: • A-2Q: @ @ @@ @ HST
  • 21.
    5. Simulation results 5.3.Qemu used guest machines LZMA benchmark Linux Kernel Windows XP start-up
  • 22.
    5. Simulation results 22 5.5.Guest 1: LZMA benchmark over Debian 0,25 0,375 0,5 62 89 72 50 55 5256 68 88 CSA flushs Quota= LRU LFU 2Q 0,25 0,375 0,5 18,5% 39,6% 26,1% 86,9% 91,3% 90,1% 81,8% 81,9% 81,8% Hotspot hit
  • 23.
    5. Simulation results 23 5.6.Guest 2: Linux kernel 2.6.20 0,25 0,375 0,5 15 18 22 15 17 21 16 19 23 CSA flushs Quota= LRU LFU 2Q +1 HSA flush +1 HSA flush 0,25 0,375 0,5 24,1% 32,1% 43,6% 24,4% 61,9% 57,4% 30,0% 64,1% 65,2% Hotspot hit
  • 24.
    5. Simulation results 24 5.7.Guest 3: Windows XP start-up 0,25 0,375 0,5 15 18 21 15 17 21 16 19 24 CSA flushs Quota= LRU LFU 2Q +1 HSA flush +1 HSA flush +1 HSA flush 0,25 0,375 0,5 16,0% 45,2% 52,1% 23,4% 56,5% 51,4% 29,0% 45,3% 64,7% Hotspot hit
  • 25.
    Qemu translation cacheis inefficient Cache algorithms based on page replacement cannot be used Our algorithm proposal advantages: Reduce unneeded re-translations TB insertion overhead is constant Drawbacks: Invalidated TB remain allocated Address find operation depend on HST size 6. Conclusion & Perspectives 25 6.1. Conclusion
  • 26.
    Use a hashfunction for HST to accelerate TB lookup before each new translation, Use an op-code buffer to accelerate TB re-translation of hot spots, Estimate size of next translation, and try to overwrite invalidated TB 6. Conclusion & Perspectives 26 6.2. Perspectives
  • 27.