Dynamic Binary Optimization for Virtualization on Multi-cores
The software market requires applications to run on many generations of
hardware. Even if software vendors tune their binaries for the most prevalent
hardware at release time, the code will rapidly becomes mismatched to new platforms
as hardware implementation evolves. All latest high-performance microprocessors
have very sophisticated runtime monitoring support that allows runtime information
such as cache misses, instruction pipeline stalls to be monitored for further re-
optimization of binary code at runtime to improve overall performance. Such
continuous program re-optimization requires a dynamic compiler to manipulate
binary code at runtime.
In another very important type of application called process/system
virtualization, a software layer is set up above the hardware to allow multiple OS’s
and/or applications with different instruction-set architectures (ISA’s) to run on the
same hardware platform. The main technology supporting virtualization is binary
translation. It is basically a different kind of runtime compiler that takes the binary
code of those OS’s and the application programs with different ISA’s, and translate
them into a sequence of instructions with the ISA of the underlying hardware platform
Most of these binary manipulation techniques today are with substantial runtime
overhead. Dynamic optimization on the binary code during runtime to improve
overall performance could be a core technology deserved to be studied. Since the
optimizations are performed at runtime, a dynamic binary optimizer has to be
carefully designed so that the overhead of runtime optimization would not outweigh
the performance gain of the optimized code. We will call an optimizer that produces
more performance gain than overhead an “effective” optimizer. There are a number of
factors that are crucial to the effectiveness of a dynamic optimizer. Before discussing
them, we first give an overview of how a general dynamic binary optimizer works.
In general, execution of an application under a dynamic binary optimizer, as
shown in Figure 1, begins with the system executing (or emulating) and profiling a
running program’s instruction stream to track its execution flow. When the system
discovers significant change in profile, it tries to find a frequently executed code
sequence (i.e. hot traces), and then the sequence is analyzed, optimized, and placed in
a code cache. The execution then switches to the optimized code in the code cache.
Figure 1. Control flow of a general dynamic binary optimizer
Now, we can identify a number of key factors that could have profound impact
on the effectiveness of a dynamic binary optimizer: (1) the profiling method to collect
runtime information, (2) the frequency the optimizer is activated, which often depends
on the phase detection method, (3) detecting frequently executed code sequence,
which is also referred to as hot code identification and hot trace generation, (4) the
optimizations performed, and (5) better optimizations with the help of compiler
annotations. In this project, we propose a light-weight, sampling based dynamic
binary optimization framework that provides novel solutions to these important
Furthermore, most of the binary optimization techniques today are for single-
core platforms. We plan to extend such binary optimization techniques to multi-core
platforms. This is a much harder problem as we need to deal with multithreading
applications and with much more shared resources on multi-core platforms.
The core technologies we will develop in this project, if successful, could have
significant impact on the development of dynamic binary optimization systems and
Profile-guided optimizations  provide runtime information for advanced
optimizations. Hence, recent researches attempt to extend the
idea of branch profile to value, cache miss  and data dependency  profiling.
However, collecting a representative profile is difficult for real applications .
Although post-link optimizations  optimize programs
based on performance profiles and reduce the need to recompile, applications can
have different performance characteristics with individual inputs. Therefore, an
application may have to be optimized during execution since the specific information
about its performance cannot be gathered before the input is given. Dynamic
optimization systems  are getting important because
of the need to customize optimizations for individual inputs, changing behavior with
time, dynamic linking library and the micro-architecture. Such dynamic optimization
systems typically manipulate and optimize binary code at runtime.
The profiling methods used by most binary manipulation and optimization
systems can be classified into two categories: Virtual Machine (VM) based, and
sampling based. VM-based systems, such as Dynamo , DynamoRIO , Mojo 
and PIN  typically instrument code for profiling. Therefore, accurate runtime data
for phase detection strategies, such as instruction working set and basic block vectors
can be collected without problem. However, such systems are with substantial
overhead of profiling, emulation, code-cache management and the expensive handling
of indirect branches. For example, Pin  has an overhead of 54% for SPECint2000
benchmarks on IA32 systems and DynamoRIO  has an overhead of 42% for the
same environment. This is the minimal overhead reported, when no instrumentation
or optimization is performed.
Unlike VM based optimizers, sampling based optimizers, such as ADORE ,
and sampling based profiling tools, such as SimPoint , typically do not instrument
code for profiling. Therefore, runtime data for phase detection strategies cannot be
used with the same accuracy. Also, Sampling-based optimizers do not have complete
control over program execution. They take frequent snapshots of program execution
and thus only see frequently executed code, but not the complete execution path
leading to it. However, sampling based profiling has lower runtime overhead than
Dynamic optimization systems using sample based profiling rely on phase
detection to detect change in code working set and change in performance
characteristics that can affect optimization strategies. Phase detection techniques can
be classified into two categories: Global Phase Detection (GPD) [8 ] and Local Phase
Detection (LPD) [9 ]. In GPD, program characteristics are computed by taking
into account information from all regions that executed during the profiled interval.
Hence, it is sensitive to sampling period, interval size and thresholds used in the phase
detector. LPD can detect phase change more accurately than GPD because the scope
of phase detection is reduced to a small code region, such as a basic block, a loop, or a
procedure. Commonly use LPD methods include region monitoring based  and
trace compilation .
Table 1 compares these optimization systems. Details of each optimization
system are described below. Note that all of these optimization systems are for single-
core platforms, except that the ADORE system runs the optimizer and the user
application code on separate cores.
Dynamo  is a software dynamic optimization system that is capable of
transparently improving the performance of a native instruction stream as it
executes on the processor. The input native instruction stream to Dynamo can be
dynamically generated (by a JIT for example), or it can come from the execution
of a statically compiled native binary. Dynamo focuses its efforts on optimization
opportunities that tend to manifest only at runtime. Experimental results
demonstrate that even statically optimized native binaries can be accelerated by
Dynamo. For example, the average performance of -O optimized SpecInt95
benchmark binaries created by the HP product C compiler is improved to a level
comparable to their -O4 optimized version running without Dynamo. The
performance advantage of Dynamo in such case is not surprising because it was
compared with compile-time static optimizations, which usually lack runtime
information to generate code with good performance. Since Dynamo relies on
VM-based profiling and runtime emulation of the program execution, its runtime
overhead could be high.
DynamoRIO  is a framework, extended from Dynamo, for implementing
dynamic analyses and optimizations. It provides an interface for building
external modules, or clients, for the DynamoRlO dynamic code modification
system. This interface abstracts away many low-level details of the DynamoRlO
runtime system while exposing a simple yet efficient API. This is achieved by
restricting optimization units to linear streams of code and using adaptive levels
of detail for representing instructions. The interface is not restricted to
optimization and can be used for instrumentation, profiling, dynamic translation,
etc. DynamoRIO also implements several optimizations. These improve the
performance of some applications by 12% on average, relative to native
execution. Since DynamoRIO is intended to be a analysis and instrumental tool,
it uses expensive software instrumentation based profiling and interpreter for
Dynamo DynamoRIO Mojo  ADORE PIN  JikesRVM 
  
Sampli no no no Yes no no
ling VM yes yes yes No yes yes
Emulation yes yes no No yes yes
Annotation no no no no no yes
Optimization 1. Hot 1. Constant 1. Hot 1. 1. 1. Adaptive Inlining
tracing Propagation Path Dynamic Persiste 2. Register
Linking 2. Dead Code Linking Register nt Code allocation and
Removal Allocatio Cachin coalescing
3.Call Return 2. drop n g 3. tail recursion
Matching unconditio 2. elimiation
4. Stack Adjust nal jumps Runtime 4. code reordering
Data 5. Dead code
3. Prefetchi 6. loop
call/return ng normalization &
sequences 3. Hot unrolling
inlined trace 7. load/store &
Patching redundant branch
Table 1. Comparison of VM based dynamic optimizers and sampling based optimizers
Mojo  is unlike most dynamic optimizers that have been chiefly targeted
towards running the SPEC benchmarks on scientific workstations. Mojo,
developed by Microsoft Research, contends that dynamic optimization
technology is also important to the desktop computing environment where
running large, complex commercial software applications is commonplace. Mojo
implements its optimizations for the x86 architecture. It also supports exception
handling and multithreaded applications on Windows along with preliminary
performance measurements. Similar to Dynamo and DynamoRIO, Mojo also
employs VM based profiling. However, it does not rely on the time-consuming
emulation/interpretation of program execution.
ADORE  is a light-weight dynamic binary optimization system developed at
the University of Minnesota. It’s light-weight because it uses hardware
performance monitoring based sampling for profiling. ADORE uses dynamic
optimization to address cache miss, branch mis-prediction, and other
performance events at runtime. It detects performance problems of running
applications and deploys optimizations to increase execution efficiency.
ADORE’s approach includes detecting performance bottlenecks, generating
optimized traces and redirecting execution from the original code to the
dynamically optimized code. Experiment results show that ADORE speeds up
many of the CPU2000 benchmark programs having large numbers of D-Cache
misses through dynamically deployed cache prefetching. For other applications
that don’t benefit from ADORE’s runtime optimization, the average cost is only
2% of execution time. ADORE is a good example of using existing hardware and
software to deploy speculative optimizations to improve a program’s runtime
performance. In this project, we will develop our dynamic binary optimization
system based on ADORE because of the various efficient and attractive features
PIN  is an instrumentation system developed by Intel. It aims to support easy
to use, portable, transparent, and efficient instrumentation. Instrumentation tools
(called Pintools) are written in C/C++ using Pin's API. Pin follows the model of
ATOM, allowing the tool writer to analyze an application at the instruction level
without the need for detailed knowledge of the underlying instruction set. Pin
uses dynamic compilation to instrument executables while they are running. For
efficiency, Pin uses several techniques, including inlining, register re-allocation,
liveness analysis, and instruction scheduling to optimize instrumentation. As a
result, Pin is 3.3x faster than Valgrind and 2x faster than DynamoRIO for basic-
block counting. Pin is publicly available for Linux platforms on four
architectures: IA32 (32-bit x86), EM64T (64-bit x86), Itanium®, and ARM.
JikesRVM : Jikes RVM (Research Virtual Machine) provides a flexible open
testbed to prototype virtual machine technologies and experiment with a large
variety of design alternatives. Jikes RVM can run on various platforms. It
implements virtual machine technologies for dynamic compilation, adaptive
optimization, garbage collection, thread scheduling, and synchronization. A
distinguishing characteristic of Jikes RVM is that it is implemented in the Java™
programming language and is self-hosted i.e., its Java code runs on itself without
requiring a second virtual machine. JikesRVM uses VM based profiling and
interpreter for program emulation.
We propose a light-weight, sampling based dynamic binary optimization
system. The system diagram, including system components and major data
structures, of our virtualization system proposed in the main project is depicted in
Figure 2. The blocks circled by a dotted line are the components for the dynamic
binary optimization sub-system proposed in this sub-project. The components
include a Hardware Performance Monitor profiler (HPM Data), Phase Detector, Hot
Trace Generator, and Optimizer.
Figure 2. System diagram of our virtualization system
We first describe the functionality of and our design decision for each of the
components in our optimization system. We also address the important research
issues in each component.
1.1 Light-weight HPM-Based Profiling
We exploit hardware performance monitors in the processor for light-weight
profiling. Since HPM counters the events automatically, we can expect that the
extra overhead for monitoring the program behaviors is much lower than the
instrumentation approach to counter the events for profiling.
We adopt Perfmon2 , a standard performance monitoring interface for
Linux to exploit HPM. It provides the friendly interfaces to help the user setting
the register of HPM for monitoring events that the user wants to observe. Each
Linux thread can perform a monitor section of Perfmon2. In a monitor section,
the user can indicate which core or which thread to be monitor. In our
virtualization system, we will target the multi-threading programs and implement
a guest thread as a pthread so we will create a monitor section for each pthread
created for a guest thread. Moreover, we will create a monitor section for each
core in the host platform, too.
1.2 Sampling Accumulation Phase Detection
A dynamic optimizers need to accurately identify periods of execution
when program must be optimized or re-optimized. The concept of phase was
introduced to identify periods of execution when certain runtime characteristics
do not change. Phase detection  identifies these periods and triggers
phase changes between these periods. Thus, an accurate and reliable phase
detection scheme is crucial to runtime performance.
Phase detection is an important component of sampling based dynamic
optimizers. Phase detection, as implemented in current sampling-based prototype
dynamic optimization systems such as ADORE, is called Global Phase
Detection (GPD) as program characteristics are computed by taking into account
information from all regions that executed during the profiled interval. The
problem with GPD is that it may not be able to detect the change between two
phases if they have the same average program counter value.
We propose a new phase detection approach called sampling accumulation
phase detection to solve this problem. For each sampling interval, we maintain a
code blocks vector and an accumulation vector. Both vectors have the same
cardinality. An element in the code blocks vector is a pair of program counters
indicating the beginning and end of a code block of the program. An element in
the accumulation vector records the number of times a program counter is located
in the corresponding code block of this vector element. When the HPM data
buffer overflows, the program counter in the data structure HPM Data is retrieved
and compared with the values in the code blocks vector to find the code block in
which this program counter is located. Then, the corresponding element (a value)
in the accumulation vector is incremented by 1. For two adjacent sampling
intervals, we can compare the Manhattan distance or the Euclidean distance of
their accumulation vectors. If the distance is larger than a threshold value, then
there is a phase change.
1.3 Hot Code Identification
Optimizing at runtime can be expensive and incurs real performance
penalty. Limiting the scope of optimization reduces this overhead. The scope of
dynamic optimization can be reduced by finding frequently executed code. Such
code exists naturally in programs from loops and recursive function calls. The
general technique for identifying such code is to maintain a count for each basic
block and when the basic block count exceeds a threshold, it is optimized.
Sampling based dynamic optimizers rely on hardware performance counters to
collect this data. By sampling these counters, program counter samples are
obtained periodically. Using these samples, frequently executed code can be
1.4 Hot Trace Generation
Optimization at a basic block level may not be beneficial because the
granularity is too small. Thus, it is desirable to aggregate multiple basic blocks
into a larger code segment (also called trace). Traces are basic blocks that form
the unit of optimization for dynamic optimizers. Dynamic optimizers try to select
those basic blocks that form loops, as trace exits would be minimized. The other
consideration when building traces is to minimize analysis time. As traces are
units of optimization, the dynamic optimizer passes these traces to its
optimization algorithms. These algorithms must quickly generate an optimized
trace. A sampling based optimizer such as ours is limited by the fact that it does
not have complete control over execution. We will solve the trace generation
problem by dynamic code analysis and runtime estimate of profile. We may also
apply the concept of superblocks  and hyperblocks  to help guide our
Our approach for trace generation is described as follows. According to the
HPM Data for a sampling interval, we can construct a directed graph with
weighted edges. A vertex indicates an IR basic block and the weight on an edge
represents the frequency of the branch between two IR basic blocks in this
sampling interval. We can generate the hot trances as follows. First, according to
the result of hot code identification, we delete the vertices that represent the IR
basic blocks that are not hot. Next, we delete the edges whose weights are lower
than the threshold value. This step will result in a graph with a number of
connected sub-graphs. A sub-graph presents a hot trace. Furthermore, the hottest
block in a sub-graph is the entry point of the trace.
In the example shown in Figure 3, we have six IR basic blocks. Blocks A, C
and E are the hot blocks and block A is the hottest block. Let the threshold value
for frequent branches between two basic blocks be 8. According to the algorithm
described above, blocks B, D and F will be removed, and edges with weight
smaller than 8 will be removed. This results in a graph of three vertices and three
edges, which happens to be a loop.
Figure 3 An example of hot trace generation
1.5 Machine-Independent Optimizations
LLVM was originally developed as a research infrastructure at the
University of Illinois at Urbana-Champaign to investigate dynamic compilation
techniques for static and dynamic programming languages. LLVM can perform
its own optimizations (scalar, interprocedural, profile-driven, and loop
optimizations) and code generation from the intermediate form generated by
GCC front ends. The LLVM code generator is easily re-targetable, supporting
x86, PowerPC, MIPS and various other ISA’s. Because of these attractive
features, our dynamic optimization system uses LLVM back-end to perform
machine-independent optimizations from the intermediate form (IR). To expose
more opportunity for optimization and to maximize the benefit of LLVM
optimization, our dynamic optimizer will try to aggregate smaller hot code blocks
to form a longer trace using our hot trace generation method.
Another optimization we will consider is optimization for indirect branches.
The original program addresses must be used wherever the application stores
indirect branch targets. These addresses must be translated into their
corresponding code cache addresses in order to jump to the target code. This
translation is usually performed as a hash table lookup, which may be a source of
overhead for a dynamic optimizer. Instead, we will use the following approach.
With several rounds of execution and profiling, the frequently occurring branch
targets of an indirect branch instruction can be detected. The optimizer inserts a
code sequence at the bottom of the trace. The code sequence consists of a series
of compares and conditional direct branches for each frequent target. Hash table
lookup is performed only when the comparisons in the code sequence fail.
1.6 Machine-Dependent Optimizations
Itanium provides predicate bit, the mechanism that can turn on or off the
effect of an instruction by setting the bit. The compilation may generate
multiple versions of the binary codes for different data access patterns or
different frequencies of the branches taken. According to profiling, we can set
the predicate bits to choose the appropriate versions of the binary code. We
may also set the predicate bits to turn off some prefetch operations to reduce
cache miss rate.
With profiling, we can collect information about frequent branches. There
are different kinds of instructions for branch in x86 ISA. It depends on the
address offset. These different kinds of instructions have different latencies. In
general, the branch instruction for shorter offset has lower latency. We should
arrange the one basic block to another one as close as possible if one block
jumps to the other one frequently, so that frequent branches can be replaced by
lower-latency branch instructions.
The register addressing mode has the lowest latency in all the addressing
modes of x86 ISA. We should store the frequently accessed objects in registers
so that we can replace the instructions in other addressing modes with the ones
in register addressing mode to improve performance. With profiling, we can
collect information about frequently accessed objects. Therefore, we can apply
Data locality may improve the efficiency of data cache access in x86
architecture because the hardware pre-fetches locality memory to cache
automatically. We should put the data that are accessed at the same time in
neighboring locations so that data cache space will be saved.
1.7 Optimization for Multi-cores
Optimization for multi-core platforms is a much harder problem than for
single-core as we need to deal with parallel applications and with much more
shared resources on multicore platforms. One of our optimizations for multi-
cores is to reduce resource contention caused by concurrent access to the same
resource by multiple threads. For example, if the profiling data finds that two
certain threads constantly compete for the same resource on one core, then one
way to solve the problem is to propagate such information to the operating
system so that the OS scheduler can dispatch the two threads to different CPUs
or lower the priority of one of the threads so they will not be executed at the
same time. Such optimization can be done either at user-level or system-level.
The user-level approach is based on the assumption that the OS is capable of
taking hints from the hardware monitor through an user-level optimization
software. The system-level approach will require modifying the OS scheduler.
Another optimization problem we would like to investigate is disabling
over-aggressive prefetching to reduce cache miss rate. On single-core platforms,
prefetching is an effective mechanism to overlap computation with data access.
On multi-core platforms, however, prefetching needs to be done carefully. The
caches are usually shared by multiple cores (and thus multiple threads). Over-
aggressive prefetching of one thread may cause increased cache miss rates in
other threads. One solution is to disable some of the prefetching instructions.
One challenging research issue is to determine an appropriate set of prefetching
instructions to strike a good balance between the benefit of prefetching and the
penality of over-prefetching. One possible solution is to facilitate hardware
support to provide prefetch information, such as whether the prefetched data is
actually used or is pushed out of the cache before it is used.
1.8 Interaction with Annotations
Help for phase detection
Annotation can provide the information of the code boundaries for important
procedures and loops. This information can help us to appropriately define the
code region for each element in the vector that accumulates the frequencies of
execution in the different code regions. Using the hottest basic blocks to define
those code regions may not detect the phase change if two different phases have
similar hottest basic blocks.
Help for identification of hot code
Annotation can provide the information about the frequencies of execution
for the basic blocks. This information can help us to calculate the frequency of
execution for each basic block.
Help for hot trace generation
The information of the code boundaries for important procedures and loops
from annotation can help us to find the entry point for a hot trace or help us group
more basic blocks together for more opportunities of optimization with LLVM
end-back code generator. For example, according to the algorithm described
above, we may generate two hot traces for two hot loops. However, these two
loops may be a main part of a procedure. In this case, we should combine these
two trace into one hot trace or even make this whole procedure to be one hot
Help for optimization
Annotation information on functional unit use can guide the optimizer to
dispatch the threads that compete for the same functional units to different cores.
Annotation information on register use can guide the optimizer to replace
memory store with register store and use low-latency register read to access the
Figure 4. Control flow of our dynamic binary optimization system
Having addressed all the important issues, we can now describe the control
flow of our dynamic binary optimization system (Figure 3). Hardware
Performance Monitor (HPM) samples hardware events periodically and writes
the sampling data into a kernel buffer. When the buffer overflows, HPM Data will
be produced from these samplings in the buffer. HPM Data contains timestamp,
program counter of the instruction that is executed at the time of sampling,
number of data cache miss, etc. HPM buffer overflow will activate Phase
Detector to analyze the HPM Data to detect whether the behavior of the guest
program has changed. If the phase change is detected, the Optimizer will be
triggered. The Optimizer will identify the hot code blocks of the guest program,
and find their corresponding IR code blocks by looking up the address mapping
table between guest binary and host binary, and then chains these hot IR code
blocks together to form hot traces in IR form. Next, these hot IR traces are fed to
the LLVM back-end for optimization and code generation. The generated code is
then passed to the Optimizer for further optimization.
2. Work Plan
The goal of the first year is to develop light-weight profiling mechanism,
phase detector and hot trace identification. The work items include:
profiling with Perfmon2
With Perfmon2, we can monitor runtime information about individual thread
or individual core, and store the runtime data in HPM Data. The data in HPM
Data represent runtime information for a set of samples. Data for a sample
include a program counter, a time stamp, thread ID, core ID, and counters of the
last level cache miss event, instruction retired event and clock cycles event. We
will develop the mechanism to retrieve data in the kernel buffer for HPM to
HPM Data when the buffer overflows. We will also implement API(s) to access
individual fields of any individual sample in HPM Data.
Algorithm design and implementation for Phase detection
First, we will construct a number of program phases from the guest program
and the code region in each program phase. A program phase may be a basic
block. However, the number of basic blocks may be too large to be the
appropriate choice. We will consider the hottest basic blocks in average to be the
Algorithm design and implementation for hot code identification
Year 2: Implementation of Optimizer and Hot Trace Cache
The goal of the second year is to develop the hot trace generator and
implement machine-dependent optimizations on Itanium2. Work items include:
Design and implement the algorithm for generating hot traces.
Develop the mechanism to interact with the translator developed in sub-
project 2 to perform IR-level machine-independent optimizations.
This will require a method to map binary hot trace to LLVM IR form, and
develop API for passing the IR hot trace to the translator.
Design and implement the algorithms for machine-dependent optimizations
for Itanium. The optimizations include choosing appropriate version of
binary code, minimizing cache contention by turning off some of the
The goal of the third year is to develop machine-dependent optimizations for
x86, optimizations for multi-cores, and improving optimizations with compiler
annotation. Work items include:
Design and implement the algorithms for machine-dependent optimizations
for x86. The optimizations include generating low-latency branch
instruction, better register usage, improving data locality.
Develop optimizations for multi-cores. The main objective is to reduce
contention in shared resources.
Improve phase detection, trace generation with procedure and loop
Improve machine-dependent optimizations with register use annotation.
Improve multi-core optimizations with functional-unit use and data access
第一年(08/01/2010 -07/31/2011) 第二年(08/01/2011 -07/31/2012) 第三年(08/01/2012 - 07/31/2013)
主 1. Study hardware monitor mechanisms
on the multi-cores.
1. Study the micro architecture and
instruction set for the targeted host
1. Study OS’s thread scheduler for
要 machine and the host.
研 2. Design and implement of the related 2. Design of the API with translator 2. Design of the API with annotation
API with translator to get the address to generate optimized code for hot to get the information for
目 mapping between guest binary code and code regions. optimization.
host binary code. 3. Development of the algorithm to 3. Development of the new
3. Development of the phase detection form long paths. optimization algorithms with
algorithm. 4. Development of machine annotation data.
dependent dynamic binary 4. Develop techniques to generate
optimizations scheduling hints from analysis of
HPM data and annotation data, as
well as the technique to pass the
hints to the OS scheduler
查 Getting the data from HPM Generate hot traces for
LLVM back-end to generate
Reducing resource contention
核 optimized code.
項 Getting the mapping information
目 from translator Optimize the optimized Improving effectiveness of
Detecting phase changing code from LLVM the dynamic optimizer with
accurately annotation data.
產 The data structure for HPM Data Hot trace generator Annotation-enhanced
出 dynamic binary optimizer
項 Phase detector Machine-independent Dynamic binary optimizer for
目 dynamic optimizer with multi-cores
LLVM back-end code
 Vasanth Bala, Evelyn Duesterwald, Sanjeev Banerjia, “Dynamo: A transparent dynamic
optimization system”, Proceedings of the ACM SIGPLAN conference onprogramming
language design and implementation, p.1-12, June 18-21, 2000.
 D. Bruening, T. Garnett, S. Amarasinghe, “An Infrastructure for Adaptive Dynamic
Optimization”, Proceedings of the international symposium on codegeneration and
 W.K. Chen, S. Lerner, R. Chaiken, and D. Gillies, “Mojo: A dynamic optimization
system”, 3rd acm workshop on feedback-directed and dynamic optimization, p.81-90,
 J. Lu, H. Chen, P.-C. Yew, W.-C. Hsu, “Design and Implementation of a Lightweight
Dynamic Optimization System”, Journal of Instruction-LevelParallelism, vol.6, 2004.
 Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., and Wallace, S., Reddi,
V. J., Hazelwood, K., “Pin: Building Customized Program Analysis Tools with Dynamic
Instrumentation”, Programming languages design and implementation, June 2005.
 “Jikes Research Virtual Machine (RVM)”, http://jikesrvm.org/
 Timothy Sherwood, Erez Perelman, Greg Hamerly and Brad Calder, “Automatically
Characterizing Large Scale Program Behavior”, In the 10th International Conference on
Architectural Support for Programming Languages and Operating Systems, October
 J. Lu, H. Chen, P.-C. Yew, W.-C. Hsu, “Design and Implementation of a Lightweight
Dynamic Optimization System”, Journal of Instruction-Level Parallelism, vol.6, 2004.
 Christian Wimmer, Marcelo S. Cintra, Michael Bebenita Mason Chang, Andreas Gal and
Michael Franz, “Phase Detection using Trace Compilation”, PPPJ’09, 2009.
 Abhinav Das, Jiwei Lu and Wei-Chung Hsu, “Region Monitoring for Local Phase
Detection in Dynamic Optimization Systems”, International Symposium on Code
Generation and Optimization, 2006.
 “PerfMon”, http://www.hpl.hp.com/research/linux/perfmon/.
 W. W. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bringmann, R. G.
Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, J. G. Holm, D. M. Lavery, “The
superblock: an effective technique for VLIW and superscalar compilation”, The Journal
of Supercomputing, v.7 n.1-2, p.229-248, May 1993.
 S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, R. A. Bringmann, “Effective compiler
support for predicated execution using the hyperblock”, Proceedings of the 25th annual
international symposium on Microarchitecture, p.45-54, December 01-04, 1992.
 “Low Level Virtual Machine (LLVM)”, http://llvm.org/
 Sherwood, T., Sair, S., and Calder, B., “Phase tracking and prediction”, International
symposium on computer architecture, 2003.
 Merten, M. C., Trick, A. R., George, C. N., Gyllenhaal, J. C., and Hwu, W.W., “A
hardware-driven profiling scheme for identifying program hot spots to support runtime
optimization”, International symposium on computer architecture, 1999.
 Karl Pettis, Robert C. Hansen, “Profile guided code positioning”, Proceedings of the
ACM SIGPLAN conference on programming language design and implementation,
p.16-27, June 1990.
 A. Ramirez, L. Barroso, K. Gharachorloo, R. Cohn, J. Larriba-Pey, P. G. Lowney, M.
Valero, “Code layout optimizations for transaction processing workloads”, Proceedings
of the 28th annual international symposium on computer architecture, p.155-164, 2001.
 P. P. Chang, W.W. Hwu, “Trace selection for compiling large C applicationprograms to
microcode”, Proceedings of the 21st annual workshop on microprogramming and
microarchitecture, p.21-29, 1988.
 Brad Calder, Peter Feller, Alan Eustace, “Value profiling”, Proceedings of the 30th
annual ACM/IEEE international symposium on Microarchitecture, p.259- 269, December
 S. G. Abraham, R. A. Sugumar, D. Windheiser, B. R. Rau, Rajiv Gupta, “Predictability of
load/store instruction latencies”, Proceedings of the 26th annual international symposium
on microarchitecture, p.139-152, December 01-03, 1993.
 Todd M. Austin, Gurindar S. Sohi, “Dynamic dependency analysis of ordinary
programs”, Proceedings of the 19th annual international symposium on Computer
architecture, p.342-351, May 19-21, 1992.
 Scott McFarling, “Reality-based optimization”, Proceedings of the international
symposium on code generation and optimization, p.59 - 68, 2003.
 P. P. Chang, S. A. Mahlke, and W. W. Hwu, “Using profile information to assist classic
code optimizations”, Software-Practice and Experience, vol.21(12), p.1301-1321,
 Robert Cohn, P. G. Lowney, “Hot cold optimization of large WindowsNT applications”,
Proceedings of the 29th annual ACM/IEEE international symposium on
microarchitecture, p.80-89, December 02-04, 1996.
 Todd C. Mowry, Chi-Keung Luk, “Predicting data cache misses in non-
numericapplications through correlation profiling”, Proceedings of the 30th annual
ACM/IEEE international symposium on microarchitecture, p.314-320, December 01-03,
 A. Srivastava, D. W. Wall, “A practical system for intermodule code optimizationat link-
time”, Journal of programming languages, vol.1 (1), p. 1-18, March 1993.
 C.-K. Luk, R. Muth, H. Patil, R. Weiss, P. G. Lowney, R. Cohn, “Profile-guided post-link
stride prefetching”, Proceedings of the 16th international conference onSupercomputing,
p. 167-178, 2002.
 C. B. Zilles, G. S. Sohi, “Understanding the backward slices of performance degrading
instructions”, Proceedings of the 27th annual international symposium on computer
architecture, p.172-181, June 2000.
 Goodwin, D. W., “Interprocedural dataflow analysis in an executable optimizer”,
Programming language design and implementation, June 16-18, 1997.
 A. Srivastava, A. Edwards, and H. Vo, “Vulcan. Binary translation in a distributed
environment”, Technical Report MSR-TR-2001-50, Microsoft Research, April 2001.
 Luk, C., Muth, R., Patil, H., Cohn, R., and Lowney, G., “Ispike: A Post-link Optimizer
for the Intel® Itanium® Architecture”, Proceedings of the international symposium on
code generation and optimization: feedback-directed and runtime optimization, March
 Patel, S. J. and Lumetta, S. S., “rePLay: A Hardware Framework for Dynamic
Optimization”, IEEE Transactions on Computers, vol.50 (6), p.590-608, June 2001.
 Fahs, B., Bose, S., Crum, M., Slechta, B., Spadini, F., Tung, T., Patel, S. J., and Lumetta,
S. S., “Performance characterization of a hardware mechanism for dynamic
optimization”, Proceedings of the 34th annual ACM/IEEE international symposium on
microarchitecture, December 01-05, 2001.
 Dehnert, J. C., Grant, B. K., Banning, J. P., Johnson, R., Kistler, T., Klaiber, A., and
Mattson, J., “The Transmeta Code Morphing™ Software: using speculation, recovery,
and adaptive retranslation to address real-life challenges”, Proceedings of the
international symposium on code generation and optimization: feedbackdirected and
runtime optimization, March 23-26, 2003.
 Zhang, W., Calder, B., and Tullsen, D. M., “An Event-Driven Multithreaded Dynamic
Optimization Framework”, Proceedings of the 14th international conference on parallel
architectures and compilation techniques, September 17- 21, 2005.