可能遭遇之困難及解決途徑。 重要儀器之配合使用情形。 如為整合型研究計畫，請就
展及其他應用方面預期之貢獻。 對於參與之工作人員，預期可獲之訓練。 本計畫如為整
In the past, after the binary code is generated by a static compiler and is linked and loaded
to the memory, it is pretty much left untouched and unchanged during its program execution
and in the rest of its software life cycle unless the code is re-compiled and changed. However,
on more recent systems, code manipulation has been extended beyond static compilation time.
The most notable systems are virtual machines such as Java virtual machine (JVM), C# and
some recent scripting languages. Intermediate code format such as bytecode in JVM is
generated first and then interpreted at runtime, the intermediate code could also be compiled
dynamically after hot code traces are detected during its execution, and further optimized in its
binary form for continuous improvement and transformation during its program execution.
As a matter of fact, binary code manipulation is not just limited to virtual machines.
Virtualization at all system levels has become increasingly important at the advent of multi-
cores and the dawn of utility computing, also known as clouding computing. When there are
many compute resources available on the platform, utilizing such resources efficiently and
effectively often requires code compiled in one instruction set architecture (ISA) to be moved
around in the internet and run on platforms with different ISAs. Binary translation and
optimization becomes very important to support such system virtualization.
There are also other important applications that require manipulation of binary code. The
most notable application includes binary instrumentation in which additional binary code
segments are inserted to some specific points of the original binary code. These added binary
code segments could be used to monitor the program execution by collecting runtime
information during program execution that could be fed back to profile-based static compilers
to further improve their compiled code, to detect the potential breach of security protocols, to
trace program execution for testing and debugging, or to reverse engineer the binary code and
retrieve its algorithms or other vital program information.
The main objective of this research is to support high-performance binary code
manipulation, in particular, to support system virtualization in which binary translation and
binary optimization are crucial to their performance.
Existing binary manipulation systems such as DynamoRio, QEMU, Simics, all assume that
the binary code they use as input comes straight from the exiting optimizing compilers. Such
binary code is often assembled after linking with runtime libraries and other relevant files that
are needed for its execution. Hence, it has the entire execution code image. It allows such
binary manipulation systems to have a global view of the entire code and could do a global
analysis which each individual code piece could not do during its individual compilation phase.
However, such global analyses are notoriously time consuming. Furthermore, a lot of vital
program information such as control and data flow graph, data types and liveness of their
variables, as well as alias information is often not available, and is very difficult to obtain from
the original binary code. Such program information could be extremely valuable for a runtime
binary manipulation system to carry out more advanced types of code translation and
Binary translation and optimization have been used very successfully in many applications.
In functional simulators such as QEMU and Simics, an entire guest operating system using a
different ISA could be brought up on a host platform with a completely different ISA and
operating system. Guest applications running on the guest operating system will be totally
unaware of the host platform and the host operating system it is run on. Such a virtualization
system has many important applications. One of the important applications is to allow system
and application software to be developed in parallel with a new hardware system still under
development. The new hardware system still under development could be virtualized and run
on a host system with a completely different ISA and OS. It could save a significant amount of
software development time and shorten the time to market substantially. Other virtualization
systems being developed on multicore platforms allow different OS’s to be run simultaneously
on the same multicore platforms. It provides excellent isolation among these OS’s for high
security and reliability. A crashed OS will not affect the other OS’s concurrently run on the
B. Previous related work
We have many years of experience in dynamic binary optimization systems. ADORE [ ]
developed on single-core platforms was started in early 2000’s on Intel’s Itanium processor, and
later ported to microprocessors developed by Sun Microsystems. ADORE (see Figure 1) has
several major components. It relies on hardware performance monitoring system to help in
identifying program phases, program control flow structures, and performance bottlenecks in the
As such runtime binary manipulation system will take away compute resources from the
application that is currently running, the overhead of such manipulation system will be counted as
part of the application execution time it tries to optimize. Hence, the resulting program
improvement has to be substantial enough to offset such an overhead, or the overhead has to be
minimized sufficiently enough not to interfere with the original program execution. Even though
such binary manipulation could be done on a different core, thus not interfering with the program
execution, it still takes away major core resources from other useful work.
To minimize runtime overhead, ADORE uses hardware performance monitoring system to
sample machine states at a fixed interval. The interval size determines the overhead, the program
phases and the amount of runtime information it could collect. As a stable program phase is
detected, hot traces of the program execution will be generated and optimizations will be applied
to the hot traces. Optimized hot traces are then placed in a code cache, the original program will
be patched to cause control transfer to the optimized code stored in the code cache. If most of the
execution time, the program is executing from the optimized code in the code cache, good
performance could be achieved.
ADORE, COBRA and current NSF grants
C. Proposed approach
To provide valuable program information obtained during sophisticated static compiler analyses
for more effective and efficient runtime binary manipulation, we plan to have the static compiler
annotate the generated binary code with such information. The types and the extent of program
information useful for runtime binary manipulation will be one of the main subjects of this
For example, from the experience of ADORE, we found that it is extremely difficult to find free
registers that could be safely used by the runtime binary optimizer. As the runtime binary
optimizer will share the same register file as the code it tries to optimize, it needs to spill some
registers to the memory for its own use. However, there is no application interface (API)
convention defined for such interaction between the static compiler and the runtime optimizer.
Hence, it is even not easy to find available registers to execute the code that will spill the registers
because, to spill registers, it will need at least one register to keep the address of the memory
location it plans to spill into.
Another example is that it is quite difficult to determine the boundaries of a loop, especially in
loops that have complex control flow structures. This could cause extreme difficulty to insert
memory prefetch instructions for identified long latency delinquent load operations. Because such
prefetch instructions are often inserted in a location that needs to be executed several loop
iterations before the intended delinquent load instruction in order to offset the long miss penalty. It
is because loops are often implemented in jump instructions which could be difficult to
differentiate from the other jump instructions existed in the same loop or the nesting loops. Hence,
annotating loop boundaries and/or the control flow graph of each procedure could save a lot
analysis time and give runtime binary optimizer tremendous advantages of possible optimization
Several potentially useful types of program annotations were identified in [Das]. They include:
(1) control flow annotations; (2) register usage; (3) data flow annotations; (4) annotations that will
be useful for exception handlers; (5) annotations describing load-prefetch relationship.
Our research in annotations will be first driven by the need of binary translation and binary
optimization, in particular, for multi-core systems. As most of the studies in this area focused
primarily on single-core systems so far, adding annotations for multi-core applications will add
additional complexity and challenge. In particular, the information that could be used at runtime to
balance the workload and to help in mitigating synchronization overhead will be of particular
interests to this research. (Expand on this)
C.1. Annotate binary code by expanding ELF and adding require information
The annotations will primarily be incorporated in the binary code. We plan to use ELF binary
format in our prototype as ELF is a standard binary format use on most Linux systems. The
challenges will primarily be in several important areas:
(1) Size of annotations. Annotating all of the information mentioned in [ ] could take up a lot of
memory space and expand the binary code size to the extent that becomes unmanageable. A lot of
annotated information may not be useful for a particular program. A carefully designed annotation
format and encoding scheme could drastically reduce the annotation size and keep ELF as
compact as possible
(2) API for annotations. Annotating the types of runtime information that is useful to a
particular program on a particular platform could further reduce its size. In many cases, the
programmer has the needed inside knowledge about such information, for example, whether the
target platform is single-core or multi-core, the program is integer-intensive or floating-point
intensive. Such knowledge could affect the type of runtime optimizations that could be useful to
the code. Hence, a carefully design API that allows programmers to direct the types of annotations
to be generated, e.g. CFG, DFG, register liveness information, or likelihood of exceptions, for
some particular code regions could be particularly useful to both the generation of annotations and
the potential sets of runtime optimizations to be applied.
(3) Dynamic update for annotations. The types of useful annotation might change after each
phase of binary manipulation. For example, during binary translation, we might want to add
additional information that could be useful to the later runtime optimization. For example, there
might be changes to the original CFG or DFG due to binary translation, some useful runtime
information may point to the uselessness of some original annotations and trim them to reduce
overall code size.
(4) Code security. As runtime binary manipulation could potentially alter the original binary
code, avoid accidentally overriding some code regions during annotation updates, or binary code
translation and optimization, should be carefully considered. For example, using offset to some
base address for all memory references instead of absolute memory address could avoid accidental
stepping into forbidden code regions and improve its security.
C.2. API for specifying the program information to be included in the binary code
To allow flexibility and “annotate as your go”, we need to provide good API for
user/compiler to specify the types of annotation needed. Provide good API to allow binary
manipulator (i.e. translator and optimizer) to access annotation without knowing the format and
arrangement of annotations. Good for future upgrades and changes of annotations with changing
the users of the annotation as the API is standardized
C.3. Annotations useful in the sub-project 2 for binary translation
The main purpose of binary translation is to translate a code in the guest ISA to a binary code
in the host ISA. In sub-project 2, we propose to build a binary translator based on QEMU-like
functional simulator. In such a translator, each guest binary instruction is first interpreted and
translated into host binary instruction. In QEMU, each guest instruction is translated into a string
of micro-operations defined in QEMU. These micro-operations are then converted to the host ISA.
The advantage of such an approach is that the machine state could be accurately preserved and
converted from one ISA to another ISA. Hence, even privileged instructions could be interpreted
and translated this way. It allows an OS to be booted and run on QEMU accurately. However, the
overhead for such an approach is quite high as each guest instruction will require several micro-
operations to carry out, and each micro-operation may in term require several host machine
instruction to execute it.
As shown in sub-project 2, we plan to interpret and translate guest instructions on a per basic
block basis. These basic blocks will be translated into an intermediate representation (IR) format,
such as that used in LLVM. They will be optimized at the IR level using the annotated information
provided in the guest binary code. Such optimization will usually be somewhat machine
independent and fast. LLVM code generator could then be used to generate machine dependent
host binary code. When hot traces are identified and further machine dependent optimizations will
be applied on the identified hot traces, they will be kept in a hot-trace code cache for fast program
execution and further optimization. Information collected by runtime hardware performance
monitoring system could be used in such further optimizations.
There are many implications on how the annotated information in the guest binary code is used
and it in term determines what kinds of annotated information will be useful. Some past
experiences show that instead of interpreting each guest binary instruction first and then start
translating them after hot traces are identified, it is actually more efficient to translate each guest
binary instruction as it is being executed. Code optimization could be applied after hot traces are
identified. Optimized hot traces are then placed in the code cache for future execution. If the
coverage of the traces in the code cache is high, most of the execution time will come from the
optimized hot traces in the code cache, and the overall performance could be significantly
In the approach proposed in sub-project 2, we will first translate the guest binary into a
machine independent IR format on a per basic block basis, and carry out machine independent
code optimization at the IR level followed by code generation to the targeted host machine with
another ISA. The annotation in the guest binary code could then be used directly in machine
independent code optimization within each translated basic block before host machine code is
generated. The annotated information could help in forming hot traces and in global optimization
across basic blocks within an identified hot trace. Some of the annotation such as those identified
[Das] on branch target information for indirect branches could be useful in this process also.
C.4. Annotations useful in the sub-project 3 for further binary optimization
Many types of annotation that could be useful in binary optimizations for single-core
platforms have been identified in [Das]. We plan to identify more annotated information useful on
Here, we have two possible scenarios. One is similar to a traditional binary optimizer, such as
ADORE, in which guest binary code is using the same ISA as that of the host platforms. The other
is for the binary optimizer to optimize the translated binary code from a binary translator. In that
case, annotated information may have to be converted to match the translated host binary code as
well. In sub-project 3, our main focus will be on optimizing the translated host binary code. As
mentioned earlier, there are two levels of optimization that could be performed. One is machine
independent optimization at the machine independent IR level. The other is at the machine
dependent host binary code level after runtime information is collected during program execution.
Even for some machine dependent optimizations such as register allocation, they might be
parameterized and done during the machine independent optimization phase at the IR level. For
example, if we know the number of registers on the host platform and their ABI convention such
as register assignment in a procedure call, the types of annotation in the guest binary code such as
alias and data dependence information could be used to optimize register allocation.
Other useful annotations include:
(1) Control flow graph. It could be used in determining the loop structures for hot traces to
increase optimized code coverage. It has also been proposed that edge profiling information could
be used to annotate each branch instruction and indicate whether it is most likely to be taken or
not-taken. Hot traces could be formed this way during binary translation phase. Indirect branch
instructions in which branch targets are not known until runtime are the most challenging in
forming CFG. However, many indirect branches are from high-level program structures such as
the switch statement in a C program, or the return statement of a procedure. If such information
could be annotated, their target instructions could be identified and more accurate CFG could be
(2) Live-in and live-out variables could be annotated to provide information for more optimized
local register allocation. If the optimization is performed after binary translation, live-in and live-
out information might be updated as the translated code might introduce some additional
information that needs to be passed on to the following basic blocks. Here, the register acquisition
problem may not be as serious as in ADORE as register allocation will be performed during
binary translation process. Available free registers could be annotated after the translation phase to
the binary optimizer for use later.
(3) Data flow graph. Def-use and use-def information could be useful in many well-know
compiler optimizations. Other useful information includes alias information and data dependence
information. Such information could be useful in register allocation and some well-known partial
redundancy elimination (PRE) optimizations. It is even suggested that some information obtained
from value profiling could be annotated to help in tracking data flow information. However, the
effectiveness of such optimizations and the amount of overhead required for such optimization at
runtime is interesting research issues.
(4) Exception handlers. A lot of optimizations could alter the order of program execution and,
hence, could alter the original machine state when an exception is thrown. However, binary code
usually does not have the information on where the exception handlers will be used. To produce
accurate machine states when exceptions occur in the original guest binary code, a lot potential
code optimization such as code scheduling need to be conservatively suppressed which could have
a very significant impact on the program performance. By annotating the regions of exception
handlers and only avoid aggressive optimizations in those regions such concerns could be
(5) Prefetch instruction and its corresponding load instruction. Prefetching could be very
effective in reducing miss penalty and cache misses on single-core platforms. However,
prefetching instructions usually come with some additional instructions to compute prefetch
addresses. Also, because of its proven effectiveness, exiting optimizing compilers often generate
very aggressive, and often excessive, prefetcihng instructions. Such excessive prefetching
instructions could consume a lot of bus and memory bandwidth if their corresponding load
instructions are not delinquent load instructions as originally assumed. It have been shown that by
selectively eliminating such aggressive prefetching instructions, performance actually will
improve in many programs on single-core platforms [ ]. In a multi-core environment when
multiple programs could be in execution concurrently. Too aggressive prefetching from one
program could adversely affect the other programs running on the same platform because it takes
away valuable bus and memory bandwidth needed by other programs. Identifying a particular load
instruction with its corresponding prefetch instruction and the supporting address calculation
instructions could help to eliminate those instructions if the load instruction is no longer a
delinquent load in this particular run.
(6) Workload and synchronization information for parallel applications. For parallel
applications running on multi-core platforms, the types of annotation that could help in improving
programming performance, or help in tracking program execution for testing and debugging, will
be identified and studied. Some well-known information that could help in improving the parallel
programs includes workload estimation of each thread and the synchronization information that
identifies critical sections or signal and wait instructions. Such information could help in workload
balancing at runtime and reducing synchronization wait time and overhead. Other useful
information such as the one identified in (5) that could improve the bus and memory bandwidth as
well the load latency on a multi-core platform and the tracking of cache coherence traffic to
remove potential false data sharing will be of great interests to our study.
Evaluation of the effectiveness of adding annotations to the binary code will be another
major effort in this project. Open64 compiler will be used as our main platform in generating
annotations. Open64 has been a very robust production-level and high-quality open-source
compiler. It has all of the major components of a production compiler and is also supporting
profile-based approach. It has good documentations and a large and active user community. It is
currently supported by major companies such as HP and AMD, and major compiler groups such as
University of Minnesota, University of Delaware, University of Houston and several others. It
supports almost all major platforms such as Intel IA-64, IA32, Itanium, MIPS and CUDA. It is a
very general-purpose compiler that supports C, C++, FORTRAN, and JAVA.
We plan to use it to generate all of the useful binary annotations and study its effectiveness in
improving program performance, its code size expansion, and the API support needed.
D. Other related work
E. Methodology and our Work Plan (for each year)
E.1 The annotation framework
Figure 1 is this sub-project’s framework and it can be divided into two parts, annotation
producer and consumer. Producer's functionality focuses on producing annotation and annotates
these into binary file. The main component of the producer is the compiler. Consumer's
functionality reads annotation from binary file and try to leverage it efficiently, and the main
components of the consumer are the binary translator and the binary optimizer.
The source of annotation data come from two parts, the one is from the static compiler analysis,
and the other one is from the program’s profiles analysis. We will adopt Open64 as the compiler, it
is an open source, optimizing compiler for the Itanium and x86-64 microprocessor architectures.
The compiler derives from the SGI compilers for the MIPS R10000 processor and was released
under the GNU GPL in 2000. Open64 supports Fortran 77/95 and C/C++, as well as the shared
memory programming model OpenMP. The major components of Open64 are the frontend for C/
C++ (using GCC) and Fortran 77/90 (using the CraySoft front-end and libraries), inter-procedural
analysis (IPA), loop nest optimizer (LNO), global optimizer (WOPT), and code generator (CG). It
can conduct high-quality inter-procedural analysis, data-flow analysis, data dependence analysis,
and array region analysis.
Figure 1: The annotation framework.
E.2 The Producer
Figure 2: The producer’s components and data flowchart.
Figure 2 shows the flow of how the producer annotates the information to the ELF executable
file. There are two ways to produce the annotated ELF executable file according to the annotation
source, the static compiler analysis and the profile analysis.
At the static compiler analysis way, the program source code will be compiled and produced
assemble code and annotation data by the modified Open64 compiler. However, at the compilation
phase, without the virtual address produced by the compiler, only function labels can be identified.
Therefore, if the annotation data contain information in the function, the function labels will be the
basic point and the annotation data’s position will be record as the offset of function labels. After the
assembly code are assembled and produced the ELF executable file by the assembler, the virtual
address of the function and instruction are assigned. The annotation data and the corresponding
virtual address will be combined and produced the annotated ELF executable file by the annotation
At the profile analysis way, profile data will be analyzed by the profile analyzer, and then
producing the useful annotation data. The produced annotation data will be combined by the
annotation combiner as well as the static compiler analysis way’s annotation data and then producing
the annotated ELF executable file.
Annotation information can be stored as a new section in the ELF file (for example“.annotate”).
This section can be loaded in the memory after the text segment by setting the SHF_ALLOC flag in
the section header for .annotate section and adding this section to the program header table with the
PT_LOAD flag set. In order for the consumer (the sub-project 2 or the sub-project 3) to find the
location of the .annotate section, the section-name string table should be loaded in memory too.
Modifications can be made to the ELF file to enable this. The consumer can now read the in-memory
representation of the ELF headers and load the contents of the .annotate section.
E.3 The annotation granularity
There are four annotation data level, such as the program level, the procedural unit level, the
basic block level and instruction level, according to the range of the described information. The
instruction level needs the biggest storage space; it is because the annotation data has to record each
instruction’s information. Each instruction maps to a virtual address at runtime, therefore, the
annotation data will contain the instruction information and its corresponding virtual address. On the
other side, the program level needs the smallest storage space; it is because the described
information represents for the whole program and without to keep the corresponding virtual address.
The data in program level describes whole program’s information. Such as the hot trace annotation,
it is to annotate a set of basic blocks to form a frequently executed path called hot trace and then
the hot trace will be treated as an optimization unit at runtime. The phase change annotation is to
annotate which program regions are hit frequently, when the program’s execution path hit these
regions over a threshold at runtime in a period, the phase change can be identified. The inter-
procedure loop annotation is to annotate loop’s relation in the procedure; it can help for building hot
The data in procedure unit level describes procedure’s information in the program, like intra-
procedure loop annotation. It is to annotate the information of loop in procedure. The loop is usually
executed for many times at runtime; therefore, it is often treated as an optimization unit.
The data in basic block level describes basic block’s information in the program. Register usage
is an annotation which labels each basic block register usage and this information can be used to
identify free registers at runtime.
The data in instruction level describes instruction’s information in the program, such as a
memory reference annotation. It is to annotate each instruction’s memory reference. If the memory
reference information is correctly known, instruction rescheduling can be performed further by the
dynamic binary optimizer.
E.4 The Consumer
Figure 3: Annotation framework’s position in the virtualization system.
Figure 3 shows this sub-project’s position in the whole virtualization system (red parts). The
goal of this sub-project is to assist the sub-project 2 to perform dynamic binary translation efficiently
and to help the sub-project 3 to perform advanced optimizations. Therefore, the sub-project 2 and
sub-project 3 are the consumers for this sub-project. The annotation data will be annotated to the
EFL executable file firstly and then reading by the sub-project 2. When the sub-project 2 performs
dynamic binary translation, it will influence the information range described by the annotation data.
It is because the memory layout of the guest machine will be changed to the memory layout of the
host machine. Therefore, the annotation data will be adjusted when the sub-project 2 performs
dynamic binary translation so that the dynamic binary optimizer can read the correct annotation data.
F. Work Plan
There are three points in the first year. The first point is to exploit the useful annotations from
the static compiler analysis to improve the dynamic binary translator efficiency. The second one is to
exploit the beneficial annotations from the static compiler analysis to help the dynamic binary
optimizer performing advanced optimization. In order to achieve these two points, we have to know
benchmark’s runtime behaviors which are manipulated by the dynamic binary translator and
dynamic binary optimizer explicitly. The last one is to modify the compiler to produce the annotation
data which are exploited by the previous two points.
There are three points in the second year. The first point is to design a guest binary annotation
encoding format for the dynamic binary translator. The second point is to implement the annotations
which are exploiting from the first to the dynamic binary translator. The last point is to exploit useful
annotations from the profile data which is produced by the dynamic binary translator and then
feedback to the dynamic binary translator.
There are three points in the second year. The first point is to translate the guest binary
annotation encoding format to the host annotation format for the dynamic binary translator. The
second point is to implement the annotations which are exploiting from the first to the dynamic
binary optimizer. The last point is to exploit useful annotations from the profile data which is
produced by the dynamic binary optimizer and then feedback to the dynamic binary optimizer.
F. Milestones and Deliverables
1. Exploiting various annotations to help the binary translation and binary optimization from the
static compiler analysis and to estimate the performance gain and overhead using these
2. Modifying the open64 compiler to produce corresponding annotations and to estimate the size
1. Designing the annotation format for the binary translator to consume annotation and
annotating the annotation to the ELF file.
2. Equipping the annotation schemes which are exploited from the first year to the binary
translator and evaluating the performance gain and overhead when the dynamic translator
3. Exploiting various annotations to help the binary translation form the profiles produced by the
binary translator and implementing these annotation scheme to the binary translator.
1. Transforming the annotation format from the binary translator to the binary optimizer.
2. Equipping the annotation schemes which are exploited from the first year to the binary
optimizer and evaluating the performance gain and overhead when the dynamic translator
3. Exploiting various annotations to help the binary optimization form the profiles produced by
the binary optimizer and implementing these annotation scheme to the binary optimizer.
4. Evaluating the performance gain and overhead for the virtualization system when using the
表C012 共 頁 第 頁