Intermediate languages are with us for quite a long time. Most of the
compilers employ them in one-way or another. The way of using
intermediate languages have also sometimes contributed to the
success of the languages like Pascal, Java and recently C# (with .NET
initiative in general). Even some tools like MS-Word makes use of an
intermediate language (p-code). Intermediate languages have made a
big difference in achieving portability of code. This article explores the
concept of intermediate languages and their use in achieving complete
portability and language interoperability.
In 1950s itself, this idea of having an intermediate language was
experimented with the introduction of UNCOL . Later Pascal became
widely available because of p-code. Not until recently did Java
employed the same technology with its promise of portability to
become a big success. Now Microsoft's .NET initiative is also based on
this approach, but the keyword is language interoperability. A closer
look reveals that all are based on same idea: use of intermediate code,
and that is the current trend in the language translation technology.
Approaches in Language Translation
Two major approaches towards language translation are compilation
and interpretation. In compilation, since the native code is generated,
different compilers are needed for different platforms. An interpreter is
also particular to the platform, but its specifics are abstracted making
only the behavior of the program transparent as the code is translated
and executed 'on-the-fly'.
Neither of them is ‘the’ best approach - both boasts of significant
advantages and also suffer from some serious disadvantages.
The advantages of compilation mostly overweigh its disadvantages and
that is why it is preferred in most of the languages. The translation
and optimization needs to be done only once, and native code ensures
efficiency of execution. However as native code is generated, it is
Interpretation has some very good advantages that compilation
doesn’t have. In particular, they can achieve full portability. Also it can
provide better interactive features and diagnostic (error) messages.
But one problem that cripples interpretation is performance. Since
translation needs to be done every time (and for the same reason,
optimization is less attractive) execution becomes generally very slow
(10 to 100 times) compared to the equivalent compiled code.
You can think of an interpreter as the one that implements a virtual
machine. So, it can be inferred that the 'machine language' of that
virtual machine is the high-level language for which the interpreter is
Interpreters provide many advantages including greater flexibility,
better diagnostics and interactive features like debugging. True power
of interpretation is generally underestimated. To illustrate, you can
think of printf library routine in C as a tiny interpreter for formatted
output - depending on the requirement it parses the format string,
gets the arguments, computes the format on the fly, prints it to the
output along with facilities like padding, aligning, specific notations etc.
Interpretation allows the code to be generated or modify itself quot;on the
flyquot; and execute it. For example, interpreted languages like Lisp or
even partly interpreted languages like Java or C# provides such
facilities (although not recommended to use such facilities for practical
programming). Note that few language features like reflection are only
possible with interpretation. This is the reason why compilers cannot
be written for some languages and interpretation is the only option.
So, instead of restricting to pure compilation or interpretation, you can
infer that a combination of the both is better to enjoy the advantages
of these approaches.
It seems to be very simple to use both compilation and interpretation.
For that itself, various languages use different approaches.
Few languages provides options for both interpretation and compilation
and can be used depending on the requirement. For example, for
development and debugging, Microsoft Visual Basic uses an interpreter
and for deployment, it employs a compiler.
As already stated, writing compilers for the languages that are meant
for interpretation is sometimes not possible. In such cases, it is
possible to bundle the source code with the whole interpreter as an
executable file and distribute it. Java Native Interface (JNI) follows a
similar approach by embedding a Java interpreter in an executable
code created by a C/C++ source code, and run the Java code through
it. This solution is feasible because the Java interpreter’s size is not
A very interesting approach towards combining compilation and
interpretation through hardware support was widely used in as early
as 1960s in IBM chips. IBM released chips that incorporated the
interpreter software burnt into chips (nothing but firmware) that
supported a general instruction set. The main advantage being that
the same instruction set (of the interpreter) shall be used across
different processor architectures. Since, the compilers would generate
the code just for that one instruction set, the effort is minimized.
Later, the advent of VLSI design with RISC architecture obviated this
The very popular and widely used approach that combines the
compilation and interpretation is making use of intermediate code
directly. Since its use in Java this approach seems to have received
widespread acceptance, even though its idea was used long back in
Pascal and in many other languages. Languages following this
approach are sometimes referred to as p-code languages since p-code
of Pascal is one of the first intermediate codes to follow this approach
successfully. Few of important p-code languages are Python, Perl, Java
and now, C# (.NET framework). The idea is to use a compiler to
generate the intermediate language code. The intermediate code can
be moved on to other platforms or even across the network where an
interpreter (also called as virtual machine) for that platform will take
over to interpret the code.
The usage of intermediate code in compilers is not a new idea. There
are various phases in a compiler and they can be grouped generally as
the front-end and the back-end. The analysis and translation of the
code (in general 'parsing') is done by the front end and the synthesis
of the code is taken care by the back-end. The information in the form
of intermediate languages is represented that is generally independent
any particular architecture and thus serves as a link to transfer the
information from front-end to the back-end. The advantage with this
approach is that, the front-end can be same and only the back-end
needs to be changed to support new architectures and hence the effort
is limited only in rewriting the back-end of the compiler. Intermediate
languages also help in general optimization of the code that are not
particular to any platform and provides many related advantages. So
most of the compilers employ intermediate code.
If intermediate languages have been in usage for a long time in
compilers, why is this renewed interest?
The difference is that previously most of the compilers hid the use of
intermediate languages as ‘implementation details’. But now, to make
the best use of them, they are extended, exposed and used in
conjunction with interpretation.
An advantage with the intermediate language approach is ‘binary code
portability’, i.e., the next level of portability promised by languages
like C/C++ that provides source code portability. As long as an
interpreter is there to understand and execute the intermediate code,
no recompilation is necessary to move the code to a new platform. It
also provides the ability to verify code and thus type-safety can be
ensured and as a result, ability to build security systems on that.
Intermediate languages can also provide the advantage of ‘language
interoperability’. As more than one language can be converted to the
same intermediate language, the code compiled through two different
high-level languages can interact at intermediate code level agnostic of
the different possible source languages. A related benefit is that the
number of translators that needs to be written is also drastically
reduced. This is a significant benefit as the cost of retargeting the
translator is high and writing a new one is very tedious. As a result
only a few translators need to be developed.
To illustrate, in traditional compilation approach, for N high-level
languages and M target machines, N * M translators are required. With
intermediate language approach, where the IL is capable of supporting
N high-level languages and M target platforms, then the number of
translators required effectively reduces to just N + M.
Apart from portability and interoperability, there are many features
that make the use of intermediate languages attractive. One of the
advantages that seems to be insignificant at the first look is that the
file size tends to be small compared to the equivalent native code.
That is one of the reasons why intermediate code approach was
popular around 1980s when disk and memory space was precious. The
small size of the code is advantageous even now: in the networked
environment, the small size of applets (nothing but Java class files)
makes their transmission across the network faster.
Approaches in Representing Intermediate Code
A challenge for an intermediate language is to satisfy seemingly two
1) To represent the source code without significant loss of
2) To keep only the necessary information so that the source code
can be translated directly and efficiently to the target machine
Depending on the level of abstraction of the intermediate language, we
can classify them as high-level, medium-level or low-level ILs.
Usually the high-level ILs are represented as trees or DAGs (Directed
Acyclic Graphs) and sometimes as stack based. The middle-level ILs
typically use approaches like triples and quadruples. The low-level ILs
generally use code closer to the assembly language of the target
High-level ILs takes more effort to get the code converted to the
native code, but are sufficiently high-level that it can represent rich
features of various source languages directly. So it is possible to
retarget new languages to high-level ILs. However, more effort is
needed in converting the intermediate code to native code. On another
extreme, low-level ILs make the translation to native code easier, but
it is generally hard to retarget other high-level languages than that it
is originally intended for.
Another design approach is to abstract the programming language (for
example, ANDF) or the machine (for example, RTL) and an IL has to
strike a balance between the two extremes.
There are many methods used to represent the information at
intermediate level. Some widely used approaches are discussed here.
In most of the language compilers, three address codes are generated
as intermediate code and later code-generators generate code for that
particular target machine. They are mostly register based, and are
fairly low-level, so target-code generation is straightforward. A similar
approach is Register Transfer Language (RTL) widely used in GNU C
compilers. Following is an example for three address code:
R1 := R2 + R3
And the general syntax is.
X := Y op Z
An interesting approach to represent information at intermediate level
is object files. With each platform, the format of object code file and
the information stored differs. This makes them platform dependent.
An effort has been made to combine the information that is required in
various formats. It resulted in creating a huge format that contains
all such details, and is referred to as 'fat binaries' effectively used in
NExtStep platforms. On the other hand, 'slim binaries' are also
available that generates the portable object code 'on-the-fly'.
ANDF [2,3] stands for Architectural Neutral Distribution Format
abstracts high-level language constructs. It aims at converting various
language constructs in a programming language to an intermediate
form that shall retain all the information available in the source. It
takes care to represent different language semantics, platform
dependencies etc. Languages including C, C++, Fortran, Ada has ANDF
producers (high-level compilers). This is achieved through an
intermediate representation of the language constructs (keeping its
variants in various programming languages). For example, expressions
in the source language are converted into EXPs, identifiers as TAGs,
parameterization as TOKENs etc. ANDF follows a tree representation of
constructs called CAPSULEs. It also has installers (low-level compilers)
in wide variety of platforms including Dec-Alpha, Sun-Sparc, PowerPC
and Intel and executed as native code.
Using another High Level Language!
Instead of creating a new intermediate language, this approach is
towards using a high-level language as a intermediate language that is
already available, highly portable, efficient and sufficiently low-level.
As you would have guessed, C is an interesting high-level language
with ability to write low-level code. Owing to this property, it is
sometimes referred to as a portable assembler or 'middle-level'
language. The beauty of C is that, in spite of its low-level abilities, the
code is highly portable. Due to this nature, C has effectively served as
a form of a portable intermediate language for other high-level
programming languages. In fact, in initial days of UNIX, translators
generating C code as target code was written so that re-writing them
for every platform were not required. Many languages follow this
approach, including Mercury, Scheme, APL, Haskell etc. Also GNU
Fortran and Pascal compilers (f2c and p2c) generate C code and native
C compilers take over to compile and execute them.
This representation assumes the presence of a run-time stack and
generates code for that. It uses the stack for evaluation of expressions
by making use of stack itself to store the intermediate values. Thus
the underlying concept is very simple and that is its main strength.
Also, least assumptions are made about the hardware and support
available in the target architecture. When a virtual machine (runtime
interpreter) simulates a stack-based machine and provides necessary
resources, this approach can effectively provide platform-
independence. In other words, the virtual machine becomes a platform
of its own that shall be simulated in possibly any hardware platform
(and thus needs to be written for those platforms). One problem with
this approach however is that the code optimization is difficult. Java
virtual machine is a very good example for such stack-based
Other major efforts include Oberon Module Interchange (OMI) ,
'Juice' that is based on OMI, Dynamic binding and VP code of TAO
operating system and U-Code.
Implementations based on stack approach
Since stack based machines are one of the most successful ones and is
gaining widespread acceptance in recent times, let us discuss few
implementations based on that in detail here.
P-system was popular in 1970-80s as an operating system that made
use of p-code . Compilers would emit p-code, an architectural
neutral one that was executed by a virtual machine. UCSD Pascal was
the most popular compiler that generated p-code and also the whole
p-system was written in UCSD Pascal itself making it easier to port to
new architectures with much ease.
Pascal implementations contained a compiler generating p-code and
later an interpreter will take over to execute that p-code. The
interpreter was very simple and so it was easy to write an interpreter
for a new architecture. With that interpreter, a Pascal compiler in the
form of p-code can be used to compile Pascal programs that can in-
turn be run on the interpreter.
However, this steps need to be done only in initial steps. Later the
compiler can be modified to generate native code, and once, the
Pascal compiler itself is given as input for the Pascal compiler in p-
code, the compiler becomes available in native code. This process is
referred to as 'bootstrapping', that is generally used in porting
compilers to new environments. But the point here is that this
'bootstrapping' process becomes easy and elegant with the p-code
approach. This let to the popularity and widespread use of Pascal
Java was originally intended to be used in embedded systems and
set-top boxes and thus a heterogeneous environment. Internet was
becoming very popular at that time and there was no language that
best suited that heterogeneous environment. Java suited for that
purpose very well and thus became a sort of lingua franca of Internet.
The technology is as we already saw of intermediate languages. Java
compiler converts the source into intermediate code - bytecode and
pack with related information in class files. These class files (applets or
applications) can be distributed across the network and executed in
any platform provided that a Java Virtual Machine  is available
there. The advantage of this technology is, as everybody knows now,
'write once, run anywhere'. Programmers need not worry about the
target platform. They just concentrate on the problem and develop the
To illustrate, for each and every class or interface defined, a class file
is generated. Every method is compiled into byte codes - the
instructions for the hypothetical microprocessor. For example:
int a = b + 2;
this may be converted to the bytecodes as,
iload_1 // push the variable 1 (int type) to evaluation stack
iconst_2 // push the constant value 2 (int type) on the stack
iadd // pop two integer values from stack, add them, push the result back.
istore_2 // pop the int value from the stack and store it in variable 2 // (int type)
All the information that you give inside the class definition are
converted and get stored to form an intermediate class file.
The intermediate class file format is a specification given by Sun. It is
a complex format and holds the information like the methods declared
along with the byte codes, the class variables used, initializing values
(if any), super class info, signatures of variables and methods, etc. It
has a constant pool - a per class runtime data structure, which is
similar to the symbol table created by the compiler.
Virtual machine plays a major role in Java technology. All the Java
virtual machines have the same behavior, as defined by SUN, but the
implementation differs. It forms a layer over the physical machine in
Even though the Java technology provides the basis for achieving full
portability, Java’s design aids to make it architectural neutral. To
illustrate, lets compare it with a C design consideration with that of
Java. C gives much importance to efficiency and for that, it leaves size
of its data types (except char) to be implementation-dependent. But
Java primitive types are of predetermined size, independent of the
implementation. In general, Java improves upon C by removing the
constructs having various behavioral types in C by having well defined
JVMs are available in very wide variety of platforms. However, Java
class file format and bytecodes are closely tied with the Java language
semantics. So, it becomes tough to retarget other language compilers
to generate Java class files. Few of the languages including Eiffel, Ada
and Python have compilers generating class files to be executed by
JVMs. Since the class files and bytecodes are not versatile enough to
accommodate wide range of languages, class files (and bytecodes)
doesn’t fare as a feasible universal intermediate language.
.NET architecture addresses an important need - language
interoperability, the concept that can change the way programming is
done and is significantly different from what Java offers. If Java came
as a revolution providing platform independence, .NET has language
interoperability to offer.
To illustrate, JVM is implemented for multiple platforms including
Windows, Linux, Solaris, OS/2 etc. But it has only few languages
targeting at JVM to generate class files. On the other hand, .NET
supports multiple languages generating code (MSIL – Microsoft
Intermediate Language) targeting Common Language Runtime (CLR).
This list includes, C#, VB, C, C++, COBOL, etc. and nevertheless this
list is an impressive one. CLR, the runtime interpreter of .NET platform
(equivalent of JVM of Java) is currently implemented only for Windows
platform and efforts are underway to implement it for Linux and other
platforms. A Java program can be compiled on a PC and can be
executed by a Mac or a Mainframe. Whereas with .NET, we can write a
code in COBOL, extend it in VB and deploy it as a component. Thus the
Java and .NET have fundamentally different design goals. However,
you may have noticed that they partially satisfy the two different
requirements for universal intermediate language: to support different
source languages and target architectures.
In .NET, the unit of deployment is the PE (Portable Executable) file - a
predefined binary standard (similar to class files of Java). It is made
up of collection of modules, exported types and resources and is put
together as an .exe file. It is very useful in versioning, deploying,
sharing, etc. The modules in PE file are known as assemblies.
The IL is designed such that it can accommodate wide range of
languages. Also at the heart of the .NET is the type system where the
types are declared and the metadata of how to use the types is also
stored. One important difference between MSIL and bytecodes is that
MSIL is type-neutral. For example, iconst_0 is a bytecode that tells to
push integer value 0 on the stack, meaning that the type information
is kept in the bytecodes. But the similar IL code just tells to push four
bytes, meaning that no type information is passed on.
With .NET you can write a class in C#, extend it in VB and instantiate
it in managed C++. This gives you the ability to work in diversified
languages depending on your requirement and still benefit from the
unified platform to which the code for any .NET compliant language is
The benefits of using intermediate languages are well known, and the
current trend is towards achieving fully portable code and language
interoperability. To understand the concept of portable code
compilation, the understanding of the conventional translation
technology and their advantages/disadvantages needs to be known.
There are many issues involved in intermediate code design and
understanding them shall enable us to appreciate the various efforts to
generate portable code.
Way back in 1950s itself, the first effort towards achieving portability
was taken through UNCOL initiative. Later, during 1970-80s, Pascal in
the form of p-code picked up the idea. In 1995s Java came as a
revolution to capture the imagination of the language world. Now it
is .NET’s turn with the age-old concept of language interoperability.
Thanks to such technological advances, achieving 100% portable code
with language interoperability is no more a dream. It is clear that the
concept of intermediate languages is here to stay for a long time and
we are not far from having a universal intermediate language.
 see ftp://riftp.osf.org/pub/andf/andf_coll_papers/UNCOLtoANDF.ps
 see ANDF home page: http://www.andf.net/
 refer 'Architecture Neutral Distribution Format (XANDF)', Xopen
Co. Ltd., 1996
 see http://huxley.inf.ethz.ch/~marais/OMI.html
 see http://www.ics.uci.edu/~juice/
 see http://www.threedee.com/jcm/psystem/
 see http://java.sun.com/
 refer James Gosling, Bill Joy, Guy Steele, quot;The Java Language
Specificationquot;, Addison-Wesley, 1996
 refer Tim Lindolm, Frank Yellin, quot;The Java Virtual Machine
Specificationquot;, Addison-Wesley, 1997
 ECMA C# specification, Microsoft Corporation, 2001
 quot;The Development of the C languagequot;, Dennis M. Ritchie, Second
History of programming Languages conference, Cambridge, Mass.
Approaches in improving the performance of the intermediate
code in Java/.NET
One very obvious disadvantage of the combination of compilation and
interpretation is the speed factor. Although the code is compiled, still
an interpreter needs to be employed to execute the code and typically
the execution speed is reduced to the factor of 10 to 30 times to the
equivalent of the native C/C++ code. There are many suggested ways
to improve this speed to achieve the speed of quot;native codequot;. The
remaining part of the article explores the approaches that are made to
overcome this problem with Java and .NET in particular.
An interesting approach is to avoid the interpretation at the target
platform and do compilation again. This can avoid the slowness
inherent in the interpretation approach, as translation is required each
and every time the code needs to be executed. So an alternative
approach of compiling the intermediate code again into native code to
execute can be thought of. This makes the life of the virtual machine
tougher but can improve the performance and it should be noted that
the approach should make sure that the platform independence is not
lost for this performance gain.
One such approach is Just-In-Time compilation. The aim is to compile
the code when the code is called and 'boot-strap' that the compiled
native code be used the next time the code is called. Since, in general,
90% of the time is spent in executing 10% of the code, this approach
can reap rich dividends. JIT compilers are also referred to as quot;load and
goquot; compilers as they don’t write the target code into a file. When a
method is called for the first time, it is compiled into native code and
kept in memory that the compiled code may be used next time the
method is called again. Java uses this approach of compilation on
demand extensively and many of commercial JVMs are JITters. Note
that, JVMs are allowed to follow the interpretative approach or JIT and
hence it is an optional feature for performance improvement. For .NET,
the use of JIT is mandatory and thus all MSIL code is JITted to native
code before execution.
Aggressive optimizations are also possible to make that are not
possible with static optimization techniques, since:
i) precious runtime information is available for doing
ii) depending on the situation and user requirement, only
particular parts of the code will be called and doing aggressive
optimizations on that part of the code and bootstrapping it shall
improve the performance considerably.
iii) since optimization is done the frequently called code, it shall
provide improved performance in general than the general
optimization without any profiler information.
However it should be noted that, the performance of JITters is still not
on par with the equivalent C/C++ code, but it is considerably better
than the equivalent interpreted code. Also, in environments like
embedded systems where RAM is precious, it may not be possible to
bootstrap the methods for later use.
Providing ‘Ahead of time’ recompilation is yet another approach for
improving the performance through native code. A compiler reads the
information in the Java class files and generates a file that contains the
original bytecodes and the native code equivalent for various
platforms. The resultant file is sometime referred to as quot;fatquot; class file
(refer to the quot;fatquot; binaries in the main text) and depending on the
platform, the native code shall be invoked. This can effectively avoid
the necessity to recompile every time the code needs to be executed
and at the same time not losing portability.
Another approach is Ahead of time compilation. When the file is loaded
into the memory itself, all the code gets compiled into native code. The
advantage with this approach is that the code shall be compiled before
any method is called, as opposed to JIT approach and hence is
sometimes referred to as PreJIT. So, it doesn't suffer the problem of
overhead with JIT compilation. This can thus effectively reduce the
time for start-up and execution, as the code is ready of execution in
form of native code.
So it seems that Ahead of time compiling is a very good alternative for
JIT, but that is not the end of the story. The virtual machine still needs
the meta-data about the classes, for example for supporting features
like reflection. If no meta-data is maintained, then reflection cannot be
supported which would be a serious disadvantage. Moreover, if the
environment or security settings change, then the code needs to be
compiled again. Thus PreJIT is not much promising approach. .NET
supports PreJIT with its support with its NGEN tool in Visual
One of the interesting ideas to overcome the performance problem is
implementing the interpreter in hardware. That is what the JavaChip
intends to do - the bytecodes shall be directly executed by a
microprocessor designed specifically for that. Since Java bytecode are
not too low level to be implemented at chip level, still a JVM runs over
that, with the difference that the bytecodes are now directly executed
by the hardware. This idea too, is reminiscent of the old idea in IBM
machines that is discussed in the main text.
Sidebar - 2
C - A retrospect on its evolution and design decision
C is known for its high performance through generation of native code.
Its interesting to note that its predecessor B didn't generate native
code, instead it generated a stack based intermediate code called as
threaded code. When Ritchie wanted to write his operating system
with his new language, he found that it was handicapped to use the
interpreted approach. To overcome this problem, Thomson tried by
offering a 'virtual B' compiler that still had an interpreter, but it was
too slow for practical purposes. So, to make C sufficiently low level
that it can be used to write operating systems and compilers, Ritchie
abandoned threaded code approach and instead generated native
code. So this drastic change in the translation approach itself was
influenced by two main problems with interpretation: the runtime
overhead of interpretation and performance. In retrospect, had C
continued the B's tradition of interpretive approach, would it have
become such a huge success?
All rights reserved. Copyright Jan 2004.