SlideShare a Scribd company logo
1 of 110
Download to read offline
IT-University Copenhagen
Micro-F#
Compiling Functional Source-code to
Object-oriented stack Machine Code
Master Thesis
September 2015
Author:
Joachim Vincent Hasseldam (jhas@itu.dk)
Supervisor:
Associate Professor
Rasmus Ejlers Møgelberg (mogel@itu.dk)
Joachim Vincent Hasseldam September 2015
Acknowledgement
I would like to thank Associate Professor Rasmus Ejlers Møgelberg, from
the Theoretical Computer Science department at ITU Copenhagen, for
supervising this thesis and for his valued advice and guidance throughout
the project.
i
Joachim Vincent Hasseldam September 2015
Abstract
Compiling a functional programming language to an object-oriented stack-
based language requires a way to translate constructs like higher-order
functions and inductive data types to object-oriented equivalents, while
still maintaining the structural and semantically integrity of the source
program.
To understand how this can be done, a small programming language, with
a subset of the features found in a ML-based language is created and com-
piled to Microsoft Common Intermediate Language (CIL).
This report describes the new language and suggests a technique for trans-
lating the language to an object-oriented intermediate language conceived
for this project, which is then translated to the target language CIL.
The technique presented proves to accommodate the required constructs
within the constraints of the language, although support for recursive
functions with a substantial input size is still lacking.
ii
Contents
I Introduction 1
1 Testing Environment 2
2 Background 2
2.1 Higher-Order Functions . . . . . . . . . . . . . . . . . . . . . . . 2
2.1.1 Currying and Partial Application . . . . . . . . . . . . . . 3
2.1.2 Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Recursive Data Types . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Common Intermediate Language . . . . . . . . . . . . . . . . . . 4
3 CIL Abstraction 5
II Design 6
4 Micro-F# 6
4.1 Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.3 Let Bindings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.4 Program Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.5 Abstract Syntax Tree . . . . . . . . . . . . . . . . . . . . . . . . 8
5 Translating Program Structure 9
5.1 Expression Language . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.2 First-Order Functions . . . . . . . . . . . . . . . . . . . . . . . . 11
5.3 One Parameter Higher-Order Functions . . . . . . . . . . . . . . 12
5.3.1 Shared Interface . . . . . . . . . . . . . . . . . . . . . . . 13
5.4 N Parameter Higher-Order Functions . . . . . . . . . . . . . . . . 14
6 Inductive Data Types 16
6.1 List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6.1.1 Constructor Methods . . . . . . . . . . . . . . . . . . . . 16
6.1.2 Test Methods . . . . . . . . . . . . . . . . . . . . . . . . . 18
6.1.3 Accessor Methods . . . . . . . . . . . . . . . . . . . . . . 18
6.2 Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
6.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7 Intermediate Language 21
7.1 Bridging Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . 21
7.2 Encapsulating Side Effects . . . . . . . . . . . . . . . . . . . . . . 22
7.3 Target Language Agnostic . . . . . . . . . . . . . . . . . . . . . . 22
7.4 Abstract Syntax Tree . . . . . . . . . . . . . . . . . . . . . . . . 23
III Implementation 26
8 Lexical and Syntax Analysis 26
iii
Joachim Vincent Hasseldam September 2015
8.1 Lexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
8.2 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
9 Type-Checking and AST Transformation 28
9.1 Type Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
9.2 Transform Multiple Argument Function Calls . . . . . . . . . . . 29
9.3 Enriching the AST . . . . . . . . . . . . . . . . . . . . . . . . . . 31
10 MF Compilation 32
10.1 Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
10.2 Function Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 34
10.2.1 Extracting Invoke Types . . . . . . . . . . . . . . . . . . . 34
10.2.2 Step Class Instructions . . . . . . . . . . . . . . . . . . . . 35
10.2.3 Step Class Structure . . . . . . . . . . . . . . . . . . . . . 36
10.3 Match Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
11 MFIL Compilation 39
11.1 IFunc Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
11.2 Built-in Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
11.2.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
11.2.2 Create Classes . . . . . . . . . . . . . . . . . . . . . . . . 41
11.2.3 Create Methods . . . . . . . . . . . . . . . . . . . . . . . 42
11.3 Meta Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
11.3.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . 44
11.3.2 Global Functions . . . . . . . . . . . . . . . . . . . . . . . 45
11.3.3 Predefined Types . . . . . . . . . . . . . . . . . . . . . . . 45
11.4 Expected MFIL Output . . . . . . . . . . . . . . . . . . . . . . . 45
11.5 Generate CIL Class Structure . . . . . . . . . . . . . . . . . . . . 47
11.5.1 Define Classes . . . . . . . . . . . . . . . . . . . . . . . . 47
11.5.2 Populate Classes . . . . . . . . . . . . . . . . . . . . . . . 50
11.6 Generate CIL Instructions . . . . . . . . . . . . . . . . . . . . . . 51
11.6.1 Step Structure . . . . . . . . . . . . . . . . . . . . . . . . 51
11.6.2 Match Expression . . . . . . . . . . . . . . . . . . . . . . 54
11.6.3 Branch Instructions . . . . . . . . . . . . . . . . . . . . . 57
IV Discussion 62
12 Tail Recursion 62
13 .NET Interoperability 63
14 Conclusion 64
Appendices 66
A Compiler Source Code 66
A.1 Ast.fs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
A.2 Util.fs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
A.3 TypeChecker.fs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
A.4 MFIL.fs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
iv
Joachim Vincent Hasseldam September 2015
A.5 MFCompiler.fs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
A.6 Env.fs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
A.7 CILTyping.fs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
A.8 CILCompileInstrs.fs . . . . . . . . . . . . . . . . . . . . . . . . . 82
A.9 CILInterfaceBuilder.fs . . . . . . . . . . . . . . . . . . . . . . . . 85
A.10 CILMethodBuilder.fs . . . . . . . . . . . . . . . . . . . . . . . . . 86
A.11 CILClassBuilder.fs . . . . . . . . . . . . . . . . . . . . . . . . . . 88
A.12 CILListBuilder.fs . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
A.13 CILTreeBuilder.fs . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
A.14 MFILCompiler.fs . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
A.15 SourceCode.fs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
A.16 Program.fs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
A.17 Lexer Specification . . . . . . . . . . . . . . . . . . . . . . . . . . 98
A.18 Parser Specification . . . . . . . . . . . . . . . . . . . . . . . . . . 100
v
Joachim Vincent Hasseldam September 2015
Part I
Introduction
Functional programming languages are structurally and semantically different
from object-oriented languages. Functional languages treats functions as first-
class citizens by allowing them to be passed as arguments to other functions,
have a function be the return type of another function and assigning functions
to local values. In that sense, functions provide the primary level of abstraction
in functional languages. This is in contrast to object-oriented languages where
the object is the primary level of abstraction.
Functional languages also have a tendency to favour recursion, over the more it-
erative loop approach primarily used in object-oriented languages, when travers-
ing data structures. Because of this, data structures in functional languages are
recursively defined, to adhere with the languages immutable data approach.
Despite the differences it is still possible to translate a functional language to an
object-oriented one. This is necessary when these different language paradigms
share the same runtime environment, as it is the case with Java and Scala on the
Java Virtual Machine(JVM) platform and C# and F# on the Microsoft Common
Language Runtime(CLR) platform.
There are several advantages for languages to share the same runtime, one being
the interoperability between them. For eaxmple, if we look at the CLR platform
having constructs written in F# available in another C# project, provides a great
source of code re-usability. To achieve this goal, the two languages are compiled
to the same intermediate language namely: Common Intermediate Language or
CIL. CIL is an object-oriented language, so it seems somewhat intuitive that
a program written in C#, also being object-oriented, can easily be translated
into CIL. The functional F# program on the other hand, will have to undergo a
paradigm shift.
This project aims at understanding how a functional language can be translated
to an object-oriented stack-based language, and still maintain its structural in-
tegrity. The focus will be on how higher-order functions and recursive data
types are translated to object-oriented equivalent structures, and not so much
on the efficiency of the translation in terms of recursive calls.
To accomplish this goal, a functional language will be designed and implemented
with the intention to make it run on the CLR platform by compiling it to CIL
byte-code. The language is not meant to be a fully fledged programming lan-
guage, but rather contain a small subset of the features found in a typical
ML-based functional language, with support for higher-order functions and re-
cursive data-types.
The entire design and compilation process of a programming language, spans
many different areas and cover many interesting topics. This report will de-
scribe the entire compilation process, but since the focus of this project is on
the translation from functional constructs to object-oriented ones, certain areas
such as source language design, lexical analysis and syntactical analysis, will
only be covered in modest details.
1
Joachim Vincent Hasseldam September 2015
1 Testing Environment
To test the compiler, a small development environment has been created. This
enables the reader to run a list of predefined programs, as well as creating and
running his own programs in MF. The development environment can be accessed
by downloading the MFDE.zip from the following location:
https://www.dropbox.com/s/l6ml7gszid3631l/MFDE.zip?dl=0
then extract the content of the zip file and run MFDE.exe.
2 Background
This section will briefly introduce a few topics, which are important for the
remainder of this report.
2.1 Higher-Order Functions
Languages with support for higher-order functions, allow functions to be passed
as arguments to other functions, or having functions returning functions as
their return values [Sestoft, 2012]. Higher-order functions are an integral part
of functional programming languages and as one of its applications: provide
support for multiple parameters through currying (see section: 2.1.1). A classic
example of a higher-order function is the List.map function which, implemented
in F#, would look something like this:
let rec map f lst =
match lst with
| next::rest -> (f next)::map f rest
| [] -> []
Here the map function takes two arguments, a function predicate f and a list
lst, and applies the predicate to each element in the input list, resulting in a
new list.
Figure 1: List.map in F# and IEnumerable.Select in C# produces a new list with
the results of applying a predicate to each element in the input list
2
Joachim Vincent Hasseldam September 2015
2.1.1 Currying and Partial Application
Several functional languages including F# handles functions with more than one
parameter through a technique known as currying. This means that a multiple
parameter function, is actually evaluated as a series of one parameter functions.
This is best illustrated through an example. The map function just presented
has the following function signature:
('a -> 'b) -> 'a list -> 'b list}
This translates to: map takes as input a function from generic type ’a to generic
type ’b and returns a new function which takes a list with elements of type
’a as input and returns a list with elements of type ’b. Currying provides the
compiler with the responsibility of keeping track of that extra function when a
call to map is made with two arguments. That is why the following call to map
is perfectly valid:
map (fun x -> x * x) [1..10]
When in essence it is just syntactic sugar for first calling map with the square
function predicate which returns a new anonymous function, and then make a
call to the new function with the list argument.
Of course it is possible to call a n parameter function with with any number of
arguments ranging from 1..n. This fixes the given arguments in the returned
function and is called partial application. In the map example, a call to the func-
tion could be made with just a function for squaring numbers as its argument:
let square x = x * x
let squareNums = map square
Now squareNums is fixed with the square function and has the function signature:
int list -> int list
It can be passed as an argument to other functions or called with a list of
integers, yielding a new integer list as expected.
2.1.2 Closure
With functions being passed to other functions, a method for keeping track of
free variables must be present. In the above example map is not part of the local
environment within the stack frame of squareNums, and therefore appears as
a free variable within the scope of the function. When a call to squareNums is
made, the compiler has to know how to resolve map with a bound square function
variable. This problem is handled by representing a function variable as a closure
[Appel and Palsberg, 2003]. A closure is a structure that provides a way to
access a function outside scope, as well as the function’s local environment.
3
Joachim Vincent Hasseldam September 2015
2.2 Recursive Data Types
A type which can be recursively defined by simpler elements is called a recursive-
or inductive data type. [Pierce, 2002]. Popular examples of inductive data types
includes lists and trees. We can also define our own types, like the following
simple abstract syntax tree for a small expression language written in F#:
type Expr =
| Mul of Expr * Expr
| Add of Expr * Expr
| Value of float
This type allows for defining simple unambiguous arithmetic expressions:
let trapezoidArea = Mul(Value(0.5), Mul(Value(7.4), Add(Value(2.3), Value(5.9))))
2.3 Common Intermediate Language
Common Intermediate Language, or CIL, is an object-oriented stack-based pro-
gramming language designed by Microsoft. CIL is part of the Common Lan-
guage Infrastructure (CLI) and is an intermediate language shared by all .NET
languages. At compile time the source language is compiled to CIL instruc-
tions by a language specific compiler, and then at runtime the CIL instructions
are compiled to platform specific bytecode, by the Common Language Run-
time (CLR) virtual machine. Having an intermediate language like CIL greatly
reduces the amount of compilers needed to for n languages to target m archi-
tectures, from n ∗ m to n + m since only one compiler from each language to
the intermediate language are needed, and one for each target architecture from
the intermediate language[Richter, 2014].
Having a common shared platform, also grants the .NET based languages the
possibility to share their assemblies. This means that code written F# can be
used in C# and vice versa.
Figure 2: Translation of .NET source languages to platform specific byte-code
4
Joachim Vincent Hasseldam September 2015
3 CIL Abstraction
Several places throughout this report, C# will be used as a tool to conceptualize
and rationalize about CIL. This reason for this is that C# is easier to read
and less verbose than looking at actual CIL output and as a result provide a
convenient way to illustrate examples and discuss certain concepts. Since the
CIL language is also object-oriented, considering design decisions and desired
output of translations and examples in C#, provides a close abstraction to how
the actual CIL code would look. And for obvious reasons, everything than can
be expressed in C# can also be expressed in CIL, so the step from C# design
ideas to CIL is only a matter of implementation.
5
Joachim Vincent Hasseldam September 2015
Part II
Design
4 Micro-F#
This section provides an introduction to Micro-F#, and its syntax, to establish
some familiarity with the language before we start looking a sample code in
the sections to come. The few examples shown in this section provides only a
brief introduction to the language and its capabilities. Therefore the reader is
encouraged to download a copy of the testing environment, to see more elabo-
rate examples and to experiment with compiling and running his own programs
written in Micro-F#. Information on how to obtain a copy of the test environ-
ment can be found in section 1.
Micro-F#, or MF, is a statically typed, eager evaluated functional programming
language. It is deigned for the sole purpose of experimenting with the com-
pilation process of a functional language on the .NET platform. MF, aims at
providing a subset of the features found in ML-based languages and its syntax
is heavily inspired by F#, so the name MF reflects the language’s syntactic sim-
ilarities.
The entire list of keywords and symbols supported by the language can be found
in figure 3.
fun end let in do = <> > < >= <=
if then else not + - * / %
bool int tree list ( ) [ ] -> |
Cons Nil Node Leaf :: : ; ,
match with true false
Figure 3: Keywords and symbols supported by MF
4.1 Types
MF currently supports four types, the primitive types boolean and integer and
the inductive types list and tree. The inductive types however are limited to
values of integers.
4.2 Functions
MF provides support for global, non-nested, higher-order functions. The fol-
lowing is the MF equivalent1
of the F# map function from section 2.1.
1Although slightly more restricted since the language has no support for generic types, nor
any other types than integers for lists.
6
Joachim Vincent Hasseldam September 2015
fun map f lst : (int -> int) -> list -> list =
match lst with
| Cons(x, xs) -> (f x)::map f xs
| Nil -> []
end
end
The syntax for the function definition is similar to what we know from F#, but
a few noteworthy differences does exist. Unlike F#, MF does not provide type
inference for function definitions. The type signature for a function has to be
specifically stated in the form:
t1 → t2 → t3 → ... → tn+1 (1)
Where t is a type supported by the language and n is the number of parameters
in the function. The last type given will be the return type of the function.
MF is not whitespace sensitive like F#, and the example above has been written
on several lines, and with indentation, purely for aesthetic reasons. Instead
scoping for functions and match expressions are provided by the end keyword,
and in this regard the language is more similar to C# which is also not whitespace
sensitive, and uses symbols to differentiate one construct from the next.
4.3 Let Bindings
Let bindings to values, expressions or functions are supported in MF. Below are
a few examples:
...
let a = 42 in // Value
let b = if a > 2 then a else 2 in // Expression
let c = mul 5 in // Function
let d = [3; 5; 8; 13; 21] in // List
let e = Node(8, Node(5, Leaf, Leaf), Leaf) in // Tree
...
A opposed to functions, the types for let bindings are omitted, and instead the
types are inferred by the type checker at compile time.
Again since MF is not whitespace sensitive, the keyword in is used to separate
the let declaration from the body. A goal archived in F# through a line break2
.
4.4 Program Structure
A valid MF program is composed of a list of zero or more global functions, with
bodies consisting of a single expression tree, followed by a single expression
tree. Since we are interested in producing an executable assembly, the single
expression tree part of the program will become an implicit entry point of the
application and since all functions must return a value, this part can not be
omitted. Figure 4 shows the separation of global functions and the entry point
of a source program.
2It should be noted however that stating the ”in” keyword explicitly is also supported, but
rarely used, in the F# syntax.
7
Joachim Vincent Hasseldam September 2015
Figure 4: The structure of a MF source program
4.5 Abstract Syntax Tree
We will conclude the MF section by taking a look at the abstract syntax tree
for the language, and briefly examine the different constructs.
The abstract syntax tree presented here is based on the syntax tree for Peter
Sestofts Expression language3
and modified to support higher-order functions
and the built-in inductive data types. The full abstract syntax tree can be seen
in figure 5.
TypeDef The types currently supported by the language.
MatchExpr Contains the matching part of a match case with information of
which names to bind to the result of the match expression. Suppose for example
a match is performed on a cons element in the list type. A name to bind the
value element, as well as one for the rest of the list is needed.
|Cons(x, xs)
MatchExpr
→ (g x)::f xs
Expr As stated earlier, the body of a global function, as well as the implicit
main function, contains a single expression tree.
TypedExpr Used to hold the inferred type of an expression. We will get
back to that when we discuss the type checker in section 9.
ConstInt Constant of the primitive type integer.
ConstBool Constant of the primitive type boolean.
Var Named values.
Let Contains the name of a let expression, the value part and the body
which is the rest of the expression tree.
3More on this in section 8.
8
Joachim Vincent Hasseldam September 2015
PrimOp One of the arithmetic operators supported by the language and
its two operands.
If An If-then-else expression.
Match A complete match expression. First the matched expression, then
a list of match-cases and their corresponding ”action” expression.
FunCall Function call with a name- and an argument part.
Constructor Construct one of the cases of the inductive types. Contains
a value and rest expression or the list type and a value and the two sub-
trees for the tree type.
FuncDef The global function definition contains the name of the function, a
list or argument names, a type signature and the expression tree constituting
the body of the function.
Program Finally the source program consisting of a list of global functions
and the expression tree for the implicit main function.
5 Translating Program Structure
Translating a functional language like MF to an object-oriented target language
like CIL demands that we start to consider the functional concepts in a object-
oriented mindset. What is the object-oriented equivalent to a function? How do
we replicate the scope of a functional program in the translation? And so on. In
this section we will examine how a functional source program can be translated
to an object-oriented equivalent, and still retain its structural integrity.
We will do this by first looking at a simpler version of the problem by enforcing
several constraints on the MF language. This will help to identify where the
challenges in the translation lie, and how they can be addressed to support the
full language. We will start by looking at how to translate a simple expression
language and then move on to approach the problem as if MF was a first-
order functional language. Next we relax the first-order constraint and allow
higher-order functions, although limited to only one parameter, and see how
this effects the design. Finally we shall expand on this structure to allow an
arbitrary number of arguments for our functions and thus provide full support
for MF.
5.1 Expression Language
With no functions, the expression language limits MF to let bindings of values
or expressions. A sample program could look like this:
let a = 11 in
let b = 5 in
let c = if a > b then a else b in
c + a
9
5.1 Expression Language 10
type TypeDef =
| FuncType of TypeDef * TypeDef
| IntType
| BoolType
| ListType
| TreeType
type MatchExpr =
| NilMatch
| ConsMatch of string * string
| LeafMatch
| NodeMatch of string * string * string
type Expr =
| TypedExpr of Expr * TypeDef
| ConstInt of int
| ConstBool of bool
| Var of string
| Let of string * Expr * Expr
| PrimOp of string * Expr * Expr
| If of Expr * Expr * Expr
| Match of Expr * (MatchExpr * Expr) list
| FunCall of Expr * Expr
| Constructor of Constructor
and Constructor =
| Nil
| Cons of Expr * Expr
| Leaf
| Node of Expr * Expr * Expr
type FuncDef = FuncDef of string * string list * TypeDef * Expr
type Program = Program of FuncDef list * Expr
Figure 5: Abstract syntax tree for the MF language
Joachim Vincent Hasseldam September 2015
If we were to build a C# program based on the code above, the resulting program
would as a minimum require a class and a method to contain the instructions.
The desired output for the examples shown here is an executable assembly and
not a library file4
, so for the above program we define a single class with a
main method. The body of the main main method will consist of the expression
above. Each let expression can be viewed as a declaration of a local value in C#
and scope for the expressions will conveniently be preserved by declaring values
from top to bottom.
class Program
{
public static int Main()
{
int a = 11;
int b = 5;
int c = a > b ? a : b;
return c + a
}
}
Since the end of the expression also marks the end of the main method, the last
instruction in the expression(c + a), is of course also the return value of the
method.
5.2 First-Order Functions
Now let us extend the language to allow functions, although limited to accepting
and returning only values. This allows for a bit more interesting programs, but
also requires an expansion of the structure for the expression language from the
previous section. Consider the following example:
fun prevN x : int -> int =
let y = x - 1 in
if y > 0 then y else 0
end
let a = prevN 11 in
a * a
The expression after the function declaration is still going to constitute the
body of the main method of the program, but a single method is no longer
sufficient since the function prevN should translate to a new method with its
own method body. With functions limited to only primitive types, an additional
static method in the same class, should be sufficient to accommodate the new
function. Similar to the main method, the result of the expression in the body
of prevN will be the return value of the method.
4Although it would not make much difference except an executable assembly requires a pro-
gram entry point(or main method) as opposed to a library file.
11
Joachim Vincent Hasseldam September 2015
class Program
{
public static int prevN(int x)
{
int y = x - 1;
return y > 0 ? y : 0;
}
public static int Main()
{
int x = prevN(11);
return x * x;
}
}
Translating MF as a first-order language is a relatively trivial process. A new
static method is added when a new function is added to the source code. Each
method is compiled using a local environment which enforces the static scoping
rule of the language.
5.3 One Parameter Higher-Order Functions
For a function to take another function as input or returning a function as
output, it requires a larger re-factoring of the structure defined so far. For the
time being, we limit all functions to only one parameter. We do however allow
for a function to be called with zero parameters, which means the function just
returns itself. This enables let bindings to functions as opposed to only values
as in the previous examples.
Consider the following program:
fun f x : int -> bool = x % 4 = 0 end
let g = f in
if g 1324 then 1 else 0
We want to bind the name g to the function f and for this to work, we require
the ability to treat the functions as values and therefore a translation to a static
method in a single class, is no longer enough the support the language. Instead
the static method is changed to an invoke instance method inside a class named
after the function. The invoke method body will contain the instructions from
the source language function similar to the static methods in the first-order
language from section 5.2. The method signature is also similar to how it was
constructed in the first-order language, with the type definition of the function
matching the input- and output type of the method.
public class f
{
public bool Invoke(int x)
{
return x % 4 == 0;
12
Joachim Vincent Hasseldam September 2015
}
}
public class Program
{
public static int Main()
{
f g = new f();
return g.Invoke(1324) ? 1 : 0;
}
}
A call to f now results in an instance being created of class f, and if the call
contains an argument, then the invoke method of the instance is called. In this
example a let binding is made from g to f and g is now a reference to the new in-
stance. The type of g is of course f since g is now an instance of class f. When g
is called next with an argument, then the invoke method of the instance is called.
5.3.1 Shared Interface
The solution described so far enables functions to return functions, but we
still run into problems if we try to pass a function to another function as an
argument. Suppose we have the following input program:
fun apply f : (int -> int) -> int = f 42 end
fun g x : int -> int = x * x end
fun h x : int -> int = x * 2 end
(apply g) + (apply h)
Function apply takes a single function matching the type definition: int ->
int and returns the result of applying the function to 42. Translating function
g and h is trivial, but apply pose a problem. The invoke method for class apply
cannot be limited to a specific parameter type, since it should be able to receive
an instance of both classes: g and h5
. The example here is a little contrived
since the language is limited to only one parameter. It does illustrate the point
though: functions with the same type definition should share the same instance
type.
One way to solve this problem could be through inheritance. We could create
an abstract superclass for each function type constellation and then let each
function class, with the same type signature, inherit from the superclass.
To avoid creating a new abstract class for each type definition and having to
implement functionality to keep track of them and to retrieve them, a single
interface will instead be created which all function classes will implement. The
interface will contain a single method definition which a generic input- and out-
put type.
5Or any other classes compiled from a function with the same type definition.
13
Joachim Vincent Hasseldam September 2015
public interface IFunc<TInput, TOutput>
{
TOutput Invoke(TInput input);
}
Now a function parameter in a higher-order function will always have the type:
IFunc and likewise for all instances created of the function classes. This solves
the compatibility issue from before and enables full support for higher-order
function with the one parameter restriction.
5.4 N Parameter Higher-Order Functions
Now we are ready to lift the final restriction and support the language in its
entirety. This will be done be extending the previous solution to support func-
tions with a arbitrary number of parameters. We will begin by taking a closer
look at how multiple parameter functions are actually evaluated.
As stated in section: 2.1.1 multiple parameter functions are handled as a series
of specialized one parameter functions with knowledge of the previous parame-
ters through a technique known as currying. The following example illustrates
this concept:
fun f x y : int -> int -> bool =
x > y
end
let g = f 5 in
let res = g 4 in
...
In the previous section a let binding to a function with no parameters resulted
in the new value becoming a reference to the actual function. This time g is
not just a reference to f but a new function, with a single parameter y, which
returns a boolean result. Inside the function g, the name x appears as a free
variable and access to the environment of f must be available here in the form
of a function closure. When g is called the body of the original function f is
evaluated and the result is assigned to the value res.
Calling a function with zero parameters still results in an instance of the function
class to be created and referenced by the binding. But calling the invoke method
on the instance can now either result in the computation of the expression in
the function body, if no more parameters exists in the original function, or in
the creation of a new function with a function closure of the previous.
This new function will simply be modelled as a new class with its own invoke
method similar to the process described earlier. The closure is provided by
extending the function class with fields to accommodate the free variables. This
creates a nested structure of classes, where the invoke method of the current
instance creates an instance of the next function class. The list of fields in each
class grows in parallel to the ”level” of this nested structure, since the next level
must contain all values in closure from the previous level as well as the value
given in the parameter of the previous.
Because each new class corresponds to the next step in the nested structure,
14
Joachim Vincent Hasseldam September 2015
this concept will be referred as the ”step-class structure”. Figure 6 provides
a graphical representation of the step-class structure for a function with three
parameters. A level in the step-class structure refers to how deep in the structure
Figure 6: Step-class structure from compiling the function definition: fun f x
y z : int -> int -> int -> int = <expr> end
we are, starting from level zero with the first class. Each parameter in the
original MF function results in a new class, so the ”bottom” level of the structure
is set at n − 1 where n is the number of parameters in the original function.
If we apply this concept to the example above, we end up with the following:
public class f : IFunc<int, IFunc<int, bool>>
{
public IFunc<int, bool> Invoke(int x)
{
return new f_Step1() { x = x };
}
}
public class f_Step1 : IFunc<int, bool>
{
public int x;
public bool Invoke(int y) { return x > y; }
}
public class Program
{
public static int Main()
{
IFunc<int, bool> g = new f().Invoke(5);
bool res = g.Invoke(4);
}
}
15
Joachim Vincent Hasseldam September 2015
6 Inductive Data Types
MF supports recursive data types in the form of two built in types: list and tree.
Since the two types are built in, their construction is not determined by any
input from the developer. They also have to be constructed before the source
program, since constructs in the source code can be dependent on the types
existence. We will start by taking a closer look at how the types are represented
and then proceed on to how they are modelled in C#, and how we can interact
with them from MF.
6.1 List
Similar to F#, lists in MF are represented as a singly linked list [Smith, 2012].
This means that every element in the list contains two pieces of information: a
value of the element, and a pointer to the next element in the list. Since each
element points to the next, we also need a special case representing the end of
the list. An element in the list containing a value and the empty list element is
usually referred to as cons and nil respectively [Pierce, 2002]. This will also be
the case here. Figure 7 provides a graphical representation of such a linked list.
Figure 7: A singly linked list
The list structure will be modelled through object-oriented decomposition, which
means that a base class is created for the actual type (in this case list) and each
of the cases (nil and cons) are represented as subclasses [Emir, 2006]. Deter-
mining the type of an instance in relation to the subclasses, as well as accessing
members of the subclasses, will all happen through the list superclass. Even
constructing an instance of any of the subclasses happens through methods de-
fined in the superclass. For the list type we only have two cases6
, but we can
imagine this technique supporting other structures with more cases equally well.
This technique encapsulates the underlying structure of the subclasses and al-
lows us to only communicate with the superclass regardless of the amount of
cases it represents. The class structure for the list type can be seen in figure 8.
6.1.1 Constructor Methods
Two static methods in the list class creates new instances of the nil and cons
subclasses. The nil case takes no arguments, since the constructor of the nil
6This is also true for the tree type as we shall see later.
16
6.1 List 17
public abstract class MFList
{
public MFList() {}
public bool IsCons() { return this is Cons; }
public bool IsNil() { return this is Nil; }
public static MFList Nil() { return new Nil(); }
public static MFList Cons(int value, MFList next)
{
return new Cons(value, next);
}
public int GetValue() { return ((Cons)this).value; }
public MFList GetNext() { return ((Cons)this).next; }
}
public class Cons : MFList
{
public int value;
public MFList next;
public Cons(int value, MFList next)
{
this.value = value;
this.next = next;
}
}
public class Nil : MFList
{
}
Figure 8: Class structure of the MF list type
Joachim Vincent Hasseldam September 2015
subclass defines no parameters. The cons case take two arguments, the value of
the cons element, which is limited to an integer type, and a list type representing
the rest of the list which can of course can be either a nil- or a cons element.
Creating a new list element always appends the element to the beginning of the
list and as a result the list is built from the bottom up.
6.1.2 Test Methods
When we traverse a list, we need a way to establish the type of a given instance
to see if we have a cons case, and traversing the structure can continue, or a nil
case and the end of the list is reached. A method is added for each case which
tests a given instance against the type of the case and responds with a logical
value.
6.1.3 Accessor Methods
Once the case type has been established the relevant information of the case
can be extracted. This is of course only relevant for the cons type since no
information is pertained in the nil case. A method is added to access each
member in the cons class. The methods performs a type cast of the given
instance to the cons class and then returns the member.
6.2 Tree
The second inductive data type supported by MF language is a binary tree.
This structure is very similar to the list types previously described and also
contain two cases.
Each element in the tree contains a value which, like the list type, is limited
to integers, and a reference to the left- and right sub-tree. This element will
be referred to as a node and the end of the tree is indicated with a leaf type.
Similar to the nil element in the list, the leaf element does not contain any
information. Figure 9 shows the structure of the binary tree.
The class structure for the tree is almost identical to that of the list. The only
noteworthy difference being that the node case has three accessor methods for
the value, left sub-tree and right sub-tree as opposed to only two in cons case.
The tree class structure can be seen in figure 10.
6.3 Limitations
The technique for the inductive data types, presented here, is relatively easy to
implement and offers a clean interface. When the types are used, we only have
to interact with the abstract superclass and not have to concern ourselves about
the subclasses, or how they are implemented. It would be easy to expand with
a different data structure with more cases, since each new case just results in
adding a new subclass, and the necessary methods to interact with the subclass,
in the superclass. We do however have to consider its limitations.
The type cast performed when field values are retrieved from the subclasses is
18
Joachim Vincent Hasseldam September 2015
Figure 9: Binary tree structure
not very type safe, because the correct type is just assumed. When the inductive
types are used by the MF compiler this should not pose a problem, since logic
in the compiler provides a check using the test methods, to ensure that type
casts are only performed on the appropriate types. But if the assembly, and the
types are used by a another .NET based language, the guarantee for type safety
is no longer present.
Consider the following simple example in C#:
var tree = MFTree.Leaf();
var leftSubTree = tree.GetLeft(); // Runtime-error
The program compiles but will result in a runtime error as soon as we try to
access a field, for example the the left sub-tree, of a Leaf. As we saw earlier the
Leaf class does not contain any fields, but because the methods for extracting
the values of the Node class fields are defined in the abstract superclass, they
are available in all subclasses.
However it is worth noting, that this limitation is not confined to the MF lan-
guage. F# handles inductive data types very similar to the technique applied
here. A notable difference is that F# types do no provide methods in the ab-
stract class for retrieving the values of a subclass’ fields, instead the type casting
is left to the developer [FSSPEC, 2012]. But the result is the same, we still end
up producing type unsafe code which could results in runtime errors.
19
6.3 Limitations 20
public abstract class MFTree
{
public MFTree() {}
public bool IsNode() { return this is Node; }
public bool IsLeaf() { return this is Leaf; }
public static MFTree Leaf() { return new Leaf(); }
public static MFTree Node(int value, MFTree left, MFTree right)
{
return new Node(value, left, right);
}
public int GetValue() { return ((Node)this).value; }
public MFTree GetLeft() { return ((Node)this).left; }
public MFTree GetRight() { return ((Node)this).right; }
}
public class Leaf : MFTree { }
public class Node : MFTree
{
public int value;
public MFTree left;
public MFTree right;
public Node(int value, MFTree left, MFTree right)
{
this.value = value;
this.left = left;
this.right = right;
}
}
Figure 10: Class structure of the MF tree type
Joachim Vincent Hasseldam September 2015
7 Intermediate Language
When targeting CIL through the Reflection API we work through several levels
of abstractions. Ranging from higher levels like interfaces, classes and methods,
to fields and method parameters and all the way down to low level stack manip-
ulation. This process is further complicated by the paradigm switch from the
functional source language to the object-oriented target language CIL.
To simplify the compilation process we break it into two parts by introduc-
ing a new intermediate language between MF and CIL. This language is called
Micro-F# Intermediate Language or MFIL for short. MFIL is object-oriented
like CIL and its main purpose is to transform MF to a state closer to CIL, but
postpone the production of the actual constructions and stack manipulation,
until a paradigm shift has been completed for the whole source program.
MFIL bridges the functional aspects of the source language and the object-
oriented ones in the target language CIL. This is where the functional constructs
like: functions and let bindings are translated to classes, methods, fields and
local variables. It provides a new abstract syntax tree which acts as a good
abstraction on top of the actual CIL instructions.
7.1 Bridging Paradigms
Working in a polyglot environment with a functional source language, an object-
oriented target language and another functional language as the meta language,
it is essential to have a structure that supports some separation of the differ-
ent elements. The most important responsibility of the intermediate language
is therefore to create a smoother transition, from the semantics of the source
language to those of the destination language. The intermediate language is not
meant to be wrapper for CIL generation in the sense that the MFIL instructions
corresponds to exact CIL instructions. Rather it serves as a translation between
a functional abstract syntax tree and an object-oriented one.
When, for example, a function in the source language is compiled to CIL, many
individual steps, in addition to just creating the list of function classes, are per-
formed on the stack machine. This includes instructions for parameters, types,
interfaces, fields and so on. The responsibility of the MF compiler is there-
fore to create a new abstract syntax tree, with the object-oriented equivalent of
the instructions in source language, to feed to the next step in the compilation
process.
An example could be the following function definition in the MF source lan-
guage:
fun name p1 p2..pn : a’ -> b’... =
.
.
end
Here p1..pn are the parameters defined by the function with n being the to-
tal amount of parameters. The result of the compilation to MFIL is a list of
FuncClass types of size n. The FuncClass type is defined as:
21
Joachim Vincent Hasseldam September 2015
FuncClass of string * string * (ILType * ILType) * Field list *
Instruction list
This instruction transfers information to the MFIL compiler that a class has to
be created with an implicit invoke method. The class has a name, a parameter
name to the invoke method, an input and output type, a list of fields and a list
of instructions to populate the body of the invoke method.
This aids in effectively communicate to the final stage of compilation what has
to be done, but not having to worry at this point exactly how it is done.
7.2 Encapsulating Side Effects
The compiler will be written in F#, which acts as the meta language since this
will be the language code examples will be presented in, and the language in
which we reason about the choices and implementation details [Sestoft, 2012].
The Microsoft Reflection API, used to generate the CIL instructions, is object-
oriented and although F# works fine with object-oriented constructs, it does
suffer some inconveniences, from a purely functional perspective, in terms of
mutable state and side effects. Therefore the intermediate language also serves
as a separation in the compiler between meta language code which is pure func-
tional and that which is using object-oriented constructs and produces side
effects. The motivation for splitting the compilation process into two steps,
is also to get as much as possible of the translation done in a pure functional
environment and postpone the interaction with object-oriented elements in the
Reflection API, for as long as possible, as well as keeping those elements en-
capsulated from the rest of the code. It also aids the development and testing
process of the compiler knowing that the translation to MFIL marks the point
in the compilation process, where input should be viewed in a object-oriented
mindset, rather than a functional one.
7.3 Target Language Agnostic
As we saw in section 2.3 having an intermediate language provides an option for
reusability. By targeting CIL, MF takes advantage of the architecture specific
compilers in the CLR runtime. The MFIL compiler however does not target a
CPU architecture, but instead another intermediate language. Even so we can
imagine the same resuability for the new intermediate language. Suppose we
would like to compile MF to another platform, for example Java Virtual Ma-
chine. All work done by the MF compiler could potentially be reused and only a
new MFIL compiler would have to be constructed for the new target platform.
As stated earlier, the intermediate language will not contain specific CIL in-
structions. Instead it is a representation of instructions meant to be executed in
a object-oriented stack-based language, and because the instructions in MFIL,
are kept at a high enough level of abstraction, the design should also support
the possibility of replacing the MFIL compiler with another compiler targeting
a different language, supporting the same constructs.
So although targeting a different intermediate language is not a primary motiva-
tional factor for creating MFIL, care will still be given to ensure the instructions
in MFIL are as generic as possible to support potential future reusability.
22
Joachim Vincent Hasseldam September 2015
7.4 Abstract Syntax Tree
With the motivational choices out of the way, we can start examining the struc-
ture of MFIL, by taking a look at the abstract syntax tree for the language, and
take a brief tour through its various constructions.
As can be seen in figure 11 the abstract syntax tree contains an assortment of
higher and lower level instructions.
ILType Contains the four types supported by the language and is recursively
defined to support type definitions for higher-order functions through the IFunc
interface described in section 5.3.1.
Variant Recall from section 6 that the built-in inductive types tree and list
each contain two different sub-classes: leaf and node for the tree type and nil
and cons for the list type. When an instruction is given to either create a new
instance of one the sub-classes, or check if an existing instance match the type of
a specific sub-class, the variant type is used to define the sub-class in question.
VariantValue This is used to define a specific value to be extracted from one
of the sub-classes to the inductive types.
Instruction Some of the operations found here in the Instruction type are
rather low level and match exactly the target operation in CIL. This is true
for pushing an integer or boolean value to the stack, the equality-, comparison-
and arithmetic operations as well as loading and storing values. Others carry
a bit more information and require several steps to be performed in the target
language.
The Branch instruction controls the flow of the application and is used for if-
then-else expressions as well as match-expressions. CallInvoke instructs a call
to be made to the invoke method of an instance of one of the step-classes7
, and
the instruction list evaluates to the argument of the method call. CreateObject
creates an instance of a given class. StoreField stores the value, currently on
top of the stack, to a specific field belonging to a specific class. This instruction
is used when closure values gets replicated in the fields of the next level in the
step-class structure. The last three instructions perform operations on one of
the variants in either of the two inductive types.
Field The name and the type of a class field is stored in this type.
Class The final and most high-level construct in the MFIL language is the
Class. An input program in MFIL contains a list of zero or more FunClass
elements and a single EntryClass element containing the instruction for the
implicit main method discussed earlier. Since all classes in the translation,
described in section 5.4, only contains a single method, there is no reason to
state the method explicitly. Instead the information pertained in the FunClass
7See section 5.4.
23
7.4 Abstract Syntax Tree 24
type ILType =
| ILInt
| ILBool
| ILList
| ILTree
| ILFunc of ILType * ILType
type Variant =
| ILNil
| ILCons
| ILLeaf
| ILNode
type VariantValue =
| ListValue
| TreeValue
| Next
| Left
| Right
type Instruction =
| PushInt of int
| PushBool of bool
| Add
| Mul
| Mod
| Div
| Sub
| Gt // >
| Lt // <
| Eq // =
| Neq // <>
| Ge // >=
| Le // <=
| Print of ILType
| PrintAscii
| Load of string
| StoreLocal of string * ILType
| Branch of Instruction list * Instruction list
| CallInvoke of string * Instruction list
| CreateObject of string
| StoreField of string * string
| CallConstructVariant of Variant
| CallCheckVariant of Variant
| ExtractVariantVal of VariantValue
type Field =
| Field of string * ILType
type Class =
| FunClass of string * string * (ILType * ILType) * Field list * Instruction list
| EntryClass of string * Instruction list
Figure 11: Keywords and symbols supported by MF
Joachim Vincent Hasseldam September 2015
type is the name of the class, the name of the invoke method parameter, an
input- and output type for the invoke method, a list of fields and finally the
MFIL instructions that make up the body of the method.
25
Joachim Vincent Hasseldam September 2015
Part III
Implementation
In part II we outlined a strategy for translating the structure of the MF language
and how to approach some of challenges of the process. Now we will examine
how the discussed concepts are actually implemented and illustrate through ex-
amples the compilation process from MF source code to CIL byte-code.
The structure of the implementation part introduces topics in chronological or-
der in respect to the compilation process. We start in section 8 by describing
the lexical- and syntactical analysis process of MF. The topic of lexical- and
syntactic-analysis is very broad and will only be briefly covered here since it
falls outside the main focus of this project, and because the resulting lexer- and
parser files are mostly auto-generated using Lex/Yacc.
After parsing of the MF source code, section 9 will discuss how type checking is
performed on the resulting abstract syntax tree. Not only does the type checker
ensure a type-safe input program, before the compilation starts, but it also pro-
vides type inference for let expression by enriching the abstract syntax tree of
the language.
With a type-safe program, we will proceed to the first actual compilation step
in section 10. The MF compiler takes a MF syntax tree and transform it to the
object-oriented intermediate language MFIL, designed for this project.
Finally in section 11 we will examine how the MFIL compiler receives constructs
written in MFIL and emit the actual stack machine instructions thereby con-
cluding the transition from MF source code to CIL byte-code.
The examples presented in the following sections were selected because they
illustrate an important concept in the process or because they provided partic-
ular challenges during the implementation.
Figure 12 provides a roadmap to the different steps of the compilation process,
and in which source files each step primarily takes place.
8 Lexical and Syntax Analysis
Before 1975, writing a lexer and a parser program was a large and time-consuming
part of building a compiler. This has changed with the introduction Lex/Yacc
and now the process can be mostly automated [Niemann, 2015]. Because the
focus in this project is on compilation, the decision has been made to use Lex/Y-
acc to generate the lexer and parser programs for the language.
The lexer- and grammar specifications used here, are inspired by Professor Peter
Sestofts definition for the Expression language and extended to accommodate
the additional constructs of the MF language8
.
8.1 Lexing
FsLex is a F# version of Lex designed to produce functionality to translate a
Unicode input string to a series of tokens [FsLex, FsYacc, 2015]. FsLex takes an
8Peters original specifications can be found here: http://www.itu.dk/people/sestoft/plc/
26
Joachim Vincent Hasseldam September 2015
Figure 12: The overall architecture of the compilation process
input a file, describing which keywords and symbols the language will support.
Based on these rules the program produces a F# file with source code to perform
the actual tokenization. The entire list of keywords and symbols supported by
MF can be found in section 4 figure 3. Any string the lexer encounters, not
specified as either a symbol or a keyword, is treated as a name, or a number(in
case of single integers). The complete lexer specification file can be found in
appendix: A.17
8.2 Parsing
The parser creates syntactic relationship between the tokens received from the
lexer. The F# implementation of the Yacc parser: FsYacc, takes as input a
context free grammar written in a Backus-Naur Form variant format, and pro-
duces the F# parser file [Niemann, 2015]. The resulting parser transforms the
source code to an abstract syntax tree representing the structure of the program.
The type definition for the MF abstract syntax tree can be found in the file:
Ast.fs, in the project source code. FsYacc is a LR-type parser and therefore
allows the grammar to be left recursive. Associativity is handled by prefixing
the supported tokens with %left and %right for left and right associativity re-
27
Joachim Vincent Hasseldam September 2015
spectively. Precedence is then handled by ordering these constructs from top to
bottom for low towards higher precedence respectively. An example of enforcing
associativity behaviour can be found by analysing the ARROW ("->") token in
the grammar. This token is used in defining the type specification for functions
in the following manner: a’ -> b’ -> c’ where: a’, b’ and c’ are types
supported by the language. Suppose the parser encounters the type definition
described above. Then a choice must be made whether to parse the expression
like this: (a’ -> b’) -> c’ or like this: a’ -> (b’ -> c’). The latter is right
associative and obviously the correct choice, since a function call will consume
the first type and return a new function with the remaining type definition as
we saw earlier. The token: ARROW is therefore preceded with the %right tag in
the beginning of the grammar, to ensure this right associativity. If instead the
function was intended to take as input another function of type a’ -> b’ and
return the type: c’, then it would have to be stated explicitly using brackets.
Figure 13: Right versus left associativity parse trees for function type specifica-
tion
The top level production of the grammar: Program defines a source program as
being a list of FuncDef, followed by an expression (Expr). The list of FunDef
will contain the definitions for the global functions stated in the beginning of
the source code, and the expression corresponds to the explicit entry point (or
main method) of the program following the language design stated in section 4.
The list of function definitions might be empty since a program without any
functions, besides the main function, is still considered valid. The expression,
on the other hand, always has to be present, since the program must return an
integer value.
9 Type-Checking and AST Transformation
MF is designed, like F# to be a statically typed language. Since types can
be verified at compile time, as opposed to runtime, and because the target
language: CIL is a strongly typed language, a logical next step is to perform
type checking on the abstract syntax tree provided by the parser. In addition
to the actual type checking, the type checker will also provide type inference for
let expressions, and enriching the abstract syntax tree with the types, as well
as supporting currying by transforming multiple argument function calls into a
form of partial application, by introducing anonymous let bindings.
28
Joachim Vincent Hasseldam September 2015
9.1 Type Checking
The MF language supports four types. The primitive types: integer and boolean
and the inductive types list and tree. In essence the expressions are checked
recursively until one of the four types is reached, in which case the type is
returned. Lookup of let bindings and function names is handled by maintaining
a symbol table in the form: name -> typeDef where name is a string representing
the bound name, and typeDef is the type for said name. This can also be referred
to as an environment [Appel and Palsberg, 2003].
First step is to update the environment with the type definitions for the global
functions in the program. Then the expression tree (function body) for each
function is checked by first updating the environment with each parameter for
the function and its corresponding type. This means that before examining the
expression of the following function:
fun myFun f x : (int -> bool) -> int -> int =
<expr>
end
The environment would contain the following entries
myFun -> FuncType (FuncType (IntType,BoolType),FuncType (IntType,IntType))
f -> FuncType(IntType, BoolType)
x -> IntType
The type checker then proceeds to check the function body with the updated
environment and since the scope of the function parameters are limited to the
body of the function, each function is checked with its own environment. The
only entries in the environment, shared between functions, are the global func-
tion names and their corresponding types.
A type check fails when the actual type does not match the expected type. In
the case of a Match expression check on a list, the following must hold:
if typeof(ve) = list and typeof(ne) = typeof(ce) return typeof(ve)
Where ve is the match expression type, ne is the nil expression type and ce is the
cons expression type. The match expression obviously has to be a list, but the
cons and nil expressions does not necessarily have to be a list, since the result
of a match on a list is not required to yield a new list. It is however required
that the two expressions share the same type, since the base case must return
the same as the inductive step. If the type check holds, ve (in this case the list
type) is set as the type for the match expression. A complete list of the type
rules for the expressions are listed in table: 1.
9.2 Transform Multiple Argument Function Calls
As established earlier: all functions accepts only one argument and return a
new function with the remaining type definition. Without a way to handle sev-
eral arguments, we would be forced to write source code with calls to multiple
29
Joachim Vincent Hasseldam September 2015
List Match(ve, ne, ce) if typeof(ve) = list and typeof(ne) = typeof(ce)
return typeof(ve)
Tree Match(ve, le, ne) if typeof(ve) = tree and typeof(le) = typeof(ne)
return typeof(ve)
Let(ve, be) return typeof(be)
If-then-else(ie, te, ee) if (typeof(ie) = bool or typeof(ie) = int) and
typeof(te) = typeof(ee) return typeof(te)
Push int int
Push bool bool
Int Op(le, re) if typeof(le) = int and typeof(re) = int
return typeof(le)
Bin Op(le, re) if(typeof(le) = int or typeof(le) = bool) and
typeof(le) = typeof(re) return typeof(le)
Table 1: Type rules for MF
parameter functions, in the rather cumbersome way of partial application. The
possibility to call f with all three arguments directly would make the language
more concise and more enjoyable to work with. From the observation that,
from a semantic point of view, a call to a function f with n arguments, should
be behave equivalent to performing n individual function calls, storing the in-
termediate functions in let bindings, one solution is to transform the syntax tree
to that form.
Parsing an n argument function call results in the following parse tree:
FunCall...(FunCall(FunCall(Var name, Expr1), Expr2)...Exprn)
The following function ”flattens” the nested function call structure to a series
of let expressions and function calls:
let rec flattenCall innerCall =
match innerCall with
| FunCall((Var _), _) -> innerCall
| FunCall(innerCall, argExpr) ->
let inner = flattenCall innerCall
let outerName = createRandomName "fun"
let outerCall = FunCall((Var outerName), argExpr)
Let(outerName, inner, outerCall)
To illustrate how flattenCall works, let us imagine a function f taking two
arguments x and y, and the input program received contains a call to f, with
30
Joachim Vincent Hasseldam September 2015
the two arguments five and seven given. The input parse tree would look like
this:
FunCall(FunCall(var "f", ConstInt 5), ConstInt 7)
After this structure has been through the flattenCall function, we end up with
the following:
Let("randomName1", FunCall(var "f", ConstInt 5),
FunCall(var "randomName1", ConstInt 7) (2)
Which is the same as if the original input program had read:
let randomName1 = f 5 in randomName1 7
From this example we see that each new argument > 1, result in a new let
expression ”wrapping” the previous call in its value expression and an instruc-
tion for the let expression to call itself, in the expression body, with the next
argument. Since the anonymous is of no importance to the developer, it is given
a randomly generated name by the function.
This transformation benefits the next step in the compilation process by adding
support for currying, so multiple parameter function calls are handled identi-
cally to evaluating a series of single parameter functions.
9.3 Enriching the AST
The type checker will, in addition to performing the actual checks, return an
exact copy of the abstract syntax tree, with type enriched expressions. To
contain the enriched expressions, a ”wrapper” addition is made to the Expr
type in the Ast.fs file:
TypedExpr of Expr * TypeDef
The TypedExpr tuple contains the original expression as well as its inferred type.
With each recursive check call, the typed expressions bubble up to the previous
level maintaining the original structure and resulting in the original parse tree
enriched with types. The check function for a let expression is listed below to
illustrate this concept:
and checkLet env name valExpr restExpr =
let (valType, valTypedExpr) = checkExpr env valExpr
let updatedEnv = Map.add name valType env
let (restType, restTypedExpr) = checkExpr updatedEnv restExpr
(restType, TypedExpr(Let(name, valTypedExpr, restTypedExpr), restType))
First the value part of the let expression is checked recursively and results in
its type and its original syntax tree wrapped in typed expressions. The current
environment is updated with the name of the let binding and its value type. The
body of the expression is then checked with the updated environment, making
31
Joachim Vincent Hasseldam September 2015
the name a bound value in the scope of the rest of the expression. The function
proceeds to return the original structure of the let syntax tree, with the value
and body expression swapped for their respective typed versions in a typed ex-
pression. A graphical representation of the let expression abstract syntax tree
before and after type checking, can be found in figure: 14.
The process described above enables type inference for let expressions in a rela-
tive simple manner. The abstract syntax tree is traversed only once to perform
both the type checking as well at the type inference.
Omitting types on let expressions results in a more compact and concise syntax
for the source language. This could be improved even further by implementing
type inference for function definitions as well, although that would require a
different approach.
Figure 14: A conceptual representation of the abstract syntax tree for let ex-
pressions before and after type enrichment
10 MF Compilation
At this point the source program is considered type safe, the abstract syntax
tree has been enriched with the inferred types, and the first compilation step
can take place.
The MF compiler compiles a MF source program to MFIL. MFIL supports a set
of constructs matching those found in object-oriented stack-based languages and
is semantically closer to the intended target language CIL than MF. Roughly
speaking the MF compiler provides the transition from a functional source pro-
gram to an object-oriented one.
Because the MF compiler deals with a mix of the functional and object-oriented
paradigms, a distinction will be made between ”function” and ”method” for
this discussion. A function from now refers to a function defined in the source
language MF or a function in F#, the language of the compiler. A method is an
invoke method in the step-classes defined in the intermediate language: MFIL.
32
Joachim Vincent Hasseldam September 2015
10.1 Expressions
In any given program, the body of all the global function definitions, as well as
the implicit main function, consists of a single expression tree. The MF compiler
translates the expression tree to a list of instructions in MFIL.
Consider for example a let expression which is defined in the MF abstract syntax
tree as:
Let(name, valExpr, restExpr)}
The two expression trees: valExpr and restExpr relates to the binding of the
let expression and the rest of the expression tree where name is a bound value,
respectively.
let a = 5 + 7 + 2 * 10
valExpr
in let b = if a < 42 ...
restExpr
From the type checker the valExpr has been wrapped in a typed expression:
(TypedExpr) containing the original expression as well as its type. The type
along with the value- and rest-expression are then sent to the compileLetExpr
function:
and compileLetExpr name typeDef valExpr restExpr =
let valInstr = compileExpr valExpr
let storeInstr = StoreLocal(name, (getILType typeDef))
let restInstr = compileExpr restExpr
restInstr @ (storeInstr :: valInstr)
First the valExpr is translated to a list of instructions. Later in the MFIL com-
pilation step, the instructions eventually gets pushed on the stack and the store
instructions binds the result of these instructions to the name of the let expres-
sion. Next the restExpr is translated to its instruction list and the combined
list of instructions becomes the result of the translation of the let expression.
Suppose we were given the following simple program written in MF:
let a = 42 in
let b = 3 in
a * b
The parsed expression tree would contain a nested structure of the two let
expressions.
Let("a", 42, Let("b", 3, Mul(a, b)))
The expression tree has been simplified here to make the example easier to
read, and the constructs does not correspond to the exact output from the
parser, but the end result is he same. After the translation, we end up with a
set of instructions in MFIL that look very similar to the actual CIL end result:
#1 PushInt 42
33
Joachim Vincent Hasseldam September 2015
#2 StoreLocal (”a”, ILInt)
#3 PushInt 3
#4 StoreLocal (”b”, ILInt)
#5 Load ”a”
#6 Load ”b”
#7 Mul
10.2 Function Definitions
As stated in section 5.4 a function in MF is translated to a structure of step-
classes and invoke methods. Let us consider the simplest example: a function
with only one parameter. This translates to a single class with a single invoke
method. The return value of the invoke method is the result of executing the
expression in the function body. This scenario is considered the base case and
does not require any special measures being taken to handle closure, because
calling the function creates an immediate result. Any functions with more than
one parameter, on the other hand, do have to handle closure.
10.2.1 Extracting Invoke Types
Since each parameter in the source function corresponds to one class and invoke
method, an approach must be established for analyzing each parameter in the
function as its own method with an input- and an output type. The resulting
parameter name, input and output type is defined as an InvokeType with the
following definition:
type InvokeType = InvokeType of string * TypeDef * TypeDef
The invoke types from each function are extracted by going through the accepted
list of parameters and matching them one-by-one with the functions type defi-
nition: TypeDef. The process is pretty straight forward, since the nested format
of TypeDef is already arranged as input/output with the left side being input,
and the right side(the rest of the type definition) being output, when the data
structure is traversed.
let rec getInvokeTypes parameters funTypeDef =
match parameters, funTypeDef with
| ([], _)-> []
| (p::ps, FuncType(left, right)) ->
InvokeType(p, left, right) :: getInvokeTypes ps right
| _ -> failwith "Illegal function type"
The function getInvokesTypes builds the invoke type list by going through the
parameters, and the nested type definition for the function, and fixing the input
type for the next parameter to be the left side of the type definition and the
output type to be the remaining type definition.
34
Joachim Vincent Hasseldam September 2015
10.2.2 Step Class Instructions
Recall that the execution of the expression tree in the function body does not
happen until we reach the ”bottom” of the step-class structure, which is in prac-
tice when the last argument has been given. Until that point the compilation
process of each step (represented by each parameter) is carried out in the same
manner, regardless of the number of steps. Given a random step in the process,
the required actions are as follows:
1 The invoke method in step-class n creates an instance of step-class n + 1.
2 The argument from the step-class n invoke method, is stored in a field by
the same name in the new step-class n + 1 instance.
3 Any fields in step-class n are replicated in the step-class n + 1 instance,
preserving the names.
The instance of step n + 1 is then returned by the invoke method in step n. See
figure 6 in section 5.4 for a graphical representation of this concept.
The function: createStepClassInstrs generates the required instructions for
the invoke method in each step class.
let createStepClassInstrs invokeType nextLvlName nextLvlFields =
let (InvokeType(paramName, inType, outType)) = invokeType
let localIlType = getILType (FuncType(inType, outType))
let newObj = CreateObject(nextLvlName)
let tempName = createRandomName "local"
let storeObjRef = StoreLocal(tempName, localIlType)
let loadObjRef = Load(tempName)
let copyFields = dupFieldInstrs nextLvlName nextLvlFields loadObjRef
newObj::storeObjRef::copyFields@[loadObjRef]
The function takes as input the invoke type for the current step, the name of
the class, and a list of the required fields for the next step level. The extracted
invoke type contains the parameter and the required input- and output type for
the current step, through the process described in section 10.2.1. First the input-
and output type is translated to its IL-type equivalent, which is a one-to-one
mapping of the MF types and acts more like a conceptual way to illustrate the
transition from MF to MFIL. Next an instruction is made to create an instance
of the next level step-class. An instruction is added to store the newly created
instance locally with its IL type bound to a randomly generated name prefixed
with ”local”9
. Next the instructions for duplicating the fields to the next level
are constructed. This process is addressed in more details below. Finally the
class instance is loaded again. That way the reference is on the top of the stack
when the method returns, and the instance pointer is returned to the caller.
9Prefixing randomly generated names with an identifier referring to its use, made examining
the syntax trees, during the development and debugging process of the compiler, a lot easier.
35
Joachim Vincent Hasseldam September 2015
Duplicating Fields The instructions for copying fields between the two step
levels are handled by the dupFieldInstrs function. Keep in mind that the list
of fields passed to the function includes all fields present in the current step
level, as well as the invoke method argument.
let rec dupFieldInstrs className fields loadObjRef =
match fields with
| field::rest ->
let (Field(fieldName,_)) = field
let loadValue = Load(fieldName)
let storeField = StoreField(className, fieldName)
loadObjRef::loadValue::storeField::dupFieldInstrs className rest loadObjRef
| [] -> []
At this point in the process we do worry about details such as how a field value
is loaded. The only information conveyed to MFIL is that any value x from the
current instance has to be replicated to the field of the same name, in the newly
created instance of the next level. Whether that value exists as an argument or
a field in the current level is a implementation detail left to the MFIL compiler.
As can be seen above, for each field a set of three instructions are returned:
1 Load the instance reference
2 Load the value to be duplicated
3 Store the value in the field
The instance reference has to be loaded for each field because the CIL operation
that stores the field value consumes both the value, as well as the instance
reference from the stack [ECMA, 2015]. This is an implementation detail that
requires knowledge of how the end compiler works, and might be different if the
MFIL code produced, were to target another stack-based language than CIL.
10.2.3 Step Class Structure
Each function in the source program is converted to a list of classes in the MFIL
language. The type containing the classes in MFIL is defined as:
type Class = FunClass of string * string * (ILType * ILType) * Field list * Instruction list
The information contained in the class type consist of the name of the function,
the name of the parameter for the invoke method, the input- and output type
of invoke method, a list of fields required for the class and finally: a list of
instructions for the invoke method.
The compileFunDef function which is responsible for creating the step-classes
described in section 5.4, takes as input the name of the source function, the list
of invoke types extracted from the parameters and the expression tree with the
instructions for the body of the function.
The list of invoke types are traversed and determines how deep the step-class
structure will be. As long as the list of invoke types still contains elements we
36
Joachim Vincent Hasseldam September 2015
perform the following steps:
let (InvokeType(paramName, inType, outType)) = nextParam
let classes = buildFunClasses rest (Field(paramName, getILType inType)::reqFields) (lvl+1)
let (FunClass(className, _, _, nextStepFields, _)) = List.head classes
The next parameter in the form of an invoke type is de-constructed to access its
name and type definition. The input type of the parameter is converted to its IL
type equivalent and the type, along with the parameter name, is wrapped in a
field type. The build function is called with the remaining parameters and with
the new field appended to the list of required fields for the next level. Returned
is a list of classes from the current level and all the way to the ”bottom” of the
structure. The top element of the returned list contains the class for the next
step level and its name and required fields are extracted.
let (inILType, outILType) = getILTypes inType outType
let stepClassName = genFunClassName funName lvl
let instrs = createStepClassInstrs nextParam className nextStepFields
The parameter input- and output types are converted to IL types and a name
for the current step class created. The name is build from the original name of
the function appended with step<n> where n is the number of the current level,
contained in the level counter. This holds for any n > 0. For n = 0, which is
the top level of the step-class structure, the name is kept as the original name of
the function for convenience reasons when handling function calls. Next the in-
structions for the invoke method body is generated, using the process described
in section: 10.2.2. When the ”bottom” of the step class structure is reached,
which happens when the last element in the list of invoke types is reached, the
actual computation of the function body will take place. So instead of providing
the invoke method with the stepclass instructions as was the case previously, it
is giving the set of instructions which comes from compiling the expression tree
defined the original function body. Compiling the expression tree is explained
in section: 10.1.
FunClass(stepClassName, paramName, (inILType, outILType), reqFields, instrs)::classes
Finally the class definition is built and added to the list of classes from the
recursive step.
10.3 Match Expression
To illustrate the concept of working with the inductive types, we are going to
look at how a pattern match expression is compiled. In MF the two inductive
data types supported: tree and list each have only two cases. For the tree we
have either a Node containing a value and the two sub-trees, or a Leaf indicating
the end of the tree. For the list we have either a Cons element containing the
current element value and the next element in the list, or Nil meaning we have
reached the end of the list.
True for both types is that looking at any given element, it is either of type A
37
Joachim Vincent Hasseldam September 2015
or type B. This is very similar to an if-then-else expression - either a condition
holds and we execute the instructions in the ”then-block” or the condition fails
and the ”else-block” instructions are executed. In fact these two constructs are
so similar that they are both compiled to the MFIL Branch instruction.
To keep the list of instructions short, a very simple match example has been
chosen.
The function myMatchExample takes a list of integers and produces a new list,
with the result of squaring all elements. Since we already discussed the trans-
lation of a function to its corresponding step-class structure, we limit this dis-
cussion to cover only the actual match expression in the function body.
fun myMatchExample lst : list -> list =
match lst with
| Nil -> []
| Cons(x, xs) -> (x * x)::myMatchExample xs
end
end
From the function compileMatchCases we extract the two match cases and send
them to compileListMatch where the actual compilation takes place. This is
done so compileListMatch can process the input regardless of the order of the
match cases. The input given is the actual expression which the match is per-
formed upon, the type of the expression, the names for the value and rest binding
and an expression for each of the cases, with the action to perform.
We start by compiling the match expression which in this case translates to a
load instruction of the name lst, because the match is performed on the list
argument of the function. An instruction is then made to store the loaded
argument to a local value given a random name.
let result = compileExpr matchExpr
let name = createRandomName "match"
let storeResult = StoreLocal (name, getILType matchExprType) :: result
Next the newly stored value is loaded again and is matched against the nil type.
Based on the result of this check, the branching of the instructions can take
place.
let loadResult = Load name
let isNilInstr = CallCheckVariant(ILNil)
As we saw earlier the Branch type consists of two lists of instructions. The
first set of instructions if the previous check holds and another set if it fails. In
this case we tested if the match expression was of type nil, so the first set of
instructions is the result of compiling the expression for the nil ”action”.
let nilBranchInstrs = compileExpr nilAction |> List.rev
In this case a match on nil just returns an empty list, so the expression here
simple evaluates to a CallConstructVariant instruction with an ILNil variant
38
Joachim Vincent Hasseldam September 2015
argument.
The cons case is a little more interesting since we also have to perform a binding
of the value and next element, before we evaluate the action expression where
these appears as bound names. This is not much different than how let expres-
sions are handled.
For each of the two elements we load the local value containing the list, create
an extract instruction for the appropriate element and store it to a local value
with the supplied name.
let bindValue = [loadResult; ExtractVariantVal(ListValue); StoreLocal(valName, ILInt)]
let bindNext = [loadResult; ExtractVariantVal(Next); StoreLocal(restName, ILList)]
Finally the expression for the action part of the cons case is translated.
let consActionInstrs = compileExpr consAction |> List.rev
Since the :: operator is simply syntactic sugar for creating a new cons element,
we start with the two arguments needed for cons. The value argument is found
by loading x (which is the name bound to the value in the cons element of the
current match) twice and applying the mul operator. To find the next element,
or the rest of the list, the function calls itself recursively with the list argument
bound to the name of the next element in the list xs. This results in a Load
instruction of xs value and a CallInvoke instruction to myMatchExample.
The last instruction given in the cons action expression is to construct a new
cons element.
The combined list of MFIL instructions from this match example can be seen in
section 11.6.2, where we complete the compilation process by compiling these
instructions to CIL.
11 MFIL Compilation
This section concludes the compilation process by explaining the MFIL compiler.
The compiler takes a program written in MFIL and produce an assembly of CIL
instructions, classes and methods. Microsofts Reflection.Emit API is used to
accomplish this task by creating a single assembly, with a single module, to
contain all the classes. Figure 15 gives a hierarchical overview of the constructs
used for this project and how they relate to each other.
11.1 IFunc Interface
To enable a structure that supports instances with the right type definition
can be given as arguments to invoke methods of other instances: all classes
must support a common interface10
. The need for this interface is explained in
section 5.3.1.
The interface defines a single method: ”Invoke” with a single generic parameter
type and a generic return type.
10Although the input- and output type definition varies depending on the original functions
type definition, it is fundamentally still the same interface.
39
Joachim Vincent Hasseldam September 2015
Figure 15: A hierarchical overview of the CIL constructs used in this project
let [| tinput; toutput |] = newInterface.DefineGenericParameters([|"TInput"; "TOutput"|])
newInterface.DefineMethod("Invoke", methodAttributes, toutput.AsType(), [| tinput.AsType() |]) |> ignore
The actual types for the generic input- and output types are concreted when
the step-classes are implementing the interface as we shall see later.
11.2 Built-in Types
To ensure the availability of the two inductive types for the rest of the instruc-
tions, they are built first before the actual source program is compiled. The tree
and list types are created in the files: CILTreeBuilder.fs and CILListBuilder.fs
respectively. A set of helper functions shared between the two can be found in:
CILClassBuilder.fs and CILMethodBuilder.fs.
The two types are semantically very similarly, the only difference being that a
node element in a tree contains a value, and a left- and right sub-tree, where a
cons element in the list only contains the value and the rest of the list.
To illustrate the concept, this section will go through the process of creating
the MFTree type11
. The expected outcome can be seen in figure 10 section 6.
11The tree type is named MFTree to empathize it being part of the MF language
40
Joachim Vincent Hasseldam September 2015
11.2.1 Structure
Recall from section 6 that through object-oriented decomposition, a tree is de-
fined as a set of nodes and leaves. A class has to be created for the two cases and
made subclasses to the abstract MFTree class. For convenience the MFTree su-
perclass will contain a set of static methods for creating and returning instances
of the two subclasses. In addition to this, methods which test a given instance
against each of the two types (leaf and node) and getter methods for the three
fields in the node class: value, left- and right sub-tree, will also be implemented.
The Leaf class contains no fields because it only acts as an indicator for an
empty tree.
11.2.2 Create Classes
The function: createTreeStructure is responsible for building the MFTree and
its two subclasses. It takes only a single argument which is an instance of the
Reflection.Emit.ModuleBuilder class (mb). As mentioned earlier: all classes are
defined in the same module in the output assembly. This includes the built-in
types.
let treeClass = defineSuperClass mb "MFTree"
let treeClassCTor = defineDefaultCTor treeClass
First the abstract class MFTree along with a default (no parameter) constructor
is created. The constructor is needed when the Node class constructor is created
as we shall see next. An important thing to keep in mind is that at the CIL
level, a constructor is not treated much differently than a method.
let nodeClass = defineSubClass mb "Node" treeClass
let valFld = addField nodeClass "value" typeof<int>
let leftFld = addField nodeClass "left" treeClass
let rightFld = addField nodeClass "right" treeClass
let nodeClassCTor = defineNodeCTor nodeClass treeClass treeClassCTor valFld leftFld rightFld
The defineSubClass function creates a class named ”Node” and define its par-
ent class as the one provided by the argument, in this case: the MFTree class.
Next we define the three required fields in the Node class: a value field and one
for each sub-tree: left and right. The value field is restricted to an integer type
and the types for the two sub-tree fields are of course the superclass MFTree,
since a sub-tree can be either a leaf or a node.
Next a custom three argument constructor is created in the Node class. The
three fields created previously are given as arguments to the function defineNodeCTor
which creates the constructor. The fields are used to store the three arguments
provided by the constructor call. The instructions for loading an argument and
storing it in the appropriate field comes in sets of three for each argument,
shown here for the value argument/field pair:
ilGen.Emit(OpCodes.Ldarg_0)
ilGen.Emit(OpCodes.Ldarg_1)
ilGen.Emit(OpCodes.Stfld, valFld)
41
Joachim Vincent Hasseldam September 2015
For instance methods (the constructor falls in this category), argument 0 refers
to the this pointer of the current instance. Actual arguments range from 1 to n.
So the Ldarg 0 instruction pushes the instance pointer to the top of the stack.
Then the first argument of the constructor is loaded and the argument is stored
to its respective field, consuming both the instance reference and the value in
the process.
The MFTree class is used in the defineNodeCTor function to define the argu-
ment types for left- and right sub-tree. The MFTree constructor on the other
hand is used for less obvious reasons. When a call to a constructor in a sub-
class is made, a call to the constructor of its parent class also has to be made
[ECMA, 2015]. This is a detail usually performed automatically by the com-
piler when we write C# code, but has to handled manually here at the CIL level.
11.2.3 Create Methods
With the class hierarchy built and the appropriate constructors for the two
subclasses defined, it is time to create the necessary methods in the abstract
parent-class: MFTree. We start by defining the type-testing methods for each
of the subclasses:
createIsTypeMethod treeClass nodeClass "IsNode"
createIsTypeMethod treeClass leafClass "IsLeaf"
The createIsTypeMethod function is designed to create a testing method for
any given compare-class. The resulting method takes no arguments, returns
a boolean value with the comparison result, and contains the following set of
instructions:
ilGen.Emit(OpCodes.Ldarg_0)
ilGen.Emit(OpCodes.Isinst, compareClass)
ilGen.Emit(OpCodes.Ldnull)
ilGen.Emit(OpCodes.Cgt_Un)
ilGen.Emit(OpCodes.Ret)
First the this pointer is pushed on to the stack. Next the instance is compared
to the compare class(Isinst). If they are equal the instance pointer is left on the
stack, if not: null is pushed on to the stack in its place. A null value is pushed
on the stack(Ldnull) and a check is performed(Cgt Un) to see if the result of the
previous compare instruction is greater than the newly pushed null value. This
is the case if the result of the compare instruction yields anything but null, in
other words: if the instance was indeed of the compare class type. The greater
than check pushes 1 on the stack if the instance on the stack is greater than null
and 0 otherwise. The return instruction(Ret) ends the current stack frame and
returns the value on top of the stack (either 0 or 1) and in the process implicitly
converting it to a boolean type.
Next up is the two static methods for creating the subclasses: Leaf and Node:
createNewLeafMethod treeClass nilClassCTor
createNewNodeMethod treeClass nodeClassCTor
42
Joachim Vincent Hasseldam September 2015
The MFTree class is passed to the create functions, since the return type of a
call to the resulting methods are MFTree. The respective constructors are also
passed to the function to create the new instances. The method signature of
the create Node method is very similar to the Node class constructor described
in section: 11.2.2. Both take three arguments: a value and the left- and right
sub-tree, the difference being that the create Node method is static and the
constructor does not have a return type. As for the body of the method, we
load all three arguments(in the right order), like in the constructor, but instead
of storing them to fields, we create an object of Node class, with the before
mentioned constructor, and end the stack frame. This consumes the three values
on the stack and returns the new class instance now on top of the stack. Notice
that Ldarg 0 does not load the this pointer as opposed to the previous example,
because this is a static method.
ilGen.Emit(OpCodes.Ldarg_0) // First argument, not instance pointer
ilGen.Emit(OpCodes.Ldarg_1)
ilGen.Emit(OpCodes.Ldarg_2)
ilGen.Emit(OpCodes.Newobj, nodeClassCTor)
ilGen.Emit(OpCodes.Ret)
The Leaf method takes no arguments and only creates an instance of the Leaf
class and returns.
The last step is to create the methods for extracting the three values from a
Node instance. Given a superclass, subclass, field and a name, the function
createGetSubFieldMethod creates a method that returns the value of said field.
The method takes no arguments and its return type is set to the type of the
field. The method body is relatively simple:
ilGen.Emit(OpCodes.Ldarg_0)
ilGen.Emit(OpCodes.Castclass, subClass)
ilGen.Emit(OpCodes.Ldfld, fieldInfo)
ilGen.Emit(OpCodes.Ret)
The instance pointer is loaded as we have seen before. A typecast(Castclass) is
performed from the instance of type MFTree to the provided subclass (in this
case the Node class). Next the value of the given field is loaded(Ldfld) and the
stack frame ends which returns the value.
11.3 Meta Information
During the compilation process a local environment must be present for each
invoke method. The environment has to be updated with local values and
arguments as we traverse through the list of instructions in the method bodies.
In addition to this, global functions and the built-in types: MFList, MFTree
along with the shared interface, must also be available to the compiler. In this
section a record named Meta is described, which will act as a place-holder, or a
”toolbox”, for these necessities. Gathering all this information in a single place
provide some advantages:
43
Joachim Vincent Hasseldam September 2015
• The compiler source code provides a higher level of readability, and is
easier to maintain, with shorter functions since only a single meta value
has to be passed around
• Refactoring of functions passing the information around is not needed if
the meta record is extended with new functionality like a new supported
type
type Meta =
{ currentEnv : Env
classInfoMap : Map<string, ClassInfo>
preDefTypes : PreDefTypes }
The Meta record contains the environment for the method currently being com-
piled, a symbol table matching names of global functions with their required
information and finally a record containing the predefined types mentioned ear-
lier.
11.3.1 Environment
At the CIL level, names for local values and arguments are unimportant. When
interacting with these, they are instead referred to by their index. Therefore
the job of the environment is to map names to their respective local indices.
Because CIL is a type-strong language, type information for the values also has
to be stored.
When compiling an invoke method, bound names stem from three different
sources: the method argument, a local value12
or as a class field. A field is
treated a bit different than the other two, since a field is not part of the method,
but instead is defined in the class. A field is accessed through the FieldInfo
type instead of an index, as it is the case for an argument and a local value. But
even though an argument and a local value both are loaded based on their index,
it is done by two different CIL instructions. Because of this, the environment
must be able to distinguish between how values are present in the local method.
type Element =
| Arg of int * ILType
| Local of int * ILType
| Field of FieldInfo * ILType
type Env =
| Env of int * Map<string, Element>
An entry in the environment is therefore a mapping between its name and
an Element type. This enables the compiler to respond appropriately to the
different element types, when a lookup is made in the environment. The integer
in the Env tuple refers to the next available index in the CIL evaluation stack and
is incremented when an argument or a local value is added to the environment.
12A let expression in the function body.
44
Joachim Vincent Hasseldam September 2015
11.3.2 Global Functions
The classInfoMap maps method names to the ClassInfo type, which is a place-
holder for all relevant information related to the step-classes. The type is defined
as:
type ClassInfo =
{ classBuilder : TypeBuilder
fields : (FieldBuilder * ILType) list
classType : Type
ctorBuilder : ConstructorBuilder }
In addition to the actual class, its type and constructor, we also store the list
of class fields. At the top level, of the step-class structure, this is an empty
list, but in subsequent levels the fields contain the values or function closures
described earlier.
Together with a field, its respective type is also stored. If the field contains a
value, its type is not needed to load it on to the stack. But if the field contains
a function closure, a call has to be made to the invoke method of the class
instance, and here the type is required.
11.3.3 Predefined Types
Last we have the built-in types: MFList and MFTree and the IFunc interface.
type PreDefTypes =
{ funcInterface : Type
listType : Type
treeType : Type }
An interesting thing worth noticing here, is that unlike the step-classes from the
previous section, we don’t need all the extra information, only the classes (and
interface). Notice also that the predefined types are saved as system types (Type)
and not as a TypeBuilder objects as was the case with the step-classes. The
reason for this is that the predefined types are already built and completed, and
as a result can be interacted with in a different way than the step classes, whose
instructions, fields and types depend on the content of the source program.
11.4 Expected MFIL Output
Before moving on to the translation of the step-class structure, we will revisit
the MFIL abstract syntax tree from section 7.4 and take a closer look at the
expected output from the MFIL compiler for each instruction.
It will soon become clear, that the instructions are a mix of lower level ones,
which are very close to the CIL end result, and others with a higher level of
abstraction, which requires several steps to be performed.
PushInt and PushBool The lower level push instructions emits either an
integer or a boolean value to the evaluation stack.
45
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp
MicroFSharp

More Related Content

What's hot

Perl &lt;b>5 Tutorial&lt;/b>, First Edition
Perl &lt;b>5 Tutorial&lt;/b>, First EditionPerl &lt;b>5 Tutorial&lt;/b>, First Edition
Perl &lt;b>5 Tutorial&lt;/b>, First Editiontutorialsruby
 
Lecture Notes in Machine Learning
Lecture Notes in Machine LearningLecture Notes in Machine Learning
Lecture Notes in Machine Learningnep_test_account
 
Uni leicester
Uni leicesterUni leicester
Uni leicesterN/A
 
Learn python the right way
Learn python the right wayLearn python the right way
Learn python the right wayDianaLaCruz2
 
HRL: Learning Subgoals and State Abstraction
HRL: Learning Subgoals and State AbstractionHRL: Learning Subgoals and State Abstraction
HRL: Learning Subgoals and State AbstractionDavid Jardim
 
Avances Base Radial
Avances Base RadialAvances Base Radial
Avances Base RadialESCOM
 
Vba notes-for-professionals
Vba notes-for-professionalsVba notes-for-professionals
Vba notes-for-professionalsmaracaverik
 
Hub location models in public transport planning
Hub location models in public transport planningHub location models in public transport planning
Hub location models in public transport planningsanazshn
 
Getting started in Transmedia Storytelling
Getting started in Transmedia Storytelling Getting started in Transmedia Storytelling
Getting started in Transmedia Storytelling Robert Pratten
 

What's hot (19)

Perl &lt;b>5 Tutorial&lt;/b>, First Edition
Perl &lt;b>5 Tutorial&lt;/b>, First EditionPerl &lt;b>5 Tutorial&lt;/b>, First Edition
Perl &lt;b>5 Tutorial&lt;/b>, First Edition
 
Red paper
Red paperRed paper
Red paper
 
Csharp
CsharpCsharp
Csharp
 
Lecture Notes in Machine Learning
Lecture Notes in Machine LearningLecture Notes in Machine Learning
Lecture Notes in Machine Learning
 
Uni leicester
Uni leicesterUni leicester
Uni leicester
 
Knapp_Masterarbeit
Knapp_MasterarbeitKnapp_Masterarbeit
Knapp_Masterarbeit
 
tutorial.pdf
tutorial.pdftutorial.pdf
tutorial.pdf
 
Phd dissertation
Phd dissertationPhd dissertation
Phd dissertation
 
Learn python the right way
Learn python the right wayLearn python the right way
Learn python the right way
 
HRL: Learning Subgoals and State Abstraction
HRL: Learning Subgoals and State AbstractionHRL: Learning Subgoals and State Abstraction
HRL: Learning Subgoals and State Abstraction
 
pickingUpPerl
pickingUpPerlpickingUpPerl
pickingUpPerl
 
C++ progrmming language
C++ progrmming languageC++ progrmming language
C++ progrmming language
 
Avances Base Radial
Avances Base RadialAvances Base Radial
Avances Base Radial
 
Learn c++
Learn c++Learn c++
Learn c++
 
Vba notes-for-professionals
Vba notes-for-professionalsVba notes-for-professionals
Vba notes-for-professionals
 
Thesis
ThesisThesis
Thesis
 
Hub location models in public transport planning
Hub location models in public transport planningHub location models in public transport planning
Hub location models in public transport planning
 
Getting started in Transmedia Storytelling
Getting started in Transmedia Storytelling Getting started in Transmedia Storytelling
Getting started in Transmedia Storytelling
 
cs-2002-01
cs-2002-01cs-2002-01
cs-2002-01
 

Viewers also liked

LAS DROGAS_99
LAS DROGAS_99LAS DROGAS_99
LAS DROGAS_99bryan_99
 
Accelerate Your Sap Testing with Bqurious
Accelerate Your Sap Testing with BquriousAccelerate Your Sap Testing with Bqurious
Accelerate Your Sap Testing with BquriousyadavSusheel
 
城市中的綠洲 愛心柱
城市中的綠洲 愛心柱城市中的綠洲 愛心柱
城市中的綠洲 愛心柱道中 黃
 
Arments pie and mash overview
Arments pie and mash overviewArments pie and mash overview
Arments pie and mash overviewArmentspieandmash
 
How to Build A Modern Website
How to Build A Modern WebsiteHow to Build A Modern Website
How to Build A Modern WebsiteRichard Morgan
 
MilliCare Carpet Care
MilliCare Carpet CareMilliCare Carpet Care
MilliCare Carpet CareCorpFlooring
 
Collaborative Economy in Argentina: the meeting of two paradigms
Collaborative Economy in Argentina: the meeting of two paradigmsCollaborative Economy in Argentina: the meeting of two paradigms
Collaborative Economy in Argentina: the meeting of two paradigmsMarcela Basch
 

Viewers also liked (15)

Past simple
Past simplePast simple
Past simple
 
About Us
About UsAbout Us
About Us
 
LAS DROGAS_99
LAS DROGAS_99LAS DROGAS_99
LAS DROGAS_99
 
Accelerate Your Sap Testing with Bqurious
Accelerate Your Sap Testing with BquriousAccelerate Your Sap Testing with Bqurious
Accelerate Your Sap Testing with Bqurious
 
城市中的綠洲 愛心柱
城市中的綠洲 愛心柱城市中的綠洲 愛心柱
城市中的綠洲 愛心柱
 
Arments pie and mash overview
Arments pie and mash overviewArments pie and mash overview
Arments pie and mash overview
 
How to Build A Modern Website
How to Build A Modern WebsiteHow to Build A Modern Website
How to Build A Modern Website
 
MilliCare Carpet Care
MilliCare Carpet CareMilliCare Carpet Care
MilliCare Carpet Care
 
Taj CV
Taj CVTaj CV
Taj CV
 
Amazing Countries To Visit In 2016
Amazing Countries To Visit In 2016Amazing Countries To Visit In 2016
Amazing Countries To Visit In 2016
 
Peru Report
Peru ReportPeru Report
Peru Report
 
Collaborative Economy in Argentina: the meeting of two paradigms
Collaborative Economy in Argentina: the meeting of two paradigmsCollaborative Economy in Argentina: the meeting of two paradigms
Collaborative Economy in Argentina: the meeting of two paradigms
 
Tema 7 España y la Unión Europea
Tema 7 España y la Unión EuropeaTema 7 España y la Unión Europea
Tema 7 España y la Unión Europea
 
cv kkd
cv kkdcv kkd
cv kkd
 
Paresh Gohel
Paresh GohelParesh Gohel
Paresh Gohel
 

Similar to MicroFSharp

452042223-Modern-Fortran-in-practice-pdf.pdf
452042223-Modern-Fortran-in-practice-pdf.pdf452042223-Modern-Fortran-in-practice-pdf.pdf
452042223-Modern-Fortran-in-practice-pdf.pdfkalelboss
 
Applying The Rapid Serial Visual Presentation Technique To Small Screens
Applying The Rapid Serial Visual Presentation Technique To Small ScreensApplying The Rapid Serial Visual Presentation Technique To Small Screens
Applying The Rapid Serial Visual Presentation Technique To Small ScreensMonica Waters
 
python learn basic tutorial learn easy..
python learn basic tutorial learn easy..python learn basic tutorial learn easy..
python learn basic tutorial learn easy..MURTHYVENKAT2
 
Francois fleuret -_c++_lecture_notes
Francois fleuret -_c++_lecture_notesFrancois fleuret -_c++_lecture_notes
Francois fleuret -_c++_lecture_noteshamza239523
 
eclipse.pdf
eclipse.pdfeclipse.pdf
eclipse.pdfPerPerso
 
Market microstructure simulator. Overview.
Market microstructure simulator. Overview.Market microstructure simulator. Overview.
Market microstructure simulator. Overview.Anton Kolotaev
 
programación en prolog
programación en prologprogramación en prolog
programación en prologAlex Pin
 
Automated antlr tree walker
Automated antlr tree walkerAutomated antlr tree walker
Automated antlr tree walkergeeksec80
 
Shariar Rostami - Master Thesis
Shariar Rostami - Master ThesisShariar Rostami - Master Thesis
Shariar Rostami - Master Thesisshahriar-ro
 
Linux kernel 2.6 document
Linux kernel 2.6 documentLinux kernel 2.6 document
Linux kernel 2.6 documentStanley Ho
 

Similar to MicroFSharp (20)

PythonIntro
PythonIntroPythonIntro
PythonIntro
 
Thesis
ThesisThesis
Thesis
 
diss
dissdiss
diss
 
452042223-Modern-Fortran-in-practice-pdf.pdf
452042223-Modern-Fortran-in-practice-pdf.pdf452042223-Modern-Fortran-in-practice-pdf.pdf
452042223-Modern-Fortran-in-practice-pdf.pdf
 
Applying The Rapid Serial Visual Presentation Technique To Small Screens
Applying The Rapid Serial Visual Presentation Technique To Small ScreensApplying The Rapid Serial Visual Presentation Technique To Small Screens
Applying The Rapid Serial Visual Presentation Technique To Small Screens
 
python learn basic tutorial learn easy..
python learn basic tutorial learn easy..python learn basic tutorial learn easy..
python learn basic tutorial learn easy..
 
VHDL book
VHDL bookVHDL book
VHDL book
 
Francois fleuret -_c++_lecture_notes
Francois fleuret -_c++_lecture_notesFrancois fleuret -_c++_lecture_notes
Francois fleuret -_c++_lecture_notes
 
Perltut
PerltutPerltut
Perltut
 
eclipse.pdf
eclipse.pdfeclipse.pdf
eclipse.pdf
 
Code Conventions
Code ConventionsCode Conventions
Code Conventions
 
Market microstructure simulator. Overview.
Market microstructure simulator. Overview.Market microstructure simulator. Overview.
Market microstructure simulator. Overview.
 
programación en prolog
programación en prologprogramación en prolog
programación en prolog
 
Automated antlr tree walker
Automated antlr tree walkerAutomated antlr tree walker
Automated antlr tree walker
 
Shariar Rostami - Master Thesis
Shariar Rostami - Master ThesisShariar Rostami - Master Thesis
Shariar Rostami - Master Thesis
 
Linux kernel 2.6 document
Linux kernel 2.6 documentLinux kernel 2.6 document
Linux kernel 2.6 document
 
Systems se
Systems seSystems se
Systems se
 
PhD-2013-Arnaud
PhD-2013-ArnaudPhD-2013-Arnaud
PhD-2013-Arnaud
 
Swi prolog-6.2.6
Swi prolog-6.2.6Swi prolog-6.2.6
Swi prolog-6.2.6
 
Java code conventions
Java code conventionsJava code conventions
Java code conventions
 

MicroFSharp

  • 1. IT-University Copenhagen Micro-F# Compiling Functional Source-code to Object-oriented stack Machine Code Master Thesis September 2015 Author: Joachim Vincent Hasseldam (jhas@itu.dk) Supervisor: Associate Professor Rasmus Ejlers Møgelberg (mogel@itu.dk)
  • 2. Joachim Vincent Hasseldam September 2015 Acknowledgement I would like to thank Associate Professor Rasmus Ejlers Møgelberg, from the Theoretical Computer Science department at ITU Copenhagen, for supervising this thesis and for his valued advice and guidance throughout the project. i
  • 3. Joachim Vincent Hasseldam September 2015 Abstract Compiling a functional programming language to an object-oriented stack- based language requires a way to translate constructs like higher-order functions and inductive data types to object-oriented equivalents, while still maintaining the structural and semantically integrity of the source program. To understand how this can be done, a small programming language, with a subset of the features found in a ML-based language is created and com- piled to Microsoft Common Intermediate Language (CIL). This report describes the new language and suggests a technique for trans- lating the language to an object-oriented intermediate language conceived for this project, which is then translated to the target language CIL. The technique presented proves to accommodate the required constructs within the constraints of the language, although support for recursive functions with a substantial input size is still lacking. ii
  • 4. Contents I Introduction 1 1 Testing Environment 2 2 Background 2 2.1 Higher-Order Functions . . . . . . . . . . . . . . . . . . . . . . . 2 2.1.1 Currying and Partial Application . . . . . . . . . . . . . . 3 2.1.2 Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Recursive Data Types . . . . . . . . . . . . . . . . . . . . . . . . 4 2.3 Common Intermediate Language . . . . . . . . . . . . . . . . . . 4 3 CIL Abstraction 5 II Design 6 4 Micro-F# 6 4.1 Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.2 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.3 Let Bindings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.4 Program Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.5 Abstract Syntax Tree . . . . . . . . . . . . . . . . . . . . . . . . 8 5 Translating Program Structure 9 5.1 Expression Language . . . . . . . . . . . . . . . . . . . . . . . . . 9 5.2 First-Order Functions . . . . . . . . . . . . . . . . . . . . . . . . 11 5.3 One Parameter Higher-Order Functions . . . . . . . . . . . . . . 12 5.3.1 Shared Interface . . . . . . . . . . . . . . . . . . . . . . . 13 5.4 N Parameter Higher-Order Functions . . . . . . . . . . . . . . . . 14 6 Inductive Data Types 16 6.1 List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 6.1.1 Constructor Methods . . . . . . . . . . . . . . . . . . . . 16 6.1.2 Test Methods . . . . . . . . . . . . . . . . . . . . . . . . . 18 6.1.3 Accessor Methods . . . . . . . . . . . . . . . . . . . . . . 18 6.2 Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 6.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 7 Intermediate Language 21 7.1 Bridging Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . 21 7.2 Encapsulating Side Effects . . . . . . . . . . . . . . . . . . . . . . 22 7.3 Target Language Agnostic . . . . . . . . . . . . . . . . . . . . . . 22 7.4 Abstract Syntax Tree . . . . . . . . . . . . . . . . . . . . . . . . 23 III Implementation 26 8 Lexical and Syntax Analysis 26 iii
  • 5. Joachim Vincent Hasseldam September 2015 8.1 Lexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 8.2 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 9 Type-Checking and AST Transformation 28 9.1 Type Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 9.2 Transform Multiple Argument Function Calls . . . . . . . . . . . 29 9.3 Enriching the AST . . . . . . . . . . . . . . . . . . . . . . . . . . 31 10 MF Compilation 32 10.1 Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 10.2 Function Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 34 10.2.1 Extracting Invoke Types . . . . . . . . . . . . . . . . . . . 34 10.2.2 Step Class Instructions . . . . . . . . . . . . . . . . . . . . 35 10.2.3 Step Class Structure . . . . . . . . . . . . . . . . . . . . . 36 10.3 Match Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 11 MFIL Compilation 39 11.1 IFunc Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 11.2 Built-in Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 11.2.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 11.2.2 Create Classes . . . . . . . . . . . . . . . . . . . . . . . . 41 11.2.3 Create Methods . . . . . . . . . . . . . . . . . . . . . . . 42 11.3 Meta Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 11.3.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . 44 11.3.2 Global Functions . . . . . . . . . . . . . . . . . . . . . . . 45 11.3.3 Predefined Types . . . . . . . . . . . . . . . . . . . . . . . 45 11.4 Expected MFIL Output . . . . . . . . . . . . . . . . . . . . . . . 45 11.5 Generate CIL Class Structure . . . . . . . . . . . . . . . . . . . . 47 11.5.1 Define Classes . . . . . . . . . . . . . . . . . . . . . . . . 47 11.5.2 Populate Classes . . . . . . . . . . . . . . . . . . . . . . . 50 11.6 Generate CIL Instructions . . . . . . . . . . . . . . . . . . . . . . 51 11.6.1 Step Structure . . . . . . . . . . . . . . . . . . . . . . . . 51 11.6.2 Match Expression . . . . . . . . . . . . . . . . . . . . . . 54 11.6.3 Branch Instructions . . . . . . . . . . . . . . . . . . . . . 57 IV Discussion 62 12 Tail Recursion 62 13 .NET Interoperability 63 14 Conclusion 64 Appendices 66 A Compiler Source Code 66 A.1 Ast.fs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 A.2 Util.fs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 A.3 TypeChecker.fs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 A.4 MFIL.fs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 iv
  • 6. Joachim Vincent Hasseldam September 2015 A.5 MFCompiler.fs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 A.6 Env.fs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 A.7 CILTyping.fs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 A.8 CILCompileInstrs.fs . . . . . . . . . . . . . . . . . . . . . . . . . 82 A.9 CILInterfaceBuilder.fs . . . . . . . . . . . . . . . . . . . . . . . . 85 A.10 CILMethodBuilder.fs . . . . . . . . . . . . . . . . . . . . . . . . . 86 A.11 CILClassBuilder.fs . . . . . . . . . . . . . . . . . . . . . . . . . . 88 A.12 CILListBuilder.fs . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 A.13 CILTreeBuilder.fs . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 A.14 MFILCompiler.fs . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 A.15 SourceCode.fs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 A.16 Program.fs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 A.17 Lexer Specification . . . . . . . . . . . . . . . . . . . . . . . . . . 98 A.18 Parser Specification . . . . . . . . . . . . . . . . . . . . . . . . . . 100 v
  • 7. Joachim Vincent Hasseldam September 2015 Part I Introduction Functional programming languages are structurally and semantically different from object-oriented languages. Functional languages treats functions as first- class citizens by allowing them to be passed as arguments to other functions, have a function be the return type of another function and assigning functions to local values. In that sense, functions provide the primary level of abstraction in functional languages. This is in contrast to object-oriented languages where the object is the primary level of abstraction. Functional languages also have a tendency to favour recursion, over the more it- erative loop approach primarily used in object-oriented languages, when travers- ing data structures. Because of this, data structures in functional languages are recursively defined, to adhere with the languages immutable data approach. Despite the differences it is still possible to translate a functional language to an object-oriented one. This is necessary when these different language paradigms share the same runtime environment, as it is the case with Java and Scala on the Java Virtual Machine(JVM) platform and C# and F# on the Microsoft Common Language Runtime(CLR) platform. There are several advantages for languages to share the same runtime, one being the interoperability between them. For eaxmple, if we look at the CLR platform having constructs written in F# available in another C# project, provides a great source of code re-usability. To achieve this goal, the two languages are compiled to the same intermediate language namely: Common Intermediate Language or CIL. CIL is an object-oriented language, so it seems somewhat intuitive that a program written in C#, also being object-oriented, can easily be translated into CIL. The functional F# program on the other hand, will have to undergo a paradigm shift. This project aims at understanding how a functional language can be translated to an object-oriented stack-based language, and still maintain its structural in- tegrity. The focus will be on how higher-order functions and recursive data types are translated to object-oriented equivalent structures, and not so much on the efficiency of the translation in terms of recursive calls. To accomplish this goal, a functional language will be designed and implemented with the intention to make it run on the CLR platform by compiling it to CIL byte-code. The language is not meant to be a fully fledged programming lan- guage, but rather contain a small subset of the features found in a typical ML-based functional language, with support for higher-order functions and re- cursive data-types. The entire design and compilation process of a programming language, spans many different areas and cover many interesting topics. This report will de- scribe the entire compilation process, but since the focus of this project is on the translation from functional constructs to object-oriented ones, certain areas such as source language design, lexical analysis and syntactical analysis, will only be covered in modest details. 1
  • 8. Joachim Vincent Hasseldam September 2015 1 Testing Environment To test the compiler, a small development environment has been created. This enables the reader to run a list of predefined programs, as well as creating and running his own programs in MF. The development environment can be accessed by downloading the MFDE.zip from the following location: https://www.dropbox.com/s/l6ml7gszid3631l/MFDE.zip?dl=0 then extract the content of the zip file and run MFDE.exe. 2 Background This section will briefly introduce a few topics, which are important for the remainder of this report. 2.1 Higher-Order Functions Languages with support for higher-order functions, allow functions to be passed as arguments to other functions, or having functions returning functions as their return values [Sestoft, 2012]. Higher-order functions are an integral part of functional programming languages and as one of its applications: provide support for multiple parameters through currying (see section: 2.1.1). A classic example of a higher-order function is the List.map function which, implemented in F#, would look something like this: let rec map f lst = match lst with | next::rest -> (f next)::map f rest | [] -> [] Here the map function takes two arguments, a function predicate f and a list lst, and applies the predicate to each element in the input list, resulting in a new list. Figure 1: List.map in F# and IEnumerable.Select in C# produces a new list with the results of applying a predicate to each element in the input list 2
  • 9. Joachim Vincent Hasseldam September 2015 2.1.1 Currying and Partial Application Several functional languages including F# handles functions with more than one parameter through a technique known as currying. This means that a multiple parameter function, is actually evaluated as a series of one parameter functions. This is best illustrated through an example. The map function just presented has the following function signature: ('a -> 'b) -> 'a list -> 'b list} This translates to: map takes as input a function from generic type ’a to generic type ’b and returns a new function which takes a list with elements of type ’a as input and returns a list with elements of type ’b. Currying provides the compiler with the responsibility of keeping track of that extra function when a call to map is made with two arguments. That is why the following call to map is perfectly valid: map (fun x -> x * x) [1..10] When in essence it is just syntactic sugar for first calling map with the square function predicate which returns a new anonymous function, and then make a call to the new function with the list argument. Of course it is possible to call a n parameter function with with any number of arguments ranging from 1..n. This fixes the given arguments in the returned function and is called partial application. In the map example, a call to the func- tion could be made with just a function for squaring numbers as its argument: let square x = x * x let squareNums = map square Now squareNums is fixed with the square function and has the function signature: int list -> int list It can be passed as an argument to other functions or called with a list of integers, yielding a new integer list as expected. 2.1.2 Closure With functions being passed to other functions, a method for keeping track of free variables must be present. In the above example map is not part of the local environment within the stack frame of squareNums, and therefore appears as a free variable within the scope of the function. When a call to squareNums is made, the compiler has to know how to resolve map with a bound square function variable. This problem is handled by representing a function variable as a closure [Appel and Palsberg, 2003]. A closure is a structure that provides a way to access a function outside scope, as well as the function’s local environment. 3
  • 10. Joachim Vincent Hasseldam September 2015 2.2 Recursive Data Types A type which can be recursively defined by simpler elements is called a recursive- or inductive data type. [Pierce, 2002]. Popular examples of inductive data types includes lists and trees. We can also define our own types, like the following simple abstract syntax tree for a small expression language written in F#: type Expr = | Mul of Expr * Expr | Add of Expr * Expr | Value of float This type allows for defining simple unambiguous arithmetic expressions: let trapezoidArea = Mul(Value(0.5), Mul(Value(7.4), Add(Value(2.3), Value(5.9)))) 2.3 Common Intermediate Language Common Intermediate Language, or CIL, is an object-oriented stack-based pro- gramming language designed by Microsoft. CIL is part of the Common Lan- guage Infrastructure (CLI) and is an intermediate language shared by all .NET languages. At compile time the source language is compiled to CIL instruc- tions by a language specific compiler, and then at runtime the CIL instructions are compiled to platform specific bytecode, by the Common Language Run- time (CLR) virtual machine. Having an intermediate language like CIL greatly reduces the amount of compilers needed to for n languages to target m archi- tectures, from n ∗ m to n + m since only one compiler from each language to the intermediate language are needed, and one for each target architecture from the intermediate language[Richter, 2014]. Having a common shared platform, also grants the .NET based languages the possibility to share their assemblies. This means that code written F# can be used in C# and vice versa. Figure 2: Translation of .NET source languages to platform specific byte-code 4
  • 11. Joachim Vincent Hasseldam September 2015 3 CIL Abstraction Several places throughout this report, C# will be used as a tool to conceptualize and rationalize about CIL. This reason for this is that C# is easier to read and less verbose than looking at actual CIL output and as a result provide a convenient way to illustrate examples and discuss certain concepts. Since the CIL language is also object-oriented, considering design decisions and desired output of translations and examples in C#, provides a close abstraction to how the actual CIL code would look. And for obvious reasons, everything than can be expressed in C# can also be expressed in CIL, so the step from C# design ideas to CIL is only a matter of implementation. 5
  • 12. Joachim Vincent Hasseldam September 2015 Part II Design 4 Micro-F# This section provides an introduction to Micro-F#, and its syntax, to establish some familiarity with the language before we start looking a sample code in the sections to come. The few examples shown in this section provides only a brief introduction to the language and its capabilities. Therefore the reader is encouraged to download a copy of the testing environment, to see more elabo- rate examples and to experiment with compiling and running his own programs written in Micro-F#. Information on how to obtain a copy of the test environ- ment can be found in section 1. Micro-F#, or MF, is a statically typed, eager evaluated functional programming language. It is deigned for the sole purpose of experimenting with the com- pilation process of a functional language on the .NET platform. MF, aims at providing a subset of the features found in ML-based languages and its syntax is heavily inspired by F#, so the name MF reflects the language’s syntactic sim- ilarities. The entire list of keywords and symbols supported by the language can be found in figure 3. fun end let in do = <> > < >= <= if then else not + - * / % bool int tree list ( ) [ ] -> | Cons Nil Node Leaf :: : ; , match with true false Figure 3: Keywords and symbols supported by MF 4.1 Types MF currently supports four types, the primitive types boolean and integer and the inductive types list and tree. The inductive types however are limited to values of integers. 4.2 Functions MF provides support for global, non-nested, higher-order functions. The fol- lowing is the MF equivalent1 of the F# map function from section 2.1. 1Although slightly more restricted since the language has no support for generic types, nor any other types than integers for lists. 6
  • 13. Joachim Vincent Hasseldam September 2015 fun map f lst : (int -> int) -> list -> list = match lst with | Cons(x, xs) -> (f x)::map f xs | Nil -> [] end end The syntax for the function definition is similar to what we know from F#, but a few noteworthy differences does exist. Unlike F#, MF does not provide type inference for function definitions. The type signature for a function has to be specifically stated in the form: t1 → t2 → t3 → ... → tn+1 (1) Where t is a type supported by the language and n is the number of parameters in the function. The last type given will be the return type of the function. MF is not whitespace sensitive like F#, and the example above has been written on several lines, and with indentation, purely for aesthetic reasons. Instead scoping for functions and match expressions are provided by the end keyword, and in this regard the language is more similar to C# which is also not whitespace sensitive, and uses symbols to differentiate one construct from the next. 4.3 Let Bindings Let bindings to values, expressions or functions are supported in MF. Below are a few examples: ... let a = 42 in // Value let b = if a > 2 then a else 2 in // Expression let c = mul 5 in // Function let d = [3; 5; 8; 13; 21] in // List let e = Node(8, Node(5, Leaf, Leaf), Leaf) in // Tree ... A opposed to functions, the types for let bindings are omitted, and instead the types are inferred by the type checker at compile time. Again since MF is not whitespace sensitive, the keyword in is used to separate the let declaration from the body. A goal archived in F# through a line break2 . 4.4 Program Structure A valid MF program is composed of a list of zero or more global functions, with bodies consisting of a single expression tree, followed by a single expression tree. Since we are interested in producing an executable assembly, the single expression tree part of the program will become an implicit entry point of the application and since all functions must return a value, this part can not be omitted. Figure 4 shows the separation of global functions and the entry point of a source program. 2It should be noted however that stating the ”in” keyword explicitly is also supported, but rarely used, in the F# syntax. 7
  • 14. Joachim Vincent Hasseldam September 2015 Figure 4: The structure of a MF source program 4.5 Abstract Syntax Tree We will conclude the MF section by taking a look at the abstract syntax tree for the language, and briefly examine the different constructs. The abstract syntax tree presented here is based on the syntax tree for Peter Sestofts Expression language3 and modified to support higher-order functions and the built-in inductive data types. The full abstract syntax tree can be seen in figure 5. TypeDef The types currently supported by the language. MatchExpr Contains the matching part of a match case with information of which names to bind to the result of the match expression. Suppose for example a match is performed on a cons element in the list type. A name to bind the value element, as well as one for the rest of the list is needed. |Cons(x, xs) MatchExpr → (g x)::f xs Expr As stated earlier, the body of a global function, as well as the implicit main function, contains a single expression tree. TypedExpr Used to hold the inferred type of an expression. We will get back to that when we discuss the type checker in section 9. ConstInt Constant of the primitive type integer. ConstBool Constant of the primitive type boolean. Var Named values. Let Contains the name of a let expression, the value part and the body which is the rest of the expression tree. 3More on this in section 8. 8
  • 15. Joachim Vincent Hasseldam September 2015 PrimOp One of the arithmetic operators supported by the language and its two operands. If An If-then-else expression. Match A complete match expression. First the matched expression, then a list of match-cases and their corresponding ”action” expression. FunCall Function call with a name- and an argument part. Constructor Construct one of the cases of the inductive types. Contains a value and rest expression or the list type and a value and the two sub- trees for the tree type. FuncDef The global function definition contains the name of the function, a list or argument names, a type signature and the expression tree constituting the body of the function. Program Finally the source program consisting of a list of global functions and the expression tree for the implicit main function. 5 Translating Program Structure Translating a functional language like MF to an object-oriented target language like CIL demands that we start to consider the functional concepts in a object- oriented mindset. What is the object-oriented equivalent to a function? How do we replicate the scope of a functional program in the translation? And so on. In this section we will examine how a functional source program can be translated to an object-oriented equivalent, and still retain its structural integrity. We will do this by first looking at a simpler version of the problem by enforcing several constraints on the MF language. This will help to identify where the challenges in the translation lie, and how they can be addressed to support the full language. We will start by looking at how to translate a simple expression language and then move on to approach the problem as if MF was a first- order functional language. Next we relax the first-order constraint and allow higher-order functions, although limited to only one parameter, and see how this effects the design. Finally we shall expand on this structure to allow an arbitrary number of arguments for our functions and thus provide full support for MF. 5.1 Expression Language With no functions, the expression language limits MF to let bindings of values or expressions. A sample program could look like this: let a = 11 in let b = 5 in let c = if a > b then a else b in c + a 9
  • 16. 5.1 Expression Language 10 type TypeDef = | FuncType of TypeDef * TypeDef | IntType | BoolType | ListType | TreeType type MatchExpr = | NilMatch | ConsMatch of string * string | LeafMatch | NodeMatch of string * string * string type Expr = | TypedExpr of Expr * TypeDef | ConstInt of int | ConstBool of bool | Var of string | Let of string * Expr * Expr | PrimOp of string * Expr * Expr | If of Expr * Expr * Expr | Match of Expr * (MatchExpr * Expr) list | FunCall of Expr * Expr | Constructor of Constructor and Constructor = | Nil | Cons of Expr * Expr | Leaf | Node of Expr * Expr * Expr type FuncDef = FuncDef of string * string list * TypeDef * Expr type Program = Program of FuncDef list * Expr Figure 5: Abstract syntax tree for the MF language
  • 17. Joachim Vincent Hasseldam September 2015 If we were to build a C# program based on the code above, the resulting program would as a minimum require a class and a method to contain the instructions. The desired output for the examples shown here is an executable assembly and not a library file4 , so for the above program we define a single class with a main method. The body of the main main method will consist of the expression above. Each let expression can be viewed as a declaration of a local value in C# and scope for the expressions will conveniently be preserved by declaring values from top to bottom. class Program { public static int Main() { int a = 11; int b = 5; int c = a > b ? a : b; return c + a } } Since the end of the expression also marks the end of the main method, the last instruction in the expression(c + a), is of course also the return value of the method. 5.2 First-Order Functions Now let us extend the language to allow functions, although limited to accepting and returning only values. This allows for a bit more interesting programs, but also requires an expansion of the structure for the expression language from the previous section. Consider the following example: fun prevN x : int -> int = let y = x - 1 in if y > 0 then y else 0 end let a = prevN 11 in a * a The expression after the function declaration is still going to constitute the body of the main method of the program, but a single method is no longer sufficient since the function prevN should translate to a new method with its own method body. With functions limited to only primitive types, an additional static method in the same class, should be sufficient to accommodate the new function. Similar to the main method, the result of the expression in the body of prevN will be the return value of the method. 4Although it would not make much difference except an executable assembly requires a pro- gram entry point(or main method) as opposed to a library file. 11
  • 18. Joachim Vincent Hasseldam September 2015 class Program { public static int prevN(int x) { int y = x - 1; return y > 0 ? y : 0; } public static int Main() { int x = prevN(11); return x * x; } } Translating MF as a first-order language is a relatively trivial process. A new static method is added when a new function is added to the source code. Each method is compiled using a local environment which enforces the static scoping rule of the language. 5.3 One Parameter Higher-Order Functions For a function to take another function as input or returning a function as output, it requires a larger re-factoring of the structure defined so far. For the time being, we limit all functions to only one parameter. We do however allow for a function to be called with zero parameters, which means the function just returns itself. This enables let bindings to functions as opposed to only values as in the previous examples. Consider the following program: fun f x : int -> bool = x % 4 = 0 end let g = f in if g 1324 then 1 else 0 We want to bind the name g to the function f and for this to work, we require the ability to treat the functions as values and therefore a translation to a static method in a single class, is no longer enough the support the language. Instead the static method is changed to an invoke instance method inside a class named after the function. The invoke method body will contain the instructions from the source language function similar to the static methods in the first-order language from section 5.2. The method signature is also similar to how it was constructed in the first-order language, with the type definition of the function matching the input- and output type of the method. public class f { public bool Invoke(int x) { return x % 4 == 0; 12
  • 19. Joachim Vincent Hasseldam September 2015 } } public class Program { public static int Main() { f g = new f(); return g.Invoke(1324) ? 1 : 0; } } A call to f now results in an instance being created of class f, and if the call contains an argument, then the invoke method of the instance is called. In this example a let binding is made from g to f and g is now a reference to the new in- stance. The type of g is of course f since g is now an instance of class f. When g is called next with an argument, then the invoke method of the instance is called. 5.3.1 Shared Interface The solution described so far enables functions to return functions, but we still run into problems if we try to pass a function to another function as an argument. Suppose we have the following input program: fun apply f : (int -> int) -> int = f 42 end fun g x : int -> int = x * x end fun h x : int -> int = x * 2 end (apply g) + (apply h) Function apply takes a single function matching the type definition: int -> int and returns the result of applying the function to 42. Translating function g and h is trivial, but apply pose a problem. The invoke method for class apply cannot be limited to a specific parameter type, since it should be able to receive an instance of both classes: g and h5 . The example here is a little contrived since the language is limited to only one parameter. It does illustrate the point though: functions with the same type definition should share the same instance type. One way to solve this problem could be through inheritance. We could create an abstract superclass for each function type constellation and then let each function class, with the same type signature, inherit from the superclass. To avoid creating a new abstract class for each type definition and having to implement functionality to keep track of them and to retrieve them, a single interface will instead be created which all function classes will implement. The interface will contain a single method definition which a generic input- and out- put type. 5Or any other classes compiled from a function with the same type definition. 13
  • 20. Joachim Vincent Hasseldam September 2015 public interface IFunc<TInput, TOutput> { TOutput Invoke(TInput input); } Now a function parameter in a higher-order function will always have the type: IFunc and likewise for all instances created of the function classes. This solves the compatibility issue from before and enables full support for higher-order function with the one parameter restriction. 5.4 N Parameter Higher-Order Functions Now we are ready to lift the final restriction and support the language in its entirety. This will be done be extending the previous solution to support func- tions with a arbitrary number of parameters. We will begin by taking a closer look at how multiple parameter functions are actually evaluated. As stated in section: 2.1.1 multiple parameter functions are handled as a series of specialized one parameter functions with knowledge of the previous parame- ters through a technique known as currying. The following example illustrates this concept: fun f x y : int -> int -> bool = x > y end let g = f 5 in let res = g 4 in ... In the previous section a let binding to a function with no parameters resulted in the new value becoming a reference to the actual function. This time g is not just a reference to f but a new function, with a single parameter y, which returns a boolean result. Inside the function g, the name x appears as a free variable and access to the environment of f must be available here in the form of a function closure. When g is called the body of the original function f is evaluated and the result is assigned to the value res. Calling a function with zero parameters still results in an instance of the function class to be created and referenced by the binding. But calling the invoke method on the instance can now either result in the computation of the expression in the function body, if no more parameters exists in the original function, or in the creation of a new function with a function closure of the previous. This new function will simply be modelled as a new class with its own invoke method similar to the process described earlier. The closure is provided by extending the function class with fields to accommodate the free variables. This creates a nested structure of classes, where the invoke method of the current instance creates an instance of the next function class. The list of fields in each class grows in parallel to the ”level” of this nested structure, since the next level must contain all values in closure from the previous level as well as the value given in the parameter of the previous. Because each new class corresponds to the next step in the nested structure, 14
  • 21. Joachim Vincent Hasseldam September 2015 this concept will be referred as the ”step-class structure”. Figure 6 provides a graphical representation of the step-class structure for a function with three parameters. A level in the step-class structure refers to how deep in the structure Figure 6: Step-class structure from compiling the function definition: fun f x y z : int -> int -> int -> int = <expr> end we are, starting from level zero with the first class. Each parameter in the original MF function results in a new class, so the ”bottom” level of the structure is set at n − 1 where n is the number of parameters in the original function. If we apply this concept to the example above, we end up with the following: public class f : IFunc<int, IFunc<int, bool>> { public IFunc<int, bool> Invoke(int x) { return new f_Step1() { x = x }; } } public class f_Step1 : IFunc<int, bool> { public int x; public bool Invoke(int y) { return x > y; } } public class Program { public static int Main() { IFunc<int, bool> g = new f().Invoke(5); bool res = g.Invoke(4); } } 15
  • 22. Joachim Vincent Hasseldam September 2015 6 Inductive Data Types MF supports recursive data types in the form of two built in types: list and tree. Since the two types are built in, their construction is not determined by any input from the developer. They also have to be constructed before the source program, since constructs in the source code can be dependent on the types existence. We will start by taking a closer look at how the types are represented and then proceed on to how they are modelled in C#, and how we can interact with them from MF. 6.1 List Similar to F#, lists in MF are represented as a singly linked list [Smith, 2012]. This means that every element in the list contains two pieces of information: a value of the element, and a pointer to the next element in the list. Since each element points to the next, we also need a special case representing the end of the list. An element in the list containing a value and the empty list element is usually referred to as cons and nil respectively [Pierce, 2002]. This will also be the case here. Figure 7 provides a graphical representation of such a linked list. Figure 7: A singly linked list The list structure will be modelled through object-oriented decomposition, which means that a base class is created for the actual type (in this case list) and each of the cases (nil and cons) are represented as subclasses [Emir, 2006]. Deter- mining the type of an instance in relation to the subclasses, as well as accessing members of the subclasses, will all happen through the list superclass. Even constructing an instance of any of the subclasses happens through methods de- fined in the superclass. For the list type we only have two cases6 , but we can imagine this technique supporting other structures with more cases equally well. This technique encapsulates the underlying structure of the subclasses and al- lows us to only communicate with the superclass regardless of the amount of cases it represents. The class structure for the list type can be seen in figure 8. 6.1.1 Constructor Methods Two static methods in the list class creates new instances of the nil and cons subclasses. The nil case takes no arguments, since the constructor of the nil 6This is also true for the tree type as we shall see later. 16
  • 23. 6.1 List 17 public abstract class MFList { public MFList() {} public bool IsCons() { return this is Cons; } public bool IsNil() { return this is Nil; } public static MFList Nil() { return new Nil(); } public static MFList Cons(int value, MFList next) { return new Cons(value, next); } public int GetValue() { return ((Cons)this).value; } public MFList GetNext() { return ((Cons)this).next; } } public class Cons : MFList { public int value; public MFList next; public Cons(int value, MFList next) { this.value = value; this.next = next; } } public class Nil : MFList { } Figure 8: Class structure of the MF list type
  • 24. Joachim Vincent Hasseldam September 2015 subclass defines no parameters. The cons case take two arguments, the value of the cons element, which is limited to an integer type, and a list type representing the rest of the list which can of course can be either a nil- or a cons element. Creating a new list element always appends the element to the beginning of the list and as a result the list is built from the bottom up. 6.1.2 Test Methods When we traverse a list, we need a way to establish the type of a given instance to see if we have a cons case, and traversing the structure can continue, or a nil case and the end of the list is reached. A method is added for each case which tests a given instance against the type of the case and responds with a logical value. 6.1.3 Accessor Methods Once the case type has been established the relevant information of the case can be extracted. This is of course only relevant for the cons type since no information is pertained in the nil case. A method is added to access each member in the cons class. The methods performs a type cast of the given instance to the cons class and then returns the member. 6.2 Tree The second inductive data type supported by MF language is a binary tree. This structure is very similar to the list types previously described and also contain two cases. Each element in the tree contains a value which, like the list type, is limited to integers, and a reference to the left- and right sub-tree. This element will be referred to as a node and the end of the tree is indicated with a leaf type. Similar to the nil element in the list, the leaf element does not contain any information. Figure 9 shows the structure of the binary tree. The class structure for the tree is almost identical to that of the list. The only noteworthy difference being that the node case has three accessor methods for the value, left sub-tree and right sub-tree as opposed to only two in cons case. The tree class structure can be seen in figure 10. 6.3 Limitations The technique for the inductive data types, presented here, is relatively easy to implement and offers a clean interface. When the types are used, we only have to interact with the abstract superclass and not have to concern ourselves about the subclasses, or how they are implemented. It would be easy to expand with a different data structure with more cases, since each new case just results in adding a new subclass, and the necessary methods to interact with the subclass, in the superclass. We do however have to consider its limitations. The type cast performed when field values are retrieved from the subclasses is 18
  • 25. Joachim Vincent Hasseldam September 2015 Figure 9: Binary tree structure not very type safe, because the correct type is just assumed. When the inductive types are used by the MF compiler this should not pose a problem, since logic in the compiler provides a check using the test methods, to ensure that type casts are only performed on the appropriate types. But if the assembly, and the types are used by a another .NET based language, the guarantee for type safety is no longer present. Consider the following simple example in C#: var tree = MFTree.Leaf(); var leftSubTree = tree.GetLeft(); // Runtime-error The program compiles but will result in a runtime error as soon as we try to access a field, for example the the left sub-tree, of a Leaf. As we saw earlier the Leaf class does not contain any fields, but because the methods for extracting the values of the Node class fields are defined in the abstract superclass, they are available in all subclasses. However it is worth noting, that this limitation is not confined to the MF lan- guage. F# handles inductive data types very similar to the technique applied here. A notable difference is that F# types do no provide methods in the ab- stract class for retrieving the values of a subclass’ fields, instead the type casting is left to the developer [FSSPEC, 2012]. But the result is the same, we still end up producing type unsafe code which could results in runtime errors. 19
  • 26. 6.3 Limitations 20 public abstract class MFTree { public MFTree() {} public bool IsNode() { return this is Node; } public bool IsLeaf() { return this is Leaf; } public static MFTree Leaf() { return new Leaf(); } public static MFTree Node(int value, MFTree left, MFTree right) { return new Node(value, left, right); } public int GetValue() { return ((Node)this).value; } public MFTree GetLeft() { return ((Node)this).left; } public MFTree GetRight() { return ((Node)this).right; } } public class Leaf : MFTree { } public class Node : MFTree { public int value; public MFTree left; public MFTree right; public Node(int value, MFTree left, MFTree right) { this.value = value; this.left = left; this.right = right; } } Figure 10: Class structure of the MF tree type
  • 27. Joachim Vincent Hasseldam September 2015 7 Intermediate Language When targeting CIL through the Reflection API we work through several levels of abstractions. Ranging from higher levels like interfaces, classes and methods, to fields and method parameters and all the way down to low level stack manip- ulation. This process is further complicated by the paradigm switch from the functional source language to the object-oriented target language CIL. To simplify the compilation process we break it into two parts by introduc- ing a new intermediate language between MF and CIL. This language is called Micro-F# Intermediate Language or MFIL for short. MFIL is object-oriented like CIL and its main purpose is to transform MF to a state closer to CIL, but postpone the production of the actual constructions and stack manipulation, until a paradigm shift has been completed for the whole source program. MFIL bridges the functional aspects of the source language and the object- oriented ones in the target language CIL. This is where the functional constructs like: functions and let bindings are translated to classes, methods, fields and local variables. It provides a new abstract syntax tree which acts as a good abstraction on top of the actual CIL instructions. 7.1 Bridging Paradigms Working in a polyglot environment with a functional source language, an object- oriented target language and another functional language as the meta language, it is essential to have a structure that supports some separation of the differ- ent elements. The most important responsibility of the intermediate language is therefore to create a smoother transition, from the semantics of the source language to those of the destination language. The intermediate language is not meant to be wrapper for CIL generation in the sense that the MFIL instructions corresponds to exact CIL instructions. Rather it serves as a translation between a functional abstract syntax tree and an object-oriented one. When, for example, a function in the source language is compiled to CIL, many individual steps, in addition to just creating the list of function classes, are per- formed on the stack machine. This includes instructions for parameters, types, interfaces, fields and so on. The responsibility of the MF compiler is there- fore to create a new abstract syntax tree, with the object-oriented equivalent of the instructions in source language, to feed to the next step in the compilation process. An example could be the following function definition in the MF source lan- guage: fun name p1 p2..pn : a’ -> b’... = . . end Here p1..pn are the parameters defined by the function with n being the to- tal amount of parameters. The result of the compilation to MFIL is a list of FuncClass types of size n. The FuncClass type is defined as: 21
  • 28. Joachim Vincent Hasseldam September 2015 FuncClass of string * string * (ILType * ILType) * Field list * Instruction list This instruction transfers information to the MFIL compiler that a class has to be created with an implicit invoke method. The class has a name, a parameter name to the invoke method, an input and output type, a list of fields and a list of instructions to populate the body of the invoke method. This aids in effectively communicate to the final stage of compilation what has to be done, but not having to worry at this point exactly how it is done. 7.2 Encapsulating Side Effects The compiler will be written in F#, which acts as the meta language since this will be the language code examples will be presented in, and the language in which we reason about the choices and implementation details [Sestoft, 2012]. The Microsoft Reflection API, used to generate the CIL instructions, is object- oriented and although F# works fine with object-oriented constructs, it does suffer some inconveniences, from a purely functional perspective, in terms of mutable state and side effects. Therefore the intermediate language also serves as a separation in the compiler between meta language code which is pure func- tional and that which is using object-oriented constructs and produces side effects. The motivation for splitting the compilation process into two steps, is also to get as much as possible of the translation done in a pure functional environment and postpone the interaction with object-oriented elements in the Reflection API, for as long as possible, as well as keeping those elements en- capsulated from the rest of the code. It also aids the development and testing process of the compiler knowing that the translation to MFIL marks the point in the compilation process, where input should be viewed in a object-oriented mindset, rather than a functional one. 7.3 Target Language Agnostic As we saw in section 2.3 having an intermediate language provides an option for reusability. By targeting CIL, MF takes advantage of the architecture specific compilers in the CLR runtime. The MFIL compiler however does not target a CPU architecture, but instead another intermediate language. Even so we can imagine the same resuability for the new intermediate language. Suppose we would like to compile MF to another platform, for example Java Virtual Ma- chine. All work done by the MF compiler could potentially be reused and only a new MFIL compiler would have to be constructed for the new target platform. As stated earlier, the intermediate language will not contain specific CIL in- structions. Instead it is a representation of instructions meant to be executed in a object-oriented stack-based language, and because the instructions in MFIL, are kept at a high enough level of abstraction, the design should also support the possibility of replacing the MFIL compiler with another compiler targeting a different language, supporting the same constructs. So although targeting a different intermediate language is not a primary motiva- tional factor for creating MFIL, care will still be given to ensure the instructions in MFIL are as generic as possible to support potential future reusability. 22
  • 29. Joachim Vincent Hasseldam September 2015 7.4 Abstract Syntax Tree With the motivational choices out of the way, we can start examining the struc- ture of MFIL, by taking a look at the abstract syntax tree for the language, and take a brief tour through its various constructions. As can be seen in figure 11 the abstract syntax tree contains an assortment of higher and lower level instructions. ILType Contains the four types supported by the language and is recursively defined to support type definitions for higher-order functions through the IFunc interface described in section 5.3.1. Variant Recall from section 6 that the built-in inductive types tree and list each contain two different sub-classes: leaf and node for the tree type and nil and cons for the list type. When an instruction is given to either create a new instance of one the sub-classes, or check if an existing instance match the type of a specific sub-class, the variant type is used to define the sub-class in question. VariantValue This is used to define a specific value to be extracted from one of the sub-classes to the inductive types. Instruction Some of the operations found here in the Instruction type are rather low level and match exactly the target operation in CIL. This is true for pushing an integer or boolean value to the stack, the equality-, comparison- and arithmetic operations as well as loading and storing values. Others carry a bit more information and require several steps to be performed in the target language. The Branch instruction controls the flow of the application and is used for if- then-else expressions as well as match-expressions. CallInvoke instructs a call to be made to the invoke method of an instance of one of the step-classes7 , and the instruction list evaluates to the argument of the method call. CreateObject creates an instance of a given class. StoreField stores the value, currently on top of the stack, to a specific field belonging to a specific class. This instruction is used when closure values gets replicated in the fields of the next level in the step-class structure. The last three instructions perform operations on one of the variants in either of the two inductive types. Field The name and the type of a class field is stored in this type. Class The final and most high-level construct in the MFIL language is the Class. An input program in MFIL contains a list of zero or more FunClass elements and a single EntryClass element containing the instruction for the implicit main method discussed earlier. Since all classes in the translation, described in section 5.4, only contains a single method, there is no reason to state the method explicitly. Instead the information pertained in the FunClass 7See section 5.4. 23
  • 30. 7.4 Abstract Syntax Tree 24 type ILType = | ILInt | ILBool | ILList | ILTree | ILFunc of ILType * ILType type Variant = | ILNil | ILCons | ILLeaf | ILNode type VariantValue = | ListValue | TreeValue | Next | Left | Right type Instruction = | PushInt of int | PushBool of bool | Add | Mul | Mod | Div | Sub | Gt // > | Lt // < | Eq // = | Neq // <> | Ge // >= | Le // <= | Print of ILType | PrintAscii | Load of string | StoreLocal of string * ILType | Branch of Instruction list * Instruction list | CallInvoke of string * Instruction list | CreateObject of string | StoreField of string * string | CallConstructVariant of Variant | CallCheckVariant of Variant | ExtractVariantVal of VariantValue type Field = | Field of string * ILType type Class = | FunClass of string * string * (ILType * ILType) * Field list * Instruction list | EntryClass of string * Instruction list Figure 11: Keywords and symbols supported by MF
  • 31. Joachim Vincent Hasseldam September 2015 type is the name of the class, the name of the invoke method parameter, an input- and output type for the invoke method, a list of fields and finally the MFIL instructions that make up the body of the method. 25
  • 32. Joachim Vincent Hasseldam September 2015 Part III Implementation In part II we outlined a strategy for translating the structure of the MF language and how to approach some of challenges of the process. Now we will examine how the discussed concepts are actually implemented and illustrate through ex- amples the compilation process from MF source code to CIL byte-code. The structure of the implementation part introduces topics in chronological or- der in respect to the compilation process. We start in section 8 by describing the lexical- and syntactical analysis process of MF. The topic of lexical- and syntactic-analysis is very broad and will only be briefly covered here since it falls outside the main focus of this project, and because the resulting lexer- and parser files are mostly auto-generated using Lex/Yacc. After parsing of the MF source code, section 9 will discuss how type checking is performed on the resulting abstract syntax tree. Not only does the type checker ensure a type-safe input program, before the compilation starts, but it also pro- vides type inference for let expression by enriching the abstract syntax tree of the language. With a type-safe program, we will proceed to the first actual compilation step in section 10. The MF compiler takes a MF syntax tree and transform it to the object-oriented intermediate language MFIL, designed for this project. Finally in section 11 we will examine how the MFIL compiler receives constructs written in MFIL and emit the actual stack machine instructions thereby con- cluding the transition from MF source code to CIL byte-code. The examples presented in the following sections were selected because they illustrate an important concept in the process or because they provided partic- ular challenges during the implementation. Figure 12 provides a roadmap to the different steps of the compilation process, and in which source files each step primarily takes place. 8 Lexical and Syntax Analysis Before 1975, writing a lexer and a parser program was a large and time-consuming part of building a compiler. This has changed with the introduction Lex/Yacc and now the process can be mostly automated [Niemann, 2015]. Because the focus in this project is on compilation, the decision has been made to use Lex/Y- acc to generate the lexer and parser programs for the language. The lexer- and grammar specifications used here, are inspired by Professor Peter Sestofts definition for the Expression language and extended to accommodate the additional constructs of the MF language8 . 8.1 Lexing FsLex is a F# version of Lex designed to produce functionality to translate a Unicode input string to a series of tokens [FsLex, FsYacc, 2015]. FsLex takes an 8Peters original specifications can be found here: http://www.itu.dk/people/sestoft/plc/ 26
  • 33. Joachim Vincent Hasseldam September 2015 Figure 12: The overall architecture of the compilation process input a file, describing which keywords and symbols the language will support. Based on these rules the program produces a F# file with source code to perform the actual tokenization. The entire list of keywords and symbols supported by MF can be found in section 4 figure 3. Any string the lexer encounters, not specified as either a symbol or a keyword, is treated as a name, or a number(in case of single integers). The complete lexer specification file can be found in appendix: A.17 8.2 Parsing The parser creates syntactic relationship between the tokens received from the lexer. The F# implementation of the Yacc parser: FsYacc, takes as input a context free grammar written in a Backus-Naur Form variant format, and pro- duces the F# parser file [Niemann, 2015]. The resulting parser transforms the source code to an abstract syntax tree representing the structure of the program. The type definition for the MF abstract syntax tree can be found in the file: Ast.fs, in the project source code. FsYacc is a LR-type parser and therefore allows the grammar to be left recursive. Associativity is handled by prefixing the supported tokens with %left and %right for left and right associativity re- 27
  • 34. Joachim Vincent Hasseldam September 2015 spectively. Precedence is then handled by ordering these constructs from top to bottom for low towards higher precedence respectively. An example of enforcing associativity behaviour can be found by analysing the ARROW ("->") token in the grammar. This token is used in defining the type specification for functions in the following manner: a’ -> b’ -> c’ where: a’, b’ and c’ are types supported by the language. Suppose the parser encounters the type definition described above. Then a choice must be made whether to parse the expression like this: (a’ -> b’) -> c’ or like this: a’ -> (b’ -> c’). The latter is right associative and obviously the correct choice, since a function call will consume the first type and return a new function with the remaining type definition as we saw earlier. The token: ARROW is therefore preceded with the %right tag in the beginning of the grammar, to ensure this right associativity. If instead the function was intended to take as input another function of type a’ -> b’ and return the type: c’, then it would have to be stated explicitly using brackets. Figure 13: Right versus left associativity parse trees for function type specifica- tion The top level production of the grammar: Program defines a source program as being a list of FuncDef, followed by an expression (Expr). The list of FunDef will contain the definitions for the global functions stated in the beginning of the source code, and the expression corresponds to the explicit entry point (or main method) of the program following the language design stated in section 4. The list of function definitions might be empty since a program without any functions, besides the main function, is still considered valid. The expression, on the other hand, always has to be present, since the program must return an integer value. 9 Type-Checking and AST Transformation MF is designed, like F# to be a statically typed language. Since types can be verified at compile time, as opposed to runtime, and because the target language: CIL is a strongly typed language, a logical next step is to perform type checking on the abstract syntax tree provided by the parser. In addition to the actual type checking, the type checker will also provide type inference for let expressions, and enriching the abstract syntax tree with the types, as well as supporting currying by transforming multiple argument function calls into a form of partial application, by introducing anonymous let bindings. 28
  • 35. Joachim Vincent Hasseldam September 2015 9.1 Type Checking The MF language supports four types. The primitive types: integer and boolean and the inductive types list and tree. In essence the expressions are checked recursively until one of the four types is reached, in which case the type is returned. Lookup of let bindings and function names is handled by maintaining a symbol table in the form: name -> typeDef where name is a string representing the bound name, and typeDef is the type for said name. This can also be referred to as an environment [Appel and Palsberg, 2003]. First step is to update the environment with the type definitions for the global functions in the program. Then the expression tree (function body) for each function is checked by first updating the environment with each parameter for the function and its corresponding type. This means that before examining the expression of the following function: fun myFun f x : (int -> bool) -> int -> int = <expr> end The environment would contain the following entries myFun -> FuncType (FuncType (IntType,BoolType),FuncType (IntType,IntType)) f -> FuncType(IntType, BoolType) x -> IntType The type checker then proceeds to check the function body with the updated environment and since the scope of the function parameters are limited to the body of the function, each function is checked with its own environment. The only entries in the environment, shared between functions, are the global func- tion names and their corresponding types. A type check fails when the actual type does not match the expected type. In the case of a Match expression check on a list, the following must hold: if typeof(ve) = list and typeof(ne) = typeof(ce) return typeof(ve) Where ve is the match expression type, ne is the nil expression type and ce is the cons expression type. The match expression obviously has to be a list, but the cons and nil expressions does not necessarily have to be a list, since the result of a match on a list is not required to yield a new list. It is however required that the two expressions share the same type, since the base case must return the same as the inductive step. If the type check holds, ve (in this case the list type) is set as the type for the match expression. A complete list of the type rules for the expressions are listed in table: 1. 9.2 Transform Multiple Argument Function Calls As established earlier: all functions accepts only one argument and return a new function with the remaining type definition. Without a way to handle sev- eral arguments, we would be forced to write source code with calls to multiple 29
  • 36. Joachim Vincent Hasseldam September 2015 List Match(ve, ne, ce) if typeof(ve) = list and typeof(ne) = typeof(ce) return typeof(ve) Tree Match(ve, le, ne) if typeof(ve) = tree and typeof(le) = typeof(ne) return typeof(ve) Let(ve, be) return typeof(be) If-then-else(ie, te, ee) if (typeof(ie) = bool or typeof(ie) = int) and typeof(te) = typeof(ee) return typeof(te) Push int int Push bool bool Int Op(le, re) if typeof(le) = int and typeof(re) = int return typeof(le) Bin Op(le, re) if(typeof(le) = int or typeof(le) = bool) and typeof(le) = typeof(re) return typeof(le) Table 1: Type rules for MF parameter functions, in the rather cumbersome way of partial application. The possibility to call f with all three arguments directly would make the language more concise and more enjoyable to work with. From the observation that, from a semantic point of view, a call to a function f with n arguments, should be behave equivalent to performing n individual function calls, storing the in- termediate functions in let bindings, one solution is to transform the syntax tree to that form. Parsing an n argument function call results in the following parse tree: FunCall...(FunCall(FunCall(Var name, Expr1), Expr2)...Exprn) The following function ”flattens” the nested function call structure to a series of let expressions and function calls: let rec flattenCall innerCall = match innerCall with | FunCall((Var _), _) -> innerCall | FunCall(innerCall, argExpr) -> let inner = flattenCall innerCall let outerName = createRandomName "fun" let outerCall = FunCall((Var outerName), argExpr) Let(outerName, inner, outerCall) To illustrate how flattenCall works, let us imagine a function f taking two arguments x and y, and the input program received contains a call to f, with 30
  • 37. Joachim Vincent Hasseldam September 2015 the two arguments five and seven given. The input parse tree would look like this: FunCall(FunCall(var "f", ConstInt 5), ConstInt 7) After this structure has been through the flattenCall function, we end up with the following: Let("randomName1", FunCall(var "f", ConstInt 5), FunCall(var "randomName1", ConstInt 7) (2) Which is the same as if the original input program had read: let randomName1 = f 5 in randomName1 7 From this example we see that each new argument > 1, result in a new let expression ”wrapping” the previous call in its value expression and an instruc- tion for the let expression to call itself, in the expression body, with the next argument. Since the anonymous is of no importance to the developer, it is given a randomly generated name by the function. This transformation benefits the next step in the compilation process by adding support for currying, so multiple parameter function calls are handled identi- cally to evaluating a series of single parameter functions. 9.3 Enriching the AST The type checker will, in addition to performing the actual checks, return an exact copy of the abstract syntax tree, with type enriched expressions. To contain the enriched expressions, a ”wrapper” addition is made to the Expr type in the Ast.fs file: TypedExpr of Expr * TypeDef The TypedExpr tuple contains the original expression as well as its inferred type. With each recursive check call, the typed expressions bubble up to the previous level maintaining the original structure and resulting in the original parse tree enriched with types. The check function for a let expression is listed below to illustrate this concept: and checkLet env name valExpr restExpr = let (valType, valTypedExpr) = checkExpr env valExpr let updatedEnv = Map.add name valType env let (restType, restTypedExpr) = checkExpr updatedEnv restExpr (restType, TypedExpr(Let(name, valTypedExpr, restTypedExpr), restType)) First the value part of the let expression is checked recursively and results in its type and its original syntax tree wrapped in typed expressions. The current environment is updated with the name of the let binding and its value type. The body of the expression is then checked with the updated environment, making 31
  • 38. Joachim Vincent Hasseldam September 2015 the name a bound value in the scope of the rest of the expression. The function proceeds to return the original structure of the let syntax tree, with the value and body expression swapped for their respective typed versions in a typed ex- pression. A graphical representation of the let expression abstract syntax tree before and after type checking, can be found in figure: 14. The process described above enables type inference for let expressions in a rela- tive simple manner. The abstract syntax tree is traversed only once to perform both the type checking as well at the type inference. Omitting types on let expressions results in a more compact and concise syntax for the source language. This could be improved even further by implementing type inference for function definitions as well, although that would require a different approach. Figure 14: A conceptual representation of the abstract syntax tree for let ex- pressions before and after type enrichment 10 MF Compilation At this point the source program is considered type safe, the abstract syntax tree has been enriched with the inferred types, and the first compilation step can take place. The MF compiler compiles a MF source program to MFIL. MFIL supports a set of constructs matching those found in object-oriented stack-based languages and is semantically closer to the intended target language CIL than MF. Roughly speaking the MF compiler provides the transition from a functional source pro- gram to an object-oriented one. Because the MF compiler deals with a mix of the functional and object-oriented paradigms, a distinction will be made between ”function” and ”method” for this discussion. A function from now refers to a function defined in the source language MF or a function in F#, the language of the compiler. A method is an invoke method in the step-classes defined in the intermediate language: MFIL. 32
  • 39. Joachim Vincent Hasseldam September 2015 10.1 Expressions In any given program, the body of all the global function definitions, as well as the implicit main function, consists of a single expression tree. The MF compiler translates the expression tree to a list of instructions in MFIL. Consider for example a let expression which is defined in the MF abstract syntax tree as: Let(name, valExpr, restExpr)} The two expression trees: valExpr and restExpr relates to the binding of the let expression and the rest of the expression tree where name is a bound value, respectively. let a = 5 + 7 + 2 * 10 valExpr in let b = if a < 42 ... restExpr From the type checker the valExpr has been wrapped in a typed expression: (TypedExpr) containing the original expression as well as its type. The type along with the value- and rest-expression are then sent to the compileLetExpr function: and compileLetExpr name typeDef valExpr restExpr = let valInstr = compileExpr valExpr let storeInstr = StoreLocal(name, (getILType typeDef)) let restInstr = compileExpr restExpr restInstr @ (storeInstr :: valInstr) First the valExpr is translated to a list of instructions. Later in the MFIL com- pilation step, the instructions eventually gets pushed on the stack and the store instructions binds the result of these instructions to the name of the let expres- sion. Next the restExpr is translated to its instruction list and the combined list of instructions becomes the result of the translation of the let expression. Suppose we were given the following simple program written in MF: let a = 42 in let b = 3 in a * b The parsed expression tree would contain a nested structure of the two let expressions. Let("a", 42, Let("b", 3, Mul(a, b))) The expression tree has been simplified here to make the example easier to read, and the constructs does not correspond to the exact output from the parser, but the end result is he same. After the translation, we end up with a set of instructions in MFIL that look very similar to the actual CIL end result: #1 PushInt 42 33
  • 40. Joachim Vincent Hasseldam September 2015 #2 StoreLocal (”a”, ILInt) #3 PushInt 3 #4 StoreLocal (”b”, ILInt) #5 Load ”a” #6 Load ”b” #7 Mul 10.2 Function Definitions As stated in section 5.4 a function in MF is translated to a structure of step- classes and invoke methods. Let us consider the simplest example: a function with only one parameter. This translates to a single class with a single invoke method. The return value of the invoke method is the result of executing the expression in the function body. This scenario is considered the base case and does not require any special measures being taken to handle closure, because calling the function creates an immediate result. Any functions with more than one parameter, on the other hand, do have to handle closure. 10.2.1 Extracting Invoke Types Since each parameter in the source function corresponds to one class and invoke method, an approach must be established for analyzing each parameter in the function as its own method with an input- and an output type. The resulting parameter name, input and output type is defined as an InvokeType with the following definition: type InvokeType = InvokeType of string * TypeDef * TypeDef The invoke types from each function are extracted by going through the accepted list of parameters and matching them one-by-one with the functions type defi- nition: TypeDef. The process is pretty straight forward, since the nested format of TypeDef is already arranged as input/output with the left side being input, and the right side(the rest of the type definition) being output, when the data structure is traversed. let rec getInvokeTypes parameters funTypeDef = match parameters, funTypeDef with | ([], _)-> [] | (p::ps, FuncType(left, right)) -> InvokeType(p, left, right) :: getInvokeTypes ps right | _ -> failwith "Illegal function type" The function getInvokesTypes builds the invoke type list by going through the parameters, and the nested type definition for the function, and fixing the input type for the next parameter to be the left side of the type definition and the output type to be the remaining type definition. 34
  • 41. Joachim Vincent Hasseldam September 2015 10.2.2 Step Class Instructions Recall that the execution of the expression tree in the function body does not happen until we reach the ”bottom” of the step-class structure, which is in prac- tice when the last argument has been given. Until that point the compilation process of each step (represented by each parameter) is carried out in the same manner, regardless of the number of steps. Given a random step in the process, the required actions are as follows: 1 The invoke method in step-class n creates an instance of step-class n + 1. 2 The argument from the step-class n invoke method, is stored in a field by the same name in the new step-class n + 1 instance. 3 Any fields in step-class n are replicated in the step-class n + 1 instance, preserving the names. The instance of step n + 1 is then returned by the invoke method in step n. See figure 6 in section 5.4 for a graphical representation of this concept. The function: createStepClassInstrs generates the required instructions for the invoke method in each step class. let createStepClassInstrs invokeType nextLvlName nextLvlFields = let (InvokeType(paramName, inType, outType)) = invokeType let localIlType = getILType (FuncType(inType, outType)) let newObj = CreateObject(nextLvlName) let tempName = createRandomName "local" let storeObjRef = StoreLocal(tempName, localIlType) let loadObjRef = Load(tempName) let copyFields = dupFieldInstrs nextLvlName nextLvlFields loadObjRef newObj::storeObjRef::copyFields@[loadObjRef] The function takes as input the invoke type for the current step, the name of the class, and a list of the required fields for the next step level. The extracted invoke type contains the parameter and the required input- and output type for the current step, through the process described in section 10.2.1. First the input- and output type is translated to its IL-type equivalent, which is a one-to-one mapping of the MF types and acts more like a conceptual way to illustrate the transition from MF to MFIL. Next an instruction is made to create an instance of the next level step-class. An instruction is added to store the newly created instance locally with its IL type bound to a randomly generated name prefixed with ”local”9 . Next the instructions for duplicating the fields to the next level are constructed. This process is addressed in more details below. Finally the class instance is loaded again. That way the reference is on the top of the stack when the method returns, and the instance pointer is returned to the caller. 9Prefixing randomly generated names with an identifier referring to its use, made examining the syntax trees, during the development and debugging process of the compiler, a lot easier. 35
  • 42. Joachim Vincent Hasseldam September 2015 Duplicating Fields The instructions for copying fields between the two step levels are handled by the dupFieldInstrs function. Keep in mind that the list of fields passed to the function includes all fields present in the current step level, as well as the invoke method argument. let rec dupFieldInstrs className fields loadObjRef = match fields with | field::rest -> let (Field(fieldName,_)) = field let loadValue = Load(fieldName) let storeField = StoreField(className, fieldName) loadObjRef::loadValue::storeField::dupFieldInstrs className rest loadObjRef | [] -> [] At this point in the process we do worry about details such as how a field value is loaded. The only information conveyed to MFIL is that any value x from the current instance has to be replicated to the field of the same name, in the newly created instance of the next level. Whether that value exists as an argument or a field in the current level is a implementation detail left to the MFIL compiler. As can be seen above, for each field a set of three instructions are returned: 1 Load the instance reference 2 Load the value to be duplicated 3 Store the value in the field The instance reference has to be loaded for each field because the CIL operation that stores the field value consumes both the value, as well as the instance reference from the stack [ECMA, 2015]. This is an implementation detail that requires knowledge of how the end compiler works, and might be different if the MFIL code produced, were to target another stack-based language than CIL. 10.2.3 Step Class Structure Each function in the source program is converted to a list of classes in the MFIL language. The type containing the classes in MFIL is defined as: type Class = FunClass of string * string * (ILType * ILType) * Field list * Instruction list The information contained in the class type consist of the name of the function, the name of the parameter for the invoke method, the input- and output type of invoke method, a list of fields required for the class and finally: a list of instructions for the invoke method. The compileFunDef function which is responsible for creating the step-classes described in section 5.4, takes as input the name of the source function, the list of invoke types extracted from the parameters and the expression tree with the instructions for the body of the function. The list of invoke types are traversed and determines how deep the step-class structure will be. As long as the list of invoke types still contains elements we 36
  • 43. Joachim Vincent Hasseldam September 2015 perform the following steps: let (InvokeType(paramName, inType, outType)) = nextParam let classes = buildFunClasses rest (Field(paramName, getILType inType)::reqFields) (lvl+1) let (FunClass(className, _, _, nextStepFields, _)) = List.head classes The next parameter in the form of an invoke type is de-constructed to access its name and type definition. The input type of the parameter is converted to its IL type equivalent and the type, along with the parameter name, is wrapped in a field type. The build function is called with the remaining parameters and with the new field appended to the list of required fields for the next level. Returned is a list of classes from the current level and all the way to the ”bottom” of the structure. The top element of the returned list contains the class for the next step level and its name and required fields are extracted. let (inILType, outILType) = getILTypes inType outType let stepClassName = genFunClassName funName lvl let instrs = createStepClassInstrs nextParam className nextStepFields The parameter input- and output types are converted to IL types and a name for the current step class created. The name is build from the original name of the function appended with step<n> where n is the number of the current level, contained in the level counter. This holds for any n > 0. For n = 0, which is the top level of the step-class structure, the name is kept as the original name of the function for convenience reasons when handling function calls. Next the in- structions for the invoke method body is generated, using the process described in section: 10.2.2. When the ”bottom” of the step class structure is reached, which happens when the last element in the list of invoke types is reached, the actual computation of the function body will take place. So instead of providing the invoke method with the stepclass instructions as was the case previously, it is giving the set of instructions which comes from compiling the expression tree defined the original function body. Compiling the expression tree is explained in section: 10.1. FunClass(stepClassName, paramName, (inILType, outILType), reqFields, instrs)::classes Finally the class definition is built and added to the list of classes from the recursive step. 10.3 Match Expression To illustrate the concept of working with the inductive types, we are going to look at how a pattern match expression is compiled. In MF the two inductive data types supported: tree and list each have only two cases. For the tree we have either a Node containing a value and the two sub-trees, or a Leaf indicating the end of the tree. For the list we have either a Cons element containing the current element value and the next element in the list, or Nil meaning we have reached the end of the list. True for both types is that looking at any given element, it is either of type A 37
  • 44. Joachim Vincent Hasseldam September 2015 or type B. This is very similar to an if-then-else expression - either a condition holds and we execute the instructions in the ”then-block” or the condition fails and the ”else-block” instructions are executed. In fact these two constructs are so similar that they are both compiled to the MFIL Branch instruction. To keep the list of instructions short, a very simple match example has been chosen. The function myMatchExample takes a list of integers and produces a new list, with the result of squaring all elements. Since we already discussed the trans- lation of a function to its corresponding step-class structure, we limit this dis- cussion to cover only the actual match expression in the function body. fun myMatchExample lst : list -> list = match lst with | Nil -> [] | Cons(x, xs) -> (x * x)::myMatchExample xs end end From the function compileMatchCases we extract the two match cases and send them to compileListMatch where the actual compilation takes place. This is done so compileListMatch can process the input regardless of the order of the match cases. The input given is the actual expression which the match is per- formed upon, the type of the expression, the names for the value and rest binding and an expression for each of the cases, with the action to perform. We start by compiling the match expression which in this case translates to a load instruction of the name lst, because the match is performed on the list argument of the function. An instruction is then made to store the loaded argument to a local value given a random name. let result = compileExpr matchExpr let name = createRandomName "match" let storeResult = StoreLocal (name, getILType matchExprType) :: result Next the newly stored value is loaded again and is matched against the nil type. Based on the result of this check, the branching of the instructions can take place. let loadResult = Load name let isNilInstr = CallCheckVariant(ILNil) As we saw earlier the Branch type consists of two lists of instructions. The first set of instructions if the previous check holds and another set if it fails. In this case we tested if the match expression was of type nil, so the first set of instructions is the result of compiling the expression for the nil ”action”. let nilBranchInstrs = compileExpr nilAction |> List.rev In this case a match on nil just returns an empty list, so the expression here simple evaluates to a CallConstructVariant instruction with an ILNil variant 38
  • 45. Joachim Vincent Hasseldam September 2015 argument. The cons case is a little more interesting since we also have to perform a binding of the value and next element, before we evaluate the action expression where these appears as bound names. This is not much different than how let expres- sions are handled. For each of the two elements we load the local value containing the list, create an extract instruction for the appropriate element and store it to a local value with the supplied name. let bindValue = [loadResult; ExtractVariantVal(ListValue); StoreLocal(valName, ILInt)] let bindNext = [loadResult; ExtractVariantVal(Next); StoreLocal(restName, ILList)] Finally the expression for the action part of the cons case is translated. let consActionInstrs = compileExpr consAction |> List.rev Since the :: operator is simply syntactic sugar for creating a new cons element, we start with the two arguments needed for cons. The value argument is found by loading x (which is the name bound to the value in the cons element of the current match) twice and applying the mul operator. To find the next element, or the rest of the list, the function calls itself recursively with the list argument bound to the name of the next element in the list xs. This results in a Load instruction of xs value and a CallInvoke instruction to myMatchExample. The last instruction given in the cons action expression is to construct a new cons element. The combined list of MFIL instructions from this match example can be seen in section 11.6.2, where we complete the compilation process by compiling these instructions to CIL. 11 MFIL Compilation This section concludes the compilation process by explaining the MFIL compiler. The compiler takes a program written in MFIL and produce an assembly of CIL instructions, classes and methods. Microsofts Reflection.Emit API is used to accomplish this task by creating a single assembly, with a single module, to contain all the classes. Figure 15 gives a hierarchical overview of the constructs used for this project and how they relate to each other. 11.1 IFunc Interface To enable a structure that supports instances with the right type definition can be given as arguments to invoke methods of other instances: all classes must support a common interface10 . The need for this interface is explained in section 5.3.1. The interface defines a single method: ”Invoke” with a single generic parameter type and a generic return type. 10Although the input- and output type definition varies depending on the original functions type definition, it is fundamentally still the same interface. 39
  • 46. Joachim Vincent Hasseldam September 2015 Figure 15: A hierarchical overview of the CIL constructs used in this project let [| tinput; toutput |] = newInterface.DefineGenericParameters([|"TInput"; "TOutput"|]) newInterface.DefineMethod("Invoke", methodAttributes, toutput.AsType(), [| tinput.AsType() |]) |> ignore The actual types for the generic input- and output types are concreted when the step-classes are implementing the interface as we shall see later. 11.2 Built-in Types To ensure the availability of the two inductive types for the rest of the instruc- tions, they are built first before the actual source program is compiled. The tree and list types are created in the files: CILTreeBuilder.fs and CILListBuilder.fs respectively. A set of helper functions shared between the two can be found in: CILClassBuilder.fs and CILMethodBuilder.fs. The two types are semantically very similarly, the only difference being that a node element in a tree contains a value, and a left- and right sub-tree, where a cons element in the list only contains the value and the rest of the list. To illustrate the concept, this section will go through the process of creating the MFTree type11 . The expected outcome can be seen in figure 10 section 6. 11The tree type is named MFTree to empathize it being part of the MF language 40
  • 47. Joachim Vincent Hasseldam September 2015 11.2.1 Structure Recall from section 6 that through object-oriented decomposition, a tree is de- fined as a set of nodes and leaves. A class has to be created for the two cases and made subclasses to the abstract MFTree class. For convenience the MFTree su- perclass will contain a set of static methods for creating and returning instances of the two subclasses. In addition to this, methods which test a given instance against each of the two types (leaf and node) and getter methods for the three fields in the node class: value, left- and right sub-tree, will also be implemented. The Leaf class contains no fields because it only acts as an indicator for an empty tree. 11.2.2 Create Classes The function: createTreeStructure is responsible for building the MFTree and its two subclasses. It takes only a single argument which is an instance of the Reflection.Emit.ModuleBuilder class (mb). As mentioned earlier: all classes are defined in the same module in the output assembly. This includes the built-in types. let treeClass = defineSuperClass mb "MFTree" let treeClassCTor = defineDefaultCTor treeClass First the abstract class MFTree along with a default (no parameter) constructor is created. The constructor is needed when the Node class constructor is created as we shall see next. An important thing to keep in mind is that at the CIL level, a constructor is not treated much differently than a method. let nodeClass = defineSubClass mb "Node" treeClass let valFld = addField nodeClass "value" typeof<int> let leftFld = addField nodeClass "left" treeClass let rightFld = addField nodeClass "right" treeClass let nodeClassCTor = defineNodeCTor nodeClass treeClass treeClassCTor valFld leftFld rightFld The defineSubClass function creates a class named ”Node” and define its par- ent class as the one provided by the argument, in this case: the MFTree class. Next we define the three required fields in the Node class: a value field and one for each sub-tree: left and right. The value field is restricted to an integer type and the types for the two sub-tree fields are of course the superclass MFTree, since a sub-tree can be either a leaf or a node. Next a custom three argument constructor is created in the Node class. The three fields created previously are given as arguments to the function defineNodeCTor which creates the constructor. The fields are used to store the three arguments provided by the constructor call. The instructions for loading an argument and storing it in the appropriate field comes in sets of three for each argument, shown here for the value argument/field pair: ilGen.Emit(OpCodes.Ldarg_0) ilGen.Emit(OpCodes.Ldarg_1) ilGen.Emit(OpCodes.Stfld, valFld) 41
  • 48. Joachim Vincent Hasseldam September 2015 For instance methods (the constructor falls in this category), argument 0 refers to the this pointer of the current instance. Actual arguments range from 1 to n. So the Ldarg 0 instruction pushes the instance pointer to the top of the stack. Then the first argument of the constructor is loaded and the argument is stored to its respective field, consuming both the instance reference and the value in the process. The MFTree class is used in the defineNodeCTor function to define the argu- ment types for left- and right sub-tree. The MFTree constructor on the other hand is used for less obvious reasons. When a call to a constructor in a sub- class is made, a call to the constructor of its parent class also has to be made [ECMA, 2015]. This is a detail usually performed automatically by the com- piler when we write C# code, but has to handled manually here at the CIL level. 11.2.3 Create Methods With the class hierarchy built and the appropriate constructors for the two subclasses defined, it is time to create the necessary methods in the abstract parent-class: MFTree. We start by defining the type-testing methods for each of the subclasses: createIsTypeMethod treeClass nodeClass "IsNode" createIsTypeMethod treeClass leafClass "IsLeaf" The createIsTypeMethod function is designed to create a testing method for any given compare-class. The resulting method takes no arguments, returns a boolean value with the comparison result, and contains the following set of instructions: ilGen.Emit(OpCodes.Ldarg_0) ilGen.Emit(OpCodes.Isinst, compareClass) ilGen.Emit(OpCodes.Ldnull) ilGen.Emit(OpCodes.Cgt_Un) ilGen.Emit(OpCodes.Ret) First the this pointer is pushed on to the stack. Next the instance is compared to the compare class(Isinst). If they are equal the instance pointer is left on the stack, if not: null is pushed on to the stack in its place. A null value is pushed on the stack(Ldnull) and a check is performed(Cgt Un) to see if the result of the previous compare instruction is greater than the newly pushed null value. This is the case if the result of the compare instruction yields anything but null, in other words: if the instance was indeed of the compare class type. The greater than check pushes 1 on the stack if the instance on the stack is greater than null and 0 otherwise. The return instruction(Ret) ends the current stack frame and returns the value on top of the stack (either 0 or 1) and in the process implicitly converting it to a boolean type. Next up is the two static methods for creating the subclasses: Leaf and Node: createNewLeafMethod treeClass nilClassCTor createNewNodeMethod treeClass nodeClassCTor 42
  • 49. Joachim Vincent Hasseldam September 2015 The MFTree class is passed to the create functions, since the return type of a call to the resulting methods are MFTree. The respective constructors are also passed to the function to create the new instances. The method signature of the create Node method is very similar to the Node class constructor described in section: 11.2.2. Both take three arguments: a value and the left- and right sub-tree, the difference being that the create Node method is static and the constructor does not have a return type. As for the body of the method, we load all three arguments(in the right order), like in the constructor, but instead of storing them to fields, we create an object of Node class, with the before mentioned constructor, and end the stack frame. This consumes the three values on the stack and returns the new class instance now on top of the stack. Notice that Ldarg 0 does not load the this pointer as opposed to the previous example, because this is a static method. ilGen.Emit(OpCodes.Ldarg_0) // First argument, not instance pointer ilGen.Emit(OpCodes.Ldarg_1) ilGen.Emit(OpCodes.Ldarg_2) ilGen.Emit(OpCodes.Newobj, nodeClassCTor) ilGen.Emit(OpCodes.Ret) The Leaf method takes no arguments and only creates an instance of the Leaf class and returns. The last step is to create the methods for extracting the three values from a Node instance. Given a superclass, subclass, field and a name, the function createGetSubFieldMethod creates a method that returns the value of said field. The method takes no arguments and its return type is set to the type of the field. The method body is relatively simple: ilGen.Emit(OpCodes.Ldarg_0) ilGen.Emit(OpCodes.Castclass, subClass) ilGen.Emit(OpCodes.Ldfld, fieldInfo) ilGen.Emit(OpCodes.Ret) The instance pointer is loaded as we have seen before. A typecast(Castclass) is performed from the instance of type MFTree to the provided subclass (in this case the Node class). Next the value of the given field is loaded(Ldfld) and the stack frame ends which returns the value. 11.3 Meta Information During the compilation process a local environment must be present for each invoke method. The environment has to be updated with local values and arguments as we traverse through the list of instructions in the method bodies. In addition to this, global functions and the built-in types: MFList, MFTree along with the shared interface, must also be available to the compiler. In this section a record named Meta is described, which will act as a place-holder, or a ”toolbox”, for these necessities. Gathering all this information in a single place provide some advantages: 43
  • 50. Joachim Vincent Hasseldam September 2015 • The compiler source code provides a higher level of readability, and is easier to maintain, with shorter functions since only a single meta value has to be passed around • Refactoring of functions passing the information around is not needed if the meta record is extended with new functionality like a new supported type type Meta = { currentEnv : Env classInfoMap : Map<string, ClassInfo> preDefTypes : PreDefTypes } The Meta record contains the environment for the method currently being com- piled, a symbol table matching names of global functions with their required information and finally a record containing the predefined types mentioned ear- lier. 11.3.1 Environment At the CIL level, names for local values and arguments are unimportant. When interacting with these, they are instead referred to by their index. Therefore the job of the environment is to map names to their respective local indices. Because CIL is a type-strong language, type information for the values also has to be stored. When compiling an invoke method, bound names stem from three different sources: the method argument, a local value12 or as a class field. A field is treated a bit different than the other two, since a field is not part of the method, but instead is defined in the class. A field is accessed through the FieldInfo type instead of an index, as it is the case for an argument and a local value. But even though an argument and a local value both are loaded based on their index, it is done by two different CIL instructions. Because of this, the environment must be able to distinguish between how values are present in the local method. type Element = | Arg of int * ILType | Local of int * ILType | Field of FieldInfo * ILType type Env = | Env of int * Map<string, Element> An entry in the environment is therefore a mapping between its name and an Element type. This enables the compiler to respond appropriately to the different element types, when a lookup is made in the environment. The integer in the Env tuple refers to the next available index in the CIL evaluation stack and is incremented when an argument or a local value is added to the environment. 12A let expression in the function body. 44
  • 51. Joachim Vincent Hasseldam September 2015 11.3.2 Global Functions The classInfoMap maps method names to the ClassInfo type, which is a place- holder for all relevant information related to the step-classes. The type is defined as: type ClassInfo = { classBuilder : TypeBuilder fields : (FieldBuilder * ILType) list classType : Type ctorBuilder : ConstructorBuilder } In addition to the actual class, its type and constructor, we also store the list of class fields. At the top level, of the step-class structure, this is an empty list, but in subsequent levels the fields contain the values or function closures described earlier. Together with a field, its respective type is also stored. If the field contains a value, its type is not needed to load it on to the stack. But if the field contains a function closure, a call has to be made to the invoke method of the class instance, and here the type is required. 11.3.3 Predefined Types Last we have the built-in types: MFList and MFTree and the IFunc interface. type PreDefTypes = { funcInterface : Type listType : Type treeType : Type } An interesting thing worth noticing here, is that unlike the step-classes from the previous section, we don’t need all the extra information, only the classes (and interface). Notice also that the predefined types are saved as system types (Type) and not as a TypeBuilder objects as was the case with the step-classes. The reason for this is that the predefined types are already built and completed, and as a result can be interacted with in a different way than the step classes, whose instructions, fields and types depend on the content of the source program. 11.4 Expected MFIL Output Before moving on to the translation of the step-class structure, we will revisit the MFIL abstract syntax tree from section 7.4 and take a closer look at the expected output from the MFIL compiler for each instruction. It will soon become clear, that the instructions are a mix of lower level ones, which are very close to the CIL end result, and others with a higher level of abstraction, which requires several steps to be performed. PushInt and PushBool The lower level push instructions emits either an integer or a boolean value to the evaluation stack. 45