Compiler
Construction
LECTURE 1
Why Take this Course
Reason #1: understand compilers and
languages
understand the code structure
understand language semantics
understand relation between source
code and generated machine code
become a better programmer
2
Why Take this Course
Reason #2: nice balance of theory and
practice
Theory
◦ mathematical models: regular expressions, automata,
grammars, graphs
◦ algorithms that use these models
Practice
◦ Apply theoretical notions to build a real compiler
3
Why Take this Course
Reason #3: programming experience
write a large program which manipulates complex data
structures
4
What are Compilers
Translate information from one
representation to another
Usually information = program
5
Examples
Typical Compilers
◦ VC, VC++, GCC, JavaC
◦ FORTRAN, Pascal, VB
Translators
◦ Word to PDF
◦ PDF to Postscript
6
In This Course
We will study typical compilation:
from programs written in high-level languages to low-
level object code and machine code
7
Typical Compilation
8
High-level source code
Compiler
Low-level machine code
Source Code
int expr( int n )
{
int d;
d = 4*n*n*(n+1)*(n+1);
return d;
}
9
Source Code
Optimized for human readability
Matches human notions of grammar
Uses named constructs such as variables and
procedures
10
Assembly Code
.globl _expr
_expr:
pushl %ebp
movl %esp,%ebp
subl $24,%esp
movl 8(%ebp),%eax
movl %eax,%edx
leal 0(,%edx,4),%eax
movl %eax,%edx
imull 8(%ebp),%edx
movl 8(%ebp),%eax
incl %eax
imull %eax,%edx
movl 8(%ebp),%eax
incl %eax
imull %eax,%edx
movl %edx,-4(%ebp)
movl -4(%ebp),%edx
movl %edx,%eax
jmp L2
.align 4
L2:
leave
ret
11
Assembly Code
Optimized for hardware
Consists of machine instructions
Uses registers and unnamed memory locations
Much harder to understand by humans
12
How to Translate
Correctness:
 the generated machine code must execute
precisely the same computation as the source code
13
How to Translate
Is there a unique translation? No!
Is there an algorithm for an “ideal translation”? No!
14
How to Translate
Translation is a complex process
source language and generated code are very different
Need to structure the translation
15
INTRODUCTION
Languages
Our world operates when objects send messages to each other in
various forms.
These forms are commands or requests or generally messages.
All messages are expressed in some language.
There is a protocol for every situation.
Various human languages are developed for communication over
centuries ------ Natural Languages.
16
INTRODUCTION
Languages
When the era of computer-like devices started, a need arose to communicate with
them.
Commands were used and in a complete sequence to perform a specific task.
Sequence of commands is called program.
During the early ages, binary language was the only language that was followed in
the form of gates and flip-flops.
It was difficult to understand such language.
Human Language---------L
Machine Language -----------M
There should be a method to translate L into M…..But how???
17
INTRODUCTION
Languages
Solution….
Made possible by using certain properties of
programming languages.
This mechanical work can be done by computing
machines.
Our concentrate will be on translation of programming
languages using models from formal languages.
18
INTRODUCTION
Languages
Machine Language….
Computer hardware is able to interpret for execution.
Consists of instructions in binary digits.
Machine instructions of modern computers consist of
one or more bytes.
19
INTRODUCTION
Languages
HexAbsoluteLoader Language….
The difficulty of machine language was overcome to
some extent by writing a Hex loader program in machine
language.
Could read ASCII representation of the machine
instruction in hexadecimal.
Still far cry from human-friendly programming language
20
INTRODUCTION
Languages
Assembly Language….
An assembly language provided:
• Mnemonic opcodes
• Symbolic operands
• Address arithmetic
• Data declaration
• Memory reservation
An assembly language reflects the architecture and
Instruction set of the computer for which it is designed.
21
INTRODUCTION
Languages
Macro Assembly Language….
With some experience in assembly language,
programmers found that same sequence is being used for
most of the time.
Such sequences defined as macro and be given an
identifier.
Removed drudgery from programming.
IBM “autocoder”.
22
INTRODUCTION
Languages
Intermediate or Bytecode….
Byte code represents “machine language” of phantom or
virtual computing machine.
Used as an intermediate representation of a program,
keeping only the most essential features of the program.
23
INTRODUCTION
Languages
High Level Languages….
Looks more like a natural language than a machine or
assembly language. Characterized by:
• Consists of statements made up of language atoms like
numbers, text strings, characters, function calls etc.
oDeclarative
oDefinition
oImperative
oAssignment
oControl
24
INTRODUCTION
Languages
Very High Level Languages….
After advantages from HLL, very HLL were developed.
PROLOG
Python
Perl
Haskell
METAFONT
25
INTRODUCTION
Translation Process
• A programming language translator takes a source code
written in one language and generates an output code in
the target language.
• Sometimes the output is further processed by another
translator to a second target language.
Instructions in source language Instructions in target language
26
TRANSLATOR
INTRODUCTION
Translation Schemes
• Different kinds of translators are available. We focus on
compilers…
27
INTRODUCTION
Translation Schemes
• Compilers
A compiler takes a source code in an HLL and generates
either a machine code executable or an object code for
subsequent linking and execution.
Instructions in HLL executable machine code
28
COMPILER
INTRODUCTION
Translation Schemes
Types of Compilers
1. One-pass compilers: the compiler completes all its processing
while scanning the source code only once.
• Has advantage of simpler and faster compiler but cannot so some of the
sophisticated optimization.
2. Multi-pass compilers: The compiler scans the source code
several times to complete the translation.
• This allows much better optimization.
• Also takes care of some quirks of the HLL being handled.
3. Load-and-Go Compilers: used for one-time programs.
• Programs are written in built-in-editor an immediately translated and executed.
• Always loaded at the fixed location in the memory.
29
INTRODUCTION
Translation Schemes
Types of Compilers
1. Optimizing compilers: contains provision for target code
• Efficient in terms of execution speed and memory
2. Just-in-time compilers: used by Java and Microsoft. Net’s
common Intermediate language.
30
INTRODUCTION
Translation Schemes
What does a compiler do?
1. Translates a user program in one language L1 into a program
in another language L2.
2. A large set of programs, with several modules.
3. L1(source) is usually a HLL like C, C++ or Java.
4. L2(target) is usually a form of binary machine language.
5. L2 is not a pure machine language, because two further
operations linking and loading are needed before the program
is in executable form.
6. Consists of several types of phases.
31
INTRODUCTION
Translation Schemes
What does a compiler do?
Instructions in HLL executable
L1 L2 Machine Code
32
COMPILER
Linker
Loader
Library
INTRODUCTION
Acceptor & Compiler
• A compiler is simply an acceptor which reads input string
and outputs “Yes” or “No”, depending on whether the string
is in L1 or not.
• A compiler is useless if it does not point errors.
• It tells you whether your program adheres strictly to the
rules of a particular language.
• These rules are given as a grammar and a compiler
represents this grammar.
33
INTRODUCTION
Phases of a Compiler
• Pre-processing
• Lexical analysis (Scanner)
• Syntax analysis (Parser)
• Semantic analysis (Mapper)
• Code Generation
• Error Checking (Spread throughout the compiler)
• Optimization (Spread among several phases)
34
INTRODUCTION
Phases of a Compiler
• Front -and back -end of a compiler
Compiler
Front-end Back-end
Analysis Synthesis Execution
35
Source
Scanner
Parser
Mapper
Code Gen
Optimizer
target
Intermediate Interpreter VM
INTRODUCTION
Phases of a Compiler
• Front -end of compiler:
 Consisting of pre-processing, lexical, syntax and semantic phases.
 This part analyzes the input code.
• Back -end of compiler:
 Code generation, optimization phases include.
 This part does the synthesis of the output or target code.
36
INTRODUCTION
Phases of a Compiler
• A compiler generates intermediate files between all phases to communicate output of
one phase as input to the next.
• Source Program
• Assembly
Language
37
Pre-processor
Mapper
Parser
Scanner
Code
Gen
Pre-processed tokens tree Inter code
INTRODUCTION
Phases of a Compiler
Lexical Analyzer – Scanner
• Analyze individual character sequences and find language tokens,
like Number, Identifier, Operator etc. Usually these tokens are
internally denoted by small integers, e.g.
257 for NUMBER
258 for IDENTIFIER
259 for OPERATOR
• Send a stream of pairs(Token type, value) for each language
construct, to the Parser.
• Lexical tokens are generally defined as regular expressions.
• Usually based on a finite state machine(FSM) model.
38
INTRODUCTION
Phases of a Compiler
Lexical Analyzer – Scanner
•Maps character stream into words – basic unit of syntax
•Produces pairs –
• a word and
• its part of speech
39
•Example
x = x + y
becomes
<id,x>
<assign,=>
<id,x>
<op,+>
<id,y>
token type
word
<id,x>
INTRODUCTION
Phases of a Compiler
Lexical Analyzer – Scanner
•we call the pair
“<token type, word>” a “token”
•typical tokens: number, identifier, +, -, new, while, if
40
INTRODUCTION
Phases of a Compiler
Syntax Analyzer – Parser
• Analyze the stream of (tokens, value) pairs and find language
syntactic constructs, like ASSIGNMENT, IF-THEN-ELSE, WHILE, FOR,
• Make a syntax tree for each identified construct.
• Detect syntax errors while doing the above.
•Recognizes context-free syntax and reports errors
•Guides context-sensitive (“semantic”) analysis
•Builds IR for source program
◦ Most of the programming languages are designed to be context-free languages (CFL).
◦ A push down automaton…..an FSM + stack is an acceptor for these types of languages.
41
INTRODUCTION
Phases of a Compiler
Semantic Analyzer – Mapper
• It converts or maps syntax trees for each construct into a sequence of intermediate language
statements.
• For example, source:
d = a * (b+c);
◦ The (unoptimized) intermediate language code may look like:
load b =
add c
store t1 d *
load a
mult t1 a +
store t2
load t2 b c
store d
42
43
Syntax Tree
x+2-y goal
expr
term
op
expr
term
op
expr
term
– <id,y>
<id,x>
+ <number, 2>
44
Abstract Syntax Trees
The parse tree contains a lot of
unneeded information.
Compilers often use an abstract
syntax tree (AST).
45
Abstract Syntax Trees
This is much more concise
–
<id,y>
<id,x> <number,2>
+
46
Abstract Syntax Trees
AST summarizes grammatical structure without the details of derivation
–
<id,y>
<id,x> <number,2>
+
47
Abstract Syntax Trees
ASTs are one kind of intermediate
representation (IR)
–
<id,y>
<id,x> <number,2>
+
48
The Back End
•Translate IR into target machine code.
•Choose machine (assembly) instructions to implement
each IR operation
•Ensure conformance with system interfaces
•Decide which values to keep in registers
INTRODUCTION
Phases of a Compiler
Code Generation and Machine-Dependent Optimization
• From the intermediate code, assembly language statements can be generated.
• Instead of temporaries in memory, CPU registers may be assigned.
 Code Optimization:
• In an intermediate code, several possibilities include:
 Unnecessary stores and loads
 Sequence of statements
• A better intermediate code could be:
load b
add c
mult a
store d
49
INTRODUCTION
Phases of a Compiler
Optimization
• Machine dependent
• Machine independent
• Possibilities are:
 Register allocation
 Taking variant code outside loops
 Operator rank reduction
 Constant expression calculation
 Removal of dead code
 Rolling out a loop
50
INTRODUCTION
Phases of a Compiler
How to develop optimized code???
• Get your code working completely error-free without optimization.
• Put the code to be optimized (most time consuming functions) in a
separate source file.
• Be aware of general nature of the optimization (code changes)
introduced by different switches.
• Start with the lowest level of optimization, get the assembly output
(use S-switch)
• Inspect it carefully; does it do the job still the way you wanted??
• If yes, apply next higher level of optimization and repeat.
51

Compiler Construction Lecture One .pptx

  • 1.
  • 2.
    Why Take thisCourse Reason #1: understand compilers and languages understand the code structure understand language semantics understand relation between source code and generated machine code become a better programmer 2
  • 3.
    Why Take thisCourse Reason #2: nice balance of theory and practice Theory ◦ mathematical models: regular expressions, automata, grammars, graphs ◦ algorithms that use these models Practice ◦ Apply theoretical notions to build a real compiler 3
  • 4.
    Why Take thisCourse Reason #3: programming experience write a large program which manipulates complex data structures 4
  • 5.
    What are Compilers Translateinformation from one representation to another Usually information = program 5
  • 6.
    Examples Typical Compilers ◦ VC,VC++, GCC, JavaC ◦ FORTRAN, Pascal, VB Translators ◦ Word to PDF ◦ PDF to Postscript 6
  • 7.
    In This Course Wewill study typical compilation: from programs written in high-level languages to low- level object code and machine code 7
  • 8.
    Typical Compilation 8 High-level sourcecode Compiler Low-level machine code
  • 9.
    Source Code int expr(int n ) { int d; d = 4*n*n*(n+1)*(n+1); return d; } 9
  • 10.
    Source Code Optimized forhuman readability Matches human notions of grammar Uses named constructs such as variables and procedures 10
  • 11.
    Assembly Code .globl _expr _expr: pushl%ebp movl %esp,%ebp subl $24,%esp movl 8(%ebp),%eax movl %eax,%edx leal 0(,%edx,4),%eax movl %eax,%edx imull 8(%ebp),%edx movl 8(%ebp),%eax incl %eax imull %eax,%edx movl 8(%ebp),%eax incl %eax imull %eax,%edx movl %edx,-4(%ebp) movl -4(%ebp),%edx movl %edx,%eax jmp L2 .align 4 L2: leave ret 11
  • 12.
    Assembly Code Optimized forhardware Consists of machine instructions Uses registers and unnamed memory locations Much harder to understand by humans 12
  • 13.
    How to Translate Correctness: the generated machine code must execute precisely the same computation as the source code 13
  • 14.
    How to Translate Isthere a unique translation? No! Is there an algorithm for an “ideal translation”? No! 14
  • 15.
    How to Translate Translationis a complex process source language and generated code are very different Need to structure the translation 15
  • 16.
    INTRODUCTION Languages Our world operateswhen objects send messages to each other in various forms. These forms are commands or requests or generally messages. All messages are expressed in some language. There is a protocol for every situation. Various human languages are developed for communication over centuries ------ Natural Languages. 16
  • 17.
    INTRODUCTION Languages When the eraof computer-like devices started, a need arose to communicate with them. Commands were used and in a complete sequence to perform a specific task. Sequence of commands is called program. During the early ages, binary language was the only language that was followed in the form of gates and flip-flops. It was difficult to understand such language. Human Language---------L Machine Language -----------M There should be a method to translate L into M…..But how??? 17
  • 18.
    INTRODUCTION Languages Solution…. Made possible byusing certain properties of programming languages. This mechanical work can be done by computing machines. Our concentrate will be on translation of programming languages using models from formal languages. 18
  • 19.
    INTRODUCTION Languages Machine Language…. Computer hardwareis able to interpret for execution. Consists of instructions in binary digits. Machine instructions of modern computers consist of one or more bytes. 19
  • 20.
    INTRODUCTION Languages HexAbsoluteLoader Language…. The difficultyof machine language was overcome to some extent by writing a Hex loader program in machine language. Could read ASCII representation of the machine instruction in hexadecimal. Still far cry from human-friendly programming language 20
  • 21.
    INTRODUCTION Languages Assembly Language…. An assemblylanguage provided: • Mnemonic opcodes • Symbolic operands • Address arithmetic • Data declaration • Memory reservation An assembly language reflects the architecture and Instruction set of the computer for which it is designed. 21
  • 22.
    INTRODUCTION Languages Macro Assembly Language…. Withsome experience in assembly language, programmers found that same sequence is being used for most of the time. Such sequences defined as macro and be given an identifier. Removed drudgery from programming. IBM “autocoder”. 22
  • 23.
    INTRODUCTION Languages Intermediate or Bytecode…. Bytecode represents “machine language” of phantom or virtual computing machine. Used as an intermediate representation of a program, keeping only the most essential features of the program. 23
  • 24.
    INTRODUCTION Languages High Level Languages…. Looksmore like a natural language than a machine or assembly language. Characterized by: • Consists of statements made up of language atoms like numbers, text strings, characters, function calls etc. oDeclarative oDefinition oImperative oAssignment oControl 24
  • 25.
    INTRODUCTION Languages Very High LevelLanguages…. After advantages from HLL, very HLL were developed. PROLOG Python Perl Haskell METAFONT 25
  • 26.
    INTRODUCTION Translation Process • Aprogramming language translator takes a source code written in one language and generates an output code in the target language. • Sometimes the output is further processed by another translator to a second target language. Instructions in source language Instructions in target language 26 TRANSLATOR
  • 27.
    INTRODUCTION Translation Schemes • Differentkinds of translators are available. We focus on compilers… 27
  • 28.
    INTRODUCTION Translation Schemes • Compilers Acompiler takes a source code in an HLL and generates either a machine code executable or an object code for subsequent linking and execution. Instructions in HLL executable machine code 28 COMPILER
  • 29.
    INTRODUCTION Translation Schemes Types ofCompilers 1. One-pass compilers: the compiler completes all its processing while scanning the source code only once. • Has advantage of simpler and faster compiler but cannot so some of the sophisticated optimization. 2. Multi-pass compilers: The compiler scans the source code several times to complete the translation. • This allows much better optimization. • Also takes care of some quirks of the HLL being handled. 3. Load-and-Go Compilers: used for one-time programs. • Programs are written in built-in-editor an immediately translated and executed. • Always loaded at the fixed location in the memory. 29
  • 30.
    INTRODUCTION Translation Schemes Types ofCompilers 1. Optimizing compilers: contains provision for target code • Efficient in terms of execution speed and memory 2. Just-in-time compilers: used by Java and Microsoft. Net’s common Intermediate language. 30
  • 31.
    INTRODUCTION Translation Schemes What doesa compiler do? 1. Translates a user program in one language L1 into a program in another language L2. 2. A large set of programs, with several modules. 3. L1(source) is usually a HLL like C, C++ or Java. 4. L2(target) is usually a form of binary machine language. 5. L2 is not a pure machine language, because two further operations linking and loading are needed before the program is in executable form. 6. Consists of several types of phases. 31
  • 32.
    INTRODUCTION Translation Schemes What doesa compiler do? Instructions in HLL executable L1 L2 Machine Code 32 COMPILER Linker Loader Library
  • 33.
    INTRODUCTION Acceptor & Compiler •A compiler is simply an acceptor which reads input string and outputs “Yes” or “No”, depending on whether the string is in L1 or not. • A compiler is useless if it does not point errors. • It tells you whether your program adheres strictly to the rules of a particular language. • These rules are given as a grammar and a compiler represents this grammar. 33
  • 34.
    INTRODUCTION Phases of aCompiler • Pre-processing • Lexical analysis (Scanner) • Syntax analysis (Parser) • Semantic analysis (Mapper) • Code Generation • Error Checking (Spread throughout the compiler) • Optimization (Spread among several phases) 34
  • 35.
    INTRODUCTION Phases of aCompiler • Front -and back -end of a compiler Compiler Front-end Back-end Analysis Synthesis Execution 35 Source Scanner Parser Mapper Code Gen Optimizer target Intermediate Interpreter VM
  • 36.
    INTRODUCTION Phases of aCompiler • Front -end of compiler:  Consisting of pre-processing, lexical, syntax and semantic phases.  This part analyzes the input code. • Back -end of compiler:  Code generation, optimization phases include.  This part does the synthesis of the output or target code. 36
  • 37.
    INTRODUCTION Phases of aCompiler • A compiler generates intermediate files between all phases to communicate output of one phase as input to the next. • Source Program • Assembly Language 37 Pre-processor Mapper Parser Scanner Code Gen Pre-processed tokens tree Inter code
  • 38.
    INTRODUCTION Phases of aCompiler Lexical Analyzer – Scanner • Analyze individual character sequences and find language tokens, like Number, Identifier, Operator etc. Usually these tokens are internally denoted by small integers, e.g. 257 for NUMBER 258 for IDENTIFIER 259 for OPERATOR • Send a stream of pairs(Token type, value) for each language construct, to the Parser. • Lexical tokens are generally defined as regular expressions. • Usually based on a finite state machine(FSM) model. 38
  • 39.
    INTRODUCTION Phases of aCompiler Lexical Analyzer – Scanner •Maps character stream into words – basic unit of syntax •Produces pairs – • a word and • its part of speech 39 •Example x = x + y becomes <id,x> <assign,=> <id,x> <op,+> <id,y> token type word <id,x>
  • 40.
    INTRODUCTION Phases of aCompiler Lexical Analyzer – Scanner •we call the pair “<token type, word>” a “token” •typical tokens: number, identifier, +, -, new, while, if 40
  • 41.
    INTRODUCTION Phases of aCompiler Syntax Analyzer – Parser • Analyze the stream of (tokens, value) pairs and find language syntactic constructs, like ASSIGNMENT, IF-THEN-ELSE, WHILE, FOR, • Make a syntax tree for each identified construct. • Detect syntax errors while doing the above. •Recognizes context-free syntax and reports errors •Guides context-sensitive (“semantic”) analysis •Builds IR for source program ◦ Most of the programming languages are designed to be context-free languages (CFL). ◦ A push down automaton…..an FSM + stack is an acceptor for these types of languages. 41
  • 42.
    INTRODUCTION Phases of aCompiler Semantic Analyzer – Mapper • It converts or maps syntax trees for each construct into a sequence of intermediate language statements. • For example, source: d = a * (b+c); ◦ The (unoptimized) intermediate language code may look like: load b = add c store t1 d * load a mult t1 a + store t2 load t2 b c store d 42
  • 43.
  • 44.
    44 Abstract Syntax Trees Theparse tree contains a lot of unneeded information. Compilers often use an abstract syntax tree (AST).
  • 45.
    45 Abstract Syntax Trees Thisis much more concise – <id,y> <id,x> <number,2> +
  • 46.
    46 Abstract Syntax Trees ASTsummarizes grammatical structure without the details of derivation – <id,y> <id,x> <number,2> +
  • 47.
    47 Abstract Syntax Trees ASTsare one kind of intermediate representation (IR) – <id,y> <id,x> <number,2> +
  • 48.
    48 The Back End •TranslateIR into target machine code. •Choose machine (assembly) instructions to implement each IR operation •Ensure conformance with system interfaces •Decide which values to keep in registers
  • 49.
    INTRODUCTION Phases of aCompiler Code Generation and Machine-Dependent Optimization • From the intermediate code, assembly language statements can be generated. • Instead of temporaries in memory, CPU registers may be assigned.  Code Optimization: • In an intermediate code, several possibilities include:  Unnecessary stores and loads  Sequence of statements • A better intermediate code could be: load b add c mult a store d 49
  • 50.
    INTRODUCTION Phases of aCompiler Optimization • Machine dependent • Machine independent • Possibilities are:  Register allocation  Taking variant code outside loops  Operator rank reduction  Constant expression calculation  Removal of dead code  Rolling out a loop 50
  • 51.
    INTRODUCTION Phases of aCompiler How to develop optimized code??? • Get your code working completely error-free without optimization. • Put the code to be optimized (most time consuming functions) in a separate source file. • Be aware of general nature of the optimization (code changes) introduced by different switches. • Start with the lowest level of optimization, get the assembly output (use S-switch) • Inspect it carefully; does it do the job still the way you wanted?? • If yes, apply next higher level of optimization and repeat. 51

Editor's Notes