Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Compilers Are Databases
JVM Languages Summit
Martin Odersky
TypeSafe and EPFL
Compilers...
2
Compilers and Data Bases
3
Compilers are Data Bases?
4
Put a square peg in a round
hole?
This Talk ...
... reports on a new compiler architecture for dsc,
the Dotty Scala Compiler.
• It has a mostly functional a...
My Early Involvement in Compilers
80s Pascal, Modula-2
single pass, following the school of Niklaus Wirth.
95-96 Espresso,...
Current Scala Compiler
2004-12 nsc compiler for Scala (2.0-2.10)
Made (some) use of functional capabilities of Scala
Added...
Next Generation Scala Compiler
2012 – now: Dotty
• Rethink compiler architecture
from the ground up.
• Introduce some lang...
Compilers – Traditional View
9
Compilers – Traditional View
10
Add Separate Compilation
11
Challenges
A compiler for a language like Scala faces quite a few
challenges.
Among the most important are:
» Complexity
»...
Challenge: Complex Transformations
• Input language (Scala) is complicated.
• Output language (JVM) is also complicated.
•...
Deep Transformation Pipeline
Parser
Typer
FirstTransform
ValueClasses
Mixin
LazyVals
Memoize
CapturedVars
Constructors
Lam...
Challenge: Speed
• Current scalac achieves 500-700 loc/sec on idiomatic
Scala code.
• Can be much lower, depending on inpu...
Challenge: Latency
• Some applications require fast turnaround for small
changes more than high throughput.
• Examples:
– ...
Challenge: Reusability
• A compiler has many clients:
– Command line
– Build tools
– IDEs
– REPL
– Meta-programming
 Abst...
A Question
Every compiler has to answer questions like this:
Say I have a class
class C[T] {
def f(x: T): T = ...
}
At som...
Time-Varying Answers
Initially: (x: T): T
After erasure: (x: Any): Any
After the edit: (x: T)(y: T): T
After uncurry: (x: ...
Naive Functional Approach
World1  IR1,1  ...  IRn,1  Output1
World2  IR1,2  ...  IRn,2  Output2
.
.
.
Worldk  IR1...
A More Practical Strategy
Taking Inspiration from FRP and Functional Databases:
• Treat every value as a time-varying func...
Time in dsc
Period = (RunID, PhaseID)
• RunIDs is incremented for each compiler run
• PhaseID ranges from 1 (parser) to ~ ...
Time-Indexed Values
sig(C.f, (Run 1, parser)) = (x: T): T
sig(C.f, (Run 1, erasure)) = (x: Any): Any
sig(C.f, (Run 2, eras...
Task of the Compiler
• Compute all values needed for analysis and code
generation over all periods where they are relevant...
Core Data Types
Abstract Syntax Trees
Types
References
Denotations
Symbols
25
Abstract Syntax Trees
• For instance, for x * 2:
26
Tree Attributes
What about tree attributes?
In dsc, we simplified as much as we could.
Were left with just two attributes:...
Typed Abstract Syntax Trees
28
For instance, for x * 2:
The distinction whether a tree is typed or untyped is pretty
impor...
From Untyped to Typed Trees
Idea: parameterize the type Tree of AST’s with the attribute
info it carries.
Typed tree: tpd....
Question of Variance
• Question: Which of the following two subtype
relationships should hold?
tpd.Tree <: untpd.Tree
untp...
Fixing class Tree
class Tree[-T] {
def tpe: T @uncheckedVariance
def withType(t: Type): Tree[Type]
}
Interesting exception...
Types
• Types carry most of the essential information of trees
and symbols.
• Two kinds of types.
– Value types: Int, Int ...
References
case class Select(qual: Tree, name: Name) {
// what is its tpe?
}
case class Ident(name: Name) {
// what is its...
Traditional Scheme
34
That’s not very functional!
A Question of Meaning
Question: What is the meaning of
obj.fun
?
It depends on the period!
Does that mean that obj.fun has...
References
36
• A reference is a type
• It contains (only)
– a name
– potentially a prefix
• References
are immutable, the...
What about Overloads?
The name of a TermRef may be shared by several
overloaded members of a class.
How do we determine wh...
What Does A Reference Reference?
Surely, a symbol?
No!
References capture more than a symbol
And sometimes they do not ref...
References capture more than a symbol.
Consider:
class C[T] {
def f(x: T): T
}
val prefix = new C[Int]
Then prefix.f:
reso...
References
Sometimes references point to no symbol at all.
We have already seen overloading.
Here’s another example using ...
Denotations
The meaning
of a reference is a denotation.
Non-overloaded denotations
carry symbols (maybe) and
types (always...
What Then Is A Symbol?
A symbol represents a declaration in some source
file.
It “lives” as long as the source file is unc...
Denotation Transformers
• How do we compute new denotations from old ones?
• For references pre.f: Can recompute the membe...
Caching Denotations
Symbols are memoized functions: Period  Denotation
Keep all denotations of a symbol at different phas...
Putting it all Together
45
• ER diagram of core compiler architecture:
*
*
Lessons Learned
(Not done yet, still learning)
• Think databases for modeling.
• Think FP for transformations.
• Get effic...
To Find Out More
47
How to make it Fast
• Caching
– Symbols cache last denotation
– NamedTypes do the same
– Caches are stamped with validity ...
Many forms of Caches
• Lazy vals
• Memoization
• LRU Caches
• Rely on
– Purely functional semantics
– Access to low-level ...
Optimization: Phase Fusion
• For modularity reasons, phases should be small. Each
phase should od one self-contained trans...
Upcoming SlideShare
Loading in …5
×

Compilers Are Databases

8,760 views

Published on

Keynote, JVM Languages Summit 2015

Published in: Technology

Compilers Are Databases

  1. 1. Compilers Are Databases JVM Languages Summit Martin Odersky TypeSafe and EPFL
  2. 2. Compilers... 2
  3. 3. Compilers and Data Bases 3
  4. 4. Compilers are Data Bases? 4 Put a square peg in a round hole?
  5. 5. This Talk ... ... reports on a new compiler architecture for dsc, the Dotty Scala Compiler. • It has a mostly functional architecture, but uses a lot of low-level tricks for speed. • Some of its concepts are inspired by functional databases.
  6. 6. My Early Involvement in Compilers 80s Pascal, Modula-2 single pass, following the school of Niklaus Wirth. 95-96 Espresso, the 2nd Java compiler  E Compiler  Borland’s JBuilder used an OO AST with one class per node and all processing distributed between methods on these nodes. 96-99 Pizza  GJ  javac (1.3+) -> scalac (1.x) replaced OO AST with pattern matching. 6
  7. 7. Current Scala Compiler 2004-12 nsc compiler for Scala (2.0-2.10) Made (some) use of functional capabilities of Scala Added: – REPL – presentation compiler for IDEs (Eclipse, Ensime) – run-time meta programming with toolboxes It’s the codebase for the official scalac compiler for 2.11, 2.12 and beyond. 7
  8. 8. Next Generation Scala Compiler 2012 – now: Dotty • Rethink compiler architecture from the ground up. • Introduce some language changes with the aim of better regularity. • Status: – Close to bootstrap – But still rough around the edges 8
  9. 9. Compilers – Traditional View 9
  10. 10. Compilers – Traditional View 10
  11. 11. Add Separate Compilation 11
  12. 12. Challenges A compiler for a language like Scala faces quite a few challenges. Among the most important are: » Complexity » Speed » Latency » Reusability
  13. 13. Challenge: Complex Transformations • Input language (Scala) is complicated. • Output language (JVM) is also complicated. • Semantic gap between the two is large. Compare with compilers to simple low-level languages such as System F or SSA. 13
  14. 14. Deep Transformation Pipeline Parser Typer FirstTransform ValueClasses Mixin LazyVals Memoize CapturedVars Constructors LambdaLift Flatten ElimStaticThis RestoreScopes GenBCode Source Bytecode RefChecks ElimRepeated NormalizeFlags ExtensionMethods TailRec PatternMatcher ExplicitOuter ExpandSAMs Splitter SeqLiterals InterceptedMeths Literalize Getters ClassTags ElimByName AugmentS2Traits ResolveSuper Erasure To achieve reliability, need – excellent modularity – minimized side effects  Functional code rules!
  15. 15. Challenge: Speed • Current scalac achieves 500-700 loc/sec on idiomatic Scala code. • Can be much lower, depending on input. • Everyone would like it to be faster. • But this is very hard to achieve. - FP does have costs. - Optimizations are ineffective. - No hotspots, costs are smeared out widely. 15
  16. 16. Challenge: Latency • Some applications require fast turnaround for small changes more than high throughput. • Examples: – REPL – Worksheet – IDE Presentation Compiler  Need to keep things loaded (program + data) 16
  17. 17. Challenge: Reusability • A compiler has many clients: – Command line – Build tools – IDEs – REPL – Meta-programming  Abstractions must not leak. (FP helps) 17
  18. 18. A Question Every compiler has to answer questions like this: Say I have a class class C[T] { def f(x: T): T = ... } At some point I change it to: class C[T] { def f(x: T)(y: T): T = ... } What is the type signature of C.f? Clearly, it depends on the time when the question is asked! 18
  19. 19. Time-Varying Answers Initially: (x: T): T After erasure: (x: Any): Any After the edit: (x: T)(y: T): T After uncurry: (x: T, y: T): T After erasure: (x: Any, y: Any): Any 19
  20. 20. Naive Functional Approach World1  IR1,1  ...  IRn,1  Output1 World2  IR1,2  ...  IRn,2  Output2 . . . Worldk  IR1,k  ...  IRn,k  Outputk How big is the world? 20
  21. 21. A More Practical Strategy Taking Inspiration from FRP and Functional Databases: • Treat every value as a time-varying function. • So the question is not: “What is the signature of C.f” ? but: “What is the signature of C.f at a given point in time” ?  Need to index every piece of information with the time where it holds. 21
  22. 22. Time in dsc Period = (RunID, PhaseID) • RunIDs is incremented for each compiler run • PhaseID ranges from 1 (parser) to ~ 50 (backend) 22 Run1 Run2 Run3
  23. 23. Time-Indexed Values sig(C.f, (Run 1, parser)) = (x: T): T sig(C.f, (Run 1, erasure)) = (x: Any): Any sig(C.f, (Run 2, erasure)) = (x: T)(y: T): T sig(C.f, (Run 2, uncurry)) = (x: T, y: T): T sig(C.f, (Run 2, erasure) = (x: Any, y: Any): Any 23
  24. 24. Task of the Compiler • Compute all values needed for analysis and code generation over all periods where they are relevant. • Problem: The graph of this function is humongous! • More work is needed to make it efficiently explorable. • But for a start it looks like the right model. 24
  25. 25. Core Data Types Abstract Syntax Trees Types References Denotations Symbols 25
  26. 26. Abstract Syntax Trees • For instance, for x * 2: 26
  27. 27. Tree Attributes What about tree attributes? In dsc, we simplified as much as we could. Were left with just two attributes: – Position (intrinsic) – Type The job of the type checker is to transform untyped to typed trees. 27
  28. 28. Typed Abstract Syntax Trees 28 For instance, for x * 2: The distinction whether a tree is typed or untyped is pretty important, merits being reflected in the type of AST itself.
  29. 29. From Untyped to Typed Trees Idea: parameterize the type Tree of AST’s with the attribute info it carries. Typed tree: tpd.Tree = Tree[Type] Untyped tree: untpd.Tree = Tree[Nothing] This leads to the following class: class Tree[T] { def tpe: T def withType(t: Type): Tree[Type] } 29
  30. 30. Question of Variance • Question: Which of the following two subtype relationships should hold? tpd.Tree <: untpd.Tree untpd.Tree <: tpd.Tree ? • What is the more useful relationship? (the first) • What relationship do the variance rules imply? (the second) 30 class Tree[? T] { def tpe: T ... }
  31. 31. Fixing class Tree class Tree[-T] { def tpe: T @uncheckedVariance def withType(t: Type): Tree[Type] } Interesting exception to the variance rules related to the bottom type Nothing. What can go “wrong” here? Given an untpd.Tree, I expect Nothing, but I might get a Type. Shows that it’s good have an escape hatch in the form of @uncheckedVariance. 31
  32. 32. Types • Types carry most of the essential information of trees and symbols. • Two kinds of types. – Value types: Int, Int => Int, (Boolean, String) – Types of definitions: (x: Int)Int, Lo..Hi, Class(...) • Represented as subtypes of the same type “Type” for convenience. 32
  33. 33. References case class Select(qual: Tree, name: Name) { // what is its tpe? } case class Ident(name: Name) { // what is its tpe? } • Normally, these tree nodes would carry a “symbol”, which acts as a reference to some definition. • But there are no symbol attributes in dsc, for good reason. 33
  34. 34. Traditional Scheme 34 That’s not very functional!
  35. 35. A Question of Meaning Question: What is the meaning of obj.fun ? It depends on the period! Does that mean that obj.fun has different types, depending on period? No, trees are immutable! 35
  36. 36. References 36 • A reference is a type • It contains (only) – a name – potentially a prefix • References are immutable, they exist forever.
  37. 37. What about Overloads? The name of a TermRef may be shared by several overloaded members of a class. How do we determine which member is meant? (In a nutshell, that’s why overloading is so universally hated by compiler writers) Trick: Allow “signature” as part of term names. 37
  38. 38. What Does A Reference Reference? Surely, a symbol? No! References capture more than a symbol And sometimes they do not refer to a unique symbol at all. 38
  39. 39. References capture more than a symbol. Consider: class C[T] { def f(x: T): T } val prefix = new C[Int] Then prefix.f: resolves to C’s f but at type (Int)Int, not (T)T Both pieces of information are part of the meaning of prefix.f. 39
  40. 40. References Sometimes references point to no symbol at all. We have already seen overloading. Here’s another example using union types, which are newly supported by dsc: class A { def f: Int } class B { def f: Int } val prefix: A | B = if (...) new A else new B prefix.f What symbol is referenced by prefix.f ? 40
  41. 41. Denotations The meaning of a reference is a denotation. Non-overloaded denotations carry symbols (maybe) and types (always). 41
  42. 42. What Then Is A Symbol? A symbol represents a declaration in some source file. It “lives” as long as the source file is unchanged. It has a denotation depending on the period. 42
  43. 43. Denotation Transformers • How do we compute new denotations from old ones? • For references pre.f: Can recompute the member at new phase. • For symbols? uncurry.transDenot(<(x: A)(y: B): C>) = <(x: A, y: B): C> 43
  44. 44. Caching Denotations Symbols are memoized functions: Period  Denotation Keep all denotations of a symbol at different phases as a ring. 44
  45. 45. Putting it all Together 45 • ER diagram of core compiler architecture: * *
  46. 46. Lessons Learned (Not done yet, still learning) • Think databases for modeling. • Think FP for transformations. • Get efficiency through low-level techniques (caching) • But take care not to compromise the high-level semantics. 46
  47. 47. To Find Out More 47
  48. 48. How to make it Fast • Caching – Symbols cache last denotation – NamedTypes do the same – Caches are stamped with validity interval (current period until the next denotation transformer kicks in). – Need to update only if outside of validity period – Member lookup caches denotation Not yet tried: Parallelization. - Could be hard (similar to chess programs) 48
  49. 49. Many forms of Caches • Lazy vals • Memoization • LRU Caches • Rely on – Purely functional semantics – Access to low-level imperative implementation code. – Important to keep the levels of abstractions apart! 49
  50. 50. Optimization: Phase Fusion • For modularity reasons, phases should be small. Each phase should od one self-contained transform. • But that means we end up with many phases. • Problem: Repeated tree rewriting is a performance killer. • Solution: Automatically fuse phases into one tree traversal. – Relies on design pattern and some small amount of introspection. 50

×