2. Scala is
• functional
• object-oriented
• statically typed
• interoperates well with Java and Javascript
3. 1st Invariant: A Scalable Language
• Flexible Syntax
• Flexible Types
• User-definable operators
• Higher-order functions
• Implicits
...
Make it relatively easy to build new DSLs on top of
Scala
3
4. DSLs on top of Scala
4
SBT
Chisel
Spray
Dispatch
Akka
ScalaTest
Squeryl
Specs
shapeless
Scalaz
Slick
Spiral
Opti{X}
5. Scala and Spark
5
JVM
Scala Runtime
Spark Runtime Cluster Manager
(e.g. Yarn, Mesos)
File System (e.g.
HDFS, Cassandra)
Scala REPL Scala Compiler
- A domain specific language
- Implemented in Scala
- Embedded in Scala as a host language
6. Two Reasons for DSLs
(1) Support new syntax.
(2) Support new functionality.
I find (2) much more interesting than (1).
7. What Kind of DSL is Spark?
• Centered around collections.
• Immutable data sets equipped with functional
transformers.
map reduce union
flatMap fold intersection
filter aggregate ...
These are exactly the Scala collection operations!
8. Spark vs Scala Collections
• So, is Spark exactly Scala Collections, but
running in a cluster?
• Not quite. Two main differences
• Spark is lazy, Scala collections are strict.
• Spark has added functionality, e.g. PairRDDs.
10. Strict vs Lazy
val xs = data.map(f)
xs.filter(p).sum
xs.take(10).toString
Strict: map is evaluated once,
produces intermediate list xs
Lazy: map is evaluated twice,
no intermediate list.
Lazy needs a purely functional architecture.
11. How Scala Can Learn from Spark
• Redo views for Scala collections:
val ys = xs
.view
.map(f)
.filter(p)
.force
Lifting to views makes it clear when things get
evaluated.
• Add cache() operation for persistence
12. How Scala Can Learn from Spark
• Redo views for Scala collections:
val ys = xs
.view
.map(f)
.filter(p)
.to(Vector)
Lifting to views makes it clear when things get
evaluated.
• Add cache() operation for persistence
13. How Scala Can Learn from Spark
• Add pairwise operations such as
reduceByKey
groupByKey
lookup
• These work on a sequences of pairs,
analogously to PairRDDs.
• Provides a lightweight way to add map-like
operations to sequences.
14. • Most of the advanced types concepts are about flexibility,
less so about safety.
2nd Scala Invariant: It’s about the Types
14
Flexibility / Ease of Use
Safety
Scala
Trend in Type-systems
Goals of PL design
where we’d like it to move
15. Spark is A Multi-Language Platform
Supported Programming Languages:
• Python
• Scala
• Java
• R
Why Scala instead of Python?
• Native to Spark, can use
everything without translation.
• Types help (a lot).
16. Why Types Help
Functional operations do not have hidden
dependencies
- inputs are parameters
- outputs are results
• Hence, every interaction is given a type.
• Logic errors usually translate into type errors.
• This is a great asset for programming, and the
same should hold for data science!
17. How can Scala help Spark?
• Developer experience
• Infrastructure
• Spores
• Fusion
19. Spores
• Problem: Closures use Java serialization, can
drag a very large dependency graph.
class C {
val data = someLargeCollection
val sum: Int = data.sum
doLater(x => println(sum))
}
All of data is serialized in the closure.
20. Spores: Safely serializable closures
Idea: Make explicit what a closure captures:
class C {
val data = someLargeCollection
val sum: Int = data.sum
doLater {
val s = sum
x => println(s))
}
21. Spores
Idea: Make explicit what a closure captures:
class C {
val data = someLargeCollection
val sum: Int = data.sum
doLater {
Spore {
val s = sum
x => println(s))
}
}
Spores make sure no hidden dependencies
remain.
22. Fusion
• Problem: Many Spark jobs are compute bound.
Kay Ousterhout “Making sense of Spark performance”
- Lazy collections avoid intermediate results.
- Overhead because every operation is a separate closure.
23. Fusion
• Problem: Many Spark jobs are compute bound.
Kay Ousterhout “Making sense of Spark performance”
- Lazy collections avoid intermediate results.
- Overhead because every operation is a separate closure.
• Fusion cuts through this, can combine several
closures in a tight loop.
• Spark implements fusion for selected operations
using dataframes.
• Would like to do the same for general Scala code.
24. The Dotty Linker
• A smart, whole program optimizer
• Analyses and tunes program and dependencies.
• removes dead code
• eliminates virtual dispatch, boxing, closures
PageRank example, 80K nodes:
Bytecode Objects Virtual CPU branch Running
size allocated calls miss rate time
Original 60KB app 3M 24M 12% 3.2 sec
+ 5.5MB lib
Optimized 920KB 490K 3M 3% 1.7 sec
25. Instead of a Summary
• Spark and Scala are beautiful examples of what
can be achieved by a bunch of dedicated grad
students using a language & system originally
written by another bunch of dedicated grad
students.
• I am looking forward to the next steps of their co-
evolution.