Spark - The Ultimate Scala Collections by Martin Odersky

Spark
-
The Ultimate Scala
Collections
Martin Odersky

Scala is
• functional
• object-oriented
• statically typed
• interoperates well with Java and Javascript

1st Invariant: A Scalable Language
• Flexible Syntax
• Flexible Types
• User-definable operators
• Higher-order functions
• Implicits
...
Make it relatively easy to build new DSLs on top of
Scala
3

DSLs on top of Scala
4
SBT
Chisel
Spray
Dispatch
Akka
ScalaTest
Squeryl
Specs
shapeless
Scalaz
Slick
Spiral
Opti{X}

Scala and Spark
5
JVM
Scala Runtime
Spark Runtime Cluster Manager
(e.g. Yarn, Mesos)
File System (e.g.
HDFS, Cassandra)
Scala REPL Scala Compiler
- A domain specific language
- Implemented in Scala
- Embedded in Scala as a host language

Two Reasons for DSLs
(1) Support new syntax.
(2) Support new functionality.
I find (2) much more interesting than (1).

What Kind of DSL is Spark?
• Centered around collections.
• Immutable data sets equipped with functional
transformers.
map reduce union
flatMap fold intersection
filter aggregate ...
These are exactly the Scala collection operations!

Spark vs Scala Collections
• So, is Spark exactly Scala Collections, but
running in a cluster?
• Not quite. Two main differences
• Spark is lazy, Scala collections are strict.
• Spark has added functionality, e.g. PairRDDs.

Collections Design Choices
Imperative Functional
Strict Lazy
vs
vs
java.util scala.collection.immutable
Scala
OCaml
C#
Spark
Scala Stream, views

Strict vs Lazy
val xs = data.map(f)
xs.filter(p).sum
xs.take(10).toString
Strict: map is evaluated once,
produces intermediate list xs
Lazy: map is evaluated twice,
no intermediate list.
Lazy needs a purely functional architecture.

How Scala Can Learn from Spark
• Redo views for Scala collections:
val ys = xs
.view
.map(f)
.filter(p)
.force
Lifting to views makes it clear when things get
evaluated.
• Add cache() operation for persistence

• Redo views for Scala collections:
val ys = xs
.view
.map(f)
.filter(p)
.to(Vector)
Lifting to views makes it clear when things get
evaluated.
• Add cache() operation for persistence

• Add pairwise operations such as
reduceByKey
groupByKey
lookup
• These work on a sequences of pairs,
analogously to PairRDDs.
• Provides a lightweight way to add map-like
operations to sequences.

• Most of the advanced types concepts are about flexibility,
less so about safety.
2nd Scala Invariant: It’s about the Types
14
Flexibility / Ease of Use
Safety
Scala
Trend in Type-systems
Goals of PL design
where we’d like it to move

Spark is A Multi-Language Platform
Supported Programming Languages:
• Python
• Scala
• Java
• R
Why Scala instead of Python?
• Native to Spark, can use
everything without translation.
• Types help (a lot).

Why Types Help
Functional operations do not have hidden
dependencies
- inputs are parameters
- outputs are results
• Hence, every interaction is given a type.
• Logic errors usually translate into type errors.
• This is a great asset for programming, and the
same should hold for data science!

How can Scala help Spark?
• Developer experience
• Infrastructure
• Spores
• Fusion

Infrastructure
18
JVM
Scala Runtime
Spark Runtime Cluster Manager
(e.g. Yarn, Mesos)
File System (e.g.
HDFS, Cassandra)
Scala REPL Scala CompilerJLine handling
Autocompletion
Use Lambdas, default methods
Serialization
Java 8 streams

Spores
• Problem: Closures use Java serialization, can
drag a very large dependency graph.
class C {
val data = someLargeCollection
val sum: Int = data.sum
doLater(x => println(sum))
}
All of data is serialized in the closure.

Spores: Safely serializable closures
Idea: Make explicit what a closure captures:
class C {
doLater {
val s = sum
x => println(s))
}

Spores
Idea: Make explicit what a closure captures:
class C {
doLater {
Spore {
val s = sum
x => println(s))
}
}
Spores make sure no hidden dependencies
remain.

Fusion
• Problem: Many Spark jobs are compute bound.
 Kay Ousterhout “Making sense of Spark performance”
- Lazy collections avoid intermediate results.
- Overhead because every operation is a separate closure.

Fusion
• Problem: Many Spark jobs are compute bound.
 Kay Ousterhout “Making sense of Spark performance”
- Lazy collections avoid intermediate results.
- Overhead because every operation is a separate closure.
• Fusion cuts through this, can combine several
closures in a tight loop.
• Spark implements fusion for selected operations
using dataframes.
• Would like to do the same for general Scala code.

The Dotty Linker
• A smart, whole program optimizer
• Analyses and tunes program and dependencies.
• removes dead code
• eliminates virtual dispatch, boxing, closures
PageRank example, 80K nodes:
Bytecode Objects Virtual CPU branch Running
size allocated calls miss rate time
Original 60KB app 3M 24M 12% 3.2 sec
+ 5.5MB lib
Optimized 920KB 490K 3M 3% 1.7 sec

Instead of a Summary
• Spark and Scala are beautiful examples of what
can be achieved by a bunch of dedicated grad
students using a language & system originally
written by another bunch of dedicated grad
students.
• I am looking forward to the next steps of their co-
evolution.

Spark - The Ultimate Scala Collections by Martin Odersky

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark - The Ultimate Scala Collections by Martin Odersky

Similar to Spark - The Ultimate Scala Collections by Martin Odersky (20)

More from Spark Summit

More from Spark Summit (20)

Recently uploaded

Recently uploaded (20)

Spark - The Ultimate Scala Collections by Martin Odersky