SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
1.
Spark
-
The Ultimate Scala
Collections
Martin Odersky
2.
Scala is
• functional
• object-oriented
• statically typed
• interoperates well with Java and Javascript
3.
1st Invariant: A Scalable Language
• Flexible Syntax
• Flexible Types
• User-definable operators
• Higher-order functions
• Implicits
...
Make it relatively easy to build new DSLs on top of
Scala
3
4.
DSLs on top of Scala
4
SBT
Chisel
Spray
Dispatch
Akka
ScalaTest
Squeryl
Specs
shapeless
Scalaz
Slick
Spiral
Opti{X}
5.
Scala and Spark
5
JVM
Scala Runtime
Spark Runtime Cluster Manager
(e.g. Yarn, Mesos)
File System (e.g.
HDFS, Cassandra)
Scala REPL Scala Compiler
- A domain specific language
- Implemented in Scala
- Embedded in Scala as a host language
6.
Two Reasons for DSLs
(1) Support new syntax.
(2) Support new functionality.
I find (2) much more interesting than (1).
7.
What Kind of DSL is Spark?
• Centered around collections.
• Immutable data sets equipped with functional
transformers.
map reduce union
flatMap fold intersection
filter aggregate ...
These are exactly the Scala collection operations!
8.
Spark vs Scala Collections
• So, is Spark exactly Scala Collections, but
running in a cluster?
• Not quite. Two main differences
• Spark is lazy, Scala collections are strict.
• Spark has added functionality, e.g. PairRDDs.
9.
Collections Design Choices
Imperative Functional
Strict Lazy
vs
vs
java.util scala.collection.immutable
Scala
OCaml
C#
Spark
Scala Stream, views
10.
Strict vs Lazy
val xs = data.map(f)
xs.filter(p).sum
xs.take(10).toString
Strict: map is evaluated once,
produces intermediate list xs
Lazy: map is evaluated twice,
no intermediate list.
Lazy needs a purely functional architecture.
11.
How Scala Can Learn from Spark
• Redo views for Scala collections:
val ys = xs
.view
.map(f)
.filter(p)
.force
Lifting to views makes it clear when things get
evaluated.
• Add cache() operation for persistence
12.
How Scala Can Learn from Spark
• Redo views for Scala collections:
val ys = xs
.view
.map(f)
.filter(p)
.to(Vector)
Lifting to views makes it clear when things get
evaluated.
• Add cache() operation for persistence
13.
How Scala Can Learn from Spark
• Add pairwise operations such as
reduceByKey
groupByKey
lookup
• These work on a sequences of pairs,
analogously to PairRDDs.
• Provides a lightweight way to add map-like
operations to sequences.
14.
• Most of the advanced types concepts are about flexibility,
less so about safety.
2nd Scala Invariant: It’s about the Types
14
Flexibility / Ease of Use
Safety
Scala
Trend in Type-systems
Goals of PL design
where we’d like it to move
15.
Spark is A Multi-Language Platform
Supported Programming Languages:
• Python
• Scala
• Java
• R
Why Scala instead of Python?
• Native to Spark, can use
everything without translation.
• Types help (a lot).
16.
Why Types Help
Functional operations do not have hidden
dependencies
- inputs are parameters
- outputs are results
• Hence, every interaction is given a type.
• Logic errors usually translate into type errors.
• This is a great asset for programming, and the
same should hold for data science!
17.
How can Scala help Spark?
• Developer experience
• Infrastructure
• Spores
• Fusion
19.
Spores
• Problem: Closures use Java serialization, can
drag a very large dependency graph.
class C {
val data = someLargeCollection
val sum: Int = data.sum
doLater(x => println(sum))
}
All of data is serialized in the closure.
20.
Spores: Safely serializable closures
Idea: Make explicit what a closure captures:
class C {
val data = someLargeCollection
val sum: Int = data.sum
doLater {
val s = sum
x => println(s))
}
21.
Spores
Idea: Make explicit what a closure captures:
class C {
val data = someLargeCollection
val sum: Int = data.sum
doLater {
Spore {
val s = sum
x => println(s))
}
}
Spores make sure no hidden dependencies
remain.
22.
Fusion
• Problem: Many Spark jobs are compute bound.
Kay Ousterhout “Making sense of Spark performance”
- Lazy collections avoid intermediate results.
- Overhead because every operation is a separate closure.
23.
Fusion
• Problem: Many Spark jobs are compute bound.
Kay Ousterhout “Making sense of Spark performance”
- Lazy collections avoid intermediate results.
- Overhead because every operation is a separate closure.
• Fusion cuts through this, can combine several
closures in a tight loop.
• Spark implements fusion for selected operations
using dataframes.
• Would like to do the same for general Scala code.
24.
The Dotty Linker
• A smart, whole program optimizer
• Analyses and tunes program and dependencies.
• removes dead code
• eliminates virtual dispatch, boxing, closures
PageRank example, 80K nodes:
Bytecode Objects Virtual CPU branch Running
size allocated calls miss rate time
Original 60KB app 3M 24M 12% 3.2 sec
+ 5.5MB lib
Optimized 920KB 490K 3M 3% 1.7 sec
25.
Instead of a Summary
• Spark and Scala are beautiful examples of what
can be achieved by a bunch of dedicated grad
students using a language & system originally
written by another bunch of dedicated grad
students.
• I am looking forward to the next steps of their co-
evolution.