Spark - The Ultimate Scala Collections by Martin Odersky

Spark Summit
Spark SummitSpark Summit
Spark
-
The Ultimate Scala
Collections
Martin Odersky
Scala is
• functional
• object-oriented
• statically typed
• interoperates well with Java and Javascript
1st Invariant: A Scalable Language
• Flexible Syntax
• Flexible Types
• User-definable operators
• Higher-order functions
• Implicits
...
Make it relatively easy to build new DSLs on top of
Scala
3
DSLs on top of Scala
4
SBT
Chisel
Spray
Dispatch
Akka
ScalaTest
Squeryl
Specs
shapeless
Scalaz
Slick
Spiral
Opti{X}
Scala and Spark
5
JVM
Scala Runtime
Spark Runtime Cluster Manager
(e.g. Yarn, Mesos)
File System (e.g.
HDFS, Cassandra)
Scala REPL Scala Compiler
- A domain specific language
- Implemented in Scala
- Embedded in Scala as a host language
Two Reasons for DSLs
(1) Support new syntax.
(2) Support new functionality.
I find (2) much more interesting than (1).
What Kind of DSL is Spark?
• Centered around collections.
• Immutable data sets equipped with functional
transformers.
map reduce union
flatMap fold intersection
filter aggregate ...
These are exactly the Scala collection operations!
Spark vs Scala Collections
• So, is Spark exactly Scala Collections, but
running in a cluster?
• Not quite. Two main differences
• Spark is lazy, Scala collections are strict.
• Spark has added functionality, e.g. PairRDDs.
Collections Design Choices
Imperative Functional
Strict Lazy
vs
vs
java.util scala.collection.immutable
Scala
OCaml
C#
Spark
Scala Stream, views
Strict vs Lazy
val xs = data.map(f)
xs.filter(p).sum
xs.take(10).toString
Strict: map is evaluated once,
produces intermediate list xs
Lazy: map is evaluated twice,
no intermediate list.
Lazy needs a purely functional architecture.
How Scala Can Learn from Spark
• Redo views for Scala collections:
val ys = xs
.view
.map(f)
.filter(p)
.force
Lifting to views makes it clear when things get
evaluated.
• Add cache() operation for persistence
How Scala Can Learn from Spark
• Redo views for Scala collections:
val ys = xs
.view
.map(f)
.filter(p)
.to(Vector)
Lifting to views makes it clear when things get
evaluated.
• Add cache() operation for persistence
How Scala Can Learn from Spark
• Add pairwise operations such as
reduceByKey
groupByKey
lookup
• These work on a sequences of pairs,
analogously to PairRDDs.
• Provides a lightweight way to add map-like
operations to sequences.
• Most of the advanced types concepts are about flexibility,
less so about safety.
2nd Scala Invariant: It’s about the Types
14
Flexibility / Ease of Use
Safety
Scala
Trend in Type-systems
Goals of PL design
where we’d like it to move
Spark is A Multi-Language Platform
Supported Programming Languages:
• Python
• Scala
• Java
• R
Why Scala instead of Python?
• Native to Spark, can use
everything without translation.
• Types help (a lot).
Why Types Help
Functional operations do not have hidden
dependencies
- inputs are parameters
- outputs are results
• Hence, every interaction is given a type.
• Logic errors usually translate into type errors.
• This is a great asset for programming, and the
same should hold for data science!
How can Scala help Spark?
• Developer experience
• Infrastructure
• Spores
• Fusion
Infrastructure
18
JVM
Scala Runtime
Spark Runtime Cluster Manager
(e.g. Yarn, Mesos)
File System (e.g.
HDFS, Cassandra)
Scala REPL Scala CompilerJLine handling
Autocompletion
Use Lambdas, default methods
Serialization
Java 8 streams
Spores
• Problem: Closures use Java serialization, can
drag a very large dependency graph.
class C {
val data = someLargeCollection
val sum: Int = data.sum
doLater(x => println(sum))
}
All of data is serialized in the closure.
Spores: Safely serializable closures
Idea: Make explicit what a closure captures:
class C {
val data = someLargeCollection
val sum: Int = data.sum
doLater {
val s = sum
x => println(s))
}
Spores
Idea: Make explicit what a closure captures:
class C {
val data = someLargeCollection
val sum: Int = data.sum
doLater {
Spore {
val s = sum
x => println(s))
}
}
Spores make sure no hidden dependencies
remain.
Fusion
• Problem: Many Spark jobs are compute bound.
 Kay Ousterhout “Making sense of Spark performance”
- Lazy collections avoid intermediate results.
- Overhead because every operation is a separate closure.
Fusion
• Problem: Many Spark jobs are compute bound.
 Kay Ousterhout “Making sense of Spark performance”
- Lazy collections avoid intermediate results.
- Overhead because every operation is a separate closure.
• Fusion cuts through this, can combine several
closures in a tight loop.
• Spark implements fusion for selected operations
using dataframes.
• Would like to do the same for general Scala code.
The Dotty Linker
• A smart, whole program optimizer
• Analyses and tunes program and dependencies.
• removes dead code
• eliminates virtual dispatch, boxing, closures
PageRank example, 80K nodes:
Bytecode Objects Virtual CPU branch Running
size allocated calls miss rate time
Original 60KB app 3M 24M 12% 3.2 sec
+ 5.5MB lib
Optimized 920KB 490K 3M 3% 1.7 sec
Instead of a Summary
• Spark and Scala are beautiful examples of what
can be achieved by a bunch of dedicated grad
students using a language & system originally
written by another bunch of dedicated grad
students.
• I am looking forward to the next steps of their co-
evolution.
Thank You
Martin Odersky
1 of 26

More Related Content

What's hot(20)

Distributed ML in Apache SparkDistributed ML in Apache Spark
Distributed ML in Apache Spark
Databricks3.4K views
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks9.7K views
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet
odsc12.5K views
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
Bryan Yang2.1K views

Similar to Spark - The Ultimate Scala Collections by Martin Odersky(20)

More from Spark Summit(20)

Spark - The Ultimate Scala Collections by Martin Odersky

  • 2. Scala is • functional • object-oriented • statically typed • interoperates well with Java and Javascript
  • 3. 1st Invariant: A Scalable Language • Flexible Syntax • Flexible Types • User-definable operators • Higher-order functions • Implicits ... Make it relatively easy to build new DSLs on top of Scala 3
  • 4. DSLs on top of Scala 4 SBT Chisel Spray Dispatch Akka ScalaTest Squeryl Specs shapeless Scalaz Slick Spiral Opti{X}
  • 5. Scala and Spark 5 JVM Scala Runtime Spark Runtime Cluster Manager (e.g. Yarn, Mesos) File System (e.g. HDFS, Cassandra) Scala REPL Scala Compiler - A domain specific language - Implemented in Scala - Embedded in Scala as a host language
  • 6. Two Reasons for DSLs (1) Support new syntax. (2) Support new functionality. I find (2) much more interesting than (1).
  • 7. What Kind of DSL is Spark? • Centered around collections. • Immutable data sets equipped with functional transformers. map reduce union flatMap fold intersection filter aggregate ... These are exactly the Scala collection operations!
  • 8. Spark vs Scala Collections • So, is Spark exactly Scala Collections, but running in a cluster? • Not quite. Two main differences • Spark is lazy, Scala collections are strict. • Spark has added functionality, e.g. PairRDDs.
  • 9. Collections Design Choices Imperative Functional Strict Lazy vs vs java.util scala.collection.immutable Scala OCaml C# Spark Scala Stream, views
  • 10. Strict vs Lazy val xs = data.map(f) xs.filter(p).sum xs.take(10).toString Strict: map is evaluated once, produces intermediate list xs Lazy: map is evaluated twice, no intermediate list. Lazy needs a purely functional architecture.
  • 11. How Scala Can Learn from Spark • Redo views for Scala collections: val ys = xs .view .map(f) .filter(p) .force Lifting to views makes it clear when things get evaluated. • Add cache() operation for persistence
  • 12. How Scala Can Learn from Spark • Redo views for Scala collections: val ys = xs .view .map(f) .filter(p) .to(Vector) Lifting to views makes it clear when things get evaluated. • Add cache() operation for persistence
  • 13. How Scala Can Learn from Spark • Add pairwise operations such as reduceByKey groupByKey lookup • These work on a sequences of pairs, analogously to PairRDDs. • Provides a lightweight way to add map-like operations to sequences.
  • 14. • Most of the advanced types concepts are about flexibility, less so about safety. 2nd Scala Invariant: It’s about the Types 14 Flexibility / Ease of Use Safety Scala Trend in Type-systems Goals of PL design where we’d like it to move
  • 15. Spark is A Multi-Language Platform Supported Programming Languages: • Python • Scala • Java • R Why Scala instead of Python? • Native to Spark, can use everything without translation. • Types help (a lot).
  • 16. Why Types Help Functional operations do not have hidden dependencies - inputs are parameters - outputs are results • Hence, every interaction is given a type. • Logic errors usually translate into type errors. • This is a great asset for programming, and the same should hold for data science!
  • 17. How can Scala help Spark? • Developer experience • Infrastructure • Spores • Fusion
  • 18. Infrastructure 18 JVM Scala Runtime Spark Runtime Cluster Manager (e.g. Yarn, Mesos) File System (e.g. HDFS, Cassandra) Scala REPL Scala CompilerJLine handling Autocompletion Use Lambdas, default methods Serialization Java 8 streams
  • 19. Spores • Problem: Closures use Java serialization, can drag a very large dependency graph. class C { val data = someLargeCollection val sum: Int = data.sum doLater(x => println(sum)) } All of data is serialized in the closure.
  • 20. Spores: Safely serializable closures Idea: Make explicit what a closure captures: class C { val data = someLargeCollection val sum: Int = data.sum doLater { val s = sum x => println(s)) }
  • 21. Spores Idea: Make explicit what a closure captures: class C { val data = someLargeCollection val sum: Int = data.sum doLater { Spore { val s = sum x => println(s)) } } Spores make sure no hidden dependencies remain.
  • 22. Fusion • Problem: Many Spark jobs are compute bound.  Kay Ousterhout “Making sense of Spark performance” - Lazy collections avoid intermediate results. - Overhead because every operation is a separate closure.
  • 23. Fusion • Problem: Many Spark jobs are compute bound.  Kay Ousterhout “Making sense of Spark performance” - Lazy collections avoid intermediate results. - Overhead because every operation is a separate closure. • Fusion cuts through this, can combine several closures in a tight loop. • Spark implements fusion for selected operations using dataframes. • Would like to do the same for general Scala code.
  • 24. The Dotty Linker • A smart, whole program optimizer • Analyses and tunes program and dependencies. • removes dead code • eliminates virtual dispatch, boxing, closures PageRank example, 80K nodes: Bytecode Objects Virtual CPU branch Running size allocated calls miss rate time Original 60KB app 3M 24M 12% 3.2 sec + 5.5MB lib Optimized 920KB 490K 3M 3% 1.7 sec
  • 25. Instead of a Summary • Spark and Scala are beautiful examples of what can be achieved by a bunch of dedicated grad students using a language & system originally written by another bunch of dedicated grad students. • I am looking forward to the next steps of their co- evolution.