Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Spark
-
The Ultimate Scala
Collections
Martin Odersky
Scala is
• functional
• object-oriented
• statically typed
• interoperates well with Java and Javascript
1st Invariant: A Scalable Language
• Flexible Syntax
• Flexible Types
• User-definable operators
• Higher-order functions
...
DSLs on top of Scala
4
SBT
Chisel
Spray
Dispatch
Akka
ScalaTest
Squeryl
Specs
shapeless
Scalaz
Slick
Spiral
Opti{X}
Scala and Spark
5
JVM
Scala Runtime
Spark Runtime Cluster Manager
(e.g. Yarn, Mesos)
File System (e.g.
HDFS, Cassandra)
Sc...
Two Reasons for DSLs
(1) Support new syntax.
(2) Support new functionality.
I find (2) much more interesting than (1).
What Kind of DSL is Spark?
• Centered around collections.
• Immutable data sets equipped with functional
transformers.
map...
Spark vs Scala Collections
• So, is Spark exactly Scala Collections, but
running in a cluster?
• Not quite. Two main diffe...
Collections Design Choices
Imperative Functional
Strict Lazy
vs
vs
java.util scala.collection.immutable
Scala
OCaml
C#
Spa...
Strict vs Lazy
val xs = data.map(f)
xs.filter(p).sum
xs.take(10).toString
Strict: map is evaluated once,
produces intermed...
How Scala Can Learn from Spark
• Redo views for Scala collections:
val ys = xs
.view
.map(f)
.filter(p)
.force
Lifting to ...
How Scala Can Learn from Spark
• Redo views for Scala collections:
val ys = xs
.view
.map(f)
.filter(p)
.to(Vector)
Liftin...
How Scala Can Learn from Spark
• Add pairwise operations such as
reduceByKey
groupByKey
lookup
• These work on a sequences...
• Most of the advanced types concepts are about flexibility,
less so about safety.
2nd Scala Invariant: It’s about the Typ...
Spark is A Multi-Language Platform
Supported Programming Languages:
• Python
• Scala
• Java
• R
Why Scala instead of Pytho...
Why Types Help
Functional operations do not have hidden
dependencies
- inputs are parameters
- outputs are results
• Hence...
How can Scala help Spark?
• Developer experience
• Infrastructure
• Spores
• Fusion
Infrastructure
18
JVM
Scala Runtime
Spark Runtime Cluster Manager
(e.g. Yarn, Mesos)
File System (e.g.
HDFS, Cassandra)
Sc...
Spores
• Problem: Closures use Java serialization, can
drag a very large dependency graph.
class C {
val data = someLargeC...
Spores: Safely serializable closures
Idea: Make explicit what a closure captures:
class C {
val data = someLargeCollection...
Spores
Idea: Make explicit what a closure captures:
class C {
val data = someLargeCollection
val sum: Int = data.sum
doLat...
Fusion
• Problem: Many Spark jobs are compute bound.
 Kay Ousterhout “Making sense of Spark performance”
- Lazy collectio...
Fusion
• Problem: Many Spark jobs are compute bound.
 Kay Ousterhout “Making sense of Spark performance”
- Lazy collectio...
The Dotty Linker
• A smart, whole program optimizer
• Analyses and tunes program and dependencies.
• removes dead code
• e...
Instead of a Summary
• Spark and Scala are beautiful examples of what
can be achieved by a bunch of dedicated grad
student...
Thank You
Martin Odersky
Upcoming SlideShare
Loading in …5
×

Spark - The Ultimate Scala Collections by Martin Odersky

4,764 views

Published on

Spark - The Ultimate Scala Collections by Martin Odersky

Published in: Data & Analytics
  • Hello! High Quality And Affordable Essays For You. Starting at $4.99 per page - Check our website! https://vk.cc/82gJD2
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Spark - The Ultimate Scala Collections by Martin Odersky

  1. 1. Spark - The Ultimate Scala Collections Martin Odersky
  2. 2. Scala is • functional • object-oriented • statically typed • interoperates well with Java and Javascript
  3. 3. 1st Invariant: A Scalable Language • Flexible Syntax • Flexible Types • User-definable operators • Higher-order functions • Implicits ... Make it relatively easy to build new DSLs on top of Scala 3
  4. 4. DSLs on top of Scala 4 SBT Chisel Spray Dispatch Akka ScalaTest Squeryl Specs shapeless Scalaz Slick Spiral Opti{X}
  5. 5. Scala and Spark 5 JVM Scala Runtime Spark Runtime Cluster Manager (e.g. Yarn, Mesos) File System (e.g. HDFS, Cassandra) Scala REPL Scala Compiler - A domain specific language - Implemented in Scala - Embedded in Scala as a host language
  6. 6. Two Reasons for DSLs (1) Support new syntax. (2) Support new functionality. I find (2) much more interesting than (1).
  7. 7. What Kind of DSL is Spark? • Centered around collections. • Immutable data sets equipped with functional transformers. map reduce union flatMap fold intersection filter aggregate ... These are exactly the Scala collection operations!
  8. 8. Spark vs Scala Collections • So, is Spark exactly Scala Collections, but running in a cluster? • Not quite. Two main differences • Spark is lazy, Scala collections are strict. • Spark has added functionality, e.g. PairRDDs.
  9. 9. Collections Design Choices Imperative Functional Strict Lazy vs vs java.util scala.collection.immutable Scala OCaml C# Spark Scala Stream, views
  10. 10. Strict vs Lazy val xs = data.map(f) xs.filter(p).sum xs.take(10).toString Strict: map is evaluated once, produces intermediate list xs Lazy: map is evaluated twice, no intermediate list. Lazy needs a purely functional architecture.
  11. 11. How Scala Can Learn from Spark • Redo views for Scala collections: val ys = xs .view .map(f) .filter(p) .force Lifting to views makes it clear when things get evaluated. • Add cache() operation for persistence
  12. 12. How Scala Can Learn from Spark • Redo views for Scala collections: val ys = xs .view .map(f) .filter(p) .to(Vector) Lifting to views makes it clear when things get evaluated. • Add cache() operation for persistence
  13. 13. How Scala Can Learn from Spark • Add pairwise operations such as reduceByKey groupByKey lookup • These work on a sequences of pairs, analogously to PairRDDs. • Provides a lightweight way to add map-like operations to sequences.
  14. 14. • Most of the advanced types concepts are about flexibility, less so about safety. 2nd Scala Invariant: It’s about the Types 14 Flexibility / Ease of Use Safety Scala Trend in Type-systems Goals of PL design where we’d like it to move
  15. 15. Spark is A Multi-Language Platform Supported Programming Languages: • Python • Scala • Java • R Why Scala instead of Python? • Native to Spark, can use everything without translation. • Types help (a lot).
  16. 16. Why Types Help Functional operations do not have hidden dependencies - inputs are parameters - outputs are results • Hence, every interaction is given a type. • Logic errors usually translate into type errors. • This is a great asset for programming, and the same should hold for data science!
  17. 17. How can Scala help Spark? • Developer experience • Infrastructure • Spores • Fusion
  18. 18. Infrastructure 18 JVM Scala Runtime Spark Runtime Cluster Manager (e.g. Yarn, Mesos) File System (e.g. HDFS, Cassandra) Scala REPL Scala CompilerJLine handling Autocompletion Use Lambdas, default methods Serialization Java 8 streams
  19. 19. Spores • Problem: Closures use Java serialization, can drag a very large dependency graph. class C { val data = someLargeCollection val sum: Int = data.sum doLater(x => println(sum)) } All of data is serialized in the closure.
  20. 20. Spores: Safely serializable closures Idea: Make explicit what a closure captures: class C { val data = someLargeCollection val sum: Int = data.sum doLater { val s = sum x => println(s)) }
  21. 21. Spores Idea: Make explicit what a closure captures: class C { val data = someLargeCollection val sum: Int = data.sum doLater { Spore { val s = sum x => println(s)) } } Spores make sure no hidden dependencies remain.
  22. 22. Fusion • Problem: Many Spark jobs are compute bound.  Kay Ousterhout “Making sense of Spark performance” - Lazy collections avoid intermediate results. - Overhead because every operation is a separate closure.
  23. 23. Fusion • Problem: Many Spark jobs are compute bound.  Kay Ousterhout “Making sense of Spark performance” - Lazy collections avoid intermediate results. - Overhead because every operation is a separate closure. • Fusion cuts through this, can combine several closures in a tight loop. • Spark implements fusion for selected operations using dataframes. • Would like to do the same for general Scala code.
  24. 24. The Dotty Linker • A smart, whole program optimizer • Analyses and tunes program and dependencies. • removes dead code • eliminates virtual dispatch, boxing, closures PageRank example, 80K nodes: Bytecode Objects Virtual CPU branch Running size allocated calls miss rate time Original 60KB app 3M 24M 12% 3.2 sec + 5.5MB lib Optimized 920KB 490K 3M 3% 1.7 sec
  25. 25. Instead of a Summary • Spark and Scala are beautiful examples of what can be achieved by a bunch of dedicated grad students using a language & system originally written by another bunch of dedicated grad students. • I am looking forward to the next steps of their co- evolution.
  26. 26. Thank You Martin Odersky

×