Iterative Spark Development at
Bloomberg
Nimbus Goehausen
Spark Platform Engineer
ngoehausen@bloomberg.net
Copyright 2016 Bloomberg L.P. All rights reserved.
A Data Pipeline
source outputProcess
output = Process(source)
Spark development cycle
Make code
change
Deploy jar
(1m)
Start spark
submit /
spark shell /
notebook
(1m)
Wait for
execution
(1m – several
hours)
Observe
results
This should work, right?
source outputProcess
output = Process(source)
What have I done to deserve this?
source outputProcess
java.lang.HotGarbageException: You messed up
at some.obscure.library.Util$$anonfun$morefun$1.apply(Util.scala:eleventyBillion)
at another.library.InputStream $$somuchfun$mcV$sp.read(InputStream.scala:42)
at org.apache.spark.serializer.PotatoSerializer.readPotato(PotatoSerializer.scala:2792)
…
Split it up
source outputStage2Stage1
Stage1
Output
Main1 Main2
stage1Out = stage1(source)
stage1Out.save(“stage1Path”)
stage1Out = load(“stage1Path”)
output = stage2(stage1Out)
Automatic checkpoint
source outputStage2Stage1
Stage1
Output
stage1Out = stage1(source).checkpoint()
Output = stage2(stage1Out)
First run
source outputStage2Stage1
Stage1
Output
stage1Out = stage1(source).checkpoint()
Output = stage2(stage1Out)
Second run
source outputStage2Stage1
Stage1
Output
stage1Out = stage1(source).checkpoint()
Output = stage2(stage1Out)
Change logic
source outputStage2
modified
Stage1
modified
Stage1
Output
stage1Out = stage1(source).checkpoint()
Output = stage2(stage1Out)
Stage1
Output
How do we do this?
• Need some way of automatically generating a
path to save to or load from.
• Should stay the same when we want to reload
the same data.
• Should change when we want to generate
new data.
Refresher: what defines an RDD
Dependencies
Logic
Result
Refresher: what defines an RDD
Dependencies
Logic
Result
[1,2,3,4,5…]
[2,4,6,8,10…]
map(x => x*2)
A logical signature
result = logic(dependencies)
signature(result) = hash(signature(dependencies) + hash(logic))
A logical signature
val rdd2 = rdd1.map(f)
signature(rdd2) == hash(signature(rdd1) + hash(f))
val sourceRDD = sc.textFile(uri)
signature(sourceRDD) == hash(“textFile:” + uri)
Wait, how do you hash a function?
val f = (x: Int) => x * 2
f compiles to a class file of jvm bytecode
Hash those bytes.
Change f to
val f = (x: Int) => x * 3
And the class file will change.
What if the function references other
functions or static methods?
• f = (x: Int) => Values.v + staticMethod(x) + g(x)
• This is the hard part.
• Must follow dependencies recursively, identify
all classfiles that the function depends on.
• Make a combined hash of the classfiles and
any runtime values.
Putting this together
source outputStage2Stage1
Stage1
Output
stage1Out = stage1(source).checkpoint()
Output = stage2(stage1Out)
How to switch in checkpoints
• One option: change spark.
• Other option: build on top of spark.
Distributed Collection (DC)
• Mostly same api as RDD, maps, flatmaps and
so on.
• Can be instantiated without a spark context.
• Pass in a spark context to get the
corresponding RDD.
• Automatically computes logical signature.
RDD vs DC
val numbers: DC[Int] = parallelize(1 to 10)
val filtered: DC[Int] = numbers.filter(_ < 5)
val doubled: DC[Int] = filtered.map(_ * 2)
val doubledRDD: RDD[Int] = doubled.getRDD(sc)
val numbers: RDD[Int] = sc.parallelize(1 to 10)
val filtered: RDD[Int] = numbers.filter(_ < 5)
val doubled: RDD[Int] = filtered.map(_ * 2)
Easy checkpoint
val numbers: DC[Int] = parallelize(1 to 10)
val filtered: DC[Int] = numbers.filter(_ < 5).checkpoint()
val doubled: DC[Int] = filtered.map(_ * 2)
Problem: Non lazy definition
val numbers: RDD[Int] = sc.parallelize(1 to 10)
val doubles: RDD[Double] = numbers.map(x => x.toDouble)
val sum: Double = doubles.sum()
val normalized: RDD[Double] = doubles.map(x => x / sum)
We can’t get a signature of `normalized` without computing the sum!
Solution: Deferred Result
• Defined by a function on an RDD
• Like a DC but contains only one element
• Pass in a spark context to compute the result
• Also has a logical signature
• We can now express entirely lazy pipelines
Deferred Result example
val numbers: DC[Int] = parallelize(1 to 10)
val doubles: DC[Double] = numbers.map(x => x.toDouble)
val sum: DR[Double] = doubles.sum()
val normalized: DC[Double] = doubles.withResult(sum).map{case (x, s) => x / s}
We can compute a signature for `normalized` without computing a sum.
Improving the workflow
• Avoid adding complexity in order to speed things up.
• Get the benefits of avoiding redundant computation, but
avoid the costs of managing intermediate results.
• Further automatic behaviors to explore, such as automatic
failure traps, and generation of common statistics.
Can I use this now?
• Not yet, come talk to me if you’re interested.
• Open source release to come.
Questions?
Nimbus Goehausen
ngoehausen@bloomberg.net
@nimbusgo on Twitter / Github
techatbloomberg.com

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Development at Bloomberg, Nimbus Goehausen, Senior Software Engineer, Bloomberg

  • 1.
    Iterative Spark Developmentat Bloomberg Nimbus Goehausen Spark Platform Engineer ngoehausen@bloomberg.net Copyright 2016 Bloomberg L.P. All rights reserved.
  • 2.
    A Data Pipeline sourceoutputProcess output = Process(source)
  • 3.
    Spark development cycle Makecode change Deploy jar (1m) Start spark submit / spark shell / notebook (1m) Wait for execution (1m – several hours) Observe results
  • 4.
    This should work,right? source outputProcess output = Process(source)
  • 5.
    What have Idone to deserve this? source outputProcess java.lang.HotGarbageException: You messed up at some.obscure.library.Util$$anonfun$morefun$1.apply(Util.scala:eleventyBillion) at another.library.InputStream $$somuchfun$mcV$sp.read(InputStream.scala:42) at org.apache.spark.serializer.PotatoSerializer.readPotato(PotatoSerializer.scala:2792) …
  • 6.
    Split it up sourceoutputStage2Stage1 Stage1 Output Main1 Main2 stage1Out = stage1(source) stage1Out.save(“stage1Path”) stage1Out = load(“stage1Path”) output = stage2(stage1Out)
  • 7.
    Automatic checkpoint source outputStage2Stage1 Stage1 Output stage1Out= stage1(source).checkpoint() Output = stage2(stage1Out)
  • 8.
    First run source outputStage2Stage1 Stage1 Output stage1Out= stage1(source).checkpoint() Output = stage2(stage1Out)
  • 9.
    Second run source outputStage2Stage1 Stage1 Output stage1Out= stage1(source).checkpoint() Output = stage2(stage1Out)
  • 10.
    Change logic source outputStage2 modified Stage1 modified Stage1 Output stage1Out= stage1(source).checkpoint() Output = stage2(stage1Out) Stage1 Output
  • 11.
    How do wedo this? • Need some way of automatically generating a path to save to or load from. • Should stay the same when we want to reload the same data. • Should change when we want to generate new data.
  • 12.
    Refresher: what definesan RDD Dependencies Logic Result
  • 13.
    Refresher: what definesan RDD Dependencies Logic Result [1,2,3,4,5…] [2,4,6,8,10…] map(x => x*2)
  • 14.
    A logical signature result= logic(dependencies) signature(result) = hash(signature(dependencies) + hash(logic))
  • 15.
    A logical signature valrdd2 = rdd1.map(f) signature(rdd2) == hash(signature(rdd1) + hash(f)) val sourceRDD = sc.textFile(uri) signature(sourceRDD) == hash(“textFile:” + uri)
  • 16.
    Wait, how doyou hash a function? val f = (x: Int) => x * 2 f compiles to a class file of jvm bytecode Hash those bytes. Change f to val f = (x: Int) => x * 3 And the class file will change.
  • 17.
    What if thefunction references other functions or static methods? • f = (x: Int) => Values.v + staticMethod(x) + g(x) • This is the hard part. • Must follow dependencies recursively, identify all classfiles that the function depends on. • Make a combined hash of the classfiles and any runtime values.
  • 18.
    Putting this together sourceoutputStage2Stage1 Stage1 Output stage1Out = stage1(source).checkpoint() Output = stage2(stage1Out)
  • 19.
    How to switchin checkpoints • One option: change spark. • Other option: build on top of spark.
  • 20.
    Distributed Collection (DC) •Mostly same api as RDD, maps, flatmaps and so on. • Can be instantiated without a spark context. • Pass in a spark context to get the corresponding RDD. • Automatically computes logical signature.
  • 21.
    RDD vs DC valnumbers: DC[Int] = parallelize(1 to 10) val filtered: DC[Int] = numbers.filter(_ < 5) val doubled: DC[Int] = filtered.map(_ * 2) val doubledRDD: RDD[Int] = doubled.getRDD(sc) val numbers: RDD[Int] = sc.parallelize(1 to 10) val filtered: RDD[Int] = numbers.filter(_ < 5) val doubled: RDD[Int] = filtered.map(_ * 2)
  • 22.
    Easy checkpoint val numbers:DC[Int] = parallelize(1 to 10) val filtered: DC[Int] = numbers.filter(_ < 5).checkpoint() val doubled: DC[Int] = filtered.map(_ * 2)
  • 23.
    Problem: Non lazydefinition val numbers: RDD[Int] = sc.parallelize(1 to 10) val doubles: RDD[Double] = numbers.map(x => x.toDouble) val sum: Double = doubles.sum() val normalized: RDD[Double] = doubles.map(x => x / sum) We can’t get a signature of `normalized` without computing the sum!
  • 24.
    Solution: Deferred Result •Defined by a function on an RDD • Like a DC but contains only one element • Pass in a spark context to compute the result • Also has a logical signature • We can now express entirely lazy pipelines
  • 25.
    Deferred Result example valnumbers: DC[Int] = parallelize(1 to 10) val doubles: DC[Double] = numbers.map(x => x.toDouble) val sum: DR[Double] = doubles.sum() val normalized: DC[Double] = doubles.withResult(sum).map{case (x, s) => x / s} We can compute a signature for `normalized` without computing a sum.
  • 26.
    Improving the workflow •Avoid adding complexity in order to speed things up. • Get the benefits of avoiding redundant computation, but avoid the costs of managing intermediate results. • Further automatic behaviors to explore, such as automatic failure traps, and generation of common statistics.
  • 27.
    Can I usethis now? • Not yet, come talk to me if you’re interested. • Open source release to come.
  • 28.

Editor's Notes

  • #2 Intro Bloomberg: premier data platform for financial professionals. Primary emphasis on accurate realtime results. Many data sources, lots of activity. Also a need for offline batch processing to build machine learning models and do backtesting on large datasets. Going to talk about data pipeline development
  • #3 Very broad view of an arbitrary data pipeline Talking about scala In a library, making a jar
  • #4 5 cycle steps, make code change … Typical dev cycle if you’re making edits to a library. Certainly you would want to avoid checking every change with a full cluster run. Compiler checks, unit tests, are your friend. Running locally with a small sample is great, but you won’t catch everything. Plenty of problems that are difficult to anticipate or test without running on the cluster. 1 in 10000 element issues, things that blow up with a lot of data. Oh it works locally? That’s great but I really want to know that it works on the full dataset. We want to speed this up, let’s target the worst bottleneck, typically execution time.
  • #5 We’re in this cycle, it compiles, we’ve tested locally, we’ve got unit tests and we’re about to run on the cluster.
  • #6 If this doesn’t take long to run, no worries, just iterate quickly. If this does take a while to run, you’ve got problems.
  • #7 We’ve solved one problem, iterate quicker on later stages. We’ve introduced new problems by persisting intermediate results.