This presentation will explore how Bloomberg uses Spark, with its formidable computational model for distributed, high-performance analytics, to take this process to the next level, and look into one of the innovative practices the team is currently developing to increase efficiency: the introduction of a logical signature for datasets.
Big Data Day LA 2016/ NoSQL track - Introduction to Graph Databases, Oren Gol...
Similar to Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Development at Bloomberg, Nimbus Goehausen, Senior Software Engineer, Bloomberg
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...GeeksLab Odessa
Similar to Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Development at Bloomberg, Nimbus Goehausen, Senior Software Engineer, Bloomberg (20)
Developer Data Modeling Mistakes: From Postgres to NoSQL
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Development at Bloomberg, Nimbus Goehausen, Senior Software Engineer, Bloomberg
1. Iterative Spark Development at
Bloomberg
Nimbus Goehausen
Spark Platform Engineer
ngoehausen@bloomberg.net
Copyright 2016 Bloomberg L.P. All rights reserved.
3. Spark development cycle
Make code
change
Deploy jar
(1m)
Start spark
submit /
spark shell /
notebook
(1m)
Wait for
execution
(1m – several
hours)
Observe
results
4. This should work, right?
source outputProcess
output = Process(source)
5. What have I done to deserve this?
source outputProcess
java.lang.HotGarbageException: You messed up
at some.obscure.library.Util$$anonfun$morefun$1.apply(Util.scala:eleventyBillion)
at another.library.InputStream $$somuchfun$mcV$sp.read(InputStream.scala:42)
at org.apache.spark.serializer.PotatoSerializer.readPotato(PotatoSerializer.scala:2792)
…
6. Split it up
source outputStage2Stage1
Stage1
Output
Main1 Main2
stage1Out = stage1(source)
stage1Out.save(“stage1Path”)
stage1Out = load(“stage1Path”)
output = stage2(stage1Out)
11. How do we do this?
• Need some way of automatically generating a
path to save to or load from.
• Should stay the same when we want to reload
the same data.
• Should change when we want to generate
new data.
13. Refresher: what defines an RDD
Dependencies
Logic
Result
[1,2,3,4,5…]
[2,4,6,8,10…]
map(x => x*2)
14. A logical signature
result = logic(dependencies)
signature(result) = hash(signature(dependencies) + hash(logic))
15. A logical signature
val rdd2 = rdd1.map(f)
signature(rdd2) == hash(signature(rdd1) + hash(f))
val sourceRDD = sc.textFile(uri)
signature(sourceRDD) == hash(“textFile:” + uri)
16. Wait, how do you hash a function?
val f = (x: Int) => x * 2
f compiles to a class file of jvm bytecode
Hash those bytes.
Change f to
val f = (x: Int) => x * 3
And the class file will change.
17. What if the function references other
functions or static methods?
• f = (x: Int) => Values.v + staticMethod(x) + g(x)
• This is the hard part.
• Must follow dependencies recursively, identify
all classfiles that the function depends on.
• Make a combined hash of the classfiles and
any runtime values.
18. Putting this together
source outputStage2Stage1
Stage1
Output
stage1Out = stage1(source).checkpoint()
Output = stage2(stage1Out)
19. How to switch in checkpoints
• One option: change spark.
• Other option: build on top of spark.
20. Distributed Collection (DC)
• Mostly same api as RDD, maps, flatmaps and
so on.
• Can be instantiated without a spark context.
• Pass in a spark context to get the
corresponding RDD.
• Automatically computes logical signature.
21. RDD vs DC
val numbers: DC[Int] = parallelize(1 to 10)
val filtered: DC[Int] = numbers.filter(_ < 5)
val doubled: DC[Int] = filtered.map(_ * 2)
val doubledRDD: RDD[Int] = doubled.getRDD(sc)
val numbers: RDD[Int] = sc.parallelize(1 to 10)
val filtered: RDD[Int] = numbers.filter(_ < 5)
val doubled: RDD[Int] = filtered.map(_ * 2)
22. Easy checkpoint
val numbers: DC[Int] = parallelize(1 to 10)
val filtered: DC[Int] = numbers.filter(_ < 5).checkpoint()
val doubled: DC[Int] = filtered.map(_ * 2)
23. Problem: Non lazy definition
val numbers: RDD[Int] = sc.parallelize(1 to 10)
val doubles: RDD[Double] = numbers.map(x => x.toDouble)
val sum: Double = doubles.sum()
val normalized: RDD[Double] = doubles.map(x => x / sum)
We can’t get a signature of `normalized` without computing the sum!
24. Solution: Deferred Result
• Defined by a function on an RDD
• Like a DC but contains only one element
• Pass in a spark context to compute the result
• Also has a logical signature
• We can now express entirely lazy pipelines
25. Deferred Result example
val numbers: DC[Int] = parallelize(1 to 10)
val doubles: DC[Double] = numbers.map(x => x.toDouble)
val sum: DR[Double] = doubles.sum()
val normalized: DC[Double] = doubles.withResult(sum).map{case (x, s) => x / s}
We can compute a signature for `normalized` without computing a sum.
26. Improving the workflow
• Avoid adding complexity in order to speed things up.
• Get the benefits of avoiding redundant computation, but
avoid the costs of managing intermediate results.
• Further automatic behaviors to explore, such as automatic
failure traps, and generation of common statistics.
27. Can I use this now?
• Not yet, come talk to me if you’re interested.
• Open source release to come.
Intro
Bloomberg: premier data platform for financial professionals. Primary emphasis on accurate realtime results. Many data sources, lots of activity. Also a need for offline batch processing to build machine learning models and do backtesting on large datasets.
Going to talk about data pipeline development
Very broad view of an arbitrary data pipeline
Talking about scala
In a library, making a jar
5 cycle steps, make code change …
Typical dev cycle if you’re making edits to a library.
Certainly you would want to avoid checking every change with a full cluster run. Compiler checks, unit tests, are your friend. Running locally with a small sample is great, but you won’t catch everything. Plenty of problems that are difficult to anticipate or test without running on the cluster. 1 in 10000 element issues, things that blow up with a lot of data. Oh it works locally? That’s great but I really want to know that it works on the full dataset.
We want to speed this up, let’s target the worst bottleneck, typically execution time.
We’re in this cycle, it compiles, we’ve tested locally, we’ve got unit tests and we’re about to run on the cluster.
If this doesn’t take long to run, no worries, just iterate quickly.
If this does take a while to run, you’ve got problems.
We’ve solved one problem, iterate quicker on later stages.
We’ve introduced new problems by persisting intermediate results.