Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Development at Bloomberg, Nimbus Goehausen, Senior Software Engineer, Bloomberg

313 views

Published on

This presentation will explore how Bloomberg uses Spark, with its formidable computational model for distributed, high-performance analytics, to take this process to the next level, and look into one of the innovative practices the team is currently developing to increase efficiency: the introduction of a logical signature for datasets.

Published in: Technology
  • Be the first to comment

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Development at Bloomberg, Nimbus Goehausen, Senior Software Engineer, Bloomberg

  1. 1. Iterative Spark Development at Bloomberg Nimbus Goehausen Spark Platform Engineer ngoehausen@bloomberg.net Copyright 2016 Bloomberg L.P. All rights reserved.
  2. 2. A Data Pipeline source outputProcess output = Process(source)
  3. 3. Spark development cycle Make code change Deploy jar (1m) Start spark submit / spark shell / notebook (1m) Wait for execution (1m – several hours) Observe results
  4. 4. This should work, right? source outputProcess output = Process(source)
  5. 5. What have I done to deserve this? source outputProcess java.lang.HotGarbageException: You messed up at some.obscure.library.Util$$anonfun$morefun$1.apply(Util.scala:eleventyBillion) at another.library.InputStream $$somuchfun$mcV$sp.read(InputStream.scala:42) at org.apache.spark.serializer.PotatoSerializer.readPotato(PotatoSerializer.scala:2792) …
  6. 6. Split it up source outputStage2Stage1 Stage1 Output Main1 Main2 stage1Out = stage1(source) stage1Out.save(“stage1Path”) stage1Out = load(“stage1Path”) output = stage2(stage1Out)
  7. 7. Automatic checkpoint source outputStage2Stage1 Stage1 Output stage1Out = stage1(source).checkpoint() Output = stage2(stage1Out)
  8. 8. First run source outputStage2Stage1 Stage1 Output stage1Out = stage1(source).checkpoint() Output = stage2(stage1Out)
  9. 9. Second run source outputStage2Stage1 Stage1 Output stage1Out = stage1(source).checkpoint() Output = stage2(stage1Out)
  10. 10. Change logic source outputStage2 modified Stage1 modified Stage1 Output stage1Out = stage1(source).checkpoint() Output = stage2(stage1Out) Stage1 Output
  11. 11. How do we do this? • Need some way of automatically generating a path to save to or load from. • Should stay the same when we want to reload the same data. • Should change when we want to generate new data.
  12. 12. Refresher: what defines an RDD Dependencies Logic Result
  13. 13. Refresher: what defines an RDD Dependencies Logic Result [1,2,3,4,5…] [2,4,6,8,10…] map(x => x*2)
  14. 14. A logical signature result = logic(dependencies) signature(result) = hash(signature(dependencies) + hash(logic))
  15. 15. A logical signature val rdd2 = rdd1.map(f) signature(rdd2) == hash(signature(rdd1) + hash(f)) val sourceRDD = sc.textFile(uri) signature(sourceRDD) == hash(“textFile:” + uri)
  16. 16. Wait, how do you hash a function? val f = (x: Int) => x * 2 f compiles to a class file of jvm bytecode Hash those bytes. Change f to val f = (x: Int) => x * 3 And the class file will change.
  17. 17. What if the function references other functions or static methods? • f = (x: Int) => Values.v + staticMethod(x) + g(x) • This is the hard part. • Must follow dependencies recursively, identify all classfiles that the function depends on. • Make a combined hash of the classfiles and any runtime values.
  18. 18. Putting this together source outputStage2Stage1 Stage1 Output stage1Out = stage1(source).checkpoint() Output = stage2(stage1Out)
  19. 19. How to switch in checkpoints • One option: change spark. • Other option: build on top of spark.
  20. 20. Distributed Collection (DC) • Mostly same api as RDD, maps, flatmaps and so on. • Can be instantiated without a spark context. • Pass in a spark context to get the corresponding RDD. • Automatically computes logical signature.
  21. 21. RDD vs DC val numbers: DC[Int] = parallelize(1 to 10) val filtered: DC[Int] = numbers.filter(_ < 5) val doubled: DC[Int] = filtered.map(_ * 2) val doubledRDD: RDD[Int] = doubled.getRDD(sc) val numbers: RDD[Int] = sc.parallelize(1 to 10) val filtered: RDD[Int] = numbers.filter(_ < 5) val doubled: RDD[Int] = filtered.map(_ * 2)
  22. 22. Easy checkpoint val numbers: DC[Int] = parallelize(1 to 10) val filtered: DC[Int] = numbers.filter(_ < 5).checkpoint() val doubled: DC[Int] = filtered.map(_ * 2)
  23. 23. Problem: Non lazy definition val numbers: RDD[Int] = sc.parallelize(1 to 10) val doubles: RDD[Double] = numbers.map(x => x.toDouble) val sum: Double = doubles.sum() val normalized: RDD[Double] = doubles.map(x => x / sum) We can’t get a signature of `normalized` without computing the sum!
  24. 24. Solution: Deferred Result • Defined by a function on an RDD • Like a DC but contains only one element • Pass in a spark context to compute the result • Also has a logical signature • We can now express entirely lazy pipelines
  25. 25. Deferred Result example val numbers: DC[Int] = parallelize(1 to 10) val doubles: DC[Double] = numbers.map(x => x.toDouble) val sum: DR[Double] = doubles.sum() val normalized: DC[Double] = doubles.withResult(sum).map{case (x, s) => x / s} We can compute a signature for `normalized` without computing a sum.
  26. 26. Improving the workflow • Avoid adding complexity in order to speed things up. • Get the benefits of avoiding redundant computation, but avoid the costs of managing intermediate results. • Further automatic behaviors to explore, such as automatic failure traps, and generation of common statistics.
  27. 27. Can I use this now? • Not yet, come talk to me if you’re interested. • Open source release to come.
  28. 28. Questions? Nimbus Goehausen ngoehausen@bloomberg.net @nimbusgo on Twitter / Github techatbloomberg.com

×