Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Beyond parallelize and collect - Spark Summit East 2016

As Spark jobs are used for more mission critical tasks, beyond exploration, it is important to have effective tools for testing. This talk expands on “Effective Testing For Spark Programs” (not required to have been seen) to discuss how to create large scale test jobs without depending on collect & parallelize which limit the sizes of datasets we can work with. Testing Spark Streaming jobs can be especially challenging, as the normal techniques for loading test data don’t work and additional work must be done to collect the results and stop streaming. We will explore the difficulties with testing Streaming Programs, options for setting up integration testing, beyond just local mode, with Spark, and also examine best practices for acceptance tests.

  • Be the first to comment

Beyond parallelize and collect - Spark Summit East 2016

  1. 1. Beyond Parallelize & Collect (Effective testing of Spark Programs) Now mostly “works”* *See developer for details. Does not imply warranty. :p
  2. 2. Who am I? My name is Holden Karau Prefered pronouns are she/her I’m a Software Engineer currently IBM and previously Alpine, Databricks, Google, Foursquare & Amazon co-author of Learning Spark & Fast Data processing with Spark @holdenkarau Slide share Linkedin
  3. 3. What is going to be covered: What I think I might know about you A bit about why you should test your programs Using parallelize & collect for unit testing (quick skim) Comparing datasets too large to fit in memory Considerations for Streaming & SQL (DataFrames & Datasets) Cute & scary pictures I promise at least one panda and one cat “Future Work”
  4. 4. Who I think you wonderful humans are? Nice* people Like silly pictures Familiar with Apache Spark If not, buy one of my books or watch Paco’s awesome video Familiar with one of Scala, Java, or Python If you know R well I’d love to chat though Want to make better software (or models, or w/e)
  5. 5. So why should you test? Makes you a better person Save $s May help you avoid losing your employer all of their money Or “users” if we were in the bay AWS is expensive Waiting for our jobs to fail is a pretty long dev cycle This is really just to guilt trip you & give you flashbacks to your QA internships
  6. 6. So why should you test - continued Results from: Testing with Spark survey
  7. 7. So why should you test - continued Results from: Testing with Spark survey
  8. 8. Why don’t we test? It’s hard Faking data, setting up integration tests, urgh w/e Our tests can get too slow It takes a lot of time and people always want everything done yesterday or I just want to go home see my partner etc.
  9. 9. Cat photo from
  10. 10. An artisanal Spark unit test @transient private var _sc: SparkContext = _ override def beforeAll() { _sc = new SparkContext("local[4]") super.beforeAll() } override def afterAll() { if (sc != null) sc.stop() System.clearProperty("spark.driver.port") // rebind issue _sc = null super.afterAll() } Photo by morinesque
  11. 11. And on to the actual test... test("really simple transformation") { val input = List("hi", "hi holden", "bye") val expected = List(List("hi"), List("hi", "holden"), List("bye")) assert(tokenize(sc.parallelize(input)).collect().toList === expected) } def tokenize(f: RDD[String]) = {" ").toList) } Photo by morinesque
  12. 12. Wait, where were the batteries? Photo by Jim Bauer
  13. 13. Let’s get batteries! Spark unit testing spark-testing-base - sscheck - Integration testing spark-integration-tests (Spark internals) - Performance spark-perf (also for Spark internals) - Spark job validation Photo by Mike Mozart
  14. 14. A simple unit test re-visited (Scala) class SampleRDDTest extends FunSuite with SharedSparkContext { test("really simple transformation") { val input = List("hi", "hi holden", "bye") val expected = List(List("hi"), List("hi", "holden"), List("bye")) assert(SampleRDD.tokenize(sc.parallelize(input)).collect().toList === expected) } }
  15. 15. Ok but what about problems @ scale Maybe our program works fine on our local sized input If we are using Spark our actual workload is probably huge How do we test workloads too large for a single machine? we can’t just use parallelize and collect Qfamily
  16. 16. Distributed “set” operations to the rescue* Pretty close - already built into Spark Doesn’t do so well with floating points :( damn floating points keep showing up everywhere :p Doesn’t really handle duplicates very well {“coffee”, “coffee”, “panda”} != {“panda”, “coffee”} but with set operations... Matti Mattila
  17. 17. Or use RDDComparisions: def compareWithOrderSamePartitioner[T: ClassTag](expected: RDD[T], result: RDD[T]): Option[(T, T)] = {{case (x, y) => x != y}.take(1).headOption } Matti Mattila
  18. 18. Or use RDDComparisions: def compare[T: ClassTag](expected: RDD[T], result: RDD[T]): Option[(T, Int, Int)] = { val expectedKeyed = => (x, 1)).reduceByKey(_ + _) val resultKeyed = => (x, 1)).reduceByKey(_ + _) expectedKeyed.cogroup(resultKeyed).filter{case (_, (i1, i2)) => i1.isEmpty || i2.isEmpty || i1.head != i2.head}.take(1).headOption. map{case (v, (i1, i2)) => (v, i1.headOption.getOrElse(0), i2.headOption.getOrElse(0))} } Matti Mattila
  19. 19. But where do we get the data for those tests? If you have production data you can sample you are lucky! If possible you can try and save in the same format If our data is a bunch of Vectors or Doubles Spark’s got tools :) Coming up with good test data can take a long time Lori Rielly
  20. 20. QuickCheck / ScalaCheck QuickCheck generates tests data under a set of constraints Scala version is ScalaCheck - supported by the two unit testing libraries for Spark sscheck Awesome people*, supports generating DStreams too! spark-testing-base Also Awesome people*, generates more pathological (e.g. empty partitions etc.) RDDs PROtara hunt
  21. 21. With spark-testing-base test("map should not change number of elements") { forAll(RDDGenerator.genRDD[String](sc)){ rdd => == rdd.count() } }
  22. 22. Testing streaming…. Photo by Steve Jurvetson
  23. 23. // Setup our Stream: class TestInputStream[T: ClassTag](@transient var sc: SparkContext, ssc_ : StreamingContext, input: Seq[Seq[T]], numPartitions: Int) extends FriendlyInputDStream[T](ssc_) { def start() {} def stop() {} def compute(validTime: Time): Option[RDD[T]] = { logInfo("Computing RDD for time " + validTime) val index = ((validTime - ourZeroTime) / slideDuration - 1).toInt val selectedInput = if (index < input.size) input(index) else Seq[T]() // lets us test cases where RDDs are not created if (selectedInput == null) { return None } val rdd = sc.makeRDD(selectedInput, numPartitions) logInfo("Created RDD " + + " with " + selectedInput) Some(rdd) } } Artisanal Stream Testing Code trait StreamingSuiteBase extends FunSuite with BeforeAndAfterAll with Logging with SharedSparkContext { // Name of the framework for Spark context def framework: String = this.getClass.getSimpleName // Master for Spark context def master: String = "local[4]" // Batch duration def batchDuration: Duration = Seconds(1) // Directory where the checkpoint data will be saved lazy val checkpointDir = { val dir = Utils.createTempDir() logDebug(s"checkpointDir: $dir") dir.toString } // Default after function for any streaming test suite. Override this // if you want to add your stuff to "after" (i.e., don't call after { } ) override def afterAll() { System.clearProperty("spark.streaming.clock") super.afterAll() } Pho to by Stev e Jurv etso n
  24. 24. and continued…. /** * Create an input stream for the provided input sequence. This is done using * TestInputStream as queueStream's are not checkpointable. */ def createTestInputStream[T: ClassTag](sc: SparkContext, ssc_ : TestStreamingContext, input: Seq[Seq[T]]): TestInputStream[T] = { new TestInputStream(sc, ssc_, input, numInputPartitions) } // Default before function for any streaming test suite. Override this // if you want to add your stuff to "before" (i.e., don't call before { } ) override def beforeAll() { if (useManualClock) { logInfo("Using manual clock") conf.set("spark.streaming.clock", "org.apache.spark.streaming.util.TestManualClock") // We can specify our own clock } else { logInfo("Using real clock") conf.set("spark.streaming.clock", "org.apache.spark.streaming.util.SystemClock") } super.beforeAll() } /** * Run a block of code with the given StreamingContext and automatically * stop the context when the block completes or when an exception is thrown. */ def withOutputAndStreamingContext[R](outputStreamSSC: (TestOutputStream[R], TestStreamingContext)) (block: (TestOutputStream[R], TestStreamingContext) => Unit): Unit = { val outputStream = outputStreamSSC._1 val ssc = outputStreamSSC._2 try { block(outputStream, ssc) } finally { try { ssc.stop(stopSparkContext = false) } catch { case e: Exception => logError("Error stopping StreamingContext", e) } } } }
  25. 25. and now for the clock /* * Allows us access to a manual clock. Note that the manual clock changed between 1.1.1 and 1.3 */ class TestManualClock(var time: Long) extends Clock { def this() = this(0L) def getTime(): Long = getTimeMillis() // Compat def currentTime(): Long = getTimeMillis() // Compat def getTimeMillis(): Long = synchronized { time } def setTime(timeToSet: Long): Unit = synchronized { time = timeToSet notifyAll() } def advance(timeToAdd: Long): Unit = synchronized { time += timeToAdd notifyAll() } def addToTime(timeToAdd: Long): Unit = advance(timeToAdd) // Compat /** * @param targetTime block until the clock time is set or advanced to at least this time * @return current time reported by the clock when waiting finishes */ def waitTillTime(targetTime: Long): Long = synchronized { while (time < targetTime) { wait(100) } getTimeMillis() } }
  26. 26. Testing streaming the happy panda way Creating test data is hard ssc.queueStream works - unless you need checkpoints (1.4.1+) Collecting the data locally is hard foreachRDD & a var figuring out when your test is “done” Let’s abstract all that away into testOperation
  27. 27. We can hide all of that: test("really simple transformation") { val input = List(List("hi"), List("hi holden"), List("bye")) val expected = List(List("hi"), List("hi", "holden"), List("bye")) testOperation[String, String](input, tokenize _, expected, useSet = true) } Photo by An eye for my mind
  28. 28. What about DataFrames? We can do the same as we did for RDD’s (.rdd) Inside of Spark validation looks like: def checkAnswer(df: DataFrame, expectedAnswer: Seq[Row]) Sadly it’s not in a published package & local only instead we expose: def equalDataFrames(expected: DataFrame, result: DataFrame) { def approxEqualDataFrames(e: DataFrame, r: DataFrame, tol: Double) {
  29. 29. …. and Datasets We can do the same as we did for RDD’s (.rdd) Inside of Spark validation looks like: def checkAnswer(df: Dataset[T], expectedAnswer: T*) Sadly it’s not in a published package & local only instead we expose: def equalDatasets(expected: Dataset[U], result: Dataset[V]) { def approxEqualDatasets(e: Dataset[U], r: Dataset[V], tol: Double) {
  30. 30. This is what it looks like: test("dataframe should be equal to its self") { val sqlCtx = sqlContext import sqlCtx.implicits._// Yah I know this is ugly val input = sc.parallelize(inputList).toDF equalDataFrames(input, input) } *This may or may not be easier.
  31. 31. Which has “built-in” large support :)
  32. 32. Photo by allison
  33. 33. Let’s talk about local mode It’s way better than you would expect* It does its best to try and catch serialization errors It’s still not the same as running on a “real” cluster Especially since if we were just local mode, parallelize and collect might be fine Photo by: Bev Sykes
  34. 34. Options beyond local mode: Just point at your existing cluster (set master) Start one with your shell scripts & change the master Really easy way to plug into existing integration testing spark-docker - hack in our own tests YarnMiniCluster seYarnClusterSuite.scala In Spark Testing Base extend SharedMiniCluster Not recommended until after SPARK-10812 (e.g. 1.5.2+ or 1.6+) Photo by Richard Masoner
  35. 35. Validation Validation can be really useful for catching errors before deploying a model Our tests can’t catch everything For now checking file sizes & execution time seem like the most common best practice (from survey) Accumulators have some challenges (see SPARK-12469 for progress) but are an interesting option spark-validator is still in early stages and not ready for production use but interesting proof of concept Photo by: Paul Schadler
  36. 36. Related talks & blog posts Testing Spark Best Practices (Spark Summit 2014) Every Day I’m Shuffling (Strata 2015) & slides Spark and Spark Streaming Unit Testing Making Spark Unit Testing With Spark Testing Base
  37. 37. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark
  38. 38. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Coming soon: Spark in Action Coming soon: High Performance Spark
  39. 39. And the next book….. Still being written - signup to be notified when it is available:
  40. 40. Related packages spark-testing-base: sscheck: spark-validator: *ALPHA* spark-perf - spark-integration-tests -
  41. 41. “Future Work” Better ScalaCheck integration (ala sscheck) Testing details in my next Spark book Whatever* you all want Testing with Spark survey: Semi-likely: integration testing (for now see @cfriegly’s Spark + Docker setup): Pretty unlikely:*That I feel like doing, or you feel like making a pull request for. Photo by bullet101
  42. 42. Cat wave photo by Quinn Dombrowski k thnx bye! If you want to fill out survey: Will use update results in Strata Presentation & tweet eventually at @holdenkarau