Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Monitoring Spark Applications

8,653 views

Published on

Spark is quickly becoming the most popular framework in the MapReduce family. With better performance and much better APIs - it's easier than ever to perform the actual data wrangling; But as always - the challenges of operating, verifying and optimizing your application over time are much greater than the initial setup - and all the more so with distributes systems. In Kenshoo, we've used and developed some tools and techniques to monitor the state of our Spark application: health, correctness, performance, utilization, and business KPIs. We'll discuss some standard tools and less standard techniques to get the most information out of your Spark cluster.

Published in: Software

Monitoring Spark Applications

  1. 1. Monitoring Spark Applications Tzach Zohar @ Kenshoo, March/2016
  2. 2. Who am I System Architect @ Kenshoo Java backend for 10 years Working with Scala + Spark for 2 years https://www.linkedin.com/in/tzachzohar
  3. 3. Who’s Kenshoo 10-year Tel-Aviv based startup Industry Leader in Digital Marketing 500+ employees Heavy data shop http://kenshoo.com/
  4. 4. And who’re you?
  5. 5. Agenda Why Monitor Spark UI Spark REST API Spark Metric Sinks Applicative Metrics
  6. 6. The Importance of being Earnest
  7. 7. Why Monitor Failures Performance Know your data Correctness of output
  8. 8. Monitoring Distributed Systems No single log file No single User Interface Often - no single framework (e.g. Spark + YARN + HDFS…)
  9. 9. Spark UI
  10. 10. Spark UI See http://spark.apache.org/docs/latest/monitoring.html#web-interfaces The first go-to tool for understanding what’s what Created per SparkContext
  11. 11. Spark UI Jobs -> Stages -> Tasks
  12. 12. Spark UI Jobs -> Stages -> Tasks
  13. 13. Spark UI Use the “DAG Visualization” in Job Details to: Understand flow Detect caching opportunities
  14. 14. Spark UI Jobs -> Stages -> Tasks Detect unbalanced stages Detect GC issues
  15. 15. Spark UI Jobs -> Stages -> Tasks -> “Event Timeline” Detect stragglers Detect repartitioning opportunities
  16. 16. Spark UI Disadvantages “Ad-Hoc”, no history* Human readable, but not machine readable Data points, not data trends
  17. 17. Spark UI Disadvantages UI can quickly become hard to use…
  18. 18. Spark REST API
  19. 19. Spark’s REST API See http://spark.apache.org/docs/latest/monitoring.html#rest-api Programmatic access to UI’s data (jobs, stages, tasks, executors, storage…) Useful for aggregations over similar jobs
  20. 20. Spark’s REST API Example: calculate total shuffle statistics: object SparkAppStats { case class SparkStage(name: String, shuffleWriteBytes: Long, memoryBytesSpilled: Long, diskBytesSpilled: Long) implicit val formats = DefaultFormats val url = "http://<host>:4040/api/v1/applications/<app-name>/stages" def main (args: Array[String]) { val json = fromURL(url).mkString val stages: List[SparkStage] = parse(json).extract[List[SparkStage]] println("stages count: " + stages.size) println("shuffleWriteBytes: " + stages.map(_.shuffleWriteBytes).sum) println("memoryBytesSpilled: " + stages.map(_.memoryBytesSpilled).sum) println("diskBytesSpilled: " + stages.map(_.diskBytesSpilled).sum) } }
  21. 21. Example: calculate total shuffle statistics: Example output: stages count: 1435 shuffleWriteBytes: 8488622429 memoryBytesSpilled: 120107947855 diskBytesSpilled: 1505616236 Spark’s REST API
  22. 22. Spark’s REST API Example: calculate total time per job name: val url = "http://<host>:4040/api/v1/applications/<app-name>/jobs" case class SparkJob(jobId: Int, name: String, submissionTime: Date, completionTime: Option[Date], stageIds: List[Int]) { def getDurationMillis: Option[Long] = completionTime.map(_.getTime - submissionTime.getTime) } def main (args: Array[String]) { val json = fromURL(url).mkString parse(json) .extract[List[SparkJob]] .filter(j => j.getDurationMillis.isDefined) // only completed jobs .groupBy(_.name) .mapValues(list => (list.map(_.getDurationMillis.get).sum, list.size)) .foreach { case (name, (time, count)) => println(s"TIME: $timetAVG: ${time / count}tNAME: $name") } }
  23. 23. Spark’s REST API Example: calculate total time per job name: Example output: TIME: 182570 AVG: 16597 NAME: count at MyAggregationService.scala:132 TIME: 230973 AVG: 1297 NAME: parquet at MyRepository.scala:99 TIME: 120393 AVG: 2188 NAME: collect at MyCollector.scala:30 TIME: 5645 AVG: 627 NAME: collect at MyCollector.scala:103
  24. 24. But that’s still ad- hoc, right?
  25. 25. Spark Metric Sinks
  26. 26. Metrics: easy Java API for creating and updating metrics stored in memory, e.g.: Metrics See http://spark.apache.org/docs/latest/monitoring.html#metrics Spark uses the popular dropwizard.metrics library (renamed from codahale.metrics and yammer.metrics) // Gauge for executor thread pool's actively executing task counts metricRegistry.register(name("threadpool", "activeTasks"), new Gauge[Int] { override def getValue: Int = threadPool.getActiveCount() })
  27. 27. Metrics What is metered? Couldn’t find any detailed documentation of this This trick flushes most of them out: search sources for “metricRegistry.register”
  28. 28. Where do these metrics go?
  29. 29. Spark Metric Sinks A “Sink” is an interface for viewing these metrics, at given intervals or ad-hoc Available sinks: Console, CSV, SLF4J, Servlet, JMX, Graphite, Ganglia* we use the Graphite Sink to send all metrics to Graphite $SPARK_HOME/metrics.properties: *.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink *.sink.graphite.host=<your graphite hostname> *.sink.graphite.port=2003 *.sink.graphite.period=30 *.sink.graphite.unit=seconds *.sink.graphite.prefix=<token>.<app-name>.<host-name>
  30. 30. .. and it’s in Graphite ( + Grafana)
  31. 31. Graphite Sink Very useful for trend analysis WARNING: Not suitable for short-running applications (will pollute graphite with new metrics for each application) Requires some Graphite tricks to get clear readings (wildcards, sums, derivatives, etc.)
  32. 32. Applicative Metrics
  33. 33. The Missing Piece Spark meters its internals pretty thoroughly, but what about your internals? Applicative metrics are a great tool for knowing your data and verifying output correctness We use Dropwizard Metrics + Graphite for this too (everywhere)
  34. 34. Counting RDD Elements rdd.count() might be costly (another action) Spark Accumulators are a good alternative Trick: send accumulator results to Graphite, using “Counter-backed Accumulators” /** * * Call returned callback after acting on returned RDD to get counter updated */ def countSilently[V: ClassTag](rdd: RDD[V], metricName: String, clazz: Class[_]): (RDD[V], Unit => Unit) = { val counter: Counter = Metrics.newCounter(new MetricName(clazz, metricName)) val accumulator: Accumulator[Long] = rdd.sparkContext.accumulator(0, metricName) val countedRdd = rdd.map(v => { accumulator += 1; v }) val callback: Unit => Unit = u => counter.inc(accumulator.value) (countedRdd, callback) }
  35. 35. Counting RDD Elements
  36. 36. We Measure... Input records Output records Parsing failures Average job time Data “freshness” histogram Much much more...
  37. 37. WARNING: it’s addictive...
  38. 38. Conclusions Spark provides a wide variety of monitoring options Each one should be used when appropriate - neither one is sufficient on its own Metrics + Graphite + Grafana can give you visibility to any numeric timeseries
  39. 39. Questions?
  40. 40. Thank you

×