Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Advanced Visualization of Spark jobs

1,402 views

Published on

Advanced Visualization of Spark jobs

Published in: Technology
  • Be the first to comment

Advanced Visualization of Spark jobs

  1. 1. Advanced visualization of Spark jobs Zoltán Zvara zoltan.zvara@sztaki.hu Márton Balassi marton.balassi@sztaki.hu
  2. 2. Agenda • Introduction • Motivation and the challenges • The Spark execution visualization • An elegant way to tackle data skew • The visualizer extended to Flink • Future plans
  3. 3. Introduction • Hungarian Academy of Sciences (MTA SZTAKI) • Research institute with strong industry ties • Big Data projects using Spark, Flink, Cassandra, Hadoop etc. • Multiple telco use cases lately, with challenging data volume and distribution
  4. 4. Motivation • We have developped an application aggreagting telco data that tested well on toy data • When deploying it against the real dataset the application seemed healthy • However it could become suprisingly slow or even crash • What did go wrong?
  5. 5. Data skew • When some data-points are very frequent, and we need to process them on a single machine – aka the Bieber-effect • Power-law distributions are very common: from social-media to telecommunication, even in our beloved WordCount • Default hashing puts these into „random” buckets • We can do better than that
  6. 6. Aim 1. Most of the telco data-processing workloads suffer from inefficient communication patterns, introduced by skewed data 2. Help developers to write better applications by detecting issues of logical and physical execution plans easier 3. Help newcomers to understand a distributed data-processing system faster 4. Demonstrate and guide the testing of adaptive (re)partitioning techniques
  7. 7. Spark’s execution model • Extended MapReduce model, where a previous stage has to complete all its tasks before the next stage can be scheduled (global synchronization) • Data skew can introduce slow tasks • In most cases, the data distribution is not known in advance • „Concept drifts” are common stage boundary & shuffle even partitioning skewed data slow task slow task
  8. 8. Physical execution plan • Understanding the physical plan • can lead to designing a better application • can uncover bottlenecks that are hard to identify otherwise • can help to pick the right tool for the job at hand • can be quite difficult for newcomers • can give insights to new, adaptive, on-the-fly partitioning & optimization strategies
  9. 9. Current visualizations in Spark • Current visualization tools • are missing the fine-grained input & output metrics – also not available in most sytems, like Spark • does not capture the data charateristics – hard to accomplish a lightweight and efficient solution • does not visualize the physical plan DAG visualization in Spark
  10. 10. Collecting information during execution • Data-transfer between tasks are accomplished in the shuffle phase, in the form of shuffle block fetches • We distinguish local & remote block fetches • Data characteristics collected on shuffle write & read
  11. 11. A scalable sampling • Key-distributions are approximated with a strategy, that • is not sensitive to early or late concept drifts, • lightweight and efficient, • scalable by using a backoff strategy keys frequency counters keys frequencysampling rate of 𝑠𝑖 sampling rate of 𝑠𝑖/𝑏 keys frequency sampling rate of 𝑠𝑖+𝑗/𝑏 truncateincrease by 𝑠𝑖 𝑇𝐵
  12. 12. Availability through the Spark REST API • Enhanced Spark REST API to provide block fetch & data characteristics information of tasks • New queries to the REST API, for example: „what happend in the last 3 seconds?” • /api/v1/applications/{appId}/{jobId}/tasks/{timestamp}
  13. 13. THE VISUALIZATION
  14. 14. Our way of tackling data skew • Goal: use the collected information to handle data skew dynamically, on-the-fly on any workload (batch or streaming) • Currently paper in progress heavy keys create heavy partitions (slow tasks) Bieber-key is alone, size of the biggest partition is minimized
  15. 15. Making visualization pluggable • Let us also support Flink • The execution model is quite different from Spark • Minimal code change on the data reader module of the visualization • Expose the necessary metrics through the Flink JM Rest API • Map Flink’s model to Spark’s
  16. 16. Computational models Batch data Kafka, RabbitMQ ... HDFS, JDBC ... Stream Data Spark RDDs break down the computation into stagesFlink computation is fully pipelined by default
  17. 17. The currently available Flink visualizer
  18. 18. Granular visualization of Flink jobs
  19. 19. Future plans • Open-sourcing of the visualization framework • Dynamic repartitioning paper is in progress • Task-classification to make the visualization more compact • Visual improvements, more features • Opening PRs against Spark & Flink with the suggested metrics • A rewrite is coming for the Flink metrics, track it at FLINK-1504
  20. 20. Conclusions • The devil is in the details • Visualizations can aid developers to better understand issues and bottlenecks of certain workloads • Specially data skew can hinder the completion time of the whole stage in a batch engine • Adaptive repartitioning strategies can be suprisingly efficient
  21. 21. Thank you for your attention Q&A stage boundary & shuffle even partitioning skewed data slow task slow task

×