Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
SPARK SUMMIT
EUROPE2016
Sparklint
a Tool for Identifying and Tuning Inefficient
Spark Jobs Across Your Cluster
Simon White...
Why Sparklint?
• A successful Spark cluster grows rapidly
• Capacity and capability mismatches arise
• Leads to resource c...
Sparklint provides:
• Live view of batch & streaming application stats
or
• Event by event analysis of historical event lo...
Sparklint Listener:
Sparklint Listener:
Sparklint Server:
Demo…
• Simulated workload analyzing site access logs:
– read text file as JSON
– convert to Record(ip, verb, status, time...
Job took 10m7s to finish
Already pretty good
distribution; low idle time
indicates good worker
usage, minimal driver node
...
Job took 15m14s to finish
Core usage increased, job is more
efficient, execution time increased,
but the app is not cpu bo...
Job took 9m24s to finish
Core utilization decreased
proportionally, trading execution time
for efficiency
Lots of IDLE sta...
Job took 11m34s to finish
Dynamic allocation only effective at
app start due to long
executorIdleTimeout setting
Core util...
Job took 33m5s to finish Core utilization is up, but execution time is up
dramatically due to reclaiming resources before
...
Job took 7m34s to finish
Core utilization way up,
with lower execution time
Flat tops show we are
becoming CPU bound
Paral...
Job took 5m6s to finish
Core utilization decreases,
trading execution time for
efficiency again here
Thanks to dynamic allocation the
utilization is high despite being a bi-
modal application
Data loading and mapping requir...
Future Features:
• Increased job & stage detail in UI
• History Server event sources
• Inline recommendations
• Auto-tunin...
The Credit:
• Lead developer is Robert Xue
• https://github.com/roboxue
• SDE @ Groupon
Contribute!
Sparklint is OSS:
https://github.com/groupon/sparklint
SPARK SUMMIT
EUROPE2016
THANK YOU.
swhitear@groupon.com
Upcoming SlideShare
Loading in …5
×

Spark Summit EU talk by Simon Whitear

1,233 views

Published on

SparkLint: a Tool for Monitoring, Identifying and Tuning Inefficient Spark Jobs Across Your Cluster

Published in: Data & Analytics
  • Be the first to comment

Spark Summit EU talk by Simon Whitear

  1. 1. SPARK SUMMIT EUROPE2016 Sparklint a Tool for Identifying and Tuning Inefficient Spark Jobs Across Your Cluster Simon Whitear Principal Engineer @ Groupon
  2. 2. Why Sparklint? • A successful Spark cluster grows rapidly • Capacity and capability mismatches arise • Leads to resource contention • Tuning process is non-trivial • Current UI operational in focus We wanted to understand application efficiency
  3. 3. Sparklint provides: • Live view of batch & streaming application stats or • Event by event analysis of historical event logs • Stats and graphs for: – Idle time – Core usage – Task locality
  4. 4. Sparklint Listener:
  5. 5. Sparklint Listener:
  6. 6. Sparklint Server:
  7. 7. Demo… • Simulated workload analyzing site access logs: – read text file as JSON – convert to Record(ip, verb, status, time) – countByIp, countByStatus, countByVerb
  8. 8. Job took 10m7s to finish Already pretty good distribution; low idle time indicates good worker usage, minimal driver node interaction in job But overallutilization is low Which is reflected in the common occurrence of the IDLE state (unused cores)
  9. 9. Job took 15m14s to finish Core usage increased, job is more efficient, execution time increased, but the app is not cpu bound
  10. 10. Job took 9m24s to finish Core utilization decreased proportionally, trading execution time for efficiency Lots of IDLE state shows we are over allocating resources
  11. 11. Job took 11m34s to finish Dynamic allocation only effective at app start due to long executorIdleTimeout setting Core utilization remains low, the config settings are not right for this workload.
  12. 12. Job took 33m5s to finish Core utilization is up, but execution time is up dramatically due to reclaiming resources before each short running task. IDLE state is reduced to a minimum, looks efficient, but execution is much slower due to dynamic allocation overhead Executor churn!
  13. 13. Job took 7m34s to finish Core utilization way up, with lower execution time Flat tops show we are becoming CPU bound Parallel execution is clearly visible in overlapping stages
  14. 14. Job took 5m6s to finish Core utilization decreases, trading execution time for efficiency again here
  15. 15. Thanks to dynamic allocation the utilization is high despite being a bi- modal application Data loading and mapping requires a large core count to get throughput Aggregation and IO of results optimized for end file size, therefore requires less cores
  16. 16. Future Features: • Increased job & stage detail in UI • History Server event sources • Inline recommendations • Auto-tuning • Streaming stage parameter delegation
  17. 17. The Credit: • Lead developer is Robert Xue • https://github.com/roboxue • SDE @ Groupon
  18. 18. Contribute! Sparklint is OSS: https://github.com/groupon/sparklint
  19. 19. SPARK SUMMIT EUROPE2016 THANK YOU. swhitear@groupon.com

×