SlideShare a Scribd company logo
1 of 20
Sparklint
a Tool for Identifying and Tuning Inefficient Spark Jobs
Across Your Cluster
Simon Whitear
Principal Engineer
Why Sparklint?
• A successful Spark cluster grows rapidly
• Capacity and capability mismatches arise
• Leads to resource contention
• Tuning process is non-trivial
• Current UI operational in focus
We wanted to understand application efficiency
Sparklint provides:
• Live view of batch & streaming application stats
or
• Event by event analysis of historical event logs
• Stats and graphs for:
– Idle time
– Core usage
– Task locality
Sparklint Listener:
Sparklint Listener:
Sparklint Server:
Demo…
• Simulated workload analyzing site access logs:
– read text file as JSON
– convert to Record(ip, verb, status, time)
– countByIp, countByStatus, countByVerb
Job took 10m7s to finish
Already pretty good
distribution; low idle time
indicates good worker
usage, minimal driver node
interaction in job
But overall utilization is low
Which is reflected in the
common occurrence of the
IDLE state (unused cores)
Job took 15m14s to
finish
Core usage increased,
job is more efficient,
execution time
increased, but the app
is not cpu bound
Job took 9m24s to finish
Core utilization decreased
proportionally, trading execution time
for efficiency
Lots of IDLE state shows
we are over allocating
resources
Job took 11m34s to finish
Core utilization remains
low, the config settings
are not right for this
workload.
Dynamic allocation only effective at
app start due to long
executorIdleTimeout setting
Job took 33m5s to finish Core utilization is up, but execution time is
up dramatically due to reclaiming
resources before each short running task.
IDLE state is reduced to a minimum, looks
efficient, but execution is much slower due to
dynamic allocation overhead
Job took 7m34s to finish
Core utilization way up,
with lower execution time
Parallel execution is
clearly visible in
overlapping stages
Flat tops show we are
becoming CPU bound
Job took 5m6s to finish
Core utilization decreases,
trading execution time for
efficiency again here
Thanks to dynamic allocation the
utilization is high despite being a bi-
modal application
Data loading and mapping requires
a large core count to get throughput
Aggregation and IO of results
optimized for end file size,
therefore requires less cores
Future Features:
• History Server event sources
• Inline recommendations
• Auto-tuning
• Streaming stage parameter delegation
• Replay capable listener
The Credit:
• Lead developer is Robert Xue
• https://github.com/roboxue
• SDE @ Groupon
Contribute!
Sparklint is OSS:
https://github.com/groupon/sparklint
Q+A

More Related Content

What's hot

Enterprise Beacon Object Hive - Siebel Version Control
Enterprise Beacon Object Hive - Siebel Version ControlEnterprise Beacon Object Hive - Siebel Version Control
Enterprise Beacon Object Hive - Siebel Version ControlMilind Waikul
 
Taking Splunk to the Next Level - Architecture
Taking Splunk to the Next Level - ArchitectureTaking Splunk to the Next Level - Architecture
Taking Splunk to the Next Level - ArchitectureSplunk
 
Infrastructure as code
Infrastructure as codeInfrastructure as code
Infrastructure as codeAakash Singhal
 
2013 04-29-evolution of backend
2013 04-29-evolution of backend2013 04-29-evolution of backend
2013 04-29-evolution of backendWooga
 
The Many Faces of Apache Kafka: Leveraging Real-time Data at Scale
The Many Faces of Apache Kafka: Leveraging Real-time Data at ScaleThe Many Faces of Apache Kafka: Leveraging Real-time Data at Scale
The Many Faces of Apache Kafka: Leveraging Real-time Data at ScaleMessaging Meetup
 
Netflix's Could Migration
Netflix's Could MigrationNetflix's Could Migration
Netflix's Could MigrationChef
 
Flink Jobs Deployment On Kubernetes
Flink Jobs Deployment On KubernetesFlink Jobs Deployment On Kubernetes
Flink Jobs Deployment On KubernetesKnoldus Inc.
 
Engineering Leader opportunity @ Netflix - Playback Data Systems
Engineering Leader opportunity @ Netflix - Playback Data SystemsEngineering Leader opportunity @ Netflix - Playback Data Systems
Engineering Leader opportunity @ Netflix - Playback Data SystemsPhilip Fisher-Ogden
 
Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018
Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018
Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018iguazio
 
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, Firefly
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, FireflyDON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, Firefly
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, FireflyDevOpsDays Tel Aviv
 
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...
Flink Forward San Francisco 2018: Andrew Gao &  Jeff Sharpe - "Finding Bad Ac...Flink Forward San Francisco 2018: Andrew Gao &  Jeff Sharpe - "Finding Bad Ac...
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...Flink Forward
 
OpenStack Orchestration - Juno Updates
OpenStack Orchestration - Juno UpdatesOpenStack Orchestration - Juno Updates
OpenStack Orchestration - Juno UpdatesOpenStack Foundation
 
Webinar: Cutting Time, Complexity and Cost from Data Science to Production
Webinar: Cutting Time, Complexity and Cost from Data Science to ProductionWebinar: Cutting Time, Complexity and Cost from Data Science to Production
Webinar: Cutting Time, Complexity and Cost from Data Science to Productioniguazio
 
Cloud applications monitoring in digital transformation era
Cloud applications monitoring in digital transformation eraCloud applications monitoring in digital transformation era
Cloud applications monitoring in digital transformation eraManageEngine, Zoho Corporation
 
Building Reactive applications with Akka
Building Reactive applications with AkkaBuilding Reactive applications with Akka
Building Reactive applications with AkkaKnoldus Inc.
 
Reaching State Zero Without Losing Your Versions
Reaching State Zero Without Losing Your VersionsReaching State Zero Without Losing Your Versions
Reaching State Zero Without Losing Your VersionsSSP Innovations
 
Stream processing with Apache Flink @ OfferUp
Stream processing with Apache Flink @ OfferUpStream processing with Apache Flink @ OfferUp
Stream processing with Apache Flink @ OfferUpBowen Li
 

What's hot (20)

Key to optimal end user experience
Key to optimal end user experienceKey to optimal end user experience
Key to optimal end user experience
 
Enterprise Beacon Object Hive - Siebel Version Control
Enterprise Beacon Object Hive - Siebel Version ControlEnterprise Beacon Object Hive - Siebel Version Control
Enterprise Beacon Object Hive - Siebel Version Control
 
Taking Splunk to the Next Level - Architecture
Taking Splunk to the Next Level - ArchitectureTaking Splunk to the Next Level - Architecture
Taking Splunk to the Next Level - Architecture
 
Infrastructure as code
Infrastructure as codeInfrastructure as code
Infrastructure as code
 
Siebel monitoring
Siebel monitoringSiebel monitoring
Siebel monitoring
 
2013 04-29-evolution of backend
2013 04-29-evolution of backend2013 04-29-evolution of backend
2013 04-29-evolution of backend
 
The Many Faces of Apache Kafka: Leveraging Real-time Data at Scale
The Many Faces of Apache Kafka: Leveraging Real-time Data at ScaleThe Many Faces of Apache Kafka: Leveraging Real-time Data at Scale
The Many Faces of Apache Kafka: Leveraging Real-time Data at Scale
 
Netflix's Could Migration
Netflix's Could MigrationNetflix's Could Migration
Netflix's Could Migration
 
Flink Jobs Deployment On Kubernetes
Flink Jobs Deployment On KubernetesFlink Jobs Deployment On Kubernetes
Flink Jobs Deployment On Kubernetes
 
Closing the door on application performance problems
Closing the door on application performance problemsClosing the door on application performance problems
Closing the door on application performance problems
 
Engineering Leader opportunity @ Netflix - Playback Data Systems
Engineering Leader opportunity @ Netflix - Playback Data SystemsEngineering Leader opportunity @ Netflix - Playback Data Systems
Engineering Leader opportunity @ Netflix - Playback Data Systems
 
Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018
Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018
Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018
 
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, Firefly
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, FireflyDON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, Firefly
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, Firefly
 
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...
Flink Forward San Francisco 2018: Andrew Gao &  Jeff Sharpe - "Finding Bad Ac...Flink Forward San Francisco 2018: Andrew Gao &  Jeff Sharpe - "Finding Bad Ac...
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...
 
OpenStack Orchestration - Juno Updates
OpenStack Orchestration - Juno UpdatesOpenStack Orchestration - Juno Updates
OpenStack Orchestration - Juno Updates
 
Webinar: Cutting Time, Complexity and Cost from Data Science to Production
Webinar: Cutting Time, Complexity and Cost from Data Science to ProductionWebinar: Cutting Time, Complexity and Cost from Data Science to Production
Webinar: Cutting Time, Complexity and Cost from Data Science to Production
 
Cloud applications monitoring in digital transformation era
Cloud applications monitoring in digital transformation eraCloud applications monitoring in digital transformation era
Cloud applications monitoring in digital transformation era
 
Building Reactive applications with Akka
Building Reactive applications with AkkaBuilding Reactive applications with Akka
Building Reactive applications with Akka
 
Reaching State Zero Without Losing Your Versions
Reaching State Zero Without Losing Your VersionsReaching State Zero Without Losing Your Versions
Reaching State Zero Without Losing Your Versions
 
Stream processing with Apache Flink @ OfferUp
Stream processing with Apache Flink @ OfferUpStream processing with Apache Flink @ OfferUp
Stream processing with Apache Flink @ OfferUp
 

Similar to Sparklint @ Spark Meetup Chicago

Spark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon WhitearSpark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon WhitearSpark Summit
 
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...Indrajit Poddar
 
Serverless Computing with Azure Functions Best Practices
Serverless Computing with Azure Functions Best PracticesServerless Computing with Azure Functions Best Practices
Serverless Computing with Azure Functions Best PracticesJuan Pablo
 
Metrics-driven tuning of Apache Spark at scale
Metrics-driven tuning of Apache Spark at scaleMetrics-driven tuning of Apache Spark at scale
Metrics-driven tuning of Apache Spark at scaleDataWorks Summit
 
Real time monitoring of hadoop and spark workflows
Real time monitoring of hadoop and spark workflowsReal time monitoring of hadoop and spark workflows
Real time monitoring of hadoop and spark workflowsShankar Manian
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
 
Taking Splunk to the Next Level - Architecture
Taking Splunk to the Next Level - ArchitectureTaking Splunk to the Next Level - Architecture
Taking Splunk to the Next Level - ArchitectureSplunk
 
Serverless Patterns by Jesse Butler
Serverless Patterns by Jesse ButlerServerless Patterns by Jesse Butler
Serverless Patterns by Jesse ButlerOracle Developers
 
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye ZhouMetrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye ZhouDatabricks
 
DoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics PlatformDoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics Platformmartinbpeters
 
SQL PASS Summit 2018
SQL PASS Summit 2018SQL PASS Summit 2018
SQL PASS Summit 2018Kendra Little
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
 
How to Drive Down iSeries Computing Costs
How to Drive Down iSeries Computing CostsHow to Drive Down iSeries Computing Costs
How to Drive Down iSeries Computing Costsmboadway
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Databricks
 
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsBest Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsDatabricks
 
Performance monitoring - Adoniram Mishra, Rupesh Dubey, ThoughtWorks
Performance monitoring - Adoniram Mishra, Rupesh Dubey, ThoughtWorksPerformance monitoring - Adoniram Mishra, Rupesh Dubey, ThoughtWorks
Performance monitoring - Adoniram Mishra, Rupesh Dubey, ThoughtWorksThoughtworks
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analyticsamesar0
 
Taking Splunk to the Next Level - Technical
Taking Splunk to the Next Level - TechnicalTaking Splunk to the Next Level - Technical
Taking Splunk to the Next Level - TechnicalSplunk
 
Taking Splunk to the Next Level – Architecture
Taking Splunk to the Next Level – ArchitectureTaking Splunk to the Next Level – Architecture
Taking Splunk to the Next Level – ArchitectureSplunk
 
How to Migrate Applications Off a Mainframe
How to Migrate Applications Off a MainframeHow to Migrate Applications Off a Mainframe
How to Migrate Applications Off a MainframeVMware Tanzu
 

Similar to Sparklint @ Spark Meetup Chicago (20)

Spark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon WhitearSpark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon Whitear
 
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
 
Serverless Computing with Azure Functions Best Practices
Serverless Computing with Azure Functions Best PracticesServerless Computing with Azure Functions Best Practices
Serverless Computing with Azure Functions Best Practices
 
Metrics-driven tuning of Apache Spark at scale
Metrics-driven tuning of Apache Spark at scaleMetrics-driven tuning of Apache Spark at scale
Metrics-driven tuning of Apache Spark at scale
 
Real time monitoring of hadoop and spark workflows
Real time monitoring of hadoop and spark workflowsReal time monitoring of hadoop and spark workflows
Real time monitoring of hadoop and spark workflows
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Taking Splunk to the Next Level - Architecture
Taking Splunk to the Next Level - ArchitectureTaking Splunk to the Next Level - Architecture
Taking Splunk to the Next Level - Architecture
 
Serverless Patterns by Jesse Butler
Serverless Patterns by Jesse ButlerServerless Patterns by Jesse Butler
Serverless Patterns by Jesse Butler
 
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye ZhouMetrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
 
DoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics PlatformDoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics Platform
 
SQL PASS Summit 2018
SQL PASS Summit 2018SQL PASS Summit 2018
SQL PASS Summit 2018
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
 
How to Drive Down iSeries Computing Costs
How to Drive Down iSeries Computing CostsHow to Drive Down iSeries Computing Costs
How to Drive Down iSeries Computing Costs
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsBest Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale Platforms
 
Performance monitoring - Adoniram Mishra, Rupesh Dubey, ThoughtWorks
Performance monitoring - Adoniram Mishra, Rupesh Dubey, ThoughtWorksPerformance monitoring - Adoniram Mishra, Rupesh Dubey, ThoughtWorks
Performance monitoring - Adoniram Mishra, Rupesh Dubey, ThoughtWorks
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
 
Taking Splunk to the Next Level - Technical
Taking Splunk to the Next Level - TechnicalTaking Splunk to the Next Level - Technical
Taking Splunk to the Next Level - Technical
 
Taking Splunk to the Next Level – Architecture
Taking Splunk to the Next Level – ArchitectureTaking Splunk to the Next Level – Architecture
Taking Splunk to the Next Level – Architecture
 
How to Migrate Applications Off a Mainframe
How to Migrate Applications Off a MainframeHow to Migrate Applications Off a Mainframe
How to Migrate Applications Off a Mainframe
 

Recently uploaded

IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIES VE
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfFIDO Alliance
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyJohn Staveley
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty SecureFemke de Vroome
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftshyamraj55
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxJennifer Lim
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomCzechDreamin
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfFIDO Alliance
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCzechDreamin
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxDavid Michel
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaCzechDreamin
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastUXDXConf
 
THE BEST IPTV in GERMANY for 2024: IPTVreel
THE BEST IPTV in  GERMANY for 2024: IPTVreelTHE BEST IPTV in  GERMANY for 2024: IPTVreel
THE BEST IPTV in GERMANY for 2024: IPTVreelreely ones
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...CzechDreamin
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGDSC PJATK
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024Stephanie Beckett
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Julian Hyde
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeCzechDreamin
 
The UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, OcadoThe UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, OcadoUXDXConf
 

Recently uploaded (20)

IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
THE BEST IPTV in GERMANY for 2024: IPTVreel
THE BEST IPTV in  GERMANY for 2024: IPTVreelTHE BEST IPTV in  GERMANY for 2024: IPTVreel
THE BEST IPTV in GERMANY for 2024: IPTVreel
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
The UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, OcadoThe UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, Ocado
 

Sparklint @ Spark Meetup Chicago

  • 1. Sparklint a Tool for Identifying and Tuning Inefficient Spark Jobs Across Your Cluster
  • 3. Why Sparklint? • A successful Spark cluster grows rapidly • Capacity and capability mismatches arise • Leads to resource contention • Tuning process is non-trivial • Current UI operational in focus We wanted to understand application efficiency
  • 4. Sparklint provides: • Live view of batch & streaming application stats or • Event by event analysis of historical event logs • Stats and graphs for: – Idle time – Core usage – Task locality
  • 8. Demo… • Simulated workload analyzing site access logs: – read text file as JSON – convert to Record(ip, verb, status, time) – countByIp, countByStatus, countByVerb
  • 9. Job took 10m7s to finish Already pretty good distribution; low idle time indicates good worker usage, minimal driver node interaction in job But overall utilization is low Which is reflected in the common occurrence of the IDLE state (unused cores)
  • 10. Job took 15m14s to finish Core usage increased, job is more efficient, execution time increased, but the app is not cpu bound
  • 11. Job took 9m24s to finish Core utilization decreased proportionally, trading execution time for efficiency Lots of IDLE state shows we are over allocating resources
  • 12. Job took 11m34s to finish Core utilization remains low, the config settings are not right for this workload. Dynamic allocation only effective at app start due to long executorIdleTimeout setting
  • 13. Job took 33m5s to finish Core utilization is up, but execution time is up dramatically due to reclaiming resources before each short running task. IDLE state is reduced to a minimum, looks efficient, but execution is much slower due to dynamic allocation overhead
  • 14. Job took 7m34s to finish Core utilization way up, with lower execution time Parallel execution is clearly visible in overlapping stages Flat tops show we are becoming CPU bound
  • 15. Job took 5m6s to finish Core utilization decreases, trading execution time for efficiency again here
  • 16. Thanks to dynamic allocation the utilization is high despite being a bi- modal application Data loading and mapping requires a large core count to get throughput Aggregation and IO of results optimized for end file size, therefore requires less cores
  • 17. Future Features: • History Server event sources • Inline recommendations • Auto-tuning • Streaming stage parameter delegation • Replay capable listener
  • 18. The Credit: • Lead developer is Robert Xue • https://github.com/roboxue • SDE @ Groupon
  • 20. Q+A

Editor's Notes

  1. Spark cluster success Platform rolls out with a maximum supported load. Early projects ramp up, usage is fine Early successes feed back into recommendations to use the platform New users start loading up the platform just as initial successes are being scaled Platform limits hit, scaling requirements now begin to be understood and planned for Rough times whilst the platform operation learns to lead the application usage ◦ spark ui provides masses of info for only recent jobs / stages/tasks by default when the job is alive ◦ when serving spark ui from history server, there is still little summary information to debug the job config: Have I used the right magic number (locality wait, cores, numPartitions, job scheduling mode, etc.) ◦ difficult to compare different execution of the same job, due to this missing level of summary, (execution time is almost the only metrics to compare)
  2. ◦ A mechanism to listen the spark event log stream, and accumulate life time stats without losing (too many) details using constant memory in live mode because of the gauge we are using ◦ The mechanism also provides convenient replay when serving from a file ◦ A set of stats and graphs to describe the job performance uniformly: 1. idle time (duration when all calc are done on driver node, things to avoid) 2. max core usage, core usage percentage (should not be too high or too low, thinking about using avg numTaskInWait to supplement it) 3. task execution time for a certain stage by locality, (which honestly describes the opportunity cost of a lower locality, and indicates the idle locality wait config.)
  3. using the ReduceByKey.scala in repo as a sample to demo a series of attempts when we try to optimize a Spark application. The logs are included as well. The highlights for each run have been annotated in the screenshots in the attachment. The application is basically reading a text file, json parse and convert to "Record(ip: String, verb: String, status: Int, time: Long)", then do countByIP, countByStatus, countByVerb on them, repeat 10 times. These are three independent map reduce jobs, each has one map stage (parsing) and one reduce stage (countByXXX). Algo level optimization is out of the discussion here. The app need a constant number of CPU seconds, and a floating but bound amount of network i/o time (decided by job locality) to finish the execution.
  4. We use 16 cores as the baseline standard. The job takes 10 min to finish. The annotations in the pic describes what are we running here, and how to read sparklint graph. After reading the chart, we decided to decrease core to see if the execution time doubles or not, to figure out if we are bonded by CPU.
  5. by using 8 cores, the job took 15 min to finish, shorter than the 20 min expectation, proving that we are not bonded by cpu. Actually this saw tooth pattern easily indicates we are not bonded by CPU, and can be used as a classic example; An example of cpu bounded application can be found in the last demo slide. This leads to another angle of optimization: job scheduling tweaking.
  6. by using 32 cores, the job took 9 min to finish, proving again that throwing more cores doesn't provide commensurate performance gains.. The graph is a classic example about over allocating resources. We can assume we need no more than 24 cores to do the work effectively, so now we can look into other ways of tuning the job: dynamic allocation and increased parallelism.
  7. we try to optimize resource requirement by using dynamic allocation, initially just using the default executorIdleTimeout of 1min. This has also led us to try 1 core / executor. Since we don't usually have any task longer than 1 minute, we proved that dynamic allocation is not the key in optimizing this kind of app that has shorter tasks.
  8. we reduced executorIdleTimeout to 10s. In this way we decreased resource footprint and increased utilization. However this is a false saving for this job, because the job throughput is reduced due to low core supply and overhead in getting executors. This example proved again that dynamic allocation doesn't solve the optimization challenge when we have shorter tasks So, let’s try parallelism inside the job using FAIR scheduling.
  9. by using 16 cores and FAIR scheduler, this simple tweak cut the execution time from 10 min to 7.5 min, and our job now become CPU bounded (see annotation) The tweak to run the three count stages in parallel and use FAIR scheduling increases efficiency and reduces runtime, allowing us to become CPU bound,
  10. by using 32 cores and FAIR scheduler, the execution time become 5 min (compare to 9 min in pic3 using the same resource). We reduce efficiency in order to gain execution time, this is a decision for the team to decide, if there is a hard SLA to hit, it may be worth running with lower utilization. We can now call the job scheduling optimization done.
  11. Demos the correct scenario of using dynamic allocation, and throwing more cpu will help when the job is CPU bounded (the flat tops in the usage graph is the clear proof) In this case the partition count is chosen to optimize file size on HDFS, so the team are comfortable with the runtime.
  12. Sparklint can easily distinguish CPU bounded and job scheduling bounded applications. (We are working on automating this judgment, by using average number of pending tasks) Really easy to spot when a job is not bounded by CPU, but job scheduling (leads to low core usage) and driver node operations (leads to idle time). In theory your app will be 2x faster if you throw 2x cores to it, but this is not always true The point of spark level optimization is to make your job CPU bounded, when you can decide freely between ($ gain from faster application / $ spent in providing more cores) If your job is CPU bounded, simply add cores If your job has a lot of idle time, try decrease it by reducing unwanted/unintended driver node operations. (could be simple things like doing a map on a large array instead of an RDD and they forgot about it) If your job is job scheduling bounded, you can both reduce waste by using dynamic allocation (which in turns provides you high throughput when needed), and submit independent jobs in parallel using Futures and FAIR scheduler http://spark.apache.org/docs/latest/configuration.html#scheduling