Sparklens: Understanding the Scalability Limits of Spark Applications with Rohit Karlupia

Databricks
DatabricksDeveloper Marketing and Relations at MuleSoft
Rohit Karlupia, Qubole
Sparklens: Understanding
the scalability limits of Spark
applications
@qubole #Sparklens #SAISDev17
Agenda
2Share your thoughts @qubole #Sparklens #SAISDev17
PERFORMANCE
TUNING PITFALLS
THEORY BEHIND
SPARKLENS
QUBOLE SPARKLENS
TUNING EXAMPLE
Spark Tuning: Usual Suspects
3Share your thoughts @qubole #Sparklens #SAISDev17
ProfilingExperimentsResource
Utilization
Minimize Doing Nothing
4
Driver Stage 1 Stage 2 Stage 3
Core1
Core2
Core3
Core4
Time
Share your thoughts @qubole Sparklens
Driver Side Computations
5
Driver Stage 1 Stage 2 Stage 3
Core1
Core2
Core3
Core4
Time
Share your thoughts @qubole #Sparklens #SAISDev17
Driver does...
File listing & split
computation
Loading of hive tables
FOC
Collect
df.toPandas()
6Share your thoughts @qubole #Sparklens #SAISDev17
Not Enough Tasks
7
Driver Stage 1 Stage 2 Stage 3
Core1
Core2
Core3
Core4
Time
Share your thoughts @qubole #Sparklens #SAISDev17
Controlling
number of tasks
HDFS block size
Min/max split size
Default Parallelism
Shuffle Partitions
Repartitions
8Share your thoughts @qubole #Sparklens #SAISDev17
Non-Uniform Tasks: Skew
9
Driver Stage 1 Stage 2 Stage 3
Core1
Core2
Core3
Core4
Time
Share your thoughts @qubole #Sparklens #SAISDev17
Critical Path: Limit to scalability
10
Driver Stage 1 Stage 2 Stage 3
Core1
Core2
Core3
Core4
Time
Share your thoughts @qubole #Sparklens #SAISDev17
Ideal Application Time
11
Driver Stage 1 Stage 2 Stage 3
Core1
Core2
Core3
Core4
Time
Share your thoughts @qubole #Sparklens #SAISDev17
Latency Vs Compute vs Developer
12
Critical Path
Time
Wall Clock
Time
Ideal*
Application
Time
Share your thoughts @qubole #Sparklens #SAISDev17
Structure of a
Spark application
Spark application is either
executing in driver or in
parallel in executors
Child stage is not executed
until all parent stages are
complete
Stage is not complete until
all tasks of stage are
complete
13Share your thoughts @qubole Sparklens
14
Sparklens in Action
Performance Tuning 603 lines of unfamiliar Scala code
Share your thoughts @qubole #Sparklens #SAISDev17
Sparklens: First Pass
Driver WallClock 41m 40s 26%
Executor WallClock 117m 03s 74%
Total WallClock 158m 44s
Critical Path 127m 41s
Ideal Application 43m 32s
@qubole
#Sparklens
#SAISDev17
Observations & Actions
• The application had too many stages (697)
• The Critical Path Time was 3X the Ideal Application Time
• Instead of letting spark write to hive table, the code was doing
serial writes to each partition, in a loop
• We changed the code to let spark write to partitions in parallel
@qubole
#Sparklens
#SAISDev17
Sparklens: Second Pass
Driver WallClock 02m 28s 9%
Executor WallClock 24m 03s 91%
Total WallClock 26m 32s
Critical Path 25m 27s
Ideal Application 04m 48s
@qubole
#Sparklens
#SAISDev17
Count Time Utilisation
10 44m 51%
20 34m 33%
50 28m 16%
80 27m 10%
100 26m 8%
110 26m 8%
120 26m 7%
150 25m 5%
200 25m 4%
300 25m 3%
400 25m 2%
500 25m 1%
Sparklens Performance Prediction
@qubole
#Sparklens
#SAISDev17
Executor Utilisation
ECCH available 320h 50m
ECCH used 31h 00m 9%
ECCH wasted 289h 50m 91%
ECCH: Executor Core Compute Hour
@qubole
#Sparklens
#SAISDev17
Per Stage Metrics
Stage-ID WallClock Core Task PRatio -----Task------
Stage% ComputeHours Count Skew StageSkew
0 0.27 00h 00m 2 0.00 1.00 0.78
1 0.37 00h 00m 10 0.01 1.05 0.85
33 85.84 03h 18m 10 0.01 1.07 1.00
Stage-ID OIRatio |* ShuffleWrite% ReadFetch% GC% *|
0 0.00 |* 0.00 0.00 3.03 *|
1 0.00 |* 0.00 0.00 2.02 *|
33 0.00 |* 0.00 0.00 0.23 *|
CCH 3h 18m
Task Count 10
Total Cores 800
@qubole
#Sparklens
#SAISDev17
Observations & Actions
• 85% of time spent in a single stage with very low number of
tasks.
• 91% compute wasted on executor side.
• Found that repartition(10) was called somewhere in code,
resulting in only 10 tasks. Removed it.
• Also increased the spark.sql.shuffle.partitions from default 200
to 800@qubole
#Sparklens
#SAISDev17
Sparklens: Third Pass
Driver WallClock 02m 34s 26%
Executor WallClock 07m 13s 74%
Total WallClock 09m 48s
Critical Path 07m 18s
Ideal Application 07m 09s
@qubole
#Sparklens
#SAISDev17
Using Sparklens
—packages qubole:sparklens:0.2.0-s_2.11
—conf spark.extraListener=com.qubole.sparklens.QuboleJobListener
For inline processing, add following extra command line options to spark-submit
Old event log files (history server)
—packages qubole:sparklens:0.2.0-s_2.11 --class
com.qubole.sparklens.app.ReporterApp dummy-arg <eventLogFile>
source=history
Special Sparklens output files (very small file with all the relevant data)
—packages qubole:sparklens:0.2.0-s_2.11 --class
com.qubole.sparklens.app.ReporterApp dummy-arg <eventLogFile>@qubole
#Sparklens
#SAISDev17 https://github.com/qubole/sparklens
http://sparklens.qubole.net
@qubole
#Sparklens
#SAISDev17
Future Work
• Sparklens for spark pipelines
• Elasticity aware autoscaling
@qubole
#Sparklens
#SAISDev17
1 of 25

Recommended

Spark Tuning for Enterprise System Administrators By Anya Bida by
Spark Tuning for Enterprise System Administrators By Anya BidaSpark Tuning for Enterprise System Administrators By Anya Bida
Spark Tuning for Enterprise System Administrators By Anya BidaSpark Summit
2.7K views37 slides
Optimizing S3 Write-heavy Spark workloads by
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsdatamantra
7.1K views32 slides
Build Real-Time Applications with Databricks Streaming by
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingDatabricks
960 views13 slides
Building robust CDC pipeline with Apache Hudi and Debezium by
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
2.7K views17 slides
Lightweight Transactions in Scylla versus Apache Cassandra by
Lightweight Transactions in Scylla versus Apache CassandraLightweight Transactions in Scylla versus Apache Cassandra
Lightweight Transactions in Scylla versus Apache CassandraScyllaDB
1.4K views42 slides
Emr spark tuning demystified by
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystifiedOmid Vahdaty
7.7K views48 slides

More Related Content

What's hot

What s new in spark 2.3 and spark 2.4 by
What s new in spark 2.3 and spark 2.4What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4DataWorks Summit
559 views44 slides
Fig 9-02 by
Fig 9-02Fig 9-02
Fig 9-02Hironobu Suzuki
2.2K views15 slides
HBase Low Latency by
HBase Low LatencyHBase Low Latency
HBase Low LatencyDataWorks Summit
5.1K views38 slides
Memory Management in Apache Spark by
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
9.7K views59 slides
How to Extend Apache Spark with Customized Optimizations by
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsDatabricks
5.1K views51 slides
Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio by
Improving Data Locality for Spark Jobs on Kubernetes Using AlluxioImproving Data Locality for Spark Jobs on Kubernetes Using Alluxio
Improving Data Locality for Spark Jobs on Kubernetes Using AlluxioAlluxio, Inc.
1.8K views30 slides

What's hot(20)

What s new in spark 2.3 and spark 2.4 by DataWorks Summit
What s new in spark 2.3 and spark 2.4What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4
DataWorks Summit559 views
Memory Management in Apache Spark by Databricks
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks9.7K views
How to Extend Apache Spark with Customized Optimizations by Databricks
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
Databricks5.1K views
Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio by Alluxio, Inc.
Improving Data Locality for Spark Jobs on Kubernetes Using AlluxioImproving Data Locality for Spark Jobs on Kubernetes Using Alluxio
Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio
Alluxio, Inc.1.8K views
Performance Troubleshooting Using Apache Spark Metrics by Databricks
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
Databricks3.5K views
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ... by Databricks
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Databricks2.8K views
VLDB 2009 Tutorial on Column-Stores by Daniel Abadi
VLDB 2009 Tutorial on Column-StoresVLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-Stores
Daniel Abadi21.4K views
Raven: End-to-end Optimization of ML Prediction Queries by Databricks
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks450 views
Spark shuffle introduction by colorant
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant50.6K views
ClickHouse Deep Dive, by Aleksei Milovidov by Altinity Ltd
ClickHouse Deep Dive, by Aleksei MilovidovClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei Milovidov
Altinity Ltd6.6K views
Hive, Presto, and Spark on TPC-DS benchmark by Dongwon Kim
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
Dongwon Kim9.6K views
Hadoop & Greenplum: Why Do Such a Thing? by Ed Kohlwey
Hadoop & Greenplum: Why Do Such a Thing?Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?
Ed Kohlwey17.2K views
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer by Sachin Aggarwal
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal1.7K views
Hive Bucketing in Apache Spark with Tejas Patil by Databricks
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
Databricks14.9K views

Similar to Sparklens: Understanding the Scalability Limits of Spark Applications with Rohit Karlupia

Java Performance: Speedup your application with hardware counters by
Java Performance: Speedup your application with hardware countersJava Performance: Speedup your application with hardware counters
Java Performance: Speedup your application with hardware countersSergey Kuksenko
2.9K views88 slides
Phil Basford - machine learning at scale with aws sage maker by
Phil Basford - machine learning at scale with aws sage makerPhil Basford - machine learning at scale with aws sage maker
Phil Basford - machine learning at scale with aws sage makerAWSCOMSUM
304 views30 slides
Machine learning at scale with aws sage maker by
Machine learning at scale with aws sage makerMachine learning at scale with aws sage maker
Machine learning at scale with aws sage makerPhilipBasford
362 views30 slides
RMOUG 2013 - Where did my CPU go? by
RMOUG 2013 - Where did my CPU go?RMOUG 2013 - Where did my CPU go?
RMOUG 2013 - Where did my CPU go?Kristofferson A
983 views42 slides
Where Did My CPU Go? by
Where Did My CPU Go?Where Did My CPU Go?
Where Did My CPU Go?Enkitec
461 views42 slides
Rmoug13 - where did my CPU go? by
Rmoug13 - where did my CPU go?Rmoug13 - where did my CPU go?
Rmoug13 - where did my CPU go?Enkitec
321 views42 slides

Similar to Sparklens: Understanding the Scalability Limits of Spark Applications with Rohit Karlupia(20)

Java Performance: Speedup your application with hardware counters by Sergey Kuksenko
Java Performance: Speedup your application with hardware countersJava Performance: Speedup your application with hardware counters
Java Performance: Speedup your application with hardware counters
Sergey Kuksenko2.9K views
Phil Basford - machine learning at scale with aws sage maker by AWSCOMSUM
Phil Basford - machine learning at scale with aws sage makerPhil Basford - machine learning at scale with aws sage maker
Phil Basford - machine learning at scale with aws sage maker
AWSCOMSUM304 views
Machine learning at scale with aws sage maker by PhilipBasford
Machine learning at scale with aws sage makerMachine learning at scale with aws sage maker
Machine learning at scale with aws sage maker
PhilipBasford362 views
RMOUG 2013 - Where did my CPU go? by Kristofferson A
RMOUG 2013 - Where did my CPU go?RMOUG 2013 - Where did my CPU go?
RMOUG 2013 - Where did my CPU go?
Kristofferson A983 views
Where Did My CPU Go? by Enkitec
Where Did My CPU Go?Where Did My CPU Go?
Where Did My CPU Go?
Enkitec461 views
Rmoug13 - where did my CPU go? by Enkitec
Rmoug13 - where did my CPU go?Rmoug13 - where did my CPU go?
Rmoug13 - where did my CPU go?
Enkitec321 views
Running Apache Spark on Kubernetes: Best Practices and Pitfalls by Databricks
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks2.9K views
Scaling sql server 2014 parallel insert by Chris Adkin
Scaling sql server 2014 parallel insertScaling sql server 2014 parallel insert
Scaling sql server 2014 parallel insert
Chris Adkin938 views
TestUpload by ZarksaDS
TestUploadTestUpload
TestUpload
ZarksaDS405 views
Reliable Performance at Scale with Apache Spark on Kubernetes by Databricks
Reliable Performance at Scale with Apache Spark on KubernetesReliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on Kubernetes
Databricks2.2K views
Apache Beam (incubating) by Apache Apex
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)
Apache Apex762 views
Scaling Apache Spark at Facebook by Databricks
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
Databricks1.3K views
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library... by Spark Summit
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit2.8K views
OOW 2013: Where did my CPU go by Kristofferson A
OOW 2013: Where did my CPU goOOW 2013: Where did my CPU go
OOW 2013: Where did my CPU go
Kristofferson A1.1K views
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry by confluent
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
confluent11K views
Cooperative Task Execution for Apache Spark by Databricks
Cooperative Task Execution for Apache SparkCooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache Spark
Databricks500 views

More from Databricks

DW Migration Webinar-March 2022.pptx by
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
4.3K views25 slides
Data Lakehouse Symposium | Day 1 | Part 1 by
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
1.5K views43 slides
Data Lakehouse Symposium | Day 1 | Part 2 by
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
741 views16 slides
Data Lakehouse Symposium | Day 4 by
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
1.8K views74 slides
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop by
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
6.3K views64 slides
Democratizing Data Quality Through a Centralized Platform by
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
1.4K views36 slides

More from Databricks(20)

DW Migration Webinar-March 2022.pptx by Databricks
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks4.3K views
Data Lakehouse Symposium | Day 1 | Part 1 by Databricks
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks1.5K views
Data Lakehouse Symposium | Day 1 | Part 2 by Databricks
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks741 views
Data Lakehouse Symposium | Day 4 by Databricks
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks1.8K views
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop by Databricks
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks6.3K views
Democratizing Data Quality Through a Centralized Platform by Databricks
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks1.4K views
Learn to Use Databricks for Data Science by Databricks
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks1.6K views
Why APM Is Not the Same As ML Monitoring by Databricks
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks743 views
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix by Databricks
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks689 views
Stage Level Scheduling Improving Big Data and AI Integration by Databricks
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks850 views
Simplify Data Conversion from Spark to TensorFlow and PyTorch by Databricks
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks1.8K views
Scaling your Data Pipelines with Apache Spark on Kubernetes by Databricks
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks2.1K views
Scaling and Unifying SciKit Learn and Apache Spark Pipelines by Databricks
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks667 views
Sawtooth Windows for Feature Aggregations by Databricks
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks605 views
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink by Databricks
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks677 views
Re-imagine Data Monitoring with whylogs and Spark by Databricks
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks550 views
Processing Large Datasets for ADAS Applications using Apache Spark by Databricks
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks513 views
Massive Data Processing in Adobe Using Delta Lake by Databricks
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks719 views
Machine Learning CI/CD for Email Attack Detection by Databricks
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks389 views
Jeeves Grows Up: An AI Chatbot for Performance and Quality by Databricks
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks260 views

Recently uploaded

[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init... by
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...DataScienceConferenc1
5 views18 slides
Organic Shopping in Google Analytics 4.pdf by
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdfGA4 Tutorials
16 views13 slides
shivam tiwari.pptx by
shivam tiwari.pptxshivam tiwari.pptx
shivam tiwari.pptxAanyaMishra4
5 views14 slides
Data about the sector workshop by
Data about the sector workshopData about the sector workshop
Data about the sector workshopinfo828217
15 views27 slides
SUPER STORE SQL PROJECT.pptx by
SUPER STORE SQL PROJECT.pptxSUPER STORE SQL PROJECT.pptx
SUPER STORE SQL PROJECT.pptxkhan888620
13 views16 slides
Advanced_Recommendation_Systems_Presentation.pptx by
Advanced_Recommendation_Systems_Presentation.pptxAdvanced_Recommendation_Systems_Presentation.pptx
Advanced_Recommendation_Systems_Presentation.pptxneeharikasingh29
5 views9 slides

Recently uploaded(20)

[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init... by DataScienceConferenc1
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
Organic Shopping in Google Analytics 4.pdf by GA4 Tutorials
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials16 views
Data about the sector workshop by info828217
Data about the sector workshopData about the sector workshop
Data about the sector workshop
info82821715 views
SUPER STORE SQL PROJECT.pptx by khan888620
SUPER STORE SQL PROJECT.pptxSUPER STORE SQL PROJECT.pptx
SUPER STORE SQL PROJECT.pptx
khan88862013 views
Advanced_Recommendation_Systems_Presentation.pptx by neeharikasingh29
Advanced_Recommendation_Systems_Presentation.pptxAdvanced_Recommendation_Systems_Presentation.pptx
Advanced_Recommendation_Systems_Presentation.pptx
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ... by DataScienceConferenc1
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an... by StatsCommunications
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
CRM stick or twist workshop by info828217
CRM stick or twist workshopCRM stick or twist workshop
CRM stick or twist workshop
info82821711 views
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx by DataScienceConferenc1
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
3196 The Case of The East River by ErickANDRADE90
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East River
ErickANDRADE9017 views
Short Story Assignment by Kelly Nguyen by kellynguyen01
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0119 views
CRIJ4385_Death Penalty_F23.pptx by yvettemm100
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptx
yvettemm1007 views
CRM stick or twist.pptx by info828217
CRM stick or twist.pptxCRM stick or twist.pptx
CRM stick or twist.pptx
info82821711 views

Sparklens: Understanding the Scalability Limits of Spark Applications with Rohit Karlupia

  • 1. Rohit Karlupia, Qubole Sparklens: Understanding the scalability limits of Spark applications @qubole #Sparklens #SAISDev17
  • 2. Agenda 2Share your thoughts @qubole #Sparklens #SAISDev17 PERFORMANCE TUNING PITFALLS THEORY BEHIND SPARKLENS QUBOLE SPARKLENS TUNING EXAMPLE
  • 3. Spark Tuning: Usual Suspects 3Share your thoughts @qubole #Sparklens #SAISDev17 ProfilingExperimentsResource Utilization
  • 4. Minimize Doing Nothing 4 Driver Stage 1 Stage 2 Stage 3 Core1 Core2 Core3 Core4 Time Share your thoughts @qubole Sparklens
  • 5. Driver Side Computations 5 Driver Stage 1 Stage 2 Stage 3 Core1 Core2 Core3 Core4 Time Share your thoughts @qubole #Sparklens #SAISDev17
  • 6. Driver does... File listing & split computation Loading of hive tables FOC Collect df.toPandas() 6Share your thoughts @qubole #Sparklens #SAISDev17
  • 7. Not Enough Tasks 7 Driver Stage 1 Stage 2 Stage 3 Core1 Core2 Core3 Core4 Time Share your thoughts @qubole #Sparklens #SAISDev17
  • 8. Controlling number of tasks HDFS block size Min/max split size Default Parallelism Shuffle Partitions Repartitions 8Share your thoughts @qubole #Sparklens #SAISDev17
  • 9. Non-Uniform Tasks: Skew 9 Driver Stage 1 Stage 2 Stage 3 Core1 Core2 Core3 Core4 Time Share your thoughts @qubole #Sparklens #SAISDev17
  • 10. Critical Path: Limit to scalability 10 Driver Stage 1 Stage 2 Stage 3 Core1 Core2 Core3 Core4 Time Share your thoughts @qubole #Sparklens #SAISDev17
  • 11. Ideal Application Time 11 Driver Stage 1 Stage 2 Stage 3 Core1 Core2 Core3 Core4 Time Share your thoughts @qubole #Sparklens #SAISDev17
  • 12. Latency Vs Compute vs Developer 12 Critical Path Time Wall Clock Time Ideal* Application Time Share your thoughts @qubole #Sparklens #SAISDev17
  • 13. Structure of a Spark application Spark application is either executing in driver or in parallel in executors Child stage is not executed until all parent stages are complete Stage is not complete until all tasks of stage are complete 13Share your thoughts @qubole Sparklens
  • 14. 14 Sparklens in Action Performance Tuning 603 lines of unfamiliar Scala code Share your thoughts @qubole #Sparklens #SAISDev17
  • 15. Sparklens: First Pass Driver WallClock 41m 40s 26% Executor WallClock 117m 03s 74% Total WallClock 158m 44s Critical Path 127m 41s Ideal Application 43m 32s @qubole #Sparklens #SAISDev17
  • 16. Observations & Actions • The application had too many stages (697) • The Critical Path Time was 3X the Ideal Application Time • Instead of letting spark write to hive table, the code was doing serial writes to each partition, in a loop • We changed the code to let spark write to partitions in parallel @qubole #Sparklens #SAISDev17
  • 17. Sparklens: Second Pass Driver WallClock 02m 28s 9% Executor WallClock 24m 03s 91% Total WallClock 26m 32s Critical Path 25m 27s Ideal Application 04m 48s @qubole #Sparklens #SAISDev17
  • 18. Count Time Utilisation 10 44m 51% 20 34m 33% 50 28m 16% 80 27m 10% 100 26m 8% 110 26m 8% 120 26m 7% 150 25m 5% 200 25m 4% 300 25m 3% 400 25m 2% 500 25m 1% Sparklens Performance Prediction @qubole #Sparklens #SAISDev17
  • 19. Executor Utilisation ECCH available 320h 50m ECCH used 31h 00m 9% ECCH wasted 289h 50m 91% ECCH: Executor Core Compute Hour @qubole #Sparklens #SAISDev17
  • 20. Per Stage Metrics Stage-ID WallClock Core Task PRatio -----Task------ Stage% ComputeHours Count Skew StageSkew 0 0.27 00h 00m 2 0.00 1.00 0.78 1 0.37 00h 00m 10 0.01 1.05 0.85 33 85.84 03h 18m 10 0.01 1.07 1.00 Stage-ID OIRatio |* ShuffleWrite% ReadFetch% GC% *| 0 0.00 |* 0.00 0.00 3.03 *| 1 0.00 |* 0.00 0.00 2.02 *| 33 0.00 |* 0.00 0.00 0.23 *| CCH 3h 18m Task Count 10 Total Cores 800 @qubole #Sparklens #SAISDev17
  • 21. Observations & Actions • 85% of time spent in a single stage with very low number of tasks. • 91% compute wasted on executor side. • Found that repartition(10) was called somewhere in code, resulting in only 10 tasks. Removed it. • Also increased the spark.sql.shuffle.partitions from default 200 to 800@qubole #Sparklens #SAISDev17
  • 22. Sparklens: Third Pass Driver WallClock 02m 34s 26% Executor WallClock 07m 13s 74% Total WallClock 09m 48s Critical Path 07m 18s Ideal Application 07m 09s @qubole #Sparklens #SAISDev17
  • 23. Using Sparklens —packages qubole:sparklens:0.2.0-s_2.11 —conf spark.extraListener=com.qubole.sparklens.QuboleJobListener For inline processing, add following extra command line options to spark-submit Old event log files (history server) —packages qubole:sparklens:0.2.0-s_2.11 --class com.qubole.sparklens.app.ReporterApp dummy-arg <eventLogFile> source=history Special Sparklens output files (very small file with all the relevant data) —packages qubole:sparklens:0.2.0-s_2.11 --class com.qubole.sparklens.app.ReporterApp dummy-arg <eventLogFile>@qubole #Sparklens #SAISDev17 https://github.com/qubole/sparklens
  • 25. Future Work • Sparklens for spark pipelines • Elasticity aware autoscaling @qubole #Sparklens #SAISDev17