SlideShare a Scribd company logo
Spark Gotchas and Lessons Learned
Jen Waller, Ph.D.
Boulder/Denver Big Data Meetup
Feb 20, 2020
Boulder, CO
Overview
● Overall Dev Approach
● Useful Spark Built-Ins
● How to Fail at Scale
● Resource Utilization
“Strategery”
● Local machine; simulated
cluster
● Spark-shell/spark-submit
● Tiny subset of data (even
better: TDD w/
programmatically generated
data!)
● Real cluster
● Start tiny: Test functions/ configs
specific to cloud
● Bigger cluster for load testing
● Spark-shell = handy for quick
iteration on manual cluster configs,
load testing one fxn at a time
What about notebooks?
What about notebooks?
By Sam Shere (1905–1982) - Zeppelin-ramp de Hindenburg / Hindenburg
zeppelin disaster, Public Domain,
https://commons.wikimedia.org/w/index.php?curid=19329337
What about notebooks?
By Sam Shere (1905–1982) - Zeppelin-ramp de Hindenburg / Hindenburg
zeppelin disaster, Public Domain,
https://commons.wikimedia.org/w/index.php?curid=19329337
Spark UI & Spark History Server
● Can access anywhere (local, cloud)
● Jobs/tasks, execution plans, memory
usage, configs
● Maximizing utility of metrics data
○ Set labels for task groups and jobs
using sparkContext
○ Break jobs and tasks apart by
repartitioning, even dumping to disk
REST API & Metrics Sink(s)
● REST API
○ curl http://localhost:4040/api/v1/applications
● Can configure a set of sinks for:
○ Master, applications, worker, executor,
driver, shuffleService, applicationMaster
(YARN)
● And send metrics to:
○ Console, CSV file, JMX console, within
Spark UI as JSON, Graphite node, slf4j,
StatsD node
“It worked locally!”
Don’t Overload Data Store APIs
Avoid full scans of all partitions:
val df = spark
.read
.parquet(“s3://mybucket/mydata”)
.filter(col(“mycolumn”).equalTo(“someDate”))
You can still read in data as partitioned without scanning entire table:
val df = spark
.read
.option(“basePath”, “s3://mybucket/mydata”)
.parquet(“s3://mybucket/mydata/someDate”)
Use Built-In Optimizations for Reading Data
● Automatic detection of partitions and efficient data read
○ Provide the basepath when reading in partitions
○ Always provide a schema to prevent repeated schema checking
● Columnar data: Parquet/ORC reader
○ Projection pushdown = only read the columns you need
○ Predicate/filter pushdown = use metadata to only read in the
rows you need
Beware the Shuffle!
● GroupBy, Join, Distinct…
● Amazon suggests avoiding shuffle
entirely.
● Do that! Find another way to aggregate
your data (i.e., aggregate it upstream in
Kafka/Kinesis/Flink, index it in
ElasticSearch - there are many good
options)
If you must shuffle… Know your data.
● Check for repeated values, nulls on join columns
○ Joining data with repeated values on both sides → gigantic result
○ Joining cols with nulls → massive skew.
■ Can “salt” nulls by pre-filling arbitrary values into empty cells
○ Cluster resource use could be throttling broadcast joins (check it!)
● Check for skew
○ Grouping by skewed column → Spark naively assigns rows to executors
based on level of skewed column
■ Application == dead (out of memory, network timeouts, lost nodes,
processes that never end)
Controlling Spark Shuffles
● Partition your data so it’s mapped across cluster evenly
○ Partition by unique ID
○ Avoid partitioning on cols with a lot of nulls, missing or skewed values
● Partition data to match job you’re running
○ Parallel transforms on many datasets: 200 partitions
○ Billions of pairwise comparisons: 4-10k partitions
○ Tests on single server/locally: 1 partition
Hacks for Shuffling Skewed Data
● Limit the job to a single level of skewed variable at a time
(serialize).
● Manually set a small broadcast blockSize to fit the size of the
instance types in your cluster.
● Salt the data
Optimizing Resources
EMR Cluster Resource Gotchas
● By default, Yarn assigns only one vCPU per executor.
● If maximizeResourceAllocation = true, you get only one executor on each
node (i.e., one Yarn container/executor per machine).
● Poor use of resources.
● Lack of parallelism = bad for things that benefit from parallelism, like
broadcast joins.
This gets really messy if
multiple applications
running on one cluster.
How to get Spark to use >= 1 vCPU/machine?
Manually change memory
allocated to
executors/driver?
Nope.
How to get Spark to use >= 1 vCPU/machine?
Change Yarn configs!
Great! Except…
● Unless you manually set the
number of cores used by the
driver, it’s 1.
● Which is fine unless you switch
to larger instance types…
● Then you should manually
configure cluster resources.
Image credit: https://c2fo.io/c2fo/spark/aws/emr/2016/07/06/apache-spark-config-cheatsheet/
Summary
● Spark is awesome, but can be tricky.
● Read the docs! Use those helpful Spark built-ins.
● Avoid/manage shuffling.
● Use the Hadoop UI to check your resource utilization.

More Related Content

What's hot

Rust & Apache Arrow @ RMS
Rust & Apache Arrow @ RMSRust & Apache Arrow @ RMS
Rust & Apache Arrow @ RMS
Andy Grove
 
Devoxx france 2015 influxdb
Devoxx france 2015 influxdbDevoxx france 2015 influxdb
Devoxx france 2015 influxdb
Nicolas Muller
 
Cassandra at talkbits
Cassandra at talkbitsCassandra at talkbits
Cassandra at talkbits
Max Alexejev
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014
Claudio Martella
 
Qdf2tf
Qdf2tfQdf2tf
Qdf2tf
Dirk Roorda
 
Writing data analysis pipeline as ruby gem
Writing data analysis pipeline as ruby gemWriting data analysis pipeline as ruby gem
Writing data analysis pipeline as ruby gem
Sean S.G Wang
 
Entity framework
Entity frameworkEntity framework
Entity framework
Rajeev Harbola
 
Ali Asad Lotia (DevOps at Beamly) - Riemann Stream Processing at #DOXLON
Ali Asad Lotia (DevOps at Beamly) - Riemann Stream Processing at #DOXLONAli Asad Lotia (DevOps at Beamly) - Riemann Stream Processing at #DOXLON
Ali Asad Lotia (DevOps at Beamly) - Riemann Stream Processing at #DOXLON
Outlyer
 
2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore
2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore
2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore
andyseaborne
 
MLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedMLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedChao Chen
 
First impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithmFirst impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithm
InfoFarm
 
Playing the toStrings
Playing the toStringsPlaying the toStrings
Playing the toStrings
Konstantin Pavlov
 
AWS and Terraform for Disaster Recovery
AWS and Terraform for Disaster RecoveryAWS and Terraform for Disaster Recovery
AWS and Terraform for Disaster Recovery
Randall Thomson
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 
(JVM) Garbage Collection - Brown Bag Session
(JVM) Garbage Collection - Brown Bag Session(JVM) Garbage Collection - Brown Bag Session
(JVM) Garbage Collection - Brown Bag Session
Jens Hadlich
 

What's hot (17)

Rust & Apache Arrow @ RMS
Rust & Apache Arrow @ RMSRust & Apache Arrow @ RMS
Rust & Apache Arrow @ RMS
 
1 storm-intro
1 storm-intro1 storm-intro
1 storm-intro
 
Devoxx france 2015 influxdb
Devoxx france 2015 influxdbDevoxx france 2015 influxdb
Devoxx france 2015 influxdb
 
Cassandra at talkbits
Cassandra at talkbitsCassandra at talkbits
Cassandra at talkbits
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014
 
Qdf2tf
Qdf2tfQdf2tf
Qdf2tf
 
Writing data analysis pipeline as ruby gem
Writing data analysis pipeline as ruby gemWriting data analysis pipeline as ruby gem
Writing data analysis pipeline as ruby gem
 
Entity framework
Entity frameworkEntity framework
Entity framework
 
Ali Asad Lotia (DevOps at Beamly) - Riemann Stream Processing at #DOXLON
Ali Asad Lotia (DevOps at Beamly) - Riemann Stream Processing at #DOXLONAli Asad Lotia (DevOps at Beamly) - Riemann Stream Processing at #DOXLON
Ali Asad Lotia (DevOps at Beamly) - Riemann Stream Processing at #DOXLON
 
2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore
2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore
2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore
 
MLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedMLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reduced
 
First impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithmFirst impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithm
 
Playing the toStrings
Playing the toStringsPlaying the toStrings
Playing the toStrings
 
AWS and Terraform for Disaster Recovery
AWS and Terraform for Disaster RecoveryAWS and Terraform for Disaster Recovery
AWS and Terraform for Disaster Recovery
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
 
(JVM) Garbage Collection - Brown Bag Session
(JVM) Garbage Collection - Brown Bag Session(JVM) Garbage Collection - Brown Bag Session
(JVM) Garbage Collection - Brown Bag Session
 
14 lab-planing
14 lab-planing14 lab-planing
14 lab-planing
 

Similar to Spark Gotchas and Lessons Learned (2/20/20)

Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
Glenn K. Lockwood
 
Amazon DynamoDB Lessen's Learned by Beginner
Amazon DynamoDB Lessen's Learned by BeginnerAmazon DynamoDB Lessen's Learned by Beginner
Amazon DynamoDB Lessen's Learned by Beginner
Hirokazu Tokuno
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
Data Con LA
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
Databricks
 
Building Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scaleBuilding Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scale
Alex Thompson
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
Adarsh Pannu
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
Amazon Web Services
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
Vmfs
VmfsVmfs
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 
How to build your query engine in spark
How to build your query engine in sparkHow to build your query engine in spark
How to build your query engine in spark
Peng Cheng
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
P. Taylor Goetz
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
Evan Chan
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
Omid Vahdaty
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Holden Karau
 

Similar to Spark Gotchas and Lessons Learned (2/20/20) (20)

Spark Meetup
Spark MeetupSpark Meetup
Spark Meetup
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
 
Amazon DynamoDB Lessen's Learned by Beginner
Amazon DynamoDB Lessen's Learned by BeginnerAmazon DynamoDB Lessen's Learned by Beginner
Amazon DynamoDB Lessen's Learned by Beginner
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
 
Building Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scaleBuilding Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scale
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Vmfs
VmfsVmfs
Vmfs
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
How to build your query engine in spark
How to build your query engine in sparkHow to build your query engine in spark
How to build your query engine in spark
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
 

Recently uploaded

Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
BrazilAccount1
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
BrazilAccount1
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
ongomchris
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 

Recently uploaded (20)

Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 

Spark Gotchas and Lessons Learned (2/20/20)

  • 1. Spark Gotchas and Lessons Learned Jen Waller, Ph.D. Boulder/Denver Big Data Meetup Feb 20, 2020 Boulder, CO
  • 2. Overview ● Overall Dev Approach ● Useful Spark Built-Ins ● How to Fail at Scale ● Resource Utilization
  • 3. “Strategery” ● Local machine; simulated cluster ● Spark-shell/spark-submit ● Tiny subset of data (even better: TDD w/ programmatically generated data!) ● Real cluster ● Start tiny: Test functions/ configs specific to cloud ● Bigger cluster for load testing ● Spark-shell = handy for quick iteration on manual cluster configs, load testing one fxn at a time
  • 5. What about notebooks? By Sam Shere (1905–1982) - Zeppelin-ramp de Hindenburg / Hindenburg zeppelin disaster, Public Domain, https://commons.wikimedia.org/w/index.php?curid=19329337
  • 6. What about notebooks? By Sam Shere (1905–1982) - Zeppelin-ramp de Hindenburg / Hindenburg zeppelin disaster, Public Domain, https://commons.wikimedia.org/w/index.php?curid=19329337
  • 7. Spark UI & Spark History Server ● Can access anywhere (local, cloud) ● Jobs/tasks, execution plans, memory usage, configs ● Maximizing utility of metrics data ○ Set labels for task groups and jobs using sparkContext ○ Break jobs and tasks apart by repartitioning, even dumping to disk
  • 8. REST API & Metrics Sink(s) ● REST API ○ curl http://localhost:4040/api/v1/applications ● Can configure a set of sinks for: ○ Master, applications, worker, executor, driver, shuffleService, applicationMaster (YARN) ● And send metrics to: ○ Console, CSV file, JMX console, within Spark UI as JSON, Graphite node, slf4j, StatsD node
  • 10. Don’t Overload Data Store APIs Avoid full scans of all partitions: val df = spark .read .parquet(“s3://mybucket/mydata”) .filter(col(“mycolumn”).equalTo(“someDate”)) You can still read in data as partitioned without scanning entire table: val df = spark .read .option(“basePath”, “s3://mybucket/mydata”) .parquet(“s3://mybucket/mydata/someDate”)
  • 11. Use Built-In Optimizations for Reading Data ● Automatic detection of partitions and efficient data read ○ Provide the basepath when reading in partitions ○ Always provide a schema to prevent repeated schema checking ● Columnar data: Parquet/ORC reader ○ Projection pushdown = only read the columns you need ○ Predicate/filter pushdown = use metadata to only read in the rows you need
  • 12. Beware the Shuffle! ● GroupBy, Join, Distinct… ● Amazon suggests avoiding shuffle entirely. ● Do that! Find another way to aggregate your data (i.e., aggregate it upstream in Kafka/Kinesis/Flink, index it in ElasticSearch - there are many good options)
  • 13. If you must shuffle… Know your data. ● Check for repeated values, nulls on join columns ○ Joining data with repeated values on both sides → gigantic result ○ Joining cols with nulls → massive skew. ■ Can “salt” nulls by pre-filling arbitrary values into empty cells ○ Cluster resource use could be throttling broadcast joins (check it!) ● Check for skew ○ Grouping by skewed column → Spark naively assigns rows to executors based on level of skewed column ■ Application == dead (out of memory, network timeouts, lost nodes, processes that never end)
  • 14. Controlling Spark Shuffles ● Partition your data so it’s mapped across cluster evenly ○ Partition by unique ID ○ Avoid partitioning on cols with a lot of nulls, missing or skewed values ● Partition data to match job you’re running ○ Parallel transforms on many datasets: 200 partitions ○ Billions of pairwise comparisons: 4-10k partitions ○ Tests on single server/locally: 1 partition
  • 15. Hacks for Shuffling Skewed Data ● Limit the job to a single level of skewed variable at a time (serialize). ● Manually set a small broadcast blockSize to fit the size of the instance types in your cluster. ● Salt the data
  • 17. EMR Cluster Resource Gotchas ● By default, Yarn assigns only one vCPU per executor. ● If maximizeResourceAllocation = true, you get only one executor on each node (i.e., one Yarn container/executor per machine). ● Poor use of resources. ● Lack of parallelism = bad for things that benefit from parallelism, like broadcast joins.
  • 18. This gets really messy if multiple applications running on one cluster.
  • 19. How to get Spark to use >= 1 vCPU/machine? Manually change memory allocated to executors/driver? Nope.
  • 20. How to get Spark to use >= 1 vCPU/machine? Change Yarn configs!
  • 21. Great! Except… ● Unless you manually set the number of cores used by the driver, it’s 1. ● Which is fine unless you switch to larger instance types… ● Then you should manually configure cluster resources. Image credit: https://c2fo.io/c2fo/spark/aws/emr/2016/07/06/apache-spark-config-cheatsheet/
  • 22. Summary ● Spark is awesome, but can be tricky. ● Read the docs! Use those helpful Spark built-ins. ● Avoid/manage shuffling. ● Use the Hadoop UI to check your resource utilization.