SlideShare a Scribd company logo
Problem Solving Recipes
Learned from Supporting Spark
Justin Pihony & Stavros Kontopoulos
Lightbend
1. OOM
Table of Contents
1. OOM1. OOM
2. NoSuchMethod
5. Strategizing
Your Joins
6. Safe Stream
Recovery
3. Perplexities of
Size
6. Safe Stream
Recovery
4. Struggles in
Speculation
5. Strategizing
Your Joins
7. Handling S3 w/o
Hanging
2. NoSuchMethod
3. Perplexities of
Size
4. Struggles in
Speculation
Prologue
Memory Problems?
Out Of Memory (OOM)
“Thrown when the Java Virtual Machine cannot allocate an
object because it is out of memory, and no more memory
could be made available by the garbage collector.”
spark.memory.fraction vs spark.memory.storageFraction
• Don’t jump straight to parameter tuning
• Be aware of execution time object creation
OOM Tips
• Don’t jump straight to parameter tuning
rdd.mapPartitions{ iterator => // fetch remote file }
• Plan the resources needed from your cluster manager
before deploying - when possible
• Plan the resources needed from your cluster manager
before deploying - when possible
OOM Tips
YARN
– Cluster vs client mode
– yarn.nodemanager.resource.memory-mb
– yarn.scheduler.minimum-allocation-mb
– spark.yarn.driver.memoryOverhead
EXAMPLE
//Re-partitioning may cause issues...
//Here target is to have one file as output but...
private def saveToFile(dataFrame: DataFrame,
filePath: String): Unit = {
dataFrame.repartition(1).write.
format("com.databricks.spark.csv")...save(filePath)
}
OOM
NoSuchMethod
NoSuchMethod
“Thrown if an application tries to call a specified method of a
class (either static or instance), and that class no longer
has a definition of that method. Normally, this error is
caught by the compiler; this error can only occur at run time
if the definition of a class has incompatibly changed.”
java.lang.NoSuchMethodError:
org.apache.commons.math3.util.MathArrays.natural
“...spark 1.4.1 uses math3 3.1.1 which, as it turns out,
doesn't have the natural method.”
Dependency Collision
Compiled against version A Runtime used version B
Version B did not have the method!
• Shade your lib
– sbt: https://github.com/sbt/sbt-assembly#shading
– Maven: https://maven.apache.org/plugins/maven-shade-plugin/
spark-submit --class "MAIN_CLASS"
--driver-class-path commons-math3-3.3.jar YOURJAR.jar
spark-submit --class "MAIN_CLASS"
--driver-class-path commons-math3-3.3.jar YOURJAR.jar
• Upgrade Spark or downgrade your library
– If you’re lucky...
• Enforce lib load order• Enforce lib load order
• Upgrade Spark or downgrade your library
– If you’re lucky...
Solutions
Perplexities of Size
https://josephderosa.files.wordpress.com/2016/04/too-big-by-half.jpg
Perplexities of Size
Perplexities of Size
Struggles in Speculation
• spark.speculation.multiplier• spark.speculation.multiplier
• spark.speculation.interval• spark.speculation.interval
Struggles in Speculation
• spark.speculation.quantile
Strategizing Your Joins
• Slow Joins
• Unavoidable Joins
Common Issues
Slow Joins
val df = largeDF.join(broadcast(smallDF), “key”)
• Avoid shuffling if one side of the join is small enough
val df = largeDF.join(broadcast(smallDF), “key”)
• Avoid shuffling if one side of the join is small enough
df.explain
df.queryExecution.executedPlan
df.explain
df.queryExecution.executedPlan
== Physical Plan ==
BroadcastHashJoin ... BuildRight
...
• Check which strategy is actually used
• For broadcast you should see :
• Check which strategy is actually used
– size < SQLConf.AUTO_BROADCASTJOIN_THRESHOLD– size < SQLConf.AUTO_BROADCASTJOIN_THRESHOLD
Join Strategies
• Shuffle hash join• Shuffle hash join
• Broadcast
• Sort merge
– spark.sql.join.preferSortMergeJoin (default = true)
• Broadcast
val df = spark.sparkContext.parallelize(
List(("Id1", 10, "London"), ("Id2", 20, "Paris"), ("Id2", 1, "NY"), ("Id2", 20, "London"))
).toDF("GroupId", "Amount","City")
val grouped = df.groupBy("GroupId").agg(max("Amount"), first("City"))
grouped.collect().foreach(println)
val df = spark.sparkContext.parallelize(
List(("Id1", 10, "London"), ("Id2", 20, "Paris"), ("Id2", 1, "NY"), ("Id2", 20, "London"))
).toDF("GroupId", "Amount","City")
val grouped = df.groupBy("GroupId").agg(max("Amount"), first("City"))
grouped.collect().foreach(println)
Unavoidable Joins
Joining is the only way to retain all related columns for max after groupBy[Id1,10,London]
[Id2,20,Paris]
val joined = df.as("a").join(grouped.as("b"),
$"a.GroupId" === $"b.GroupId" && $"a.Amount" === $"b.max(Amount)", "inner")
joined.collect().foreach(println)
[Id1,10,London,Id1,10,London]
[Id2,20,Paris,Id2,20,Paris]
[Id2,20,London,Id2,20,Paris]
Safe Stream Recovery
val products = ssc.cassandraTable[(Int, String, Float)](...)
.map{ case (id, name, price) => (id, (name, price)) }
.cache
orders.transform(rdd => {
rdd.join(products)
})
.... ....
....
ERROR JobScheduler: Error running job streaming job
org.apache.spark.SparkException: RDD transformations and actions can
only be invoked by the driver, not inside of other transformations; for
example, rdd1.map(x => rdd2.values.count() * x) is invalid because the
values transformation and count action cannot be performed inside of
the rdd1.map transformation. For more information, see SPARK-5063.
ERROR JobScheduler: Error running job streaming job
org.apache.spark.SparkException: RDD transformations and
actions can only be invoked by the driver, not inside of
other transformations; for example, rdd1.map(x =>
rdd2.values.count() * x) is invalid because the values transformation
and count action cannot be performed inside of the rdd1.map
transformation. For more information, see SPARK-5063.
val products = ssc.cassandraTable[(Int, String, Float)](...)
.map{ case (id, name, price) => (id, (name, price)) }
.cache
orders.transform(rdd => {
rdd.join(products)
})
val products = ssc.cassandraTable[(Int, String, Float)](...)
.map{ case (id, name, price) => (id, (name, price)) }
.cache
orders.transform(rdd => {
rdd.join(products)
})
orders.transform(rdd => {
val products = rdd.sparkContext.cassandraTable(…).map(…)
rdd.join(products)
})
orders.transform(rdd => {
val products = rdd.sparkContext.cassandraTable(…).map(…)
rdd.join(products)
})
val products = ssc.cassandraTable[(Int, String, Float)](...)
.map{ case (id, name, price) => (id, (name, price)) }
.cache
orders.transform(rdd => {
rdd.join(products)
})
val products = ssc.cassandraTable[(Int, String, Float)](...)
.map{ case (id, name, price) => (id, (name, price)) }
.cache
orders.transform(rdd => {
rdd.join(products)
})
orders.transform{ rdd => {
val sc = rdd.sparkContext
val productsOption = sc.getPersistentRDDs
.values.filter(rdd => rdd.name == "foo").headOption
val products = productsOption match {
case Some(persistedRDD) =>
persistedRDD.asInstanceOf[CassandraTableScanRDD[…]]
case None => {
val productsRDD = sc.cassandraTable(…).map(…)
productsRDD.setName("foo")
productsRDD.cache
}
}
rdd.join(products)
}
orders.transform{ rdd => {
val sc = rdd.sparkContext
val productsOption = sc.getPersistentRDDs
.values.filter(rdd => rdd.name == "foo").headOption
val products = productsOption match {
case Some(persistedRDD) =>
persistedRDD.asInstanceOf[CassandraTableScanRDD[…]]
case None => {
val productsRDD = sc.cassandraTable(…).map(…)
productsRDD.setName("foo")
productsRDD.cache
}
}
rdd.join(products)
}
Safe Stream Recovery
Handling S3 w/o Hanging
Simple Streaming
val ssc = new StreamingContext(sc, Seconds(10))
val lines = ssc.textFileStream("s3n://some/folder")
lines.print
Simple Streaming, Right???
val ssc = new StreamingContext(sc, Seconds(10))
val lines = ssc.textFileStream("s3n://some/folder")
lines.print
Wrong!!!
• Increase the heap• Increase the heap
• Explicit file handling
– http://stackoverflow.com/a/34681877/779513
• Explicit file handling
– http://stackoverflow.com/a/34681877/779513
Solutions
• Custom s3n handler
• Pace it with a script
• Complain
• Pace it with a script
• Custom s3n handler
…... …...
THANK YOU.
justin.pihony@lightbend.com
stavros.kontopoulos@lightbend.com

More Related Content

Similar to Spark Summit EU Supporting Spark (Brussels 2016)

Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
trihug
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
cdmaxime
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
 
Big Search with Big Data Principles
Big Search with Big Data PrinciplesBig Search with Big Data Principles
Big Search with Big Data Principles
OpenSource Connections
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
cdmaxime
 
Discovering the Plan Cache (#SQLSat 206)
Discovering the Plan Cache (#SQLSat 206)Discovering the Plan Cache (#SQLSat 206)
Discovering the Plan Cache (#SQLSat 206)Jason Strate
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
cdmaxime
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
Jason Hubbard
 
Why Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraWhy Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data Era
Handaru Sakti
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)
hiteshnd
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
Fabio Fumarola
 
Discovering the plan cache (#SQLSat211)
Discovering the plan cache (#SQLSat211)Discovering the plan cache (#SQLSat211)
Discovering the plan cache (#SQLSat211)Jason Strate
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
caidezhi655
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
Evan Chan
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
Holden Karau
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
Steve Min
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 

Similar to Spark Summit EU Supporting Spark (Brussels 2016) (20)

Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
 
Big Search with Big Data Principles
Big Search with Big Data PrinciplesBig Search with Big Data Principles
Big Search with Big Data Principles
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
 
Discovering the Plan Cache (#SQLSat 206)
Discovering the Plan Cache (#SQLSat 206)Discovering the Plan Cache (#SQLSat 206)
Discovering the Plan Cache (#SQLSat 206)
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
Why Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraWhy Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data Era
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
Discovering the plan cache (#SQLSat211)
Discovering the plan cache (#SQLSat211)Discovering the plan cache (#SQLSat211)
Discovering the plan cache (#SQLSat211)
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 

More from Stavros Kontopoulos

Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdfServerless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
Stavros Kontopoulos
 
Online machine learning in Streaming Applications
Online machine learning in Streaming ApplicationsOnline machine learning in Streaming Applications
Online machine learning in Streaming Applications
Stavros Kontopoulos
 
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
Stavros Kontopoulos
 
Streaming analytics state of the art
Streaming analytics state of the artStreaming analytics state of the art
Streaming analytics state of the art
Stavros Kontopoulos
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
Stavros Kontopoulos
 
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataVoxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Stavros Kontopoulos
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016
Stavros Kontopoulos
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
Stavros Kontopoulos
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
Stavros Kontopoulos
 

More from Stavros Kontopoulos (10)

Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdfServerless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
 
Online machine learning in Streaming Applications
Online machine learning in Streaming ApplicationsOnline machine learning in Streaming Applications
Online machine learning in Streaming Applications
 
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
 
Streaming analytics state of the art
Streaming analytics state of the artStreaming analytics state of the art
Streaming analytics state of the art
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
 
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataVoxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016
 
Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
 

Recently uploaded

Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 

Recently uploaded (20)

Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 

Spark Summit EU Supporting Spark (Brussels 2016)

  • 1. Problem Solving Recipes Learned from Supporting Spark Justin Pihony & Stavros Kontopoulos Lightbend
  • 2. 1. OOM Table of Contents 1. OOM1. OOM 2. NoSuchMethod 5. Strategizing Your Joins 6. Safe Stream Recovery 3. Perplexities of Size 6. Safe Stream Recovery 4. Struggles in Speculation 5. Strategizing Your Joins 7. Handling S3 w/o Hanging 2. NoSuchMethod 3. Perplexities of Size 4. Struggles in Speculation
  • 5. Out Of Memory (OOM) “Thrown when the Java Virtual Machine cannot allocate an object because it is out of memory, and no more memory could be made available by the garbage collector.”
  • 6.
  • 8. • Don’t jump straight to parameter tuning • Be aware of execution time object creation OOM Tips • Don’t jump straight to parameter tuning rdd.mapPartitions{ iterator => // fetch remote file }
  • 9. • Plan the resources needed from your cluster manager before deploying - when possible • Plan the resources needed from your cluster manager before deploying - when possible OOM Tips YARN – Cluster vs client mode – yarn.nodemanager.resource.memory-mb – yarn.scheduler.minimum-allocation-mb – spark.yarn.driver.memoryOverhead EXAMPLE
  • 10. //Re-partitioning may cause issues... //Here target is to have one file as output but... private def saveToFile(dataFrame: DataFrame, filePath: String): Unit = { dataFrame.repartition(1).write. format("com.databricks.spark.csv")...save(filePath) } OOM
  • 12. NoSuchMethod “Thrown if an application tries to call a specified method of a class (either static or instance), and that class no longer has a definition of that method. Normally, this error is caught by the compiler; this error can only occur at run time if the definition of a class has incompatibly changed.”
  • 13. java.lang.NoSuchMethodError: org.apache.commons.math3.util.MathArrays.natural “...spark 1.4.1 uses math3 3.1.1 which, as it turns out, doesn't have the natural method.” Dependency Collision Compiled against version A Runtime used version B Version B did not have the method!
  • 14. • Shade your lib – sbt: https://github.com/sbt/sbt-assembly#shading – Maven: https://maven.apache.org/plugins/maven-shade-plugin/ spark-submit --class "MAIN_CLASS" --driver-class-path commons-math3-3.3.jar YOURJAR.jar spark-submit --class "MAIN_CLASS" --driver-class-path commons-math3-3.3.jar YOURJAR.jar • Upgrade Spark or downgrade your library – If you’re lucky... • Enforce lib load order• Enforce lib load order • Upgrade Spark or downgrade your library – If you’re lucky... Solutions
  • 19. • spark.speculation.multiplier• spark.speculation.multiplier • spark.speculation.interval• spark.speculation.interval Struggles in Speculation • spark.speculation.quantile
  • 21. • Slow Joins • Unavoidable Joins Common Issues
  • 22. Slow Joins val df = largeDF.join(broadcast(smallDF), “key”) • Avoid shuffling if one side of the join is small enough val df = largeDF.join(broadcast(smallDF), “key”) • Avoid shuffling if one side of the join is small enough df.explain df.queryExecution.executedPlan df.explain df.queryExecution.executedPlan == Physical Plan == BroadcastHashJoin ... BuildRight ... • Check which strategy is actually used • For broadcast you should see : • Check which strategy is actually used
  • 23. – size < SQLConf.AUTO_BROADCASTJOIN_THRESHOLD– size < SQLConf.AUTO_BROADCASTJOIN_THRESHOLD Join Strategies • Shuffle hash join• Shuffle hash join • Broadcast • Sort merge – spark.sql.join.preferSortMergeJoin (default = true) • Broadcast
  • 24. val df = spark.sparkContext.parallelize( List(("Id1", 10, "London"), ("Id2", 20, "Paris"), ("Id2", 1, "NY"), ("Id2", 20, "London")) ).toDF("GroupId", "Amount","City") val grouped = df.groupBy("GroupId").agg(max("Amount"), first("City")) grouped.collect().foreach(println) val df = spark.sparkContext.parallelize( List(("Id1", 10, "London"), ("Id2", 20, "Paris"), ("Id2", 1, "NY"), ("Id2", 20, "London")) ).toDF("GroupId", "Amount","City") val grouped = df.groupBy("GroupId").agg(max("Amount"), first("City")) grouped.collect().foreach(println) Unavoidable Joins Joining is the only way to retain all related columns for max after groupBy[Id1,10,London] [Id2,20,Paris] val joined = df.as("a").join(grouped.as("b"), $"a.GroupId" === $"b.GroupId" && $"a.Amount" === $"b.max(Amount)", "inner") joined.collect().foreach(println) [Id1,10,London,Id1,10,London] [Id2,20,Paris,Id2,20,Paris] [Id2,20,London,Id2,20,Paris]
  • 26. val products = ssc.cassandraTable[(Int, String, Float)](...) .map{ case (id, name, price) => (id, (name, price)) } .cache orders.transform(rdd => { rdd.join(products) })
  • 28. ....
  • 29. ERROR JobScheduler: Error running job streaming job org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
  • 30. ERROR JobScheduler: Error running job streaming job org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
  • 31. val products = ssc.cassandraTable[(Int, String, Float)](...) .map{ case (id, name, price) => (id, (name, price)) } .cache orders.transform(rdd => { rdd.join(products) })
  • 32. val products = ssc.cassandraTable[(Int, String, Float)](...) .map{ case (id, name, price) => (id, (name, price)) } .cache orders.transform(rdd => { rdd.join(products) })
  • 33. orders.transform(rdd => { val products = rdd.sparkContext.cassandraTable(…).map(…) rdd.join(products) })
  • 34. orders.transform(rdd => { val products = rdd.sparkContext.cassandraTable(…).map(…) rdd.join(products) })
  • 35. val products = ssc.cassandraTable[(Int, String, Float)](...) .map{ case (id, name, price) => (id, (name, price)) } .cache orders.transform(rdd => { rdd.join(products) })
  • 36. val products = ssc.cassandraTable[(Int, String, Float)](...) .map{ case (id, name, price) => (id, (name, price)) } .cache orders.transform(rdd => { rdd.join(products) })
  • 37. orders.transform{ rdd => { val sc = rdd.sparkContext val productsOption = sc.getPersistentRDDs .values.filter(rdd => rdd.name == "foo").headOption val products = productsOption match { case Some(persistedRDD) => persistedRDD.asInstanceOf[CassandraTableScanRDD[…]] case None => { val productsRDD = sc.cassandraTable(…).map(…) productsRDD.setName("foo") productsRDD.cache } } rdd.join(products) }
  • 38. orders.transform{ rdd => { val sc = rdd.sparkContext val productsOption = sc.getPersistentRDDs .values.filter(rdd => rdd.name == "foo").headOption val products = productsOption match { case Some(persistedRDD) => persistedRDD.asInstanceOf[CassandraTableScanRDD[…]] case None => { val productsRDD = sc.cassandraTable(…).map(…) productsRDD.setName("foo") productsRDD.cache } } rdd.join(products) } Safe Stream Recovery
  • 39. Handling S3 w/o Hanging
  • 40. Simple Streaming val ssc = new StreamingContext(sc, Seconds(10)) val lines = ssc.textFileStream("s3n://some/folder") lines.print
  • 41. Simple Streaming, Right??? val ssc = new StreamingContext(sc, Seconds(10)) val lines = ssc.textFileStream("s3n://some/folder") lines.print
  • 43.
  • 44. • Increase the heap• Increase the heap • Explicit file handling – http://stackoverflow.com/a/34681877/779513 • Explicit file handling – http://stackoverflow.com/a/34681877/779513 Solutions • Custom s3n handler • Pace it with a script • Complain • Pace it with a script • Custom s3n handler