SlideShare a Scribd company logo
1 of 28
Download to read offline
Building a Product 
with Spark 
Common Technical and Performance 
Problems with Solutions
Format of Talk 
● Common spark exceptions with debugging 
tips 
● My personal use-case for spark and 
resulting cluster preferences 
● Optimizing code using Functional 
Programming 
● Data Science and Productization
Dreaded OOMs; heap & GC overhead limit 
● INCREASE YOUR NUM PARTITIONS!!! Spark is not 
Hadoop, ergo more partitions does not necessarily cost 
extra resources. “Typically you want 2-4 slices for each 
CPU in your cluster” - Spark documentation, “100 - 
10,000 is reasonable” - Matei Zaharia 
● Set your memory fractions according to the job. E.g. if 
you do not use the cache, then set the cache fraction to 
0.0. 
● Ensure your using the memory on each node of the 
cluster, not just the default 512m - check this in the UI. 
● Refactor code (see later)
Shuffle File Problems - inodes 
● If you get java.io.IOException: No space left on device 
while df -h shows you have space, then check your 
inodes with df -i. 
● shuffles by default create O(M * R) shuffle files, where 
M is map tasks and R is reduce tasks. Solutions (by 
pref) 
A. Research “consolidating files”; decreases to O(R) 
B. Decrease R, and decrease M using coalesce with 
shuffle = false 
C. Get DevOps to increase inodes of FS
Shuffle File Problems - ‘Too many open 
files’ 
● Shuffles open O(C * R) shuffle files, where C is your 
number of cores. Solutions (by pref) 
A. Get DevOps to increase ulimit 
B. Decrease R 
● Consolidating files will not help
General Debugging/Troubleshooting Tips 
Don’t Always Trust the Message 
(skip) 
Some exceptions have been symptomatic of OOM, space/inode problem or 
something seemingly unrelated to the ST. E.g. 
● making more meaningful is an open 
improvement (SPARK-3052) 
● and can 
be symptomatic of “no space left on device” 
● Initial job has not accepted any resources even though everything seems fine can be 
caused by accidentally configuring your RAM or Cores too high 
Use memory, JVM memory, space and inode monitoring, and dig around in 
the worker logs when exceptions are not clear. Ask an SO!
Notes on Broadcast Variables and the Driver Process 
● Broadcast variables are great for saving memory as 
they will have one in memory copy per node, whereas 
free variables from Closures will have one copy per 
task. (E.g. huge hash maps and hash sets) 
● If you need to create a large broadcast variable then 
remember that your driver JVM will need to be 
spawned with increased RAM 
● Symptoms of forgetting this is 
○ Hangs 
○ OOMs
Cluster Setup Problems & Dependency/Versioning Hell 
(rush through this) 
My personal experience/use-case: 
● Only wanted permanent production and research clusters 
● that work with other Big Data tools/clusters/hadoop-eco-system like 
HBase, Cassandra, HDFS, etc for regular hourly/daily pipelines and 
ongoing data exploration. 
● Used the spark-shell, and built fat jars for complex applications 
● Worked with 4 spark clusters over my first 6 - 9 months with Spark 
For the above I’ve come to have the following preferences: (Links on last slide) 
● use Cloudera packages to install Spark (via apt-get or similar) 
● with the Cloudera jar dependencies (e.g. 1.0.0-cdh5.1.2) 
● On Ubuntu Server 14.04 (apt-get > yum, bash >>>>>>>> cygwin or Posh) 
● On EC2 (not EMR), (powerful cloud features, super-cheap spot-instances)
Spark Configurables - Alternative defaults for stability increase 
Configurable Value Comment 
spark.executor.memory 
Whatever your cluster 
spark.cores.max 
can support 
spark.shuffle.consolidateFiles true For ext4 and xfs, which are common 
spark.cleaner.ttl 86400 To avoid space issues as mentioned 
spark.akka.frameSize 500 Prevents errors including: 
spark.akka.askTimeout 100 Seems to help with “high CPU/IO load” 
spark.task.maxFailures 16 Keep trying! 
spark.worker.timeout 150 A wee hack around gc hangs? NOTE: I 
can’t find this on the latest docs. 
spark.serializer org.apache.spark. 
serializer. 
KryoSerializer 
Requires additional faff to get it right 
apparently. Uncertain why not default?!
Code level optimizations with FP 
● Sometimes you cannot fix a problem by 
tuning the cluster/job config 
● Sometimes you just need to refactor your 
code!
Definition - Strongly Pipelineable 
Let x: RDD[T], let M := < m1, …, mn> be a valid sequence of 
method invocations on x with parameters, i.e. 
“x.m1. … .mn” is a valid line of code, then we say the 
sequence M is strongly pipelineable if and only if there 
exists a function f such that “x.flatMap(f)” is an 
‘equivalent’ line of code. 
Definition - Weakly Pipelineable 
… and M is weakly pipelineable if and only if there exists a 
function f such that “x.mapPartitions(f)” is an ‘equivalent’ 
line of code.
Practical Application 
● For both definitions such a sequence of invocations will 
not result in shuffle operations (so no net IO) 
● For Strong Pipelining we have the freedom to insert 
coalesce operations (similarly, we could define a custom 
Partitioner that is like an “inverseCoalesce” to split 
partitions but preserve location) 
● With Weak Pipelining we do not ipso facto have the 
freedom to coalesce/inverseCoalesce but we can be 
sure the task will consist of a single stage
Theorem A - Strongly Pipelineable Methods 
… If M is a sequence constructed from any of the methods 
map, filter, flatMap, mapValues, flatMapValues or sample 
then M strongly pipelines if we can assume 
(1) RDD with flatMap is a “Monad”* 
*We need scare quotes here because we must abuse 
notation and consider TraversableOnce and RDD as the 
same type
Theorem B - Weakly Pipelineable Methods 
… If M is a sequence constructed from any of the methods 
map, filter, flatMap, mapValues, flatMapValues, sample, 
and additionally mapPartitions or glom then M weakly 
pipelines if we can assume 
(2) RDD with mapPartitions is a Functor
Example 1 
val docIdWords: RDD[(Int, MultiSet[String])] 
// 2 stage job 
val docIdToIndexesAndNum = docIdWords.mapValues(_.map(getIndex)) 
.reduceByKey(_ ++ _) 
.mapValues(indexes => (indexes, indexes.size)) 
// 1 stage job 
val docIdToIndexesAndNum = docIdWords.mapValues(_.map(getIndex)) 
.mapValues(indexes => (indexes, indexes.size)) 
.reduceByKey(addEachTupleElementTogether)
Example 2 
val xvalResults: RDD[(Classification, Confusion)] 
// Induces 2 stages 
val truePositives = xvalResults.filter(_._2 == TP).count() 
val trueNegatives = xvalResults.filter(_._2 == TN).count() 
// Induces 1 stage (and still easily readable!) 
val (truePositives, trueNegatives) = xvalResults.map(_._2).map { 
case `TP` => (1, 0) 
case `TN` => (0, 1) 
} 
.reduce(addEachTupleElementTogether) 
Seems addEachTupleElementTogether is a recurring theme?! Hmm ...
Proof of A 
By (1); 
x.flatMap(f).flatMap(g) = x.flatMap(y => f(y).flatMap(g)) 
for any f and g, then by induction on n it follows that if all mi 
are flatMaps then the theorem holds. 
Therefore it remains to show that map, filter, mapValues, 
flatMapValues and sample can be written in terms of 
flatMap. 
Firstly observe sample(fraction) = filter(r) where r 
randomly returns true with probability fraction
Now we can write x.filter(f) for any f as follows: 
x.flatMap { 
case y if f(y) => List(y) 
case _ => Nil 
} 
So filter and sample can be written in terms of flatMap. 
For map, for any f; 
x.map(f) = x.flatMap(y => List(f(y))) 
Then we leave mapValues and flatMapValues as exercises. 
QED
Proof of B 
Firstly by (2); for any f and g 
x.mapPartitions(f).mapPartitions(g) 
= x.mapPartitions(f compose g) 
Now by observing 
x.flatMap(f) = x.mapPartitions(_.flatMap(f)) 
and the equivilences we noted in the previous theorem, the 
rest of this proof should be obvious. QED
BY THE POWER OF MONOID!!! 
case class MaxMin(max: Int, min: Int) 
object MaxMin { 
def apply(pair: (Int, Int)) = MaxMin(pair._1, pair._2) 
def apply(i: Int) = MaxMin(i, i) 
implicit object MaxMinMonoid extends Monoid[MaxMin] { 
def zero: MaxMin = MaxMin(Int.MinValue, Int.MaxValue) 
def append(mm1: MaxMin, mm2: => MaxMin): MaxMin = MaxMin( 
if (mm1.max > mm2.max) mm1.max else mm2.max, 
if (mm1.min < mm2.min) mm1.min else mm2.min 
) 
}
Example 3 
implicit class PimpedIntRDD(rdd: RDD[Int]) { 
// Twice as fast as doing rdd.top(1) - rdd.top(1)(reverseOrd) 
def range: Int = { 
val MaxMin(max, min) = rdd.map(MaxMin.apply).reduce(_ |+| _) 
max - min 
} 
// Similarly faster than doing rdd.reduce(_ + _) / rdd.count() 
def average: Double = { 
val (sum, total) = rdd.map((_, 1)).reduce(_ |+| _) 
sum.toDouble / total 
} 
// TODO Using generics and type-classes much of what we have done 
// can be extended to include all Numeric types easily
Example 4 
val rdd: RDD[MyKey, Int] = ... 
// BAD! Shuffles a lot of data and traverses each group twice 
val ranges: RDD[MyKey, Int] = 
rdd.groupByKey().mapValues(seq => seq.max - seq.min) 
val ranges = rdd.mapValues(MaxMin.apply).reduceByKey(_ |+| _) // GOOD! 
Let B be number of Bytes shuffled accross the network 
Let N be num records, K num keys, and W num worker nodes 
Let X be the relative size of an element of the monoid to the base records 
E.g. here a MaxMin is ~twice the size of the records 
Then in general for a groupByKey B is O(N(1 - 1/W)), 
for a “monoidally refactored” reduceByKey B is O((W - 1)KX) 
E.g. for the above, if N = 1,000,000, W = 10, K = 100, then monoid version 
has net-IO complexity 500x less.
Scalaz out of box Monoids 
Supports: 
● Some Numerics (e.g. Int) and collections (e.g. List) 
● Tuples of monoids (go back to Example 1 & 2 for a use) 
● Option of a monoid 
● Map[A, B] where B is a monoid (really cool, you must 
google!), useful as sparse vectors, datacubes, sparse 
histograms, etc. 
See also com.twitter.algebird
So 
Writing code that 
● exploits Spark’s utilization of Monad / Functor like 
equivalences, and 
● using and nesting Monoids with reduce & reduceByKey 
we can 
● Turn 10 stage jobs into 1 stage jobs 
● Shuffle Bytes of data rather than GigaBytes of data, and so 
● Optimize jobs by orders of magnitude 
● while keeping code super compact, modular and readable 
AND THAT IS NOT EVEN EVERYTHING ...
Super Unit Testing! 
With just two lines of code we can unit test our monoids for 
100s of test cases 
implicit val _: Arbitrary[MaxMin] = 
Arbitrary(for (ij <- arbitrary[(Int, Int)]) yield MaxMin(ij)) 
checkAll(monoid.laws[MaxMin]) 
// Strictly speaking we need to define commutativeMonoid.laws 
Conclusion - Functional Programming in Scala 
Much faster, more readable, generic, modular, succinct, 
type-safe code that has insane test coverage practically for 
free!
Scala - The Production Language for Data Science 
1. Highly Interoperable with the rest of the Big Data world due 
to Java/JVM interoperability: 
○ Written in Scala: Kafka, Akka, Spray, Play, Spark, 
SparkStreaming, GraphX 
○ Written in Java: Cassandra, HBase, HDFS 
2. 1st-layer APIs are better maintained, richer, easier to use 
and understand. All open source can be stepped into with 
one IDE 
3. Highly scalable, elegant, functional - perfect for 
mathematicians, concurrency and writing DSLs. With Scalaz 
code has unparalleled brevity (excluding GolfScript, etc)
4. Scala is generally a production worthy language: 
a. Powerful static type system means powerful compiler 
checking and IDE features 
b. Compiler can optimize code 
5. ML & Stats libs growing very fast due to 3. 
6. The End of the Two Team Dev Cycle: 
a. 1 language for engineering, development, productization, 
research, small data, big data, 1 node, 100 nodes 
b. Product ownership, inception, design, flat-management 
c. More Agile Cross-Functional teams 
d. Faster release cycle and time to MVP 
e. Less human resources required 
f. No “lost in translation” problems of prototype --> prod
Further Reading 
http://www.meetup.com/spark-users/events/94101942/ 
http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/ 
http://darrenjw.wordpress.com/2013/12/23/scala-as-a-platform-for-statistical-computing-and-data-science/ 
Installing spark: 
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation- 
Guide/cdh5ig_spark_installation.html 
adding spark as a dependency (search for text “spark”): 
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH-Version-and-Packaging- 
Information/cdhvd_cdh5_maven_repo.html 
Monoids 
https://github.com/rickynils/scalacheck/wiki/User-Guide 
http://timepit.eu/~frank/blog/2013/03/scalaz-monoid-example/

More Related Content

What's hot

ComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical SciencesComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical Sciencesalexstorer
 
Data manipulation with dplyr
Data manipulation with dplyrData manipulation with dplyr
Data manipulation with dplyrRomain Francois
 
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache CalciteJulian Hyde
 
Data Manipulation Using R (& dplyr)
Data Manipulation Using R (& dplyr)Data Manipulation Using R (& dplyr)
Data Manipulation Using R (& dplyr)Ram Narasimhan
 
Python for R developers and data scientists
Python for R developers and data scientistsPython for R developers and data scientists
Python for R developers and data scientistsLambda Tree
 
R + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop clusterR + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop clusterJeffrey Breen
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to sparkJavier Arrieta
 
Introduction to pandas
Introduction to pandasIntroduction to pandas
Introduction to pandasPiyush rai
 
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiMonitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiInfluxData
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word countJeff Patti
 
R Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RR Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RRsquared Academy
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce APITom Croucher
 
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...CloudxLab
 

What's hot (20)

Pune Clojure Course Outline
Pune Clojure Course OutlinePune Clojure Course Outline
Pune Clojure Course Outline
 
ComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical SciencesComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical Sciences
 
Python for R users
Python for R usersPython for R users
Python for R users
 
Data manipulation with dplyr
Data manipulation with dplyrData manipulation with dplyr
Data manipulation with dplyr
 
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
 
Data Manipulation Using R (& dplyr)
Data Manipulation Using R (& dplyr)Data Manipulation Using R (& dplyr)
Data Manipulation Using R (& dplyr)
 
RHadoop, R meets Hadoop
RHadoop, R meets HadoopRHadoop, R meets Hadoop
RHadoop, R meets Hadoop
 
Python for R developers and data scientists
Python for R developers and data scientistsPython for R developers and data scientists
Python for R developers and data scientists
 
C07.heaps
C07.heapsC07.heaps
C07.heaps
 
R + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop clusterR + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop cluster
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
 
Introduction to pandas
Introduction to pandasIntroduction to pandas
Introduction to pandas
 
Sparklyr
SparklyrSparklyr
Sparklyr
 
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiMonitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
 
R Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RR Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In R
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce API
 
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
 
Algorithm
AlgorithmAlgorithm
Algorithm
 

Similar to Spark 4th Meetup Londond - Building a Product with Spark

Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurSiddharth Mathur
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...David Walker
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurSiddharth Mathur
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
 
Overview of Chainer and Its Features
Overview of Chainer and Its FeaturesOverview of Chainer and Its Features
Overview of Chainer and Its FeaturesSeiya Tokui
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Threaded Programming
Threaded ProgrammingThreaded Programming
Threaded ProgrammingSri Prasanna
 
Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Martin Odersky
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsHolden Karau
 

Similar to Spark 4th Meetup Londond - Building a Product with Spark (20)

Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
Apache Nemo
Apache NemoApache Nemo
Apache Nemo
 
Overview of Chainer and Its Features
Overview of Chainer and Its FeaturesOverview of Chainer and Its Features
Overview of Chainer and Its Features
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
User biglm
User biglmUser biglm
User biglm
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Cloud jpl
Cloud jplCloud jpl
Cloud jpl
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
 
Oct.22nd.Presentation.Final
Oct.22nd.Presentation.FinalOct.22nd.Presentation.Final
Oct.22nd.Presentation.Final
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Threaded Programming
Threaded ProgrammingThreaded Programming
Threaded Programming
 
Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
 

Recently uploaded

Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
preservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxpreservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxnoordubaliya2003
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxJorenAcuavera1
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems
 

Recently uploaded (20)

Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
preservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxpreservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptx
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptx
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather Station
 

Spark 4th Meetup Londond - Building a Product with Spark

  • 1. Building a Product with Spark Common Technical and Performance Problems with Solutions
  • 2. Format of Talk ● Common spark exceptions with debugging tips ● My personal use-case for spark and resulting cluster preferences ● Optimizing code using Functional Programming ● Data Science and Productization
  • 3. Dreaded OOMs; heap & GC overhead limit ● INCREASE YOUR NUM PARTITIONS!!! Spark is not Hadoop, ergo more partitions does not necessarily cost extra resources. “Typically you want 2-4 slices for each CPU in your cluster” - Spark documentation, “100 - 10,000 is reasonable” - Matei Zaharia ● Set your memory fractions according to the job. E.g. if you do not use the cache, then set the cache fraction to 0.0. ● Ensure your using the memory on each node of the cluster, not just the default 512m - check this in the UI. ● Refactor code (see later)
  • 4. Shuffle File Problems - inodes ● If you get java.io.IOException: No space left on device while df -h shows you have space, then check your inodes with df -i. ● shuffles by default create O(M * R) shuffle files, where M is map tasks and R is reduce tasks. Solutions (by pref) A. Research “consolidating files”; decreases to O(R) B. Decrease R, and decrease M using coalesce with shuffle = false C. Get DevOps to increase inodes of FS
  • 5. Shuffle File Problems - ‘Too many open files’ ● Shuffles open O(C * R) shuffle files, where C is your number of cores. Solutions (by pref) A. Get DevOps to increase ulimit B. Decrease R ● Consolidating files will not help
  • 6. General Debugging/Troubleshooting Tips Don’t Always Trust the Message (skip) Some exceptions have been symptomatic of OOM, space/inode problem or something seemingly unrelated to the ST. E.g. ● making more meaningful is an open improvement (SPARK-3052) ● and can be symptomatic of “no space left on device” ● Initial job has not accepted any resources even though everything seems fine can be caused by accidentally configuring your RAM or Cores too high Use memory, JVM memory, space and inode monitoring, and dig around in the worker logs when exceptions are not clear. Ask an SO!
  • 7. Notes on Broadcast Variables and the Driver Process ● Broadcast variables are great for saving memory as they will have one in memory copy per node, whereas free variables from Closures will have one copy per task. (E.g. huge hash maps and hash sets) ● If you need to create a large broadcast variable then remember that your driver JVM will need to be spawned with increased RAM ● Symptoms of forgetting this is ○ Hangs ○ OOMs
  • 8. Cluster Setup Problems & Dependency/Versioning Hell (rush through this) My personal experience/use-case: ● Only wanted permanent production and research clusters ● that work with other Big Data tools/clusters/hadoop-eco-system like HBase, Cassandra, HDFS, etc for regular hourly/daily pipelines and ongoing data exploration. ● Used the spark-shell, and built fat jars for complex applications ● Worked with 4 spark clusters over my first 6 - 9 months with Spark For the above I’ve come to have the following preferences: (Links on last slide) ● use Cloudera packages to install Spark (via apt-get or similar) ● with the Cloudera jar dependencies (e.g. 1.0.0-cdh5.1.2) ● On Ubuntu Server 14.04 (apt-get > yum, bash >>>>>>>> cygwin or Posh) ● On EC2 (not EMR), (powerful cloud features, super-cheap spot-instances)
  • 9. Spark Configurables - Alternative defaults for stability increase Configurable Value Comment spark.executor.memory Whatever your cluster spark.cores.max can support spark.shuffle.consolidateFiles true For ext4 and xfs, which are common spark.cleaner.ttl 86400 To avoid space issues as mentioned spark.akka.frameSize 500 Prevents errors including: spark.akka.askTimeout 100 Seems to help with “high CPU/IO load” spark.task.maxFailures 16 Keep trying! spark.worker.timeout 150 A wee hack around gc hangs? NOTE: I can’t find this on the latest docs. spark.serializer org.apache.spark. serializer. KryoSerializer Requires additional faff to get it right apparently. Uncertain why not default?!
  • 10. Code level optimizations with FP ● Sometimes you cannot fix a problem by tuning the cluster/job config ● Sometimes you just need to refactor your code!
  • 11. Definition - Strongly Pipelineable Let x: RDD[T], let M := < m1, …, mn> be a valid sequence of method invocations on x with parameters, i.e. “x.m1. … .mn” is a valid line of code, then we say the sequence M is strongly pipelineable if and only if there exists a function f such that “x.flatMap(f)” is an ‘equivalent’ line of code. Definition - Weakly Pipelineable … and M is weakly pipelineable if and only if there exists a function f such that “x.mapPartitions(f)” is an ‘equivalent’ line of code.
  • 12. Practical Application ● For both definitions such a sequence of invocations will not result in shuffle operations (so no net IO) ● For Strong Pipelining we have the freedom to insert coalesce operations (similarly, we could define a custom Partitioner that is like an “inverseCoalesce” to split partitions but preserve location) ● With Weak Pipelining we do not ipso facto have the freedom to coalesce/inverseCoalesce but we can be sure the task will consist of a single stage
  • 13. Theorem A - Strongly Pipelineable Methods … If M is a sequence constructed from any of the methods map, filter, flatMap, mapValues, flatMapValues or sample then M strongly pipelines if we can assume (1) RDD with flatMap is a “Monad”* *We need scare quotes here because we must abuse notation and consider TraversableOnce and RDD as the same type
  • 14. Theorem B - Weakly Pipelineable Methods … If M is a sequence constructed from any of the methods map, filter, flatMap, mapValues, flatMapValues, sample, and additionally mapPartitions or glom then M weakly pipelines if we can assume (2) RDD with mapPartitions is a Functor
  • 15. Example 1 val docIdWords: RDD[(Int, MultiSet[String])] // 2 stage job val docIdToIndexesAndNum = docIdWords.mapValues(_.map(getIndex)) .reduceByKey(_ ++ _) .mapValues(indexes => (indexes, indexes.size)) // 1 stage job val docIdToIndexesAndNum = docIdWords.mapValues(_.map(getIndex)) .mapValues(indexes => (indexes, indexes.size)) .reduceByKey(addEachTupleElementTogether)
  • 16. Example 2 val xvalResults: RDD[(Classification, Confusion)] // Induces 2 stages val truePositives = xvalResults.filter(_._2 == TP).count() val trueNegatives = xvalResults.filter(_._2 == TN).count() // Induces 1 stage (and still easily readable!) val (truePositives, trueNegatives) = xvalResults.map(_._2).map { case `TP` => (1, 0) case `TN` => (0, 1) } .reduce(addEachTupleElementTogether) Seems addEachTupleElementTogether is a recurring theme?! Hmm ...
  • 17. Proof of A By (1); x.flatMap(f).flatMap(g) = x.flatMap(y => f(y).flatMap(g)) for any f and g, then by induction on n it follows that if all mi are flatMaps then the theorem holds. Therefore it remains to show that map, filter, mapValues, flatMapValues and sample can be written in terms of flatMap. Firstly observe sample(fraction) = filter(r) where r randomly returns true with probability fraction
  • 18. Now we can write x.filter(f) for any f as follows: x.flatMap { case y if f(y) => List(y) case _ => Nil } So filter and sample can be written in terms of flatMap. For map, for any f; x.map(f) = x.flatMap(y => List(f(y))) Then we leave mapValues and flatMapValues as exercises. QED
  • 19. Proof of B Firstly by (2); for any f and g x.mapPartitions(f).mapPartitions(g) = x.mapPartitions(f compose g) Now by observing x.flatMap(f) = x.mapPartitions(_.flatMap(f)) and the equivilences we noted in the previous theorem, the rest of this proof should be obvious. QED
  • 20. BY THE POWER OF MONOID!!! case class MaxMin(max: Int, min: Int) object MaxMin { def apply(pair: (Int, Int)) = MaxMin(pair._1, pair._2) def apply(i: Int) = MaxMin(i, i) implicit object MaxMinMonoid extends Monoid[MaxMin] { def zero: MaxMin = MaxMin(Int.MinValue, Int.MaxValue) def append(mm1: MaxMin, mm2: => MaxMin): MaxMin = MaxMin( if (mm1.max > mm2.max) mm1.max else mm2.max, if (mm1.min < mm2.min) mm1.min else mm2.min ) }
  • 21. Example 3 implicit class PimpedIntRDD(rdd: RDD[Int]) { // Twice as fast as doing rdd.top(1) - rdd.top(1)(reverseOrd) def range: Int = { val MaxMin(max, min) = rdd.map(MaxMin.apply).reduce(_ |+| _) max - min } // Similarly faster than doing rdd.reduce(_ + _) / rdd.count() def average: Double = { val (sum, total) = rdd.map((_, 1)).reduce(_ |+| _) sum.toDouble / total } // TODO Using generics and type-classes much of what we have done // can be extended to include all Numeric types easily
  • 22. Example 4 val rdd: RDD[MyKey, Int] = ... // BAD! Shuffles a lot of data and traverses each group twice val ranges: RDD[MyKey, Int] = rdd.groupByKey().mapValues(seq => seq.max - seq.min) val ranges = rdd.mapValues(MaxMin.apply).reduceByKey(_ |+| _) // GOOD! Let B be number of Bytes shuffled accross the network Let N be num records, K num keys, and W num worker nodes Let X be the relative size of an element of the monoid to the base records E.g. here a MaxMin is ~twice the size of the records Then in general for a groupByKey B is O(N(1 - 1/W)), for a “monoidally refactored” reduceByKey B is O((W - 1)KX) E.g. for the above, if N = 1,000,000, W = 10, K = 100, then monoid version has net-IO complexity 500x less.
  • 23. Scalaz out of box Monoids Supports: ● Some Numerics (e.g. Int) and collections (e.g. List) ● Tuples of monoids (go back to Example 1 & 2 for a use) ● Option of a monoid ● Map[A, B] where B is a monoid (really cool, you must google!), useful as sparse vectors, datacubes, sparse histograms, etc. See also com.twitter.algebird
  • 24. So Writing code that ● exploits Spark’s utilization of Monad / Functor like equivalences, and ● using and nesting Monoids with reduce & reduceByKey we can ● Turn 10 stage jobs into 1 stage jobs ● Shuffle Bytes of data rather than GigaBytes of data, and so ● Optimize jobs by orders of magnitude ● while keeping code super compact, modular and readable AND THAT IS NOT EVEN EVERYTHING ...
  • 25. Super Unit Testing! With just two lines of code we can unit test our monoids for 100s of test cases implicit val _: Arbitrary[MaxMin] = Arbitrary(for (ij <- arbitrary[(Int, Int)]) yield MaxMin(ij)) checkAll(monoid.laws[MaxMin]) // Strictly speaking we need to define commutativeMonoid.laws Conclusion - Functional Programming in Scala Much faster, more readable, generic, modular, succinct, type-safe code that has insane test coverage practically for free!
  • 26. Scala - The Production Language for Data Science 1. Highly Interoperable with the rest of the Big Data world due to Java/JVM interoperability: ○ Written in Scala: Kafka, Akka, Spray, Play, Spark, SparkStreaming, GraphX ○ Written in Java: Cassandra, HBase, HDFS 2. 1st-layer APIs are better maintained, richer, easier to use and understand. All open source can be stepped into with one IDE 3. Highly scalable, elegant, functional - perfect for mathematicians, concurrency and writing DSLs. With Scalaz code has unparalleled brevity (excluding GolfScript, etc)
  • 27. 4. Scala is generally a production worthy language: a. Powerful static type system means powerful compiler checking and IDE features b. Compiler can optimize code 5. ML & Stats libs growing very fast due to 3. 6. The End of the Two Team Dev Cycle: a. 1 language for engineering, development, productization, research, small data, big data, 1 node, 100 nodes b. Product ownership, inception, design, flat-management c. More Agile Cross-Functional teams d. Faster release cycle and time to MVP e. Less human resources required f. No “lost in translation” problems of prototype --> prod
  • 28. Further Reading http://www.meetup.com/spark-users/events/94101942/ http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/ http://darrenjw.wordpress.com/2013/12/23/scala-as-a-platform-for-statistical-computing-and-data-science/ Installing spark: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation- Guide/cdh5ig_spark_installation.html adding spark as a dependency (search for text “spark”): http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH-Version-and-Packaging- Information/cdhvd_cdh5_maven_repo.html Monoids https://github.com/rickynils/scalacheck/wiki/User-Guide http://timepit.eu/~frank/blog/2013/03/scalaz-monoid-example/