Spark 4th Meetup Londond - Building a Product with Spark
1. Building a Product
with Spark
Common Technical and Performance
Problems with Solutions
2. Format of Talk
● Common spark exceptions with debugging
tips
● My personal use-case for spark and
resulting cluster preferences
● Optimizing code using Functional
Programming
● Data Science and Productization
3. Dreaded OOMs; heap & GC overhead limit
● INCREASE YOUR NUM PARTITIONS!!! Spark is not
Hadoop, ergo more partitions does not necessarily cost
extra resources. “Typically you want 2-4 slices for each
CPU in your cluster” - Spark documentation, “100 -
10,000 is reasonable” - Matei Zaharia
● Set your memory fractions according to the job. E.g. if
you do not use the cache, then set the cache fraction to
0.0.
● Ensure your using the memory on each node of the
cluster, not just the default 512m - check this in the UI.
● Refactor code (see later)
4. Shuffle File Problems - inodes
● If you get java.io.IOException: No space left on device
while df -h shows you have space, then check your
inodes with df -i.
● shuffles by default create O(M * R) shuffle files, where
M is map tasks and R is reduce tasks. Solutions (by
pref)
A. Research “consolidating files”; decreases to O(R)
B. Decrease R, and decrease M using coalesce with
shuffle = false
C. Get DevOps to increase inodes of FS
5. Shuffle File Problems - ‘Too many open
files’
● Shuffles open O(C * R) shuffle files, where C is your
number of cores. Solutions (by pref)
A. Get DevOps to increase ulimit
B. Decrease R
● Consolidating files will not help
6. General Debugging/Troubleshooting Tips
Don’t Always Trust the Message
(skip)
Some exceptions have been symptomatic of OOM, space/inode problem or
something seemingly unrelated to the ST. E.g.
● making more meaningful is an open
improvement (SPARK-3052)
● and can
be symptomatic of “no space left on device”
● Initial job has not accepted any resources even though everything seems fine can be
caused by accidentally configuring your RAM or Cores too high
Use memory, JVM memory, space and inode monitoring, and dig around in
the worker logs when exceptions are not clear. Ask an SO!
7. Notes on Broadcast Variables and the Driver Process
● Broadcast variables are great for saving memory as
they will have one in memory copy per node, whereas
free variables from Closures will have one copy per
task. (E.g. huge hash maps and hash sets)
● If you need to create a large broadcast variable then
remember that your driver JVM will need to be
spawned with increased RAM
● Symptoms of forgetting this is
○ Hangs
○ OOMs
8. Cluster Setup Problems & Dependency/Versioning Hell
(rush through this)
My personal experience/use-case:
● Only wanted permanent production and research clusters
● that work with other Big Data tools/clusters/hadoop-eco-system like
HBase, Cassandra, HDFS, etc for regular hourly/daily pipelines and
ongoing data exploration.
● Used the spark-shell, and built fat jars for complex applications
● Worked with 4 spark clusters over my first 6 - 9 months with Spark
For the above I’ve come to have the following preferences: (Links on last slide)
● use Cloudera packages to install Spark (via apt-get or similar)
● with the Cloudera jar dependencies (e.g. 1.0.0-cdh5.1.2)
● On Ubuntu Server 14.04 (apt-get > yum, bash >>>>>>>> cygwin or Posh)
● On EC2 (not EMR), (powerful cloud features, super-cheap spot-instances)
9. Spark Configurables - Alternative defaults for stability increase
Configurable Value Comment
spark.executor.memory
Whatever your cluster
spark.cores.max
can support
spark.shuffle.consolidateFiles true For ext4 and xfs, which are common
spark.cleaner.ttl 86400 To avoid space issues as mentioned
spark.akka.frameSize 500 Prevents errors including:
spark.akka.askTimeout 100 Seems to help with “high CPU/IO load”
spark.task.maxFailures 16 Keep trying!
spark.worker.timeout 150 A wee hack around gc hangs? NOTE: I
can’t find this on the latest docs.
spark.serializer org.apache.spark.
serializer.
KryoSerializer
Requires additional faff to get it right
apparently. Uncertain why not default?!
10. Code level optimizations with FP
● Sometimes you cannot fix a problem by
tuning the cluster/job config
● Sometimes you just need to refactor your
code!
11. Definition - Strongly Pipelineable
Let x: RDD[T], let M := < m1, …, mn> be a valid sequence of
method invocations on x with parameters, i.e.
“x.m1. … .mn” is a valid line of code, then we say the
sequence M is strongly pipelineable if and only if there
exists a function f such that “x.flatMap(f)” is an
‘equivalent’ line of code.
Definition - Weakly Pipelineable
… and M is weakly pipelineable if and only if there exists a
function f such that “x.mapPartitions(f)” is an ‘equivalent’
line of code.
12. Practical Application
● For both definitions such a sequence of invocations will
not result in shuffle operations (so no net IO)
● For Strong Pipelining we have the freedom to insert
coalesce operations (similarly, we could define a custom
Partitioner that is like an “inverseCoalesce” to split
partitions but preserve location)
● With Weak Pipelining we do not ipso facto have the
freedom to coalesce/inverseCoalesce but we can be
sure the task will consist of a single stage
13. Theorem A - Strongly Pipelineable Methods
… If M is a sequence constructed from any of the methods
map, filter, flatMap, mapValues, flatMapValues or sample
then M strongly pipelines if we can assume
(1) RDD with flatMap is a “Monad”*
*We need scare quotes here because we must abuse
notation and consider TraversableOnce and RDD as the
same type
14. Theorem B - Weakly Pipelineable Methods
… If M is a sequence constructed from any of the methods
map, filter, flatMap, mapValues, flatMapValues, sample,
and additionally mapPartitions or glom then M weakly
pipelines if we can assume
(2) RDD with mapPartitions is a Functor
15. Example 1
val docIdWords: RDD[(Int, MultiSet[String])]
// 2 stage job
val docIdToIndexesAndNum = docIdWords.mapValues(_.map(getIndex))
.reduceByKey(_ ++ _)
.mapValues(indexes => (indexes, indexes.size))
// 1 stage job
val docIdToIndexesAndNum = docIdWords.mapValues(_.map(getIndex))
.mapValues(indexes => (indexes, indexes.size))
.reduceByKey(addEachTupleElementTogether)
16. Example 2
val xvalResults: RDD[(Classification, Confusion)]
// Induces 2 stages
val truePositives = xvalResults.filter(_._2 == TP).count()
val trueNegatives = xvalResults.filter(_._2 == TN).count()
// Induces 1 stage (and still easily readable!)
val (truePositives, trueNegatives) = xvalResults.map(_._2).map {
case `TP` => (1, 0)
case `TN` => (0, 1)
}
.reduce(addEachTupleElementTogether)
Seems addEachTupleElementTogether is a recurring theme?! Hmm ...
17. Proof of A
By (1);
x.flatMap(f).flatMap(g) = x.flatMap(y => f(y).flatMap(g))
for any f and g, then by induction on n it follows that if all mi
are flatMaps then the theorem holds.
Therefore it remains to show that map, filter, mapValues,
flatMapValues and sample can be written in terms of
flatMap.
Firstly observe sample(fraction) = filter(r) where r
randomly returns true with probability fraction
18. Now we can write x.filter(f) for any f as follows:
x.flatMap {
case y if f(y) => List(y)
case _ => Nil
}
So filter and sample can be written in terms of flatMap.
For map, for any f;
x.map(f) = x.flatMap(y => List(f(y)))
Then we leave mapValues and flatMapValues as exercises.
QED
19. Proof of B
Firstly by (2); for any f and g
x.mapPartitions(f).mapPartitions(g)
= x.mapPartitions(f compose g)
Now by observing
x.flatMap(f) = x.mapPartitions(_.flatMap(f))
and the equivilences we noted in the previous theorem, the
rest of this proof should be obvious. QED
20. BY THE POWER OF MONOID!!!
case class MaxMin(max: Int, min: Int)
object MaxMin {
def apply(pair: (Int, Int)) = MaxMin(pair._1, pair._2)
def apply(i: Int) = MaxMin(i, i)
implicit object MaxMinMonoid extends Monoid[MaxMin] {
def zero: MaxMin = MaxMin(Int.MinValue, Int.MaxValue)
def append(mm1: MaxMin, mm2: => MaxMin): MaxMin = MaxMin(
if (mm1.max > mm2.max) mm1.max else mm2.max,
if (mm1.min < mm2.min) mm1.min else mm2.min
)
}
21. Example 3
implicit class PimpedIntRDD(rdd: RDD[Int]) {
// Twice as fast as doing rdd.top(1) - rdd.top(1)(reverseOrd)
def range: Int = {
val MaxMin(max, min) = rdd.map(MaxMin.apply).reduce(_ |+| _)
max - min
}
// Similarly faster than doing rdd.reduce(_ + _) / rdd.count()
def average: Double = {
val (sum, total) = rdd.map((_, 1)).reduce(_ |+| _)
sum.toDouble / total
}
// TODO Using generics and type-classes much of what we have done
// can be extended to include all Numeric types easily
22. Example 4
val rdd: RDD[MyKey, Int] = ...
// BAD! Shuffles a lot of data and traverses each group twice
val ranges: RDD[MyKey, Int] =
rdd.groupByKey().mapValues(seq => seq.max - seq.min)
val ranges = rdd.mapValues(MaxMin.apply).reduceByKey(_ |+| _) // GOOD!
Let B be number of Bytes shuffled accross the network
Let N be num records, K num keys, and W num worker nodes
Let X be the relative size of an element of the monoid to the base records
E.g. here a MaxMin is ~twice the size of the records
Then in general for a groupByKey B is O(N(1 - 1/W)),
for a “monoidally refactored” reduceByKey B is O((W - 1)KX)
E.g. for the above, if N = 1,000,000, W = 10, K = 100, then monoid version
has net-IO complexity 500x less.
23. Scalaz out of box Monoids
Supports:
● Some Numerics (e.g. Int) and collections (e.g. List)
● Tuples of monoids (go back to Example 1 & 2 for a use)
● Option of a monoid
● Map[A, B] where B is a monoid (really cool, you must
google!), useful as sparse vectors, datacubes, sparse
histograms, etc.
See also com.twitter.algebird
24. So
Writing code that
● exploits Spark’s utilization of Monad / Functor like
equivalences, and
● using and nesting Monoids with reduce & reduceByKey
we can
● Turn 10 stage jobs into 1 stage jobs
● Shuffle Bytes of data rather than GigaBytes of data, and so
● Optimize jobs by orders of magnitude
● while keeping code super compact, modular and readable
AND THAT IS NOT EVEN EVERYTHING ...
25. Super Unit Testing!
With just two lines of code we can unit test our monoids for
100s of test cases
implicit val _: Arbitrary[MaxMin] =
Arbitrary(for (ij <- arbitrary[(Int, Int)]) yield MaxMin(ij))
checkAll(monoid.laws[MaxMin])
// Strictly speaking we need to define commutativeMonoid.laws
Conclusion - Functional Programming in Scala
Much faster, more readable, generic, modular, succinct,
type-safe code that has insane test coverage practically for
free!
26. Scala - The Production Language for Data Science
1. Highly Interoperable with the rest of the Big Data world due
to Java/JVM interoperability:
○ Written in Scala: Kafka, Akka, Spray, Play, Spark,
SparkStreaming, GraphX
○ Written in Java: Cassandra, HBase, HDFS
2. 1st-layer APIs are better maintained, richer, easier to use
and understand. All open source can be stepped into with
one IDE
3. Highly scalable, elegant, functional - perfect for
mathematicians, concurrency and writing DSLs. With Scalaz
code has unparalleled brevity (excluding GolfScript, etc)
27. 4. Scala is generally a production worthy language:
a. Powerful static type system means powerful compiler
checking and IDE features
b. Compiler can optimize code
5. ML & Stats libs growing very fast due to 3.
6. The End of the Two Team Dev Cycle:
a. 1 language for engineering, development, productization,
research, small data, big data, 1 node, 100 nodes
b. Product ownership, inception, design, flat-management
c. More Agile Cross-Functional teams
d. Faster release cycle and time to MVP
e. Less human resources required
f. No “lost in translation” problems of prototype --> prod
28. Further Reading
http://www.meetup.com/spark-users/events/94101942/
http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/
http://darrenjw.wordpress.com/2013/12/23/scala-as-a-platform-for-statistical-computing-and-data-science/
Installing spark:
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-
Guide/cdh5ig_spark_installation.html
adding spark as a dependency (search for text “spark”):
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH-Version-and-Packaging-
Information/cdhvd_cdh5_maven_repo.html
Monoids
https://github.com/rickynils/scalacheck/wiki/User-Guide
http://timepit.eu/~frank/blog/2013/03/scalaz-monoid-example/