5 Apache Spark Tips in 5 Minutes

1© Cloudera, Inc. All rights reserved.
5 Spark tips in 5 Minutes
Imran Rashid| Cloudera Engineer, Apache Spark PMC

rdd.cache()
rdd.setName(…)
BAD:
Sc.accumulator(0L)
GOOD:
Sc.accumultor(0L, “my counter”)
#1: Name Cached RDDs and Accumulators

#1b: MEMORY_AND_DISK
• BAD: rdd.cache()
• If partition is dropped, computed from scratch
• GOOD: rdd.persist(StorageLevel.MEMORY_AND_DISK)
Huge Raw Data
Filter
FlatMap
…cache

• DAG Visualization
• Key Metrics
•Data Read / Written
•Shuffle Read / Write
•Stragglers / Outliers
• Cache Utilization
#2: Use Spark’s UI

• Use Sample Code
•Count Errors
•Sample Errors
• SparkListener to output updates
• https://gist.github.com/squito/2f7cc0
2c313e4c9e7df4
#3: Debug Counters
val parseErrors = ErrorTracker(
“parsing errors", sc)
val allParsed: RDD[T] =
sc.textFile(inputFile).flatMap { line =>
try {
val r = Some(parser(line))
parseErrors.localValue.ok()
r
} catch {
case NonFatal(ex) =>
parseErrors.localValue.error(line)
None
}
}

#4: Avoid Driver Bottlenecks
GOOD BAD
rdd.collect() Exploratory data analysis; merging a
small set of results.
Sequentially scan entire data set on driver.
No parallelism, OOM on driver.
(rdd.toLocaltIterator is better, still not
good)
rdd.reduce() Summarize the results from a small
dataset.
Big Data Structures, from lots of
partitions.
sc.accumulator() Small data types, eg., counters. Big Data Structures, from lots of
partitions. Set of a million “most
interesting” user ids from each partition.

• Try Scala!
•Much simpler code
•KISS
•Sbt: ~compile, ~test-quick
•Template project with giter8
• Use Spark Testing Base
•Talk Wednesday by Holden K
• Run Spark Locally
•But try at scale periodically (you may hit
bottlenecks)
#5: Dev Environment

• I write bugs
• You write bugs
• Spark has bugs
• Long Pipelines should be restartable
•Bad: Bug in Stage 18 after 5 hours
 rerun from scratch?
•Good: Write to stable storage (eg.,
hdfs) periodically, restart from stage
17
• DiskCachedRDD
#6:Code for Fast Iterations

#7: Narrow Joins & HDFS
• Narrow Joins
• Much cheaper
• Anytime rdds share Partitioner
• What about when reading from
hdfs?
• SPARK-1061
• Read from hdfs
• “Remember” data was written
with a partitioner
Wide Join Narrow Join

Thank you

5 Apache Spark Tips in 5 Minutes

More Related Content

What's hot

Similar to 5 Apache Spark Tips in 5 Minutes

More from Cloudera, Inc.

Recently uploaded

5 Apache Spark Tips in 5 Minutes