1© Cloudera, Inc. All rights reserved.
5 Spark tips in 5 Minutes
Imran Rashid| Cloudera Engineer, Apache Spark PMC
2© Cloudera, Inc. All rights reserved.
rdd.cache()
rdd.setName(…)
BAD:
Sc.accumulator(0L)
GOOD:
Sc.accumultor(0L, “my counter”)
#1: Name Cached RDDs and Accumulators
3© Cloudera, Inc. All rights reserved.
#1b: MEMORY_AND_DISK
• BAD: rdd.cache()
• If partition is dropped, computed from scratch
• GOOD: rdd.persist(StorageLevel.MEMORY_AND_DISK)
Huge Raw Data
Filter
FlatMap
…cache
4© Cloudera, Inc. All rights reserved.
• DAG Visualization
• Key Metrics
•Data Read / Written
•Shuffle Read / Write
•Stragglers / Outliers
• Cache Utilization
#2: Use Spark’s UI
5© Cloudera, Inc. All rights reserved.
• Use Sample Code
•Count Errors
•Sample Errors
• SparkListener to output updates
• https://gist.github.com/squito/2f7cc0
2c313e4c9e7df4
#3: Debug Counters
val parseErrors = ErrorTracker(
“parsing errors", sc)
val allParsed: RDD[T] =
sc.textFile(inputFile).flatMap { line =>
try {
val r = Some(parser(line))
parseErrors.localValue.ok()
r
} catch {
case NonFatal(ex) =>
parseErrors.localValue.error(line)
None
}
}
6© Cloudera, Inc. All rights reserved.
#4: Avoid Driver Bottlenecks
GOOD BAD
rdd.collect() Exploratory data analysis; merging a
small set of results.
Sequentially scan entire data set on driver.
No parallelism, OOM on driver.
(rdd.toLocaltIterator is better, still not
good)
rdd.reduce() Summarize the results from a small
dataset.
Big Data Structures, from lots of
partitions.
sc.accumulator() Small data types, eg., counters. Big Data Structures, from lots of
partitions. Set of a million “most
interesting” user ids from each partition.
7© Cloudera, Inc. All rights reserved.
• Try Scala!
•Much simpler code
•KISS
•Sbt: ~compile, ~test-quick
•Template project with giter8
• Use Spark Testing Base
•Talk Wednesday by Holden K
• Run Spark Locally
•But try at scale periodically (you may hit
bottlenecks)
#5: Dev Environment
8© Cloudera, Inc. All rights reserved.
• I write bugs
• You write bugs
• Spark has bugs
• Long Pipelines should be restartable
•Bad: Bug in Stage 18 after 5 hours
 rerun from scratch?
•Good: Write to stable storage (eg.,
hdfs) periodically, restart from stage
17
• DiskCachedRDD
#6:Code for Fast Iterations
9© Cloudera, Inc. All rights reserved.
#7: Narrow Joins & HDFS
• Narrow Joins
• Much cheaper
• Anytime rdds share Partitioner
• What about when reading from
hdfs?
• SPARK-1061
• Read from hdfs
• “Remember” data was written
with a partitioner
Wide Join Narrow Join
10© Cloudera, Inc. All rights reserved.
Thank you

5 Apache Spark Tips in 5 Minutes

  • 1.
    1© Cloudera, Inc.All rights reserved. 5 Spark tips in 5 Minutes Imran Rashid| Cloudera Engineer, Apache Spark PMC
  • 2.
    2© Cloudera, Inc.All rights reserved. rdd.cache() rdd.setName(…) BAD: Sc.accumulator(0L) GOOD: Sc.accumultor(0L, “my counter”) #1: Name Cached RDDs and Accumulators
  • 3.
    3© Cloudera, Inc.All rights reserved. #1b: MEMORY_AND_DISK • BAD: rdd.cache() • If partition is dropped, computed from scratch • GOOD: rdd.persist(StorageLevel.MEMORY_AND_DISK) Huge Raw Data Filter FlatMap …cache
  • 4.
    4© Cloudera, Inc.All rights reserved. • DAG Visualization • Key Metrics •Data Read / Written •Shuffle Read / Write •Stragglers / Outliers • Cache Utilization #2: Use Spark’s UI
  • 5.
    5© Cloudera, Inc.All rights reserved. • Use Sample Code •Count Errors •Sample Errors • SparkListener to output updates • https://gist.github.com/squito/2f7cc0 2c313e4c9e7df4 #3: Debug Counters val parseErrors = ErrorTracker( “parsing errors", sc) val allParsed: RDD[T] = sc.textFile(inputFile).flatMap { line => try { val r = Some(parser(line)) parseErrors.localValue.ok() r } catch { case NonFatal(ex) => parseErrors.localValue.error(line) None } }
  • 6.
    6© Cloudera, Inc.All rights reserved. #4: Avoid Driver Bottlenecks GOOD BAD rdd.collect() Exploratory data analysis; merging a small set of results. Sequentially scan entire data set on driver. No parallelism, OOM on driver. (rdd.toLocaltIterator is better, still not good) rdd.reduce() Summarize the results from a small dataset. Big Data Structures, from lots of partitions. sc.accumulator() Small data types, eg., counters. Big Data Structures, from lots of partitions. Set of a million “most interesting” user ids from each partition.
  • 7.
    7© Cloudera, Inc.All rights reserved. • Try Scala! •Much simpler code •KISS •Sbt: ~compile, ~test-quick •Template project with giter8 • Use Spark Testing Base •Talk Wednesday by Holden K • Run Spark Locally •But try at scale periodically (you may hit bottlenecks) #5: Dev Environment
  • 8.
    8© Cloudera, Inc.All rights reserved. • I write bugs • You write bugs • Spark has bugs • Long Pipelines should be restartable •Bad: Bug in Stage 18 after 5 hours  rerun from scratch? •Good: Write to stable storage (eg., hdfs) periodically, restart from stage 17 • DiskCachedRDD #6:Code for Fast Iterations
  • 9.
    9© Cloudera, Inc.All rights reserved. #7: Narrow Joins & HDFS • Narrow Joins • Much cheaper • Anytime rdds share Partitioner • What about when reading from hdfs? • SPARK-1061 • Read from hdfs • “Remember” data was written with a partitioner Wide Join Narrow Join
  • 10.
    10© Cloudera, Inc.All rights reserved. Thank you