At Databricks, we have a unique view into over a hundred different companies trying out Spark for development and production use-cases, from their support tickets and forum posts. Having seen so many different workflows and applications, some discernible patterns emerge when looking at common performance and scalability issues that our users run into. This talk will discuss some of these common common issues from an engineering and operations perspective, describing solutions and clarifying misconceptions.
2. About Databricks
Offers a hosted service:
• Spark on EC2
• Notebooks
• Plot visualizations
• Cluster management
• Scheduled jobs
2
Founded by creators of Spark and remains largest
contributor
3. What have we learned?
Focus on two types:
1. Lessons for Spark
2. Lessons for users
3
Hosted service + focus on Spark = lots of user feedback
Community!
4. Outline: What are the problems?
4
● Moving beyond Python performance
● Using Spark with new languages (R)
● Network and CPU-bound workloads
● Miscellaneous common pitfalls
13. Moving beyond Python performance
Using RDDs
data = sc.textFile(...).split("t")
data.map(lambda x: (x[0], [int(x[1]), 1]))
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]])
.map(lambda x: [x[0], x[1][0] / x[1][1]])
.collect()
Using DataFrames
sqlCtx.table("people")
.groupBy("name")
.agg("name", avg("age"))
.collect()
13
(At least as much as possible!)
14. Using Spark with other languages (R)
- Problem: Difficult to run R programs
on a cluster
- Technically challenging to rewrite algorithms
to run on cluster
- Requires bigger paradigm shift than changing
languages
- As adoption rises, new groups of people try Spark:
- People who never used Hadoop or distributed computing
- People who are familiar with statistical languages
15. SparkR interface
- A pattern emerges:
- Distributed computation for initial transformations in Scala/Python
- Bring back a small dataset to a single node to do plotting and quick
advanced analyses
- Result: R interface to Spark is mainly DataFrames
people <- read.df(sqlContext, "./people.json", "json")
teenagers <- filter(people, "age >= 13 AND age <= 19")
head(teenagers)
Spark R docs
See talk: Enabling exploratory data science with Spark and R
16. Network and CPU-bound workloads
- Databricks uses S3 heavily, instead of HDFS
- S3 is a key-value based blob store “in the cloud”
- Accessed over the network
- Intended for large object storage
- ~10-200 ms latency for reads and writes
- Adapters for HDFS-like access (s3n/s3a) through Spark
- Strong consistency with some caveats (updates and us-east-1)
17. S3 as data storage
Executor
JVM
HDFS
HDFS
Executor
JVM
HDFS
HDFS
Executor
JVM
Cache
Cache
Executor
JVM
Cache
Cache
Amazon S3
Instance
“Traditional”
Data Warehouse
Databricks
18. S3(N): Not as advertised
- Had perf issues using S3N out of the box
- Could not saturate 1 Gb/s link using 8 cores
- Peaked around 800% CPU utilization and 100 MB/s
by oversubscribing cores
19. S3 Performance Problem #1
val bytes = new Array[Byte](256 * 1024)
val numRead = s3File.read(bytes)
numRead = ?
8999 1 8999 1 8999 1 8999 1 8999 1 8999 1
Answer: buffering!
20. S3 Performance Problem #2
sc.textFile(“/data”).filter(s => doCompute(s)).count()
Read 128KB doCompute() Read 128KB doCompute()
Network CPU
Utilization
Time
Time
21. S3: Pipelining to the rescue
Read
Time
S3
Reading
Thread
User
program
Pipe/
Buffer
doCompute()
Read Read
doCompute() doCompute()
ReadRead
22. S3: Results
● Max network throughput (1 Gb/s on our NICs)
● Use 100% of a core across 8 threads (largely SSL)
● With this optimization S3, has worked well:
○ Spark hides latency via its inherent batching (except for
driver metadata lookups)
○ Network is pretty fast
23. Why is network “pretty fast?”
r3.2xlarge:
- 120 MiB/s network
- Single 250 MiB/s disk
- Max of 2x improvement to be gained from disk
More surprising: Most workloads were CPU-bound
on read side
24. Why is Spark often CPU-bound?
- Users think more about the high-level details than
the CPU-efficiency
- Reasonable! Getting something to work at all is most important.
- Need the right tracing and visualization tools to find bottlenecks.
See talk: SparkUI visualization: a lens into your application
25. Why is Spark often CPU-bound?
- Just reading data may be expensive
- Decompression is not cheap - between snappy, lzf/lzo, and gzip,
be wary of gzip
See talk: SparkUI visualization: a lens into your application
- Users think more about the high-level details than
the CPU-efficiency
- Reasonable! Getting something to work at all is most important.
- Need the right tracing and visualization tools to find bottlenecks.
- Need efficient primitives for common operations (Tungsten).
26. Conclusion
- DataFrames came up a lot
- Python perf problems? Use DataFrames.
- Want to use R + Spark? Use DataFrames.
- Want more perf with less work? Use DataFrames.
- DataFrames are important for Spark to progress in:
- Expressivity in language-neutral fashion
- Performance from knowledge about structure of data
27. Common pitfalls
● Avoid RDD groupByKey()
○ API requires all values for a single key to fit in memory
○ DataFrame groupBy() works as expected, though
28. Common pitfalls
● Avoid RDD groupByKey()
○ API requires all values for a single key to fit in memory
○ DataFrame groupBy() works as expected, though
● Avoid Cartesian products in SQL
○ Always ensure you have a join condition! (Can check with
df.explain())
29. Common pitfalls
● Avoid RDD groupByKey()
○ API requires all values for a single key to fit in memory
○ DataFrame groupBy() works as expected, though
● Avoid Cartesian products in SQL
○ Always ensure you have a join condition! (Can check with
df.explain())
● Avoid overusing cache()
○ Avoid use of vanilla cache() when using data which does
not fit in memory or which will not be reused.
○ Starting in Spark 1.6, this can actually hurt performance
significantly.
○ Consider persist(MEMORY_AND_DISK) instead.
30. Common pitfalls (continued)
● Be careful when joining small with large table
○ Broadcast join is by far the best option, so make sure
SparkSQL takes it
○ Cache smaller table in memory, or use Parquet
31. Common pitfalls (continued)
● Be careful when joining small with large table
○ Broadcast join is by far the best option, so make sure
SparkSQL takes it
○ Cache smaller table in memory, or use Parquet
● Avoid using jets3t 1.9 (default in Hadoop 2)
○ Inexplicably terrible performance
32. Common pitfalls (continued)
● Be careful when joining small with large table
○ Broadcast join is by far the best option, so make sure
SparkSQL takes it
○ Cache smaller table in memory, or use Parquet
● Avoid using jets3t 1.9 (default in Hadoop 2)
○ Inexplicably terrible performance
● Prefer S3A to S3N (new in Hadoop 2.6.0)
○ Uses AWS SDK to allow for use of advanced features like
KMS encryption
○ Has some nice features, like reusing HTTP connections
○ Recently saw problem related to S3N buffering entire file!