Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark Summit EU 2015: Lessons from 300+ production users

7,863 views

Published on

At Databricks, we have a unique view into over a hundred different companies trying out Spark for development and production use-cases, from their support tickets and forum posts. Having seen so many different workflows and applications, some discernible patterns emerge when looking at common performance and scalability issues that our users run into. This talk will discuss some of these common common issues from an engineering and operations perspective, describing solutions and clarifying misconceptions.

Published in: Software
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Spark Summit EU 2015: Lessons from 300+ production users

  1. 1. Spark in Production: Lessons from 100+ production users Aaron Davidson October 28, 2015 300+
  2. 2. About Databricks Offers a hosted service: • Spark on EC2 • Notebooks • Plot visualizations • Cluster management • Scheduled jobs 2 Founded by creators of Spark and remains largest contributor
  3. 3. What have we learned? Focus on two types: 1. Lessons for Spark 2. Lessons for users 3 Hosted service + focus on Spark = lots of user feedback Community!
  4. 4. Outline: What are the problems? 4 ● Moving beyond Python performance ● Using Spark with new languages (R) ● Network and CPU-bound workloads ● Miscellaneous common pitfalls
  5. 5. Python: Who uses it, anyway? (From Spark Survey 2015)
  6. 6. PySpark Architecture sc.textFile(“/data”) .filter(lambda s: “foobar” in s) .count()
  7. 7. PySpark Architecture sc.textFile(“/data”) .filter(lambda s: “foobar” in s) .count()
  8. 8. PySpark Architecture sc.textFile(“/data”) .filter(lambda s: “foobar” in s) .count()
  9. 9. PySpark Architecture sc.textFile(“/data”) .filter(lambda s: “foobar” in s) .count() /data
  10. 10. PySpark Architecture sc.textFile(“/data”) .filter(lambda s: “foobar” in s) .count() /data Driver Java-to-Python communication is expensive!
  11. 11. Moving beyond Python performance Using RDDs data = sc.textFile(...).split("t") data.map(lambda x: (x[0], [int(x[1]), 1])) .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) .map(lambda x: [x[0], x[1][0] / x[1][1]]) .collect() 11
  12. 12. Moving beyond Python performance Using RDDs data = sc.textFile(...).split("t") data.map(lambda x: (x[0], [int(x[1]), 1])) .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) .map(lambda x: [x[0], x[1][0] / x[1][1]]) .collect() Using DataFrames sqlCtx.table("people") .groupBy("name") .agg("name", avg("age")) .collect() 12
  13. 13. Moving beyond Python performance Using RDDs data = sc.textFile(...).split("t") data.map(lambda x: (x[0], [int(x[1]), 1])) .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) .map(lambda x: [x[0], x[1][0] / x[1][1]]) .collect() Using DataFrames sqlCtx.table("people") .groupBy("name") .agg("name", avg("age")) .collect() 13 (At least as much as possible!)
  14. 14. Using Spark with other languages (R) - Problem: Difficult to run R programs on a cluster - Technically challenging to rewrite algorithms to run on cluster - Requires bigger paradigm shift than changing languages - As adoption rises, new groups of people try Spark: - People who never used Hadoop or distributed computing - People who are familiar with statistical languages
  15. 15. SparkR interface - A pattern emerges: - Distributed computation for initial transformations in Scala/Python - Bring back a small dataset to a single node to do plotting and quick advanced analyses - Result: R interface to Spark is mainly DataFrames people <- read.df(sqlContext, "./people.json", "json") teenagers <- filter(people, "age >= 13 AND age <= 19") head(teenagers) Spark R docs See talk: Enabling exploratory data science with Spark and R
  16. 16. Network and CPU-bound workloads - Databricks uses S3 heavily, instead of HDFS - S3 is a key-value based blob store “in the cloud” - Accessed over the network - Intended for large object storage - ~10-200 ms latency for reads and writes - Adapters for HDFS-like access (s3n/s3a) through Spark - Strong consistency with some caveats (updates and us-east-1)
  17. 17. S3 as data storage Executor JVM HDFS HDFS Executor JVM HDFS HDFS Executor JVM Cache Cache Executor JVM Cache Cache Amazon S3 Instance “Traditional” Data Warehouse Databricks
  18. 18. S3(N): Not as advertised - Had perf issues using S3N out of the box - Could not saturate 1 Gb/s link using 8 cores - Peaked around 800% CPU utilization and 100 MB/s by oversubscribing cores
  19. 19. S3 Performance Problem #1 val bytes = new Array[Byte](256 * 1024) val numRead = s3File.read(bytes) numRead = ? 8999 1 8999 1 8999 1 8999 1 8999 1 8999 1 Answer: buffering!
  20. 20. S3 Performance Problem #2 sc.textFile(“/data”).filter(s => doCompute(s)).count() Read 128KB doCompute() Read 128KB doCompute() Network CPU Utilization Time Time
  21. 21. S3: Pipelining to the rescue Read Time S3 Reading Thread User program Pipe/ Buffer doCompute() Read Read doCompute() doCompute() ReadRead
  22. 22. S3: Results ● Max network throughput (1 Gb/s on our NICs) ● Use 100% of a core across 8 threads (largely SSL) ● With this optimization S3, has worked well: ○ Spark hides latency via its inherent batching (except for driver metadata lookups) ○ Network is pretty fast
  23. 23. Why is network “pretty fast?” r3.2xlarge: - 120 MiB/s network - Single 250 MiB/s disk - Max of 2x improvement to be gained from disk More surprising: Most workloads were CPU-bound on read side
  24. 24. Why is Spark often CPU-bound? - Users think more about the high-level details than the CPU-efficiency - Reasonable! Getting something to work at all is most important. - Need the right tracing and visualization tools to find bottlenecks. See talk: SparkUI visualization: a lens into your application
  25. 25. Why is Spark often CPU-bound? - Just reading data may be expensive - Decompression is not cheap - between snappy, lzf/lzo, and gzip, be wary of gzip See talk: SparkUI visualization: a lens into your application - Users think more about the high-level details than the CPU-efficiency - Reasonable! Getting something to work at all is most important. - Need the right tracing and visualization tools to find bottlenecks. - Need efficient primitives for common operations (Tungsten).
  26. 26. Conclusion - DataFrames came up a lot - Python perf problems? Use DataFrames. - Want to use R + Spark? Use DataFrames. - Want more perf with less work? Use DataFrames. - DataFrames are important for Spark to progress in: - Expressivity in language-neutral fashion - Performance from knowledge about structure of data
  27. 27. Common pitfalls ● Avoid RDD groupByKey() ○ API requires all values for a single key to fit in memory ○ DataFrame groupBy() works as expected, though
  28. 28. Common pitfalls ● Avoid RDD groupByKey() ○ API requires all values for a single key to fit in memory ○ DataFrame groupBy() works as expected, though ● Avoid Cartesian products in SQL ○ Always ensure you have a join condition! (Can check with df.explain())
  29. 29. Common pitfalls ● Avoid RDD groupByKey() ○ API requires all values for a single key to fit in memory ○ DataFrame groupBy() works as expected, though ● Avoid Cartesian products in SQL ○ Always ensure you have a join condition! (Can check with df.explain()) ● Avoid overusing cache() ○ Avoid use of vanilla cache() when using data which does not fit in memory or which will not be reused. ○ Starting in Spark 1.6, this can actually hurt performance significantly. ○ Consider persist(MEMORY_AND_DISK) instead.
  30. 30. Common pitfalls (continued) ● Be careful when joining small with large table ○ Broadcast join is by far the best option, so make sure SparkSQL takes it ○ Cache smaller table in memory, or use Parquet
  31. 31. Common pitfalls (continued) ● Be careful when joining small with large table ○ Broadcast join is by far the best option, so make sure SparkSQL takes it ○ Cache smaller table in memory, or use Parquet ● Avoid using jets3t 1.9 (default in Hadoop 2) ○ Inexplicably terrible performance
  32. 32. Common pitfalls (continued) ● Be careful when joining small with large table ○ Broadcast join is by far the best option, so make sure SparkSQL takes it ○ Cache smaller table in memory, or use Parquet ● Avoid using jets3t 1.9 (default in Hadoop 2) ○ Inexplicably terrible performance ● Prefer S3A to S3N (new in Hadoop 2.6.0) ○ Uses AWS SDK to allow for use of advanced features like KMS encryption ○ Has some nice features, like reusing HTTP connections ○ Recently saw problem related to S3N buffering entire file!
  33. 33. Common pitfalls (continued) ● In RDD API, can manually reuse partitioner to avoid extra shuffles
  34. 34. Questions?

×