Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Getting Apache Spark Customers to Production

from Kostas Sakellis

  • Login to see the comments

Getting Apache Spark Customers to Production

  1. 1. 1© Cloudera, Inc. All rights reserved. Getting Spark Customers to Production Kostas Sakellis
  2. 2. 2© Cloudera, Inc. All rights reserved. Me • Software Engineer at Cloudera • Contributor to Apache Spark • Before that, contributed to Cloudera Manager
  3. 3. 3© Cloudera, Inc. All rights reserved. Our customers • Various degrees of sophistication with Spark • In all stages of development • From POC to production deployments • 95% use Spark on YARN* • Biweekly analysis of tickets
  4. 4. 4© Cloudera, Inc. All rights reserved. WARING: This is biased!
  5. 5. 5© Cloudera, Inc. All rights reserved. Building a proof of concept! Courtesy of: http://www.nefloridadesign.com/mbimages/6.jpg
  6. 6. 6© Cloudera, Inc. All rights reserved. “Why is my job failing?”
  7. 7. 7© Cloudera, Inc. All rights reserved. “Why is my job slow?”
  8. 8. 8© Cloudera, Inc. All rights reserved. Misconfiguration accounts for 20% of job failures Courtesy of: http://blog.sdrock.com/pastors/files/2013/06/time-clock.jpg
  9. 9. 9© Cloudera, Inc. All rights reserved. Resource Declaration • Not easy knowing what you need and how to specify it • Compute: • --num-executors vs. --num-cores • Memory • --executor-memory • Includes JVM overhead • Need to do the math yourself
  10. 10. 10© Cloudera, Inc. All rights reserved. Dynamic Allocation • Let Spark do the work for you • Available since Spark 1.2* • No need to specify compute a priori • Limitation: Still required to specify cores • In future: • Allow specification of “task size” • Dynamically allocate cores
  11. 11. 11© Cloudera, Inc. All rights reserved. YARN Configuration mismatch • Compute: • yarn.nodemanager.resource.cpu-vcores • yarn.scheduler.maximum-allocation.vcores • Memory: • yarn.nodemanager.resource.memory-mb • yarn.scheduler.maximum-allocation-mb
  12. 12. 12© Cloudera, Inc. All rights reserved. YARN Configuration mismatch • Common to ask for more resources than allowed • Future work: • Exposing relevant YARN configurations in Spark UI • Requires changes to YARN itself
  13. 13. 13© Cloudera, Inc. All rights reserved. Container [pid=63375,containerID=container_1388158490598_0001_01_00 0003] is running beyond physical memory limits. Current usage: 2.1 GB of 2 GB physical memory used; 2.8 GB of 4.2 GB virtual memory used. Killing container. [...] Another YARN goodie…
  14. 14. 14© Cloudera, Inc. All rights reserved. yarn.nodemanager.resource.memory-mb Executor Container spark.yarn.executor.memoryOverhead (7%) (10% in 1.4) spark.executor.memory spark.shuffle.memoryFraction (0.4) spark.storage.memoryFraction (0.6) Memory allocation
  15. 15. 15© Cloudera, Inc. All rights reserved. YARN Overhead • Future work: • Better understanding of off heap allocations • Improve memory usage visibility
  16. 16. 16© Cloudera, Inc. All rights reserved. Run program through all our data Courtesy of:https://conniehallscott.files.wordpress.com/2013/01/411748_538971446114753_1125606225_o.jpg
  17. 17. 17© Cloudera, Inc. All rights reserved. Data dependent tuning • As data rates change, re-tuning Spark is usually necessary • Spark is sensitive to shuffle spills • The most common knob we modify is…
  18. 18. 18© Cloudera, Inc. All rights reserved. Partitions, Partitions, Partitions!
  19. 19. 19© Cloudera, Inc. All rights reserved. GC Stalls
  20. 20. 20© Cloudera, Inc. All rights reserved. Partitions • Smaller is often better • Parameterized partition size • reduceByKey(…, nPartitions) • Parameterize application • Future work: • Dynamically determine # of partitions (SPARK-4630)
  21. 21. 21© Cloudera, Inc. All rights reserved. But for now? • Easy answer: • Keep multiplying by 1.5 and see what works • Harder answer:
  22. 22. 22© Cloudera, Inc. All rights reserved. Shuffle less!
  23. 23. 23© Cloudera, Inc. All rights reserved. Shuffles Wide DependencyNarrow Dependencies
  24. 24. 24© Cloudera, Inc. All rights reserved. ReduceByKey when Possible •ReduceByKey allows a map-side-combine parsed .map{line =>(line.level, 1)} .reduceByKey{(a, b) => a + b} .collect() •GroupByKey transfers all the data parsed .map{line =>(line.level, 1)} .groupByKey.map{case(word,counts) => (word,counts.sum)} .collect()
  25. 25. 25© Cloudera, Inc. All rights reserved. ReduceByKey when Possible •ReduceByKey •GroupByKey
  26. 26. 26© Cloudera, Inc. All rights reserved. Security, now it’s getting serious. Courtesy of: https://www.iti.illinois.edu/sites/default/files/Cybersecurity_image.jpg
  27. 27. 27© Cloudera, Inc. All rights reserved. Authentication • Kerberos – the necessary evil • Ubiquitous amongst other services • YARN, HDFS, Hive, HBase, etc. • Spark utilizes delegation tokens
  28. 28. 28© Cloudera, Inc. All rights reserved. Encryption • Control plane • File distribution • Block Manager • User UI / REST API • Data-at-rest (shuffle files) SPARK-6028 (Replace with netty) Replace with netty Spark 1.4 SPARK-2750 (SSL) SPARK-5682
  29. 29. 29© Cloudera, Inc. All rights reserved. Authorization • Enterprises have sensitive data • Beyond HDFS file permissions • Partial access to data • Column level granularity • Apache Sentry • HDFS-Sentry synchronization plugin
  30. 30. 30© Cloudera, Inc. All rights reserved. Customers often have shared infrastructure Courtesy of: https://radioglobalistic.files.wordpress.com/2011/02/lagos-traffic.jpg
  31. 31. 31© Cloudera, Inc. All rights reserved. Multi-tenancy • Cluster utilization is top metric • Target: 70-80% utilization • Mixed workloads from mixed customers • We recommend YARN • Built in resource manager
  32. 32. 32© Cloudera, Inc. All rights reserved. Underutilized Clusters Courtesy of: http://media.nbclosangeles.com/images/1200*675/60-freeway-repair-dec16-2-empty.JPG
  33. 33. 33© Cloudera, Inc. All rights reserved. Dynamic Allocation • Allows jobs to scale to size according to load • Knobs to control min, max and initial size • Future Work: • Target: Dynamic allocation enabled by default • Data locality & Caching • Open question with Streaming
  34. 34. 34© Cloudera, Inc. All rights reserved. Thank you We’re Hiring!

×