This document discusses Apache Spark on EMR and best practices for using Spark. It introduces the speaker and their experience with Spark at SmartNews. It then covers recent Spark updates, how SmartNews uses Spark for tasks like AD targeting and recommendation, and 10 best practices for using Spark on EMR like running Spark on Yarn, tuning memory settings, minimizing data shuffle, and using dynamic scaling with Spark Streaming.
14. Spark at SmartNews
• ML experiments
• AD targeting
• User Clustering
• Recommendation
• …
15. Best Practices
#1
• Should use the default Spark with EMR ?
• Yes Sure
• EMR 4.0 is great ! (Released today ?!)
• Hadoop 2.6 + Hive 1.0 + Spark 1.4.1
16. Best Practices
#1
• Should use the default Spark with EMR ?
• But only if you need a custom-build Spark
• Cutting Edge Version
• Native netlib-java ( mvn -Pnetlib-lgpl )
• Custom dependency version
• …
17. Best Practices
#1
• Should use the default Spark with EMR ?
• But only if you need a custom-build Spark
• --bootstrap-actions bootstrap.json
18. Best Practices
#1
• Should use the default Spark with EMR ?
• But only if you need a custom-build Spark
• Remember to start SparkHistoryServer
19. Best Practices
#2
• Run Spark on Yarn
• Use yarn-cluster mode to distribute Drivers
• specify jars and files to distribute necessary
resources
20. Best Practices
#3
• Tuning Memory
• CPU shortage only slow down your program, but short in
memory make it crash
• you can even set --executor-cores bigger than your CPU num
• Cache-able heap != JVM’s Xmx
• (normally about 50%)
21. Best Practices
#3
• Tuning Memory
• CPU shortage only slow down
your program, but short in
memory make it crash
• Cache-able heap != JVM’s
Xmx
Image from: http://0x0fff.com/spark-architecture/
22. Best Practices
#3
• Tuning Memory
• CPU shortage only slow down your program, but short in memory make
it crash
• Cache-able heap != JVM’s Xmx
• spark.yarn.executor.memoryOverhead
• spark.executor.memory
• spark.storage.memoryFraction
• …
• Split your executors if HEAP_SIZE > 64GB (GC)
• -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
23. Best Practices
#4
• If your ML job is really CPU-bound
• Try using OpenBLAS + netlib.NativeSystemBLAS
25. Best Practices
#5
• Minimize data shuffle
• Prefer reduceByKey over groupByKey+map
• RDD.repartition(NUM_OF_CORES) before cache
• Try to do filter early
31. Best Practices
#10
• use Dynamic scaling with Spark Streaming
• spark.dynamicAllocation.enabled = true
• spark.shuffle.service.enabled = true
• be careful if you use cached data
32. Best Practices
#11
• Use Spot Instance
• Be more aggressive in bid price : p
• BID_PRICE != MONEY_TO_PAY
• Check Spot Instance Pricing History
• Find the instance type with relative stable price
• often Previous Generation Instance ?
• Prepare failure, don’t use them in critical missions
33. Further Reading
• To use Spark Streaming in Production
• http://www.slideshare.net/SparkSummit/recipes-
for-running-spark-streaming-apploications-in-
production-tathagata-daspptx
34. Further Reading
• If you’re interested in new ML pipelines
• http://www.slideshare.net/SparkSummit/building-
debugging-and-tuning-spark-machine-leaning-
pipelinesjoseph-bradley