Spark at SmartNews
• ML experiments
• AD targeting
• User Clustering
• Recommendation
• …
Best Practices
#1
• Should use the default Spark with EMR ?
• Yes Sure
• EMR 4.0 is great ! (Released today ?!)
• Hadoop 2.6 + Hive 1.0 + Spark 1.4.1
Best Practices
#1
• Should use the default Spark with EMR ?
• But only if you need a custom-build Spark
• Cutting Edge Version
• Native netlib-java ( mvn -Pnetlib-lgpl )
• Custom dependency version
• …
Best Practices
#1
• Should use the default Spark with EMR ?
• But only if you need a custom-build Spark
• --bootstrap-actions bootstrap.json
Best Practices
#1
• Should use the default Spark with EMR ?
• But only if you need a custom-build Spark
• Remember to start SparkHistoryServer
Best Practices
#2
• Run Spark on Yarn
• Use yarn-cluster mode to distribute Drivers
• specify jars and files to distribute necessary
resources
Best Practices
#3
• Tuning Memory
• CPU shortage only slow down your program, but short in
memory make it crash
• you can even set --executor-cores bigger than your CPU num
• Cache-able heap != JVM’s Xmx
• (normally about 50%)
Best Practices
#3
• Tuning Memory
• CPU shortage only slow down
your program, but short in
memory make it crash
• Cache-able heap != JVM’s
Xmx
Image from: http://0x0fff.com/spark-architecture/
Best Practices
#3
• Tuning Memory
• CPU shortage only slow down your program, but short in memory make
it crash
• Cache-able heap != JVM’s Xmx
• spark.yarn.executor.memoryOverhead
• spark.executor.memory
• spark.storage.memoryFraction
• …
• Split your executors if HEAP_SIZE > 64GB (GC)
• -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
Best Practices
#4
• If your ML job is really CPU-bound
• Try using OpenBLAS + netlib.NativeSystemBLAS
Best Practices
#10
• use Dynamic scaling with Spark Streaming
• spark.dynamicAllocation.enabled = true
• spark.shuffle.service.enabled = true
• be careful if you use cached data
Best Practices
#11
• Use Spot Instance
• Be more aggressive in bid price : p
• BID_PRICE != MONEY_TO_PAY
• Check Spot Instance Pricing History
• Find the instance type with relative stable price
• often Previous Generation Instance ?
• Prepare failure, don’t use them in critical missions
Further Reading
• To use Spark Streaming in Production
• http://www.slideshare.net/SparkSummit/recipes-
for-running-spark-streaming-apploications-in-
production-tathagata-daspptx
Further Reading
• If you’re interested in new ML pipelines
• http://www.slideshare.net/SparkSummit/building-
debugging-and-tuning-spark-machine-leaning-
pipelinesjoseph-bradley