Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AWS meetup「Apache Spark on EMR」

6,140 views

Published on

AWS meetup「Apache Spark on EMR」@lan

Published in: Engineering
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

AWS meetup「Apache Spark on EMR」

  1. 1. Apache Spark on EMR Yuyang Lan SmartNews Inc.
  2. 2. MOKUJI • Intro • Recent Spark • How we use Spark in Smartnews • Best Practices
  3. 3. Who am I • @y2_lan • Engineer at SmartNews Inc. (AD team) • Hacker, Data Engineer, Beer Lover
  4. 4. 何か要望・問題あったら @kaiseh :)
  5. 5. About Apache Spark maybe just skip?
  6. 6. About Apache Spark Quick catch up RDD action transformations
  7. 7. Recent Spark at a glance • Databricks Cloud goes public • Spark 1.4.x • Project Tungsten • AWS adds support for Apache Spark on EMR • …
  8. 8. Spark 1.4.x • SparkR • DataFrame API • ML Pipeline • Streaming UI • …
  9. 9. Spark at SmartNews • AD CTR Prediction ( Logistic Regression )
  10. 10. Spark at SmartNews • Scoring articles by Kinesis + Spark Streaming
  11. 11. Spark at SmartNews • Ad-Hoc Analysis, Faster (& Hive-compatible) SQL
  12. 12. Spark at SmartNews • Realtime Stats by Kinesis + Spark Streaming
  13. 13. Spark at SmartNews • ML experiments • AD targeting • User Clustering • Recommendation • …
  14. 14. Best Practices #1 • Should use the default Spark with EMR ? • Yes Sure • EMR 4.0 is great ! (Released today ?!) • Hadoop 2.6 + Hive 1.0 + Spark 1.4.1
  15. 15. Best Practices #1 • Should use the default Spark with EMR ? • But only if you need a custom-build Spark • Cutting Edge Version • Native netlib-java ( mvn -Pnetlib-lgpl ) • Custom dependency version • …
  16. 16. Best Practices #1 • Should use the default Spark with EMR ? • But only if you need a custom-build Spark • --bootstrap-actions bootstrap.json
  17. 17. Best Practices #1 • Should use the default Spark with EMR ? • But only if you need a custom-build Spark • Remember to start SparkHistoryServer
  18. 18. Best Practices #2 • Run Spark on Yarn • Use yarn-cluster mode to distribute Drivers • specify jars and files to distribute necessary resources
  19. 19. Best Practices #3 • Tuning Memory • CPU shortage only slow down your program, but short in memory make it crash • you can even set --executor-cores bigger than your CPU num • Cache-able heap != JVM’s Xmx • (normally about 50%)
  20. 20. Best Practices #3 • Tuning Memory • CPU shortage only slow down your program, but short in memory make it crash • Cache-able heap != JVM’s Xmx Image from: http://0x0fff.com/spark-architecture/
  21. 21. Best Practices #3 • Tuning Memory • CPU shortage only slow down your program, but short in memory make it crash • Cache-able heap != JVM’s Xmx • spark.yarn.executor.memoryOverhead • spark.executor.memory • spark.storage.memoryFraction • … • Split your executors if HEAP_SIZE > 64GB (GC) • -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
  22. 22. Best Practices #4 • If your ML job is really CPU-bound • Try using OpenBLAS + netlib.NativeSystemBLAS
  23. 23. Best Practices #4 • Try using OpenBLAS + netlib.NativeSystemBLAS 4~5 times FAST
  24. 24. Best Practices #5 • Minimize data shuffle • Prefer reduceByKey over groupByKey+map • RDD.repartition(NUM_OF_CORES) before cache • Try to do filter early
  25. 25. Best Practices #5 • Minimize data shuffle
  26. 26. Best Practices #6 • Prefer DataFrame APIs over low level RDD APIs • Better DAG Optimization • Same interface & same performance
  27. 27. Best Practices #7 • Use Kryo serialization if possible --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
  28. 28. Best Practices #8 • Pick up a notebook tool (iPython or Zeppelin or ? • For memo, sharing, visualisation • Convenient for non-engineer users
  29. 29. Best Practices #9 • Multiple small & task-driven EMR clusters
  30. 30. Best Practices #10 • use Dynamic scaling with Spark Streaming • spark.dynamicAllocation.enabled = true • spark.shuffle.service.enabled = true • be careful if you use cached data
  31. 31. Best Practices #11 • Use Spot Instance • Be more aggressive in bid price : p • BID_PRICE != MONEY_TO_PAY • Check Spot Instance Pricing History • Find the instance type with relative stable price • often Previous Generation Instance ? • Prepare failure, don’t use them in critical missions
  32. 32. Further Reading • To use Spark Streaming in Production • http://www.slideshare.net/SparkSummit/recipes- for-running-spark-streaming-apploications-in- production-tathagata-daspptx
  33. 33. Further Reading • If you’re interested in new ML pipelines • http://www.slideshare.net/SparkSummit/building- debugging-and-tuning-spark-machine-leaning- pipelinesjoseph-bradley
  34. 34. Thanks! We’re hiring! http://about.smartnews.com/ja/careers/ iOSエンジニア / Androidエンジニア / Webアプリケーションエンジニア / プロダクティビティエンジニア / 機械学習 / 自然言語処理エンジニア / グロースハックエンジニア / サーバサイドエンジニア / 広告エンジニア…

×