Advertisement

More Related Content

Slideshows for you(20)

Advertisement

More from SmartNews, Inc.(19)

Advertisement

AWS meetup「Apache Spark on EMR」

  1. Apache Spark on EMR Yuyang Lan SmartNews Inc.
  2. MOKUJI • Intro • Recent Spark • How we use Spark in Smartnews • Best Practices
  3. Who am I • @y2_lan • Engineer at SmartNews Inc. (AD team) • Hacker, Data Engineer, Beer Lover
  4. 何か要望・問題あったら @kaiseh :)
  5. About Apache Spark maybe just skip?
  6. About Apache Spark Quick catch up RDD action transformations
  7. Recent Spark at a glance • Databricks Cloud goes public • Spark 1.4.x • Project Tungsten • AWS adds support for Apache Spark on EMR • …
  8. Spark 1.4.x • SparkR • DataFrame API • ML Pipeline • Streaming UI • …
  9. Spark at SmartNews • AD CTR Prediction ( Logistic Regression )
  10. Spark at SmartNews • Scoring articles by Kinesis + Spark Streaming
  11. Spark at SmartNews • Ad-Hoc Analysis, Faster (& Hive-compatible) SQL
  12. Spark at SmartNews • Realtime Stats by Kinesis + Spark Streaming
  13. Spark at SmartNews • ML experiments • AD targeting • User Clustering • Recommendation • …
  14. Best Practices #1 • Should use the default Spark with EMR ? • Yes Sure • EMR 4.0 is great ! (Released today ?!) • Hadoop 2.6 + Hive 1.0 + Spark 1.4.1
  15. Best Practices #1 • Should use the default Spark with EMR ? • But only if you need a custom-build Spark • Cutting Edge Version • Native netlib-java ( mvn -Pnetlib-lgpl ) • Custom dependency version • …
  16. Best Practices #1 • Should use the default Spark with EMR ? • But only if you need a custom-build Spark • --bootstrap-actions bootstrap.json
  17. Best Practices #1 • Should use the default Spark with EMR ? • But only if you need a custom-build Spark • Remember to start SparkHistoryServer
  18. Best Practices #2 • Run Spark on Yarn • Use yarn-cluster mode to distribute Drivers • specify jars and files to distribute necessary resources
  19. Best Practices #3 • Tuning Memory • CPU shortage only slow down your program, but short in memory make it crash • you can even set --executor-cores bigger than your CPU num • Cache-able heap != JVM’s Xmx • (normally about 50%)
  20. Best Practices #3 • Tuning Memory • CPU shortage only slow down your program, but short in memory make it crash • Cache-able heap != JVM’s Xmx Image from: http://0x0fff.com/spark-architecture/
  21. Best Practices #3 • Tuning Memory • CPU shortage only slow down your program, but short in memory make it crash • Cache-able heap != JVM’s Xmx • spark.yarn.executor.memoryOverhead • spark.executor.memory • spark.storage.memoryFraction • … • Split your executors if HEAP_SIZE > 64GB (GC) • -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
  22. Best Practices #4 • If your ML job is really CPU-bound • Try using OpenBLAS + netlib.NativeSystemBLAS
  23. Best Practices #4 • Try using OpenBLAS + netlib.NativeSystemBLAS 4~5 times FAST
  24. Best Practices #5 • Minimize data shuffle • Prefer reduceByKey over groupByKey+map • RDD.repartition(NUM_OF_CORES) before cache • Try to do filter early
  25. Best Practices #5 • Minimize data shuffle
  26. Best Practices #6 • Prefer DataFrame APIs over low level RDD APIs • Better DAG Optimization • Same interface & same performance
  27. Best Practices #7 • Use Kryo serialization if possible --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
  28. Best Practices #8 • Pick up a notebook tool (iPython or Zeppelin or ? • For memo, sharing, visualisation • Convenient for non-engineer users
  29. Best Practices #9 • Multiple small & task-driven EMR clusters
  30. Best Practices #10 • use Dynamic scaling with Spark Streaming • spark.dynamicAllocation.enabled = true • spark.shuffle.service.enabled = true • be careful if you use cached data
  31. Best Practices #11 • Use Spot Instance • Be more aggressive in bid price : p • BID_PRICE != MONEY_TO_PAY • Check Spot Instance Pricing History • Find the instance type with relative stable price • often Previous Generation Instance ? • Prepare failure, don’t use them in critical missions
  32. Further Reading • To use Spark Streaming in Production • http://www.slideshare.net/SparkSummit/recipes- for-running-spark-streaming-apploications-in- production-tathagata-daspptx
  33. Further Reading • If you’re interested in new ML pipelines • http://www.slideshare.net/SparkSummit/building- debugging-and-tuning-spark-machine-leaning- pipelinesjoseph-bradley
  34. Thanks! We’re hiring! http://about.smartnews.com/ja/careers/ iOSエンジニア / Androidエンジニア / Webアプリケーションエンジニア / プロダクティビティエンジニア / 機械学習 / 自然言語処理エンジニア / グロースハックエンジニア / サーバサイドエンジニア / 広告エンジニア…
Advertisement