Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -

1,151 views

Published on

市ヶ谷Geek★Night#11【Spark勉強会】ChristmaSpark https://ichigayageek.connpass.com/event/45925/ 発表資料

Published in: Technology
0 Comments
13 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,151
On SlideShare
0
From Embeds
0
Number of Embeds
245
Actions
Shares
0
Downloads
12
Comments
0
Likes
13
Embeds 0
No embeds

No notes for slide

Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -

  1. 1. Apache Spark - - / @laclefyoshi / ysaeki@r.recruit.co.jp
  2. 2. • • Apache Spark • • • • 2
  3. 3. • 2011/04 • 2015/09 • • Druid (KDP, 2015) • RDB NoSQL ( , 2016; : HBase ) • ESP8266 Wi-Fi IoT (KDP, 2016) • • (WebDB Forum 2014) • Spark Streaming (Spark Meetup December 2015) • Kafka AWS Kinesis (Apache Kafka Meetup Japan #1; 2016) • (FutureOfData; 2016) • Queryable State for Kafka Streams (Apache Kafka Meetup Japan #2; 2016) 3
  4. 4. Why Spark?
  5. 5. In-memory Computing Disk-based Computing In-memory Computing
  6. 6. http://www.jcmit.com/memoryprice.htm 6
  7. 7. In-memory Computing Memcached Hazelcast HANA Exadata Apache IgniteApache Spark 2003 ~ 2008 ~ 2009 ~ 2011 ~2010 ~
  8. 8. Apache Spark Lost executor X on xxxx: remote Akka client disassociated Container marked as failed: container_xxxx on host: xxxx. Exit status: 1 Container killed by YARN for exceeding memory limits shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[Remote]
  9. 9. How come?
  10. 10. Apache Spark Executor Executor Executor Driver
  11. 11. Apache Spark Executor Executor Executor Driver
  12. 12. Apache Spark Disk Memory
  13. 13. $ spark-submit --MEMORY_OPTIONS1 --MEMORY_OPTIONS2 --MEMORY_OPTIONS3 --conf ADDITIONAL_OPTIONS1 --conf ADDITIONAL_OPTIONS2 --class jp.co.recruit.app.Main spark-project-1.0-SNAPSHOT.jar Apache Spark
  14. 14. Apache Spark : Heap On-heap --executor-memory XXG or --conf spark.executor.memory=XXG --conf spark.memory.offHeap.size=XXX Disk Off-heap
  15. 15. Apache Spark : Executor Disk On-heap Off-heap On-heap Off-heap Executor Executor OS Other Apps
  16. 16. Apache Spark : Container Disk On-heap Off-heap On-heap Off-heap Executor Executor OS Other Apps Mesos / YARN Container Overhead
  17. 17. Apache Spark : Overhead On-heap --executor-memory XXG or --conf spark.executor.memory=XXG Disk Off-heap Overhead --conf spark.mesos.executor.memoryOverhead --conf spark.yarn.executor.memoryOverhead =max(XXG/10 or 384MB)
  18. 18. Apache Spark : Overhead On-heapDisk Off-heap Overhead • • Java VM
  19. 19. Apache Spark : Overhead Disk Off-heapOn-heap
  20. 20. Apache Spark : Garbage Collection Disk Off-heapOn-heap
  21. 21. Apache Spark : Tachyon Tachyon Block Store Disk Off-heapOn-heap
  22. 22. Apache Spark : Tachyon Tachyon Block Store Disk Off-heapOn-heap
  23. 23. Apache Spark : Project Tungsten Project Tungsten Disk Off-heapOn-heap
  24. 24. Apache Spark : Off-heap300MBDisk On-heap Don’t touch!
  25. 25. Apache Spark : User Memory Off-heap300MBDisk --conf spark.memory.fraction=0.6 Memory Fraction User Memory • • • Memory Fraction 

  26. 26. Apache Spark : Execution Storage Off-heap300MBDisk User Memory --conf spark.memory.storageFraction=0.5 Storage Fraction Execution Fraction
  27. 27. Apache Spark : Execution Storage Off-heap300MBDisk User Memory Storage Fraction Execution Fraction • • Broadcast Accumulator • Shuffle Join Sort Aggregate •
  28. 28. Apache Spark : Unified Memory Off-heap300MBDisk User Memory Storage Fraction Execution Fraction 

  29. 29. Examples
  30. 30. User Memory Off-heap300MBDisk User Memory Storage Fraction Execution Fraction or
  31. 31. User Memory Off-heap300MBDisk User Memory Storage Fraction Execution Fraction
  32. 32. Storage Fraction Off-heap300MBDisk User Memory Storage Fraction Execution Fraction or
  33. 33. Storage Fraction Off-heap300MBDisk User Memory Storage Fraction Execution Fraction
  34. 34. Off-heap300MBDisk User Memory Storage Fraction Execution Fraction or
  35. 35. Off-heap300MBDisk User Memory Storage Fraction Execution Fraction OutOfMemoryError
  36. 36. How Spark can help us not to stop our applications
  37. 37. Apache Spark Disk User Memory Storage Fraction Execution Fraction Spill Project Tungsten Project Tungsten Off-heap300MB Spill
  38. 38. Apache Spark : Garbage Collection Disk Off-heapOn-heap
  39. 39. JVM : Garbage Collection -XX:+UseConcMarkSweepGC // GC -XX:+UseParNewGC // GC -XX:+CMSParallelRemarkEnabled // GC Remark -XX:+DisableExplicitGC // GC(System.gc())
  40. 40. JVM : Garbage Collection -XX:+HeapDumpOnOutOfMemoryError // OoME -XX:+PrintGCDetails // GC -XX:+PrintGCDateStamps // -XX:+UseGCLogFileRotation // GC
  41. 41. JVM $ spark-submit --executor-memory 8GB --num-executors 20 --executor-cores 2 --conf "spark.executor.extraJavaOptions=..." --spark.memory.offHeap.enabled=true --spark.memory.offHeap.size=1073741824 --class jp.co.recruit.app.Main spark-project-1.0-SNAPSHOT.jar !
  42. 42. How we can help ourselves not to stop our applications
  43. 43. RDD Off-heap300MBDisk User Memory Storage Fraction Execution Fraction rdd.cache() rdd.persist() rdd.persist(StorageLevel.MEMORY_ONLY)
  44. 44. RDD Off-heap300MBDisk User Memory Storage Fraction Execution Fraction MEMORY_ONLY MEMORY_ONLY_2 MEMORY_ONLY_SER MEMORY_AND_DISK MEMORY_AND_DISK_2 MEMORY_AND_DISK_SER DISK_ONLY OFF_HEAP
  45. 45. RDD Off-heap300MBDisk User Memory Storage Fraction Execution Fraction +
  46. 46. RDD 1 • SizeEstimator $ spark-shell > import org.apache.spark.util.SizeEstimator > SizeEstimator.estimate("1234") res0: Long = 48 > val rdd = sc.makeRDD( (1 to 100000).map(e => e.toString).toSeq) > SizeEstimator.estimate(rdd) res2: Long = 7246792
  47. 47. RDD 2 • Web UI Storage panel > SizeEstimator.estimate(rdd) res2: Long = 7246792 > rdd.persist(StorageLevel.MEMORY_ONLY)
  48. 48. RDD > orders = sc.textFile("lineorder.csv") orders: org.apache.spark.rdd.RDD[String] = ... > result = orders.map(...) result: org.apache.spark.rdd.RDD[String] = ... > orders.persist(StorageLevel.MEMORY_ONLY) > result.persist(StorageLevel.MEMORY_AND_DISK)
  49. 49. RDD > result.persist(StorageLevel.MEMORY_AND_DISK)
  50. 50. RDD > orders.persist(StorageLevel.MEMORY_ONLY)
  51. 51. 16/12/09 14:34:06 WARN MemoryStore: Not enough space to cache rdd_1_39 in memory! (computed 44.4 MB so far) 16/12/09 14:34:06 WARN BlockManager: Block rdd_1_39 could not be removed as it was not found on disk or in memory 16/12/09 14:34:06 WARN BlockManager: Putting block rdd_1_39 failed
  52. 52. • Off-heap300MBDisk User Memory Storage Fraction Execution Fraction
  53. 53. • Off-heap300MBDisk User Memory Storage Fraction Execution Fraction
  54. 54. • RDD > orders.partitions.size res3: Int = 40 > orders.repartition(80) > orders.persist(StorageLevel.MEMORY_ONLY)
  55. 55. Off-heap300MBDisk User Memory Storage Fraction Execution Fraction OutOfMemoryError
  56. 56. RDD Off-heap300MBDisk User Memory Storage Fraction Execution Fraction > rdd.unpersist(true) // > rdd.unpersist(false) //
  57. 57. Execution Fraction • • • • Garbage Collection • GC • Shuffle
  58. 58. Apache Spark Off-heap300MB User Memory --conf spark.memory.storageFraction Storage Fraction Execution Fraction --conf spark.memory.fraction --conf spark.memory.offHeap.size --executor-memory --conf spark.executor.memory Overhead --conf spark.mesos. executor.memoryOverhead --conf spark.yarn. executor.memoryOverhead
  59. 59. : Executor • [A] Storage Fraction = RDD • [B] Execution Fraction = A • [C] On-heap = (A + B) / 0.6 + 300MB // 0.6 User Memory • [D] Off-heap = RDD • [E] Overhead = max(C * 0.1, 384MB) // • [F] 1 Container (Executor) • [G] OS • [H] (C + D + E) * F + G < H
  60. 60. : Driver ? Driver Memory Overhead --conf spark.mesos. driver.memoryOverhead --conf spark.yarn. driver.memoryOverhead --driver-memory --conf spark.driver.memory --conf spark.driver.maxResultSize=1G Action (collect, reduce, take ) ! Driver
  61. 61. Yes, It’s all about Spark Memory.
  62. 62. Enjoy In-memory Computing!

×