Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Apache Spark
- -
/ @laclefyoshi / ysaeki@r.recruit.co.jp
•
• Apache Spark
•
•
•
•
2
• 2011/04
• 2015/09
•
• Druid (KDP, 2015)
• RDB NoSQL ( , 2016; : HBase )
• ESP8266 Wi-Fi IoT (KDP, 2016)
•
• (WebDB Forum...
Why Spark?
In-memory Computing
Disk-based Computing In-memory Computing
http://www.jcmit.com/memoryprice.htm
6
In-memory Computing
Memcached Hazelcast HANA
Exadata
Apache IgniteApache Spark
2003 ~ 2008 ~ 2009 ~ 2011 ~2010 ~
Apache Spark
Lost executor X on xxxx: remote
Akka client disassociated
Container marked as failed:
container_xxxx on host:...
How come?
Apache Spark
Executor Executor
Executor
Driver
Apache Spark
Executor Executor
Executor
Driver
Apache Spark
Disk Memory
$ spark-submit 
--MEMORY_OPTIONS1 
--MEMORY_OPTIONS2 
--MEMORY_OPTIONS3 
--conf ADDITIONAL_OPTIONS1 
--conf ADDITIONAL_OPT...
Apache Spark : Heap
On-heap
--executor-memory XXG or
--conf spark.executor.memory=XXG
--conf spark.memory.offHeap.size=XXX...
Apache Spark : Executor
Disk
On-heap Off-heap
On-heap Off-heap
Executor
Executor
OS Other Apps
Apache Spark : Container
Disk
On-heap Off-heap
On-heap Off-heap
Executor
Executor
OS Other Apps
Mesos / YARN Container
Ove...
Apache Spark : Overhead
On-heap
--executor-memory XXG or
--conf spark.executor.memory=XXG
Disk Off-heap Overhead
--conf sp...
Apache Spark : Overhead
On-heapDisk Off-heap Overhead
•
• Java VM
Apache Spark : Overhead
Disk Off-heapOn-heap
Apache Spark : Garbage Collection
Disk Off-heapOn-heap
Apache Spark : Tachyon
Tachyon
Block Store
Disk Off-heapOn-heap
Apache Spark : Tachyon
Tachyon
Block Store
Disk Off-heapOn-heap
Apache Spark : Project Tungsten
Project Tungsten
Disk Off-heapOn-heap
Apache Spark :
Off-heap300MBDisk On-heap
Don’t touch!
Apache Spark : User Memory
Off-heap300MBDisk
--conf spark.memory.fraction=0.6
Memory Fraction
User
Memory
•
•
• Memory Fra...
Apache Spark : Execution Storage
Off-heap300MBDisk
User
Memory
--conf spark.memory.storageFraction=0.5
Storage
Fraction
Ex...
Apache Spark : Execution Storage
Off-heap300MBDisk
User
Memory
Storage
Fraction
Execution
Fraction
•
• Broadcast Accumulat...
Apache Spark : Unified Memory
Off-heap300MBDisk
User
Memory
Storage
Fraction
Execution
Fraction


Examples
User Memory
Off-heap300MBDisk
User
Memory
Storage
Fraction
Execution
Fraction
or
User Memory
Off-heap300MBDisk
User
Memory
Storage
Fraction
Execution
Fraction
Storage Fraction
Off-heap300MBDisk
User
Memory
Storage
Fraction
Execution
Fraction
or
Storage Fraction
Off-heap300MBDisk
User
Memory
Storage
Fraction
Execution
Fraction
Off-heap300MBDisk
User
Memory
Storage
Fraction
Execution
Fraction
or
Off-heap300MBDisk
User
Memory
Storage
Fraction
Execution
Fraction
OutOfMemoryError
How Spark can help us
not to stop our applications
Apache Spark
Disk
User
Memory
Storage
Fraction
Execution
Fraction
Spill
Project Tungsten
Project Tungsten
Off-heap300MB
Sp...
Apache Spark : Garbage Collection
Disk Off-heapOn-heap
JVM : Garbage Collection
-XX:+UseConcMarkSweepGC
// GC
-XX:+UseParNewGC
// GC
-XX:+CMSParallelRemarkEnabled
// GC Remark
-...
JVM : Garbage Collection
-XX:+HeapDumpOnOutOfMemoryError
// OoME
-XX:+PrintGCDetails
// GC
-XX:+PrintGCDateStamps
//
-XX:+...
JVM
$ spark-submit 
--executor-memory 8GB 
--num-executors 20 
--executor-cores 2 
--conf 
"spark.executor.extraJavaOption...
How we can help ourselves
not to stop our applications
RDD
Off-heap300MBDisk
User
Memory
Storage
Fraction
Execution
Fraction
rdd.cache()
rdd.persist()
rdd.persist(StorageLevel.M...
RDD
Off-heap300MBDisk
User
Memory
Storage
Fraction
Execution
Fraction
MEMORY_ONLY
MEMORY_ONLY_2
MEMORY_ONLY_SER
MEMORY_AND...
RDD
Off-heap300MBDisk
User
Memory
Storage
Fraction
Execution
Fraction
+
RDD 1
• SizeEstimator
$ spark-shell
> import org.apache.spark.util.SizeEstimator
> SizeEstimator.estimate("1234")
res0: Lo...
RDD 2
• Web UI Storage panel
> SizeEstimator.estimate(rdd)
res2: Long = 7246792
> rdd.persist(StorageLevel.MEMORY_ONLY)
RDD
> orders = sc.textFile("lineorder.csv")
orders: org.apache.spark.rdd.RDD[String] = ...
> result = orders.map(...)
resu...
RDD
> result.persist(StorageLevel.MEMORY_AND_DISK)
RDD
> orders.persist(StorageLevel.MEMORY_ONLY)
16/12/09 14:34:06 WARN MemoryStore: Not enough
space to cache rdd_1_39 in memory! (computed
44.4 MB so far)
16/12/09 14:34...
•
Off-heap300MBDisk
User
Memory
Storage
Fraction
Execution
Fraction
•
Off-heap300MBDisk
User
Memory
Storage
Fraction
Execution
Fraction
• RDD
> orders.partitions.size
res3: Int = 40
> orders.repartition(80)
> orders.persist(StorageLevel.MEMORY_ONLY)
Off-heap300MBDisk
User
Memory
Storage
Fraction
Execution
Fraction
OutOfMemoryError
RDD
Off-heap300MBDisk
User
Memory
Storage
Fraction
Execution
Fraction
> rdd.unpersist(true) //
> rdd.unpersist(false) //
Execution Fraction
•
•
•
• Garbage Collection
• GC
• Shuffle
Apache Spark
Off-heap300MB
User
Memory
--conf spark.memory.storageFraction
Storage
Fraction
Execution
Fraction
--conf spar...
: Executor
• [A] Storage Fraction = RDD
• [B] Execution Fraction = A
• [C] On-heap = (A + B) / 0.6 + 300MB // 0.6 User Mem...
: Driver ?
Driver Memory Overhead
--conf spark.mesos.
driver.memoryOverhead
--conf spark.yarn.
driver.memoryOverhead
--dri...
Yes, It’s all about Spark Memory.
Enjoy In-memory Computing!
Upcoming SlideShare
Loading in …5
×

Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -

4,758 views

Published on

市ヶ谷Geek★Night#11【Spark勉強会】ChristmaSpark https://ichigayageek.connpass.com/event/45925/ 発表資料

Published in: Technology

Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -

  1. 1. Apache Spark - - / @laclefyoshi / ysaeki@r.recruit.co.jp
  2. 2. • • Apache Spark • • • • 2
  3. 3. • 2011/04 • 2015/09 • • Druid (KDP, 2015) • RDB NoSQL ( , 2016; : HBase ) • ESP8266 Wi-Fi IoT (KDP, 2016) • • (WebDB Forum 2014) • Spark Streaming (Spark Meetup December 2015) • Kafka AWS Kinesis (Apache Kafka Meetup Japan #1; 2016) • (FutureOfData; 2016) • Queryable State for Kafka Streams (Apache Kafka Meetup Japan #2; 2016) 3
  4. 4. Why Spark?
  5. 5. In-memory Computing Disk-based Computing In-memory Computing
  6. 6. http://www.jcmit.com/memoryprice.htm 6
  7. 7. In-memory Computing Memcached Hazelcast HANA Exadata Apache IgniteApache Spark 2003 ~ 2008 ~ 2009 ~ 2011 ~2010 ~
  8. 8. Apache Spark Lost executor X on xxxx: remote Akka client disassociated Container marked as failed: container_xxxx on host: xxxx. Exit status: 1 Container killed by YARN for exceeding memory limits shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[Remote]
  9. 9. How come?
  10. 10. Apache Spark Executor Executor Executor Driver
  11. 11. Apache Spark Executor Executor Executor Driver
  12. 12. Apache Spark Disk Memory
  13. 13. $ spark-submit --MEMORY_OPTIONS1 --MEMORY_OPTIONS2 --MEMORY_OPTIONS3 --conf ADDITIONAL_OPTIONS1 --conf ADDITIONAL_OPTIONS2 --class jp.co.recruit.app.Main spark-project-1.0-SNAPSHOT.jar Apache Spark
  14. 14. Apache Spark : Heap On-heap --executor-memory XXG or --conf spark.executor.memory=XXG --conf spark.memory.offHeap.size=XXX Disk Off-heap
  15. 15. Apache Spark : Executor Disk On-heap Off-heap On-heap Off-heap Executor Executor OS Other Apps
  16. 16. Apache Spark : Container Disk On-heap Off-heap On-heap Off-heap Executor Executor OS Other Apps Mesos / YARN Container Overhead
  17. 17. Apache Spark : Overhead On-heap --executor-memory XXG or --conf spark.executor.memory=XXG Disk Off-heap Overhead --conf spark.mesos.executor.memoryOverhead --conf spark.yarn.executor.memoryOverhead =max(XXG/10 or 384MB)
  18. 18. Apache Spark : Overhead On-heapDisk Off-heap Overhead • • Java VM
  19. 19. Apache Spark : Overhead Disk Off-heapOn-heap
  20. 20. Apache Spark : Garbage Collection Disk Off-heapOn-heap
  21. 21. Apache Spark : Tachyon Tachyon Block Store Disk Off-heapOn-heap
  22. 22. Apache Spark : Tachyon Tachyon Block Store Disk Off-heapOn-heap
  23. 23. Apache Spark : Project Tungsten Project Tungsten Disk Off-heapOn-heap
  24. 24. Apache Spark : Off-heap300MBDisk On-heap Don’t touch!
  25. 25. Apache Spark : User Memory Off-heap300MBDisk --conf spark.memory.fraction=0.6 Memory Fraction User Memory • • • Memory Fraction 

  26. 26. Apache Spark : Execution Storage Off-heap300MBDisk User Memory --conf spark.memory.storageFraction=0.5 Storage Fraction Execution Fraction
  27. 27. Apache Spark : Execution Storage Off-heap300MBDisk User Memory Storage Fraction Execution Fraction • • Broadcast Accumulator • Shuffle Join Sort Aggregate •
  28. 28. Apache Spark : Unified Memory Off-heap300MBDisk User Memory Storage Fraction Execution Fraction 

  29. 29. Examples
  30. 30. User Memory Off-heap300MBDisk User Memory Storage Fraction Execution Fraction or
  31. 31. User Memory Off-heap300MBDisk User Memory Storage Fraction Execution Fraction
  32. 32. Storage Fraction Off-heap300MBDisk User Memory Storage Fraction Execution Fraction or
  33. 33. Storage Fraction Off-heap300MBDisk User Memory Storage Fraction Execution Fraction
  34. 34. Off-heap300MBDisk User Memory Storage Fraction Execution Fraction or
  35. 35. Off-heap300MBDisk User Memory Storage Fraction Execution Fraction OutOfMemoryError
  36. 36. How Spark can help us not to stop our applications
  37. 37. Apache Spark Disk User Memory Storage Fraction Execution Fraction Spill Project Tungsten Project Tungsten Off-heap300MB Spill
  38. 38. Apache Spark : Garbage Collection Disk Off-heapOn-heap
  39. 39. JVM : Garbage Collection -XX:+UseConcMarkSweepGC // GC -XX:+UseParNewGC // GC -XX:+CMSParallelRemarkEnabled // GC Remark -XX:+DisableExplicitGC // GC(System.gc())
  40. 40. JVM : Garbage Collection -XX:+HeapDumpOnOutOfMemoryError // OoME -XX:+PrintGCDetails // GC -XX:+PrintGCDateStamps // -XX:+UseGCLogFileRotation // GC
  41. 41. JVM $ spark-submit --executor-memory 8GB --num-executors 20 --executor-cores 2 --conf "spark.executor.extraJavaOptions=..." --spark.memory.offHeap.enabled=true --spark.memory.offHeap.size=1073741824 --class jp.co.recruit.app.Main spark-project-1.0-SNAPSHOT.jar !
  42. 42. How we can help ourselves not to stop our applications
  43. 43. RDD Off-heap300MBDisk User Memory Storage Fraction Execution Fraction rdd.cache() rdd.persist() rdd.persist(StorageLevel.MEMORY_ONLY)
  44. 44. RDD Off-heap300MBDisk User Memory Storage Fraction Execution Fraction MEMORY_ONLY MEMORY_ONLY_2 MEMORY_ONLY_SER MEMORY_AND_DISK MEMORY_AND_DISK_2 MEMORY_AND_DISK_SER DISK_ONLY OFF_HEAP
  45. 45. RDD Off-heap300MBDisk User Memory Storage Fraction Execution Fraction +
  46. 46. RDD 1 • SizeEstimator $ spark-shell > import org.apache.spark.util.SizeEstimator > SizeEstimator.estimate("1234") res0: Long = 48 > val rdd = sc.makeRDD( (1 to 100000).map(e => e.toString).toSeq) > SizeEstimator.estimate(rdd) res2: Long = 7246792
  47. 47. RDD 2 • Web UI Storage panel > SizeEstimator.estimate(rdd) res2: Long = 7246792 > rdd.persist(StorageLevel.MEMORY_ONLY)
  48. 48. RDD > orders = sc.textFile("lineorder.csv") orders: org.apache.spark.rdd.RDD[String] = ... > result = orders.map(...) result: org.apache.spark.rdd.RDD[String] = ... > orders.persist(StorageLevel.MEMORY_ONLY) > result.persist(StorageLevel.MEMORY_AND_DISK)
  49. 49. RDD > result.persist(StorageLevel.MEMORY_AND_DISK)
  50. 50. RDD > orders.persist(StorageLevel.MEMORY_ONLY)
  51. 51. 16/12/09 14:34:06 WARN MemoryStore: Not enough space to cache rdd_1_39 in memory! (computed 44.4 MB so far) 16/12/09 14:34:06 WARN BlockManager: Block rdd_1_39 could not be removed as it was not found on disk or in memory 16/12/09 14:34:06 WARN BlockManager: Putting block rdd_1_39 failed
  52. 52. • Off-heap300MBDisk User Memory Storage Fraction Execution Fraction
  53. 53. • Off-heap300MBDisk User Memory Storage Fraction Execution Fraction
  54. 54. • RDD > orders.partitions.size res3: Int = 40 > orders.repartition(80) > orders.persist(StorageLevel.MEMORY_ONLY)
  55. 55. Off-heap300MBDisk User Memory Storage Fraction Execution Fraction OutOfMemoryError
  56. 56. RDD Off-heap300MBDisk User Memory Storage Fraction Execution Fraction > rdd.unpersist(true) // > rdd.unpersist(false) //
  57. 57. Execution Fraction • • • • Garbage Collection • GC • Shuffle
  58. 58. Apache Spark Off-heap300MB User Memory --conf spark.memory.storageFraction Storage Fraction Execution Fraction --conf spark.memory.fraction --conf spark.memory.offHeap.size --executor-memory --conf spark.executor.memory Overhead --conf spark.mesos. executor.memoryOverhead --conf spark.yarn. executor.memoryOverhead
  59. 59. : Executor • [A] Storage Fraction = RDD • [B] Execution Fraction = A • [C] On-heap = (A + B) / 0.6 + 300MB // 0.6 User Memory • [D] Off-heap = RDD • [E] Overhead = max(C * 0.1, 384MB) // • [F] 1 Container (Executor) • [G] OS • [H] (C + D + E) * F + G < H
  60. 60. : Driver ? Driver Memory Overhead --conf spark.mesos. driver.memoryOverhead --conf spark.yarn. driver.memoryOverhead --driver-memory --conf spark.driver.memory --conf spark.driver.maxResultSize=1G Action (collect, reduce, take ) ! Driver
  61. 61. Yes, It’s all about Spark Memory.
  62. 62. Enjoy In-memory Computing!

×