Apache Spark
- -
/ @laclefyoshi / ysaeki@r.recruit.co.jp
•
• Apache Spark
•
•
•
•
2
• 2011/04
• 2015/09
•
• Druid (KDP, 2015)
• RDB NoSQL ( , 2016; : HBase )
• ESP8266 Wi-Fi IoT (KDP, 2016)
•
• (WebDB Forum 2014)
• Spark Streaming (Spark Meetup December 2015)
• Kafka AWS Kinesis (Apache Kafka Meetup Japan #1; 2016)
• (FutureOfData; 2016)
• Queryable State for Kafka Streams (Apache Kafka Meetup Japan #2; 2016)
3
Why Spark?
In-memory Computing
Disk-based Computing In-memory Computing
http://www.jcmit.com/memoryprice.htm
6
In-memory Computing
Memcached Hazelcast HANA
Exadata
Apache IgniteApache Spark
2003 ~ 2008 ~ 2009 ~ 2011 ~2010 ~
Apache Spark
Lost executor X on xxxx: remote
Akka client disassociated
Container marked as failed:
container_xxxx on host: xxxx. Exit status: 1
Container killed by YARN for
exceeding memory limits
shutting down JVM since 'akka.jvm-exit-on-fatal-error'
is enabled for ActorSystem[Remote]
How come?
Apache Spark
Executor Executor
Executor
Driver
Apache Spark
Executor Executor
Executor
Driver
Apache Spark
Disk Memory
$ spark-submit 
--MEMORY_OPTIONS1 
--MEMORY_OPTIONS2 
--MEMORY_OPTIONS3 
--conf ADDITIONAL_OPTIONS1 
--conf ADDITIONAL_OPTIONS2 
--class jp.co.recruit.app.Main 
spark-project-1.0-SNAPSHOT.jar
Apache Spark
Apache Spark : Heap
On-heap
--executor-memory XXG or
--conf spark.executor.memory=XXG
--conf spark.memory.offHeap.size=XXX
Disk Off-heap
Apache Spark : Executor
Disk
On-heap Off-heap
On-heap Off-heap
Executor
Executor
OS Other Apps
Apache Spark : Container
Disk
On-heap Off-heap
On-heap Off-heap
Executor
Executor
OS Other Apps
Mesos / YARN Container
Overhead
Apache Spark : Overhead
On-heap
--executor-memory XXG or
--conf spark.executor.memory=XXG
Disk Off-heap Overhead
--conf spark.mesos.executor.memoryOverhead
--conf spark.yarn.executor.memoryOverhead
=max(XXG/10 or 384MB)
Apache Spark : Overhead
On-heapDisk Off-heap Overhead
•
• Java VM
Apache Spark : Overhead
Disk Off-heapOn-heap
Apache Spark : Garbage Collection
Disk Off-heapOn-heap
Apache Spark : Tachyon
Tachyon
Block Store
Disk Off-heapOn-heap
Apache Spark : Tachyon
Tachyon
Block Store
Disk Off-heapOn-heap
Apache Spark : Project Tungsten
Project Tungsten
Disk Off-heapOn-heap
Apache Spark :
Off-heap300MBDisk On-heap
Don’t touch!
Apache Spark : User Memory
Off-heap300MBDisk
--conf spark.memory.fraction=0.6
Memory Fraction
User
Memory
•
•
• Memory Fraction 

Apache Spark : Execution Storage
Off-heap300MBDisk
User
Memory
--conf spark.memory.storageFraction=0.5
Storage
Fraction
Execution
Fraction
Apache Spark : Execution Storage
Off-heap300MBDisk
User
Memory
Storage
Fraction
Execution
Fraction
•
• Broadcast Accumulator
• Shuffle Join Sort Aggregate
•
Apache Spark : Unified Memory
Off-heap300MBDisk
User
Memory
Storage
Fraction
Execution
Fraction


Examples
User Memory
Off-heap300MBDisk
User
Memory
Storage
Fraction
Execution
Fraction
or
User Memory
Off-heap300MBDisk
User
Memory
Storage
Fraction
Execution
Fraction
Storage Fraction
Off-heap300MBDisk
User
Memory
Storage
Fraction
Execution
Fraction
or
Storage Fraction
Off-heap300MBDisk
User
Memory
Storage
Fraction
Execution
Fraction
Off-heap300MBDisk
User
Memory
Storage
Fraction
Execution
Fraction
or
Off-heap300MBDisk
User
Memory
Storage
Fraction
Execution
Fraction
OutOfMemoryError
How Spark can help us
not to stop our applications
Apache Spark
Disk
User
Memory
Storage
Fraction
Execution
Fraction
Spill
Project Tungsten
Project Tungsten
Off-heap300MB
Spill
Apache Spark : Garbage Collection
Disk Off-heapOn-heap
JVM : Garbage Collection
-XX:+UseConcMarkSweepGC
// GC
-XX:+UseParNewGC
// GC
-XX:+CMSParallelRemarkEnabled
// GC Remark
-XX:+DisableExplicitGC
// GC(System.gc())
JVM : Garbage Collection
-XX:+HeapDumpOnOutOfMemoryError
// OoME
-XX:+PrintGCDetails
// GC
-XX:+PrintGCDateStamps
//
-XX:+UseGCLogFileRotation
// GC
JVM
$ spark-submit 
--executor-memory 8GB 
--num-executors 20 
--executor-cores 2 
--conf 
"spark.executor.extraJavaOptions=..." 
--spark.memory.offHeap.enabled=true 
--spark.memory.offHeap.size=1073741824 
--class jp.co.recruit.app.Main 
spark-project-1.0-SNAPSHOT.jar
!
How we can help ourselves
not to stop our applications
RDD
Off-heap300MBDisk
User
Memory
Storage
Fraction
Execution
Fraction
rdd.cache()
rdd.persist()
rdd.persist(StorageLevel.MEMORY_ONLY)
RDD
Off-heap300MBDisk
User
Memory
Storage
Fraction
Execution
Fraction
MEMORY_ONLY
MEMORY_ONLY_2
MEMORY_ONLY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_2
MEMORY_AND_DISK_SER
DISK_ONLY
OFF_HEAP
RDD
Off-heap300MBDisk
User
Memory
Storage
Fraction
Execution
Fraction
+
RDD 1
• SizeEstimator
$ spark-shell
> import org.apache.spark.util.SizeEstimator
> SizeEstimator.estimate("1234")
res0: Long = 48
> val rdd = sc.makeRDD(
(1 to 100000).map(e => e.toString).toSeq)
> SizeEstimator.estimate(rdd)
res2: Long = 7246792
RDD 2
• Web UI Storage panel
> SizeEstimator.estimate(rdd)
res2: Long = 7246792
> rdd.persist(StorageLevel.MEMORY_ONLY)
RDD
> orders = sc.textFile("lineorder.csv")
orders: org.apache.spark.rdd.RDD[String] = ...
> result = orders.map(...)
result: org.apache.spark.rdd.RDD[String] = ...
> orders.persist(StorageLevel.MEMORY_ONLY)
> result.persist(StorageLevel.MEMORY_AND_DISK)
RDD
> result.persist(StorageLevel.MEMORY_AND_DISK)
RDD
> orders.persist(StorageLevel.MEMORY_ONLY)
16/12/09 14:34:06 WARN MemoryStore: Not enough
space to cache rdd_1_39 in memory! (computed
44.4 MB so far)
16/12/09 14:34:06 WARN BlockManager: Block
rdd_1_39 could not be removed as it was not
found on disk or in memory
16/12/09 14:34:06 WARN BlockManager: Putting
block rdd_1_39 failed
•
Off-heap300MBDisk
User
Memory
Storage
Fraction
Execution
Fraction
•
Off-heap300MBDisk
User
Memory
Storage
Fraction
Execution
Fraction
• RDD
> orders.partitions.size
res3: Int = 40
> orders.repartition(80)
> orders.persist(StorageLevel.MEMORY_ONLY)
Off-heap300MBDisk
User
Memory
Storage
Fraction
Execution
Fraction
OutOfMemoryError
RDD
Off-heap300MBDisk
User
Memory
Storage
Fraction
Execution
Fraction
> rdd.unpersist(true) //
> rdd.unpersist(false) //
Execution Fraction
•
•
•
• Garbage Collection
• GC
• Shuffle
Apache Spark
Off-heap300MB
User
Memory
--conf spark.memory.storageFraction
Storage
Fraction
Execution
Fraction
--conf spark.memory.fraction
--conf spark.memory.offHeap.size
--executor-memory
--conf spark.executor.memory
Overhead
--conf spark.mesos.
executor.memoryOverhead
--conf spark.yarn.
executor.memoryOverhead
: Executor
• [A] Storage Fraction = RDD
• [B] Execution Fraction = A
• [C] On-heap = (A + B) / 0.6 + 300MB // 0.6 User Memory
• [D] Off-heap = RDD
• [E] Overhead = max(C * 0.1, 384MB) //
• [F] 1 Container (Executor)
• [G] OS
• [H]
(C + D + E) * F + G < H
: Driver ?
Driver Memory Overhead
--conf spark.mesos.
driver.memoryOverhead
--conf spark.yarn.
driver.memoryOverhead
--driver-memory
--conf spark.driver.memory
--conf spark.driver.maxResultSize=1G
Action (collect, reduce, take )
!
Driver
Yes, It’s all about Spark Memory.
Enjoy In-memory Computing!

Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -