A Developer’s View Into
Spark’s Memory Model
Wenchen Fan
2017-6-7
About Me
• Software Engineer @
• Apache Spark Committer
• One of the most active Spark contributors
About Databricks
TEAM
Started Spark project (now Apache Spark) at UC Berkeley in
2009
MISSON
Make Big Data Simple
PRODUC
TUnified Analytics Platform
Maste
r
Worker
1
Worker
2
Worker
3
Driver
Executor
Executor
Driver Executor
Executor
Memory
Manager
Thread Pool
Memory Model inside Executor
data
1010100001010
0010001001001
1010001000111
1010100100101
internal format
Cache
Manager
Memory
Manager
operators
allocate
memor
y
memor
y pages allocate
memor
ymemor
y pages
Memory Model inside Executor
data
1010100001010
0010001001001
1010001000111
1010100100101
Cache
Manager
Memory
Manager
operators
allocate
memor
y
memor
y pages allocate
memor
ymemor
y pages
internal format
Memory Allocation
• Allocation happens in page granularity.
• Off-heap supported!
• Page is not fixed-size, but has a lower and upper bound.
• No pooling, pages are freed once there is no data on it.
Why var-length page and no pooling?
• Pros:
• simplify the implementation. (no single record will across pages)
• free memory immediately so that the OS can use them for file buffer,
etc.
• Cons:
• can not handle super big single record. (very rare in reality)
• fragmentation for records bigger than page size lower bound. (the
lower bound is several mega bytes, so it’s also rare)
• overhead in allocation. (most malloc algorithms should work well)
Memory Model inside Executor
data
1010100001010
0010001001001
1010001000111
1010100100101
Cache
Manager
Memory
Manager
operators
allocate
memor
y
memor
y pages allocate
memor
ymemor
y pages
internal format
Java Objects Based Row Format
Array
BoxedInteger(123)
String(“data”)
String(“bricks”)
Row
• 5+ objects
• high space overhead
• slow value accessing
• expensive hashCode()
(123, “data”, “bricks”)
Data objects? No!
• It is hard to monitor and control the memory usage when
we have a lot of objects.
• Garbage collection will be the killer.
• High serialization cost when transfer data inside cluster.
Efficient Binary Format
0x0
null tracking
(123, “data”, “bricks”)
Efficient Binary Format
0x0 123
null tracking
(123, “data”, “bricks”)
Efficient Binary Format
0x0 123 32 “data”4
null tracking
(123, “data”, “bricks”)
offset and length of
Efficient Binary Format
“bricks”0x0 123 32 40 “data”4 6
null tracking
(123, “data”, “bricks”)
offset and length of
offset and length of
data
Memory Model inside Executor
data
1010100001010
0010001001001
1010001000111
1010100100101
Cache
Manager
Memory
Manager
operators
allocate
memor
y
memor
y pages allocate
memor
ymemor
y pages
internal format
Operate On Binary
0x
0
12
3
2
4
4
“data
”
0x
0
12
3
2
4
4
“data
”
0x
0
1
6
2 “da”
123 >
JSON
files
substring
How to process binary
data more efficiently?
Understanding CPU Cache
Memory is becoming
slower and slower
than CPU.
Understanding CPU Cache
Pre-fetch frequently
accessed data into
CPU cache.
The most 2 important
algorithms in big data
are ...
Sort and Hash!
Naive Sort
“bbb”
“aaa”
“ccc”
pointer
pointer
pointer
Naive Sort
pointer
pointer
pointer
“bbb”
“aaa”
“ccc”
Naive Sort
pointer
pointer
pointer
“bbb”
“aaa”
“ccc”
Naive Sort
pointer
pointer
pointer
“bbb”
“aaa”
“ccc”
Naive Sort
Each comparison needs to access 2 different
memory regions, which makes it hard for CPU cache
to pre-fetch data, poor cache locality!
Cache-aware Sort
record
record
record
ptr
ptr
ptr
key-prefix
key-prefix
key-prefix
Cache-aware Sort
record
record
record
ptr
ptr
ptr
key-prefix
key-prefix
key-prefix
Cache-aware Sort
record
record
record
ptr
ptr
ptr
key-prefix
key-prefix
key-prefix
Cache-aware Sort
Most of the time, just go through the key-prefixes in a
linear fashion, good cache locality!
Naive Hash Map
key1pointer v1
key2pointer v2
key3pointer v3
Naive Hash Map
pointer
pointer
pointer
key3
lookup key1 v1
key2 v2
key3 v3
Naive Hash Map
pointer
pointer
pointer
key3
hash(key) %
size
key1 v1
key2 v2
key3 v3
Naive Hash Map
pointer
pointer
pointer
key3
compare these 2
keys
key1 v1
key2 v2
key3 v3
Naive Hash Map
pointer
pointer
pointer
key3
quadratic
probing
key1 v1
key2 v2
key3 v3
Naive Hash Map
key3
compare these 2
keys
pointer
pointer
pointer
key1 v1
key2 v2
key3 v3
Naive Hash Map
Each lookup needs many pointer dereferences and
key comparison when hash collision happens, and
jumps between 2 memory regions, bad cache locality!
ptr
ptr
ptr
full hash
full hash
full hash
Cache-aware Hash Map
key1 v1
key2 v2
key3 v3
ptr
ptr
ptr
full hash
full hash
full hash
Cache-aware Hash Map
key3
lookup key1 v1
key2 v2
key3 v3
ptr
ptr
ptr
full hash
full hash
full hash
Cache-aware Hash Map
key3
hash(key) %
size, and
compare the full
hash
key1 v1
key2 v2
key3 v3
ptr
ptr
ptr
full hash
full hash
full hash
Cache-aware Hash Map
key3
quadratic probing, and
compare the full hash
key1 v1
key2 v2
key3 v3
ptr
ptr
ptr
full hash
full hash
full hash
Cache-aware Hash Map
key3
key1 v1
key2 v2
key3 v3
compare these 2
keys
Cache-aware Hash Map
Each lookup mostly only needs one pointer
dereference and key comparison(full hash collision is
rare), and access data mostly in a single memory
region, better cache locality!
Recap: Cache-aware data structure
How to improve cache locality …
• store key-prefix with pointer.
• store key full hash with pointer.
Store extra information to try to keep the
memory accessing in a single region.
Memory Model inside Executor
data
1010100001010
0010001001001
1010001000111
1010100100101
Cache
Manager
Memory
Manager
operators
allocate
memor
y
memor
y pages allocate
memor
ymemor
y pages
internal format
Future Work
• SPARK-19489: Stable serialization format for external &
native code integration.
• SPARK-15689: Data source API v2
• SPARK-15687: Columnar execution engine.
Try Apache Spark in Databricks!
UNIFIED ANALYTICS PLATFORM
• Collaborative cloud environment
• Free version (community edition)
DATABRICKS RUNTIME
3.0
• Apache Spark - optimized for the
cloud
• Caching and optimization layer -
DBIO
• Enterprise security - DBES
Try for free
today.
databricks.com
Thank You

A Developer’s View into Spark's Memory Model with Wenchen Fan

Editor's Notes

  • #4 I’m not lying!
  • #6 first of all, spark is a distributed engine, let’s see the memory usage at the cluster level. a typical spark cluster has one master node and several worker nodes the entries of spark application is the driver process, it can be run on any node. when driver launches, it will apply for executors from master master will ask workers to launch executors for a driver each executor is a JVM process and has a memory manager and a thread pool to run tasks distribute resource across spark applications is out of this talk this talk focus on memory management inside executors Let’s zoom in!
  • #7 Keep data as binary and operate on binary. execution and storage are the 2 major memory consumers in Spark. Memory manager only manages part of the total available memory.
  • #10 16g upper bound
  • #11 why use internal format? why not just use objects?
  • #13 some more reasons about why Spark doesn’t use data objects
  • #18 how to operate on binary data directly.
  • #19 The whole data flow looks efficient, but, in Spark, we keep asking ourself, can it run faster?
  • #22 shall we re-implement all operators in Spark to be CPU cache friendly? That’s a lot of work! Instead, we only focus on 2 algorithms, because …
  • #23 many fancy algorithms/systems are fundamentally based on sort and hash, improving these 2 algorithms can benefit a lot of operators in Spark.
  • #24 most sort algorithms work by, picking 2 records, compare and switch, and repeat that again and again
  • #27 then repeat
  • #29 The idea behind this algorithm is that, mostly we can compare 2 records with first several bytes.
  • #36 collision is common
  • #39 For a hash table, collision happens frequently when you have many entries.
  • #40 the idea behind this algorithm is that, although collision is common in hash table, but full hash value collision is very rare. mostly we can identify a key with its full hash value.
  • #47 The data flow is efficient, but the beginning of it, the data source may not. data source is a public API, RDD[Row]