There are a few reasons why @volatile is not used here for the lock-free counters:1. @volatile only guarantees visibility of writes across threads, it does not prevent concurrent modifications to the field which could lead to race conditions. Using AtomicLong and compareAndSet provides true thread-safety.2. @volatile would still require locking (synchronized) to safely read and update the field atomically. Using AtomicLong allows lock-free updates via compareAndSet. 3. Storing the two counters in a single AtomicLong field allows them to be read and updated together atomically in a single machine operation. With @volatile there would need to be two separate volatile reads/writes which are not atomic
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Similar to There are a few reasons why @volatile is not used here for the lock-free counters:1. @volatile only guarantees visibility of writes across threads, it does not prevent concurrent modifications to the field which could lead to race conditions. Using AtomicLong and compareAndSet provides true thread-safety.2. @volatile would still require locking (synchronized) to safely read and update the field atomically. Using AtomicLong allows lock-free updates via compareAndSet. 3. Storing the two counters in a single AtomicLong field allows them to be read and updated together atomically in a single machine operation. With @volatile there would need to be two separate volatile reads/writes which are not atomic
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Chris Fregly
Similar to There are a few reasons why @volatile is not used here for the lock-free counters:1. @volatile only guarantees visibility of writes across threads, it does not prevent concurrent modifications to the field which could lead to race conditions. Using AtomicLong and compareAndSet provides true thread-safety.2. @volatile would still require locking (synchronized) to safely read and update the field atomically. Using AtomicLong allows lock-free updates via compareAndSet. 3. Storing the two counters in a single AtomicLong field allows them to be read and updated together atomically in a single machine operation. With @volatile there would need to be two separate volatile reads/writes which are not atomic (20)
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
There are a few reasons why @volatile is not used here for the lock-free counters:1. @volatile only guarantees visibility of writes across threads, it does not prevent concurrent modifications to the field which could lead to race conditions. Using AtomicLong and compareAndSet provides true thread-safety.2. @volatile would still require locking (synchronized) to safely read and update the field atomically. Using AtomicLong allows lock-free updates via compareAndSet. 3. Storing the two counters in a single AtomicLong field allows them to be read and updated together atomically in a single machine operation. With @volatile there would need to be two separate volatile reads/writes which are not atomic
1. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Project Tungsten
Advanced Apache Spark Meetup
Chris Fregly
Principal Data Solutions Engineer
We’re Hiring - Only Nice People!
Nov 12, 2015
2. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Who Am I?
2
Streaming Data Engineer
Open Source Committer
Data Solutions Engineer
Apache Contributor
Principal Data Solutions Engineer
IBM Technology Center
Founder
Advanced Apache Meetup
Author
Advanced .
Due 2016
My Ma’s First Time in California
3. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Random Slide: More Ma “First Time” Pics
3
In California
Using Chopsticks
Using “New” iPhone
4. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Upcoming Meetups and Conferences
London Spark Meetup (Oct 12th)
Scotland Data Science Meetup (Oct 13th)
Dublin Spark Meetup (Oct 15th)
Barcelona Spark Meetup (Oct 20th)
Madrid Big Data Meetup (Oct 22nd)
Paris Spark Meetup (Oct 26th)
Amsterdam Spark Summit (Oct 27th)
Brussels Spark Meetup (Oct 30th)
Zurich Big Data Meetup (Nov 2nd)
Geneva Spark Meetup (Nov 5th)
San Francisco Datapalooza.io (Nov 10th)
4
San Francisco Advanced Spark (Nov 12th)
Oslo Big Data Hadoop Meetup (Nov 18th)
Helsinki Spark Meetup (Nov 20th)
Stockholm Spark Meetup (Nov 23rd)
Copenhagen Spark Meetup (Nov 26th)
Budapest Spark Meetup (Nov 27th)
Singapore Strata Conference (Dec 1st)
San Francisco Advanced Spark (Dec 8th)
Mountain View Advanced Spark (Dec 10th)
Toronto Spark Meetup (Dec 14th)
Washington DC Spark Meetup (Jan 2016)
5. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Advanced Apache Spark Meetup
Meetup Metrics
~1600 Members in just 4 mos!
4th Most Active Spark Meetup!!
Meetup Goals
Dig deep into codebase of Spark and related projects
Study integrations of Cassandra, ElasticSearch,
Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R
Surface and share patterns and idioms of these
well-designed, distributed, big data components
THANKS TO ALL OF YOU!!
6. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
All Slides and Code Are Available!
slideshare.net/cfregly
github.com/fluxcapacitor
hub.docker.com/r/fluxcapacitor
6
7. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Themes of this Talk
Filter
Off-Heap
Parallelize
Approximate
Find Similarity
Minimize Seeks
Maximize Scans
Customize for Workload
Tune Performance At Every Layer
7
Be Nice, Collaborate!
Like a Mom!!
8. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Outline
① Mechanical Sympathy
② Recap of 100TB GraySort Challenge
③ Project Tungsten Deep Dive
8
9. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Mechanical Sympathy
Hardware and software working together in harmony.
- Martin Thompson
http://mechanical-sympathy.blogspot.com
Whatever your data structure, my array will beat it.
- Scott Meyers
Every C++ Book, basically
9
Hair
Sympathy
- Bruce Jenner
10. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Spark and Mechanical Sympathy
10
Project
Tungsten
(Spark 1.4-1.6+)
GraySort
Challenge
(Spark 1.1-1.2)
Minimize Memory and GC
Maximize CPU Cache Locality
Saturate Network I/O
Saturate Disk I/O
11. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
AlphaSort Technique: Sort 100 Bytes Recs
11
Value
Ptr
Key
Dereference Not Required!
AlphaSort
List [(Key, Pointer)]
Key is directly available for comparison
Naïve
List [Pointer]
Must dereference key for comparison
Ptr
Dereference for Key Comparison
Key
12. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Line and Memory Sympathy
Key (10 bytes)+Pointer (*4 bytes)*Compressed OOPs
= 14 bytes
12
Key
Ptr
Not CPU Cache-line Friendly!
Ptr
Key-Prefix
2x CPU Cache-line Friendly!
Key-Prefix (4 bytes) + Pointer (4 bytes)
= 8 bytes
Key (10 bytes)+Pad (2 bytes)+Pointer (4 bytes)
= 16 bytes
Key
Ptr
Pad
/Pad
CPU Cache-line Friendly!
13. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Performance Comparison
13
14. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Similar Trick: Direct Cache Access (DCA)
Pull out packet header along side pointer to payload
14
15. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Lines: Sequential vs. Random
15
16. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Naïve Matrix Multiplication
// Dot product of each row & column vector
for (i <- 0 until numRowA)
for (j <- 0 until numColsB)
for (k <- 0 until numColsA)
res[ i ][ j ] += matA[ i ][ k ] * matB[ k ][ j ];
16
Bad: Row-wise traversal,
not using CPU cache line,
ineffective pre-fetching
17. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Friendly Matrix Multiplication
// Transpose B
for (i <- 0 until numRowsB)
for (j <- 0 until numColsB)
matBT[ i ][ j ] = matB[ j ][ i ];
// Modify dot product calculation for B Transpose
for (i <- 0 until numRowsA)
for (j <- 0 until numColsB)
for (k <- 0 until numColsA)
res[ i ][ j ] += matA[ i ][ k ] * matBT[ j ][ k ];
17
Good: Full CPU cache line,
effective prefetching
OLD: res[ i ][ j ] += matA[ i ][ k ] * matB [ k ] [ j ];
Reference j
before k
18. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Instrumenting and Monitoring CPU
Use Linux perf command!
18
http://www.brendangregg.com/blog/2015-11-06/java-mixed-mode-flame-graphs.html
19. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Results of Matrix Multiply Comparison
Naïve Matrix Multiply
Cache-Friendly Matrix Multiply
~72x
~8x
~3x
~3x
~2x
~7x
~10x
perf stat -XX:+PreserveFramePointer -XX:-Inline
–event L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses,
LLC-prefetch-misses,cache-misses,stalled-cycles-frontend
~10x
55 hp
550 hp
20. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Demo!
Compare CPU Naïve & Cache-Friendly Matrix Multiplication
20
21. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Naïve Tuple Counters
object CacheNaiveTupleIncrement {
var tuple = (0,0)
…
def increment(leftIncrement: Int, rightIncrement: Int) : (Int, Int) = {
this.synchronized {
tuple = (tuple._1 + leftIncrement, tuple._2 + rightIncrement)
tuple
}
}
}
21
22. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Naïve Case Class Counters
case class MyTuple(left: Int, right: Int)
object CacheNaiveCaseClassCounters {
var tuple = new MyTuple(0,0)
…
def increment(leftIncrement: Int, rightIncrement: Int) : MyTuple = {
this.synchronized {
tuple = new MyTuple(tuple.left + leftIncrement,
tuple.right + rightIncrement)
tuple
}
}
}
22
23. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Friendly Lock-Free Counters
object CacheFriendlyLockFreeCounters {
// a single Long (8-bytes) will maintain 2 separate Ints (4-bytes each)
val tuple = new AtomicLong()
…
def increment(leftIncrement: Int, rightIncrement: Int) : Long = {
var originalLong = 0L
var updatedLong = 0L
do {
originalLong = tuple.get()
val originalRightInt = originalLong.toInt // cast originalLong to Int to get right counter
val originalLeftInt = (originalLong >>> 32).toInt // shift right to get left counter
val updatedRightInt = originalRightInt + rightIncrement // increment right counter
val updatedLeftInt = originalLeftInt + leftIncrement // increment left counter
updatedLong = updatedLeftInt // update the new long with the left counter
updatedLong = updatedLong << 32 // shift the new long left
updatedLong += updatedRightInt // update the new long with the right counter
} while (tuple.compareAndSet(originalLong, updatedLong) == false)
updatedLong
}
23
Quiz: Why not @volatile?
24. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Demo!
Compare CPU Naïve & Cache-Friendly Tuple Counter Sync
24
25. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Results of Counters Comparison
Naïve Tuple Counters
Naïve Case Class Counters
Cache Friendly Lock-Free Counters
~2x
~1.5x
~3.5x
~2x
~2x
~1.5x
~1.5x
~1.5x
26. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Profiling Visualizations: Flame Graphs
With Java Stack Traces!!
26
Example: Spark Word Count
Java Stack Traces
are Good!
Plateaus
are Bad!!
27. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Outline
① Mechanical Sympathy
② Recap of 100TB GraySort Challenge
③ Project Tungsten Deep Dive
27
28. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
100TB GraySort Challenge
Sort 100TB of 100-Byte Records with 10-byte Keys
Custom Data Structs & Algos for Sort & Shuffle
Saturate Network and Disk I/O Controllers
28
29. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
100TB GraySort Challenge Results
29
Performance Goals
Saturate Network I/O
Saturate Disk I/O
(2013) (2014)
30. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Winning Hardware Configuration
Compute
206 Workers, 1 Master (AWS EC2 i2.8xlarge)
32 Intel Xeon CPU E5-2670 @ 2.5 Ghz
244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4
3 GBps mixed read/write disk I/O per node
Network
AWS Placement Groups, VPC, Enhanced Networking
Single Root I/O Virtualization (SR-IOV)
10 Gbps, low latency, low jitter (iperf: ~9.5 Gbps)
30
31. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Winning Software Configuration
Spark 1.2, OpenJDK 1.7
Disable caching, compression, spec execution, shuffle spill
Force NODE_LOCAL task scheduling for optimal data locality
HDFS 2.4.1 short-circuit local reads, 2x replication
Empirically chose between 4-6 partitions per cpu
206 nodes * 32 cores = 6592 cores
6592 cores * 4 = 26,368 partitions
6592 cores * 6 = 39,552 partitions
6592 cores * 4.25 = 28,000 partitions (empirical best)
Range partitioning takes advantage of sequential keyspace
Required ~10s of sampling 79 keys from in each partition
31
32. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
New Sort Shuffle Manager for Spark 1.2
Original “hash-based”
New “sort-based”
① Use less OS resources (socket buffers, file descriptors)
② TimSort partitions in-memory
③ MergeSort partitions on-disk into a single master file
④ Serve partitions from master file: seek once, sequential scan
32
33. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Asynchronous Network Module
Switch to asyncronous Netty vs. synchronous java.nio
Switch to zero-copy epoll
Use only kernel-space between disk and network controllers
Custom memory management
spark.shuffle.blockTransferService=netty
Spark-Netty Performance Tuning
spark.shuffle.io.preferDirectBuffers=true
Reuse off-heap buffers
spark.shuffle.io.numConnectionsPerPeer=8 (for example)
Increase to saturate hosts with multiple disks (8x800 SSD)
33
Details in
SPARK-2468
34. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Custom Algorithms and Data Structures
Optimized for sort & shuffle workloads
o.a.s.util.collection.TimSort[K,V]
Based on JDK 1.7 TimSort
Performs best with partially-sorted runs
Optimized for elements of (K,V) pairs
Sorts impl of SortDataFormat (ie. KVArraySortDataFormat)
o.a.s.util.collection.AppendOnlyMap
Open addressing hash, quadratic probing
Array of [(K, V), (K, V)]
Good memory locality
Keys never removed, values only append
34
35. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Daytona GraySort Challenge Goal Success
1.1 Gbps/node network I/O (Reducers)
Theoretical max = 1.25 Gbps for 10 GB ethernet
3 GBps/node disk I/O (Mappers)
35
Aggregate
Cluster
Network I/O!
220 Gbps / 206 nodes ~= 1.1 Gbps per node
36. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Shuffle Performance Tuning Tips
Hash Shuffle Manager (Deprecated)
spark.shuffle.consolidateFiles (Mapper)
o.a.s.shuffle.FileShuffleBlockResolver
Intermediate Files
Increase spark.shuffle.file.buffer (Reducer)
Increase spark.reducer.maxSizeInFlight if memory allows
Use Smaller Number of Larger Executors
Minimizes intermediate files and overall shuffle
More opportunity for PROCESS_LOCAL
SQL: BroadcastHashJoin vs. ShuffledHashJoin
spark.sql.autoBroadcastJoinThreshold
Use DataFrame.explain(true) or EXPLAIN to verify
36
Many Threads
(1 per CPU)
37. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Outline
① Mechanical Sympathy
② Recap of 100TB GraySort Challenge
③ Project Tungsten Deep Dive
37
38. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Project Tungsten
Data Struts & Algos Operate Directly on Byte Arrays
Maximize CPU Cache Locality, Minimize GC
Utilize Dynamic Code Generation
38
SPARK-7076
(Spark 1.4)
39. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Quick Review of Project Tungsten Jiras
39
SPARK-7076
(Spark 1.4)
40. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Why is CPU the Bottleneck?
CPU is used for serialization, hashing, compression!
Network and Disk I/O bandwidth are relatively high
GraySort optimizations improved network & shuffle
Partitioning, pruning, and predicate pushdowns
Binary, compressed, columnar file formats (Parquet)
40
41. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Yet Another Spark Shuffle Manager!
spark.shuffle.manager =
hash (Deprecated)
< 10,000 reducers
Output partition file hashes the key of (K,V) pair
Mapper creates an output file per partition
Leads to M*P output files for all partitions
sort (GraySort Challenge)
> 10,000 reducers
Default from Spark 1.2-1.5
Mapper creates single output file for all partitions
Minimizes OS resources, netty + epoll optimizes network I/O, disk I/O, and memory
Uses custom data structures and algorithms for sort-shuffle workload
Wins Daytona GraySort Challenge
tungsten-sort (Project Tungsten)
Default since 1.5
Modification of existing sort-based shuffle
Uses com.misc.Unsafe for self-managed memory and garbage collection
Maximize CPU utilization and cache locality with AlphaSort-inspired binary data structures/algorithms
Perform joins, sorts, and other operators on both serialized and compressed byte buffers
41
42. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU & Memory Optimizations
Custom Managed Memory
Reduces GC overhead
Both on and off heap
Exact size calculations
Direct Binary Processing
Operate on serialized/compressed arrays
Kryo can reorder/sort serialized records
LZF can reorder/sort compressed records
More CPU Cache-aware Data Structs & Algorithms
o.a.s.sql.catalyst.expression.UnsafeRow
o.a.s.unsafe.map.BytesToBytesMap
Code Generation (default in 1.5)
Generate source code from overall query plan
100+ UDFs converted to use code generation
42
UnsafeFixedWithAggregationMap
TungstenAggregationIterator
CodeGenerator
GeneratorUnsafeRowJoiner
UnsafeSortDataFormat
UnsafeShuffleSortDataFormat
PackedRecordPointer
UnsafeRow
UnsafeInMemorySorter
UnsafeExternalSorter
UnsafeShuffleWriter
Mostly Same Join Code,
UnsafeProjection
UnsafeShuffleManager
UnsafeShuffleInMemorySorter
UnsafeShuffleExternalSorter
Details in
SPARK-7075
43. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
sun.misc.Unsafe
43
Info
addressSize()
pageSize()
Objects
allocateInstance()
objectFieldOffset()
Classes
staticFieldOffset()
defineClass()
defineAnonymousClass()
ensureClassInitialized()
Synchronization
monitorEnter()
tryMonitorEnter()
monitorExit()
compareAndSwapInt()
putOrderedInt()
Arrays
arrayBaseOffset()
arrayIndexScale()
Memory
allocateMemory()
copyMemory()
freeMemory()
getAddress() – not guaranteed after GC
getInt()/putInt()
getBoolean()/putBoolean()
getByte()/putByte()
getShort()/putShort()
getLong()/putLong()
getFloat()/putFloat()
getDouble()/putDouble()
getObjectVolatile()/putObjectVolatile()
Used by
Tungsten
45. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Traditional Java Object Row Layout
4-byte String
Multi-field Object
45
46. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Custom Data Structures for Workload
UnsafeRow
(Dense Binary Row)
TaskMemoryManager
(Virtual Memory Address)
BytesToBytesMap
(Dense Binary HashMap)
46
Dense, 8-bytes per field (word-aligned)
Key
Ptr
AlphaSort-Style (Key + Pointer)
OS-Style Memory Paging
47. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
UnsafeRow Layout Example
47
Pre-Tungsten
Tungsten
48. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Custom Memory Management
o.a.s.memory.
TaskMemoryManager & MemoryConsumer
Memory management: virtual memory allocation, pageing
Off-heap: direct 64-bit address
On-heap: 13-bit page num + 27-bit page offset
o.a.s.shuffle.sort.
PackedRecordPointer
64-bit word
(24-bit partition key, (13-bit page num, 27-bit page offset))
o.a.s.unsafe.types.
UTF8String
Primitive Array[Byte]
48
2^13 pages * 2^27 page size = 1 TB RAM per Task
49. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
UnsafeFixedWidthAggregationMap
Aggregations
o.a.s.sql.execution.
UnsafeFixedWidthAggregationMap
Uses BytesToBytesMap
In-place updates of serialized data
No object creation on hot-path
Improved external agg support
No OOM’s for large, single key aggs
o.a.s.sql.catalyst.expression.codegen.
GenerateUnsafeRowJoiner
Combine 2 UnsafeRows into 1
o.a.s.sql.execution.aggregate.
TungstenAggregate & TungstenAggregationIterator
Operates directly on serialized, binary UnsafeRow
2 Steps: hash-based agg (grouping), then sort-based agg
Supports spilling and external merge sorting
49
50. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Equality
Bitwise comparison on UnsafeRow
No need to calculate equals(), hashCode()
Row 1
Equals!
Row 2
50
51. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Joins
Surprisingly, not many code changes
o.a.s.sql.catalyst.expressions.
UnsafeProjection
Converts InternalRow to UnsafeRow
51
52. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Sorting
o.a.s.util.collection.unsafe.sort.
UnsafeSortDataFormat
UnsafeInMemorySorter
UnsafeExternalSorter
RecordPointerAndKeyPrefix
UnsafeShuffleWriter
AlphaSort-Style Cache Friendly
52
Ptr
Key-Prefix
2x CPU Cache-line Friendly!
Using multiple subclasses of SortDataFormat
simultaneously will prevent JIT inlining.
This affects sort & shuffle performance.
Supports merging compressed records
if compression CODEC supports it (LZF)
53. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Spilling
Efficient Spilling
Exact data size is known
No need to maintain heuristics & approximations
Controls amount of spilling
Spill merge on compressed, binary records!
If compression CODEC supports it
53
UnsafeFixedWidthAggregationMap.getPeakMemoryUsedBytes()
Exact Peak Memory
for Spark Jobs
54. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Code Generation
Problem
Boxing causes excessive object creation
Expensive expression tree evals per row
JVM can’t inline polymorphic impls
Solution
Codegen by-passes virtual function calls
Defer source code generation to each operator, UDF, UDAF
Use Scala quasiquote macros for Scala AST source code gen
Rewrite and optimize code for overall plan, 8-byte align, etc
Use Janino to compile generated source code into bytecode
54
55. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
IBM | spark.tc
Spark SQL UDF Code Generation
100+ UDFs now generating code
More to come in Spark 1.6+
Details in
SPARK-8159, SPARK-9571
Each Implements
Expression.genCode()!
56. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Creating a Custom UDF with Codegen
Study existing implementations
https://github.com/apache/spark/pull/7214/files
Extend base trait
o.a.s.sql.catalyst.expressions.Expression.genCode()
Register the function
o.a.s.sql.catalyst.analysis.FunctionRegistry.registerFunction()
Augment DataFrame with new UDF (Scala implicits)
o.a.s.sql.functions.scala
Don’t forget about Python!
python.pyspark.sql.functions.py
56
57. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Who Benefits from Project Tungsten?
Users of DataFrames
All Spark SQL Queries
Catalyst
All RDDs
Serialization, Compression, and Aggregations
57
58. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Performance Results
Query Time
Garbage
Collection
58
OOM’d on
Large Dataset!
59. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Thank You!!!
Chris Fregly @cfregly
IBM Spark Technology Center
San Francisco, California
Relevant Links
advancedspark.com
Signup for the book & global meetup!
github.com/fluxcapacitor/pipeline
Clone, contribute, and commit code!
hub.docker.com/r/fluxcapacitor/pipeline/wiki
Run all demos in your own environment with Docker!
59
60. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark