Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Project Tungsten
Advanced Apache Spark Meetup
Chris Fregly
Principal Data Solutions Engineer
We’re Hiring - Only Nice People!
Nov 12, 2015

IBM Spark
spark.tc
spark.tc
IBM Spark
Who Am I?
2

Streaming Data Engineer
Open Source Committer 

Data Solutions Engineer 
Apache Contributor
Principal Data Solutions Engineer
IBM Technology Center
Founder
Advanced Apache Meetup
Author
Advanced .
Due 2016
My Ma’s First Time in California

IBM Spark
spark.tc
spark.tc
IBM Spark
Random Slide: More Ma “First Time” Pics
3
In California
Using Chopsticks
Using “New” iPhone

IBM Spark
spark.tc
spark.tc
IBM Spark
Upcoming Meetups and Conferences
London Spark Meetup (Oct 12th)
Scotland Data Science Meetup (Oct 13th)
Dublin Spark Meetup (Oct 15th)
Barcelona Spark Meetup (Oct 20th)
Madrid Big Data Meetup (Oct 22nd)
Paris Spark Meetup (Oct 26th)
Amsterdam Spark Summit (Oct 27th)
Brussels Spark Meetup (Oct 30th)
Zurich Big Data Meetup (Nov 2nd)
Geneva Spark Meetup (Nov 5th)
San Francisco Datapalooza.io (Nov 10th)
4
San Francisco Advanced Spark (Nov 12th)
Oslo Big Data Hadoop Meetup (Nov 18th)
Helsinki Spark Meetup (Nov 20th)
Stockholm Spark Meetup (Nov 23rd)
Copenhagen Spark Meetup (Nov 26th)
Budapest Spark Meetup (Nov 27th)
Singapore Strata Conference (Dec 1st)
San Francisco Advanced Spark (Dec 8th)
Mountain View Advanced Spark (Dec 10th)
Toronto Spark Meetup (Dec 14th)
Washington DC Spark Meetup (Jan 2016)

IBM Spark
spark.tc
spark.tc
IBM Spark
Advanced Apache Spark Meetup
Meetup Metrics
~1600 Members in just 4 mos!
4th Most Active Spark Meetup!!

Meetup Goals
  Dig deep into codebase of Spark and related projects
  Study integrations of Cassandra, ElasticSearch,

Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R
  Surface and share patterns and idioms of these

well-designed, distributed, big data components
THANKS TO ALL OF YOU!!

IBM Spark
spark.tc
spark.tc
IBM Spark
All Slides and Code Are Available!

slideshare.net/cfregly
github.com/fluxcapacitor
hub.docker.com/r/ﬂuxcapacitor

6

IBM Spark
spark.tc
spark.tc
IBM Spark
Themes of this Talk
 Filter
 Oﬀ-Heap
 Parallelize
 Approximate
 Find Similarity
 Minimize Seeks
 Maximize Scans
 Customize for Workload
 Tune Performance At Every Layer
7
  Be Nice, Collaborate!
Like a Mom!!

IBM Spark
spark.tc
spark.tc
IBM Spark
Outline
①  Mechanical Sympathy
②  Recap of 100TB GraySort Challenge
③  Project Tungsten Deep Dive
8

IBM Spark
spark.tc
spark.tc
IBM Spark
Mechanical Sympathy
Hardware and software working together in harmony.

- Martin Thompson

http://mechanical-sympathy.blogspot.com

Whatever your data structure, my array will beat it.

- Scott Meyers

Every C++ Book, basically

9
Hair
Sympathy
- Bruce Jenner

IBM Spark
spark.tc
spark.tc
IBM Spark
Spark and Mechanical Sympathy
10
Project  
Tungsten
(Spark 1.4-1.6+)
GraySort
Challenge
(Spark 1.1-1.2)
Minimize Memory and GC
Maximize CPU Cache Locality
Saturate Network I/O
Saturate Disk I/O

IBM Spark
spark.tc
spark.tc
IBM Spark
AlphaSort Technique: Sort 100 Bytes Recs
11
Value
Ptr
Key
Dereference Not Required!
AlphaSort

List [(Key, Pointer)]

Key is directly available for comparison
Naïve

List [Pointer]

Must dereference key for comparison
Ptr
Dereference for Key Comparison
Key

IBM Spark
spark.tc
spark.tc
IBM Spark
CPU Cache Line and Memory Sympathy
Key (10 bytes)+Pointer (*4 bytes)*Compressed OOPs

= 14 bytes

12
Key
Ptr
Not CPU Cache-line Friendly!
Ptr
Key-Preﬁx
2x CPU Cache-line Friendly!
Key-Preﬁx (4 bytes) + Pointer (4 bytes)

= 8 bytes
Key (10 bytes)+Pad (2 bytes)+Pointer (4 bytes) 
= 16 bytes
Key
Ptr
Pad
/Pad
CPU Cache-line Friendly!

IBM Spark
spark.tc
spark.tc
IBM Spark
Performance Comparison
13

IBM Spark
spark.tc
spark.tc
IBM Spark
Similar Trick: Direct Cache Access (DCA)
Pull out packet header along side pointer to payload
14

IBM Spark
spark.tc
spark.tc
IBM Spark
CPU Cache Lines: Sequential vs. Random
15

IBM Spark
spark.tc
spark.tc
IBM Spark
CPU Cache Naïve Matrix Multiplication
// Dot product of each row & column vector
for (i <- 0 until numRowA)
for (j <- 0 until numColsB)
for (k <- 0 until numColsA)
res[ i ][ j ] += matA[ i ][ k ] * matB[ k ][ j ];

16
Bad: Row-wise traversal,

not using CPU cache line, 
ineﬀective pre-fetching

IBM Spark
spark.tc
spark.tc
IBM Spark
CPU Cache Friendly Matrix Multiplication

// Transpose B
for (i <- 0 until numRowsB)

matBT[ i ][ j ] = matB[ j ][ i ];
 
// Modify dot product calculation for B Transpose
for (i <- 0 until numRowsA)
for (k <- 0 until numColsA)
res[ i ][ j ] += matA[ i ][ k ] * matBT[ j ][ k ];
17
Good: Full CPU cache line, 
eﬀective prefetching
OLD: res[ i ][ j ] += matA[ i ][ k ] * matB [ k ] [ j ];
Reference j 
before k

IBM Spark
spark.tc
spark.tc
IBM Spark
Instrumenting and Monitoring CPU
Use Linux perf command!
18
http://www.brendangregg.com/blog/2015-11-06/java-mixed-mode-flame-graphs.html

IBM Spark
spark.tc
spark.tc
IBM Spark
Results of Matrix Multiply Comparison
Naïve Matrix Multiply
Cache-Friendly Matrix Multiply
~72x
~8x
~3x
~3x
~2x
~7x
~10x
perf stat -XX:+PreserveFramePointer -XX:-Inline
–event L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses,
LLC-prefetch-misses,cache-misses,stalled-cycles-frontend
~10x
55 hp
550 hp

IBM Spark
spark.tc
IBM Spark
spark.tc
Demo!
Compare CPU Naïve & Cache-Friendly Matrix Multiplication
20

IBM Spark
spark.tc
spark.tc
IBM Spark
CPU Cache Naïve Tuple Counters
object CacheNaiveTupleIncrement {
var tuple = (0,0)
…

def increment(leftIncrement: Int, rightIncrement: Int) : (Int, Int) = {
this.synchronized {
tuple = (tuple._1 + leftIncrement, tuple._2 + rightIncrement)
tuple
}
}
}
21

IBM Spark
spark.tc
spark.tc
IBM Spark
CPU Cache Naïve Case Class Counters
case class MyTuple(left: Int, right: Int)

object CacheNaiveCaseClassCounters {
var tuple = new MyTuple(0,0)
…

def increment(leftIncrement: Int, rightIncrement: Int) : MyTuple = {
this.synchronized {
tuple = new MyTuple(tuple.left + leftIncrement,

tuple.right + rightIncrement)
tuple
}
}
}
22

IBM Spark
spark.tc
spark.tc
IBM Spark
CPU Cache Friendly Lock-Free Counters
object CacheFriendlyLockFreeCounters {
// a single Long (8-bytes) will maintain 2 separate Ints (4-bytes each)
val tuple = new AtomicLong()
…
def increment(leftIncrement: Int, rightIncrement: Int) : Long = {
var originalLong = 0L
var updatedLong = 0L
do {

originalLong = tuple.get()

val originalRightInt = originalLong.toInt // cast originalLong to Int to get right counter

val originalLeftInt = (originalLong >>> 32).toInt // shift right to get left counter

val updatedRightInt = originalRightInt + rightIncrement // increment right counter

val updatedLeftInt = originalLeftInt + leftIncrement // increment left counter

updatedLong = updatedLeftInt // update the new long with the left counter

updatedLong = updatedLong << 32 // shift the new long left

updatedLong += updatedRightInt // update the new long with the right counter
} while (tuple.compareAndSet(originalLong, updatedLong) == false)
updatedLong
}
23
Quiz: Why not @volatile?

IBM Spark
spark.tc
IBM Spark
spark.tc
Demo!
Compare CPU Naïve & Cache-Friendly Tuple Counter Sync
24

IBM Spark
spark.tc
spark.tc
IBM Spark
Results of Counters Comparison
Naïve Tuple Counters

Naïve Case Class Counters

Cache Friendly Lock-Free Counters
~2x
~1.5x
~3.5x
~2x
~2x
~1.5x
~1.5x
~1.5x

IBM Spark
spark.tc
spark.tc
IBM Spark
Proﬁling Visualizations: Flame Graphs
With Java Stack Traces!!
26
Example: Spark Word Count
Java Stack Traces
are Good!
Plateaus 
are Bad!!

IBM Spark
spark.tc
spark.tc
IBM Spark
Outline
27

IBM Spark
spark.tc
IBM Spark
spark.tc
100TB GraySort Challenge
Sort 100TB of 100-Byte Records with 10-byte Keys
Custom Data Structs & Algos for Sort & Shuﬄe
Saturate Network and Disk I/O Controllers
28

IBM Spark
spark.tc
spark.tc
IBM Spark
100TB GraySort Challenge Results
29
Performance Goals
  Saturate Network I/O
  Saturate Disk I/O
(2013) (2014)

IBM Spark
spark.tc
spark.tc
IBM Spark
Winning Hardware Conﬁguration
Compute

206 Workers, 1 Master (AWS EC2 i2.8xlarge)

32 Intel Xeon CPU E5-2670 @ 2.5 Ghz

244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4

3 GBps mixed read/write disk I/O per node
Network

AWS Placement Groups, VPC, Enhanced Networking

Single Root I/O Virtualization (SR-IOV)

10 Gbps, low latency, low jitter (iperf: ~9.5 Gbps)
30

IBM Spark
spark.tc
spark.tc
IBM Spark
Winning Software Conﬁguration
Spark 1.2, OpenJDK 1.7
Disable caching, compression, spec execution, shuﬄe spill
Force NODE_LOCAL task scheduling for optimal data locality
HDFS 2.4.1 short-circuit local reads, 2x replication
Empirically chose between 4-6 partitions per cpu

206 nodes * 32 cores = 6592 cores

6592 cores * 4 = 26,368 partitions

6592 cores * 6 = 39,552 partitions

6592 cores * 4.25 = 28,000 partitions (empirical best)
Range partitioning takes advantage of sequential keyspace

Required ~10s of sampling 79 keys from in each partition
31

IBM Spark
spark.tc
spark.tc
IBM Spark
New Sort Shuffle Manager for Spark 1.2
Original “hash-based”

New “sort-based”

①  Use less OS resources (socket buffers, file descriptors)
②  TimSort partitions in-memory
③  MergeSort partitions on-disk into a single master file
④  Serve partitions from master file: seek once, sequential scan
32

IBM Spark
spark.tc
spark.tc
IBM Spark
Asynchronous Network Module
Switch to asyncronous Netty vs. synchronous java.nio
Switch to zero-copy epoll

Use only kernel-space between disk and network controllers
Custom memory management

spark.shuffle.blockTransferService=netty
Spark-Netty Performance Tuning

spark.shuffle.io.preferDirectBuffers=true

Reuse off-heap buffers

spark.shuffle.io.numConnectionsPerPeer=8 (for example)

Increase to saturate hosts with multiple disks (8x800 SSD)
33
Details in
SPARK-2468

IBM Spark
spark.tc
spark.tc
IBM Spark
Custom Algorithms and Data Structures
Optimized for sort & shuﬄe workloads
o.a.s.util.collection.TimSort[K,V]

Based on JDK 1.7 TimSort

Performs best with partially-sorted runs

Optimized for elements of (K,V) pairs

Sorts impl of SortDataFormat (ie. KVArraySortDataFormat)
o.a.s.util.collection.AppendOnlyMap

Open addressing hash, quadratic probing

Array of [(K, V), (K, V)]

Good memory locality

Keys never removed, values only append
34

IBM Spark
spark.tc
spark.tc
IBM Spark
Daytona GraySort Challenge Goal Success

1.1 Gbps/node network I/O (Reducers) 
Theoretical max = 1.25 Gbps for 10 GB ethernet
3 GBps/node disk I/O (Mappers)
35
Aggregate  
Cluster
Network I/O!
220 Gbps / 206 nodes ~= 1.1 Gbps per node

IBM Spark
spark.tc
spark.tc
IBM Spark
Shuffle Performance Tuning Tips
Hash Shuffle Manager (Deprecated)

spark.shuffle.consolidateFiles (Mapper)

o.a.s.shuffle.FileShuffleBlockResolver
Intermediate Files

Increase spark.shuffle.file.buffer (Reducer)

Increase spark.reducer.maxSizeInFlight if memory allows
Use Smaller Number of Larger Executors

Minimizes intermediate files and overall shuffle

More opportunity for PROCESS_LOCAL
SQL: BroadcastHashJoin vs. ShuffledHashJoin

spark.sql.autoBroadcastJoinThreshold

Use DataFrame.explain(true) or EXPLAIN to verify

36
Many Threads
(1 per CPU)

IBM Spark
spark.tc
spark.tc
IBM Spark
Outline
37

IBM Spark
spark.tc
IBM Spark
spark.tc
Project Tungsten
Data Struts & Algos Operate Directly on Byte Arrays
Maximize CPU Cache Locality, Minimize GC
Utilize Dynamic Code Generation
38
SPARK-7076
(Spark 1.4)

IBM Spark
spark.tc
spark.tc
IBM Spark
Quick Review of Project Tungsten Jiras

39
SPARK-7076
(Spark 1.4)

IBM Spark
spark.tc
spark.tc
IBM Spark
Why is CPU the Bottleneck?
CPU is used for serialization, hashing, compression!

Network and Disk I/O bandwidth are relatively high

GraySort optimizations improved network & shuﬄe

Partitioning, pruning, and predicate pushdowns

Binary, compressed, columnar ﬁle formats (Parquet)
40

IBM Spark
spark.tc
spark.tc
IBM Spark
Yet Another Spark Shuffle Manager!
spark.shuffle.manager =

hash (Deprecated)

< 10,000 reducers

Output partition file hashes the key of (K,V) pair

Mapper creates an output file per partition

Leads to M*P output files for all partitions

sort (GraySort Challenge)

> 10,000 reducers

Default from Spark 1.2-1.5

Mapper creates single output file for all partitions

Minimizes OS resources, netty + epoll optimizes network I/O, disk I/O, and memory

Uses custom data structures and algorithms for sort-shuffle workload

Wins Daytona GraySort Challenge

tungsten-sort (Project Tungsten)

Default since 1.5

Modification of existing sort-based shuffle

Uses com.misc.Unsafe for self-managed memory and garbage collection

Maximize CPU utilization and cache locality with AlphaSort-inspired binary data structures/algorithms

Perform joins, sorts, and other operators on both serialized and compressed byte buffers
41

IBM Spark
spark.tc
spark.tc
IBM Spark
CPU & Memory Optimizations
Custom Managed Memory

Reduces GC overhead

Both on and oﬀ heap

Exact size calculations
Direct Binary Processing

Operate on serialized/compressed arrays

Kryo can reorder/sort serialized records

LZF can reorder/sort compressed records
More CPU Cache-aware Data Structs & Algorithms

o.a.s.sql.catalyst.expression.UnsafeRow

o.a.s.unsafe.map.BytesToBytesMap
Code Generation (default in 1.5)

Generate source code from overall query plan

100+ UDFs converted to use code generation
42
UnsafeFixedWithAggregationMap
TungstenAggregationIterator
CodeGenerator
GeneratorUnsafeRowJoiner
UnsafeSortDataFormat
UnsafeShuffleSortDataFormat
PackedRecordPointer
UnsafeRow
UnsafeInMemorySorter
UnsafeExternalSorter
UnsafeShuffleWriter
Mostly Same Join Code,
UnsafeProjection
UnsafeShuffleManager
UnsafeShuffleInMemorySorter
UnsafeShuffleExternalSorter
Details in
SPARK-7075

IBM Spark
spark.tc
spark.tc
IBM Spark
sun.misc.Unsafe
43
Info

addressSize()

pageSize()
Objects

allocateInstance()

objectFieldOffset()
Classes

staticFieldOffset()

defineClass()

defineAnonymousClass()

ensureClassInitialized()
Synchronization

monitorEnter()

tryMonitorEnter()

monitorExit()

compareAndSwapInt()

putOrderedInt()
Arrays

arrayBaseOffset()

arrayIndexScale()
Memory

allocateMemory()

copyMemory()

freeMemory()

getAddress() – not guaranteed after GC

getInt()/putInt()

getBoolean()/putBoolean()

getByte()/putByte()

getShort()/putShort()

getLong()/putLong()

getFloat()/putFloat()

getDouble()/putDouble()

getObjectVolatile()/putObjectVolatile()
Used by  
Tungsten

IBM Spark
spark.tc
spark.tc
IBM Spark
Spark + com.misc.Unsafe
44
org.apache.spark.sql.execution.
aggregate.SortBasedAggregate
aggregate.TungstenAggregate
aggregate.AggregationIterator
aggregate.udaf
aggregate.utils
SparkPlanner
rowFormatConverters
UnsafeFixedWidthAggregationMap
UnsafeExternalRowSorter
UnsafeKeyValueSorter
UnsafeKVExternalSorter
local.ConvertToUnsafeNode
local.ConvertToSafeNode
local.HashJoinNode
local.ProjectNode
local.LocalNode
local.BinaryHashJoinNode
local.NestedLoopJoinNode
joins.HashJoin
joins.HashSemiJoin
joins.HashedRelation
joins.BroadcastHashJoin
joins.ShuffledHashOuterJoin (not yet converted)
joins.BroadcastHashOuterJoin
joins.BroadcastLeftSemiJoinHash
joins.BroadcastNestedLoopJoin
joins.SortMergeJoin
joins.LeftSemiJoinBNL
joins.SortMergerOuterJoin
Exchange
SparkPlan
UnsafeRowSerializer
SortPrefixUtils
sort
basicOperators
aggregate.SortBasedAggregationIterator
aggregate.TungstenAggregationIterator
datasources.WriterContainer
datasources.json.JacksonParser
datasources.jdbc.JDBCRDD
org.apache.spark.
unsafe.Platform
unsafe.KVIterator
unsafe.array.LongArray
unsafe.array.ByteArrayMethods
unsafe.array.BitSet
unsafe.bitset.BitSetMethods
unsafe.hash.Murmur3_x86_32
unsafe.map.BytesToBytesMap
unsafe.map.HashMapGrowthStrategy
unsafe.memory.TaskMemoryManager
unsafe.memory.ExecutorMemoryManager
unsafe.memory.MemoryLocation
unsafe.memory.UnsafeMemoryAllocator
unsafe.memory.MemoryAllocator (trait/interface)
unsafe.memory.MemoryBlock
unsafe.memory.HeapMemoryAllocator
unsafe.memory.ExecutorMemoryManager
unsafe.sort.RecordComparator
unsafe.sort.PrefixComparator
unsafe.sort.PrefixComparators
unsafe.sort.UnsafeSorterSpillWriter
serializer.DummySerializationInstance
shuffle.unsafe.UnsafeShuffleManager
shuffle.unsafe.UnsafeShuffleSortDataFormat
shuffle.unsafe.SpillInfo
shuffle.unsafe.UnsafeShuffleWriter
shuffle.unsafe.UnsafeShuffleExternalSorter
shuffle.unsafe.PackedRecordPointer
shuffle.ShuffleMemoryManager
util.collection.unsafe.sort.UnsafeSorterSpillMerger
util.collection.unsafe.sort.UnsafeSorterSpillReader
util.collection.unsafe.sort.UnsafeSorterSpillWriter
util.collection.unsafe.sort.UnsafeShuffleInMemorySorter
util.collection.unsafe.sort.UnsafeInMemorySorter
util.collection.unsafe.sort.RecordPointerAndKeyPrefix
util.collection.unsafe.sort.UnsafeSorterIterator
network.shuffle.ExternalShuffleBlockResolver
scheduler.Task
rdd.SqlNewHadoopRDD
executor.Executor
org.apache.spark.sql.catalyst.expressions.
regexpExpressions
BoundAttribute
SortOrder
SpecializedGetters
ExpressionEvalHelper
UnsafeArrayData
UnsafeReaders
UnsafeMapData
Projection
LiteralGeneartor
UnsafeRow
JoinedRow
SpecializedGetters
InputFileName
SpecificMutableRow
codegen.CodeGenerator
codegen.GenerateProjection
codegen.GenerateUnsafeRowJoiner
codegen.GenerateSafeProjection
codegen.GenerateUnsafeProjection
codegen.BufferHolder
codegen.UnsafeRowWriter
codegen.UnsafeArrayWriter
complexTypeCreator
rows
literals
misc
stringExpressions
Over 200 source
files affected!!

IBM Spark
spark.tc
spark.tc
IBM Spark
Traditional Java Object Row Layout
4-byte String

Multi-ﬁeld Object

45

IBM Spark
spark.tc
spark.tc
IBM Spark
Custom Data Structures for Workload

UnsafeRow
(Dense Binary Row)

TaskMemoryManager
(Virtual Memory Address)

BytesToBytesMap
(Dense Binary HashMap)
46
Dense, 8-bytes per ﬁeld (word-aligned)
Key
Ptr
AlphaSort-Style (Key + Pointer)
OS-Style Memory Paging

IBM Spark
spark.tc
spark.tc
IBM Spark
UnsafeRow Layout Example
47
Pre-Tungsten

Tungsten

IBM Spark
spark.tc
spark.tc
IBM Spark
Custom Memory Management
o.a.s.memory. 

TaskMemoryManager & MemoryConsumer

Memory management: virtual memory allocation, pageing

Off-heap: direct 64-bit address

On-heap: 13-bit page num + 27-bit page offset
o.a.s.shuffle.sort.

PackedRecordPointer

64-bit word

(24-bit partition key, (13-bit page num, 27-bit page offset))
o.a.s.unsafe.types.

UTF8String

Primitive Array[Byte]
48
2^13 pages * 2^27 page size = 1 TB RAM per Task

IBM Spark
spark.tc
spark.tc
IBM Spark
Aggregations
o.a.s.sql.execution. 


Uses BytesToBytesMap

In-place updates of serialized data

No object creation on hot-path

Improved external agg support

No OOM’s for large, single key aggs
o.a.s.sql.catalyst.expression.codegen.

GenerateUnsafeRowJoiner

Combine 2 UnsafeRows into 1
o.a.s.sql.execution.aggregate.

TungstenAggregate & TungstenAggregationIterator

Operates directly on serialized, binary UnsafeRow

2 Steps: hash-based agg (grouping), then sort-based agg

Supports spilling and external merge sorting
49

IBM Spark
spark.tc
spark.tc
IBM Spark
Equality
Bitwise comparison on UnsafeRow

No need to calculate equals(), hashCode()

Row 1
Equals!
Row 2
50

IBM Spark
spark.tc
spark.tc
IBM Spark
Joins
Surprisingly, not many code changes

o.a.s.sql.catalyst.expressions.

UnsafeProjection

Converts InternalRow to UnsafeRow
51

IBM Spark
spark.tc
spark.tc
IBM Spark
Sorting
o.a.s.util.collection.unsafe.sort.

UnsafeSortDataFormat

UnsafeInMemorySorter


RecordPointerAndKeyPrefix 

UnsafeShuffleWriter
AlphaSort-Style Cache Friendly

52
Ptr
Key-Prefix
2x CPU Cache-line Friendly!
Using multiple subclasses of SortDataFormat
simultaneously will prevent JIT inlining.
This affects sort & shuffle performance.
Supports merging compressed records
if compression CODEC supports it (LZF)

IBM Spark
spark.tc
spark.tc
IBM Spark
Spilling
Eﬃcient Spilling

Exact data size is known

No need to maintain heuristics & approximations

Controls amount of spilling
Spill merge on compressed, binary records!

If compression CODEC supports it

53
UnsafeFixedWidthAggregationMap.getPeakMemoryUsedBytes()
Exact Peak Memory
for Spark Jobs

IBM Spark
spark.tc
spark.tc
IBM Spark
Code Generation
Problem
Boxing causes excessive object creation
Expensive expression tree evals per row
JVM can’t inline polymorphic impls
Solution
Codegen by-passes virtual function calls
Defer source code generation to each operator, UDF, UDAF
Use Scala quasiquote macros for Scala AST source code gen
Rewrite and optimize code for overall plan, 8-byte align, etc
Use Janino to compile generated source code into bytecode
54

IBM Spark
spark.tc
IBM | spark.tc
Spark SQL UDF Code Generation
100+ UDFs now generating code
More to come in Spark 1.6+
Details in
SPARK-8159, SPARK-9571
Each Implements
Expression.genCode()!

IBM Spark
spark.tc
spark.tc
IBM Spark
Creating a Custom UDF with Codegen
Study existing implementations

https://github.com/apache/spark/pull/7214/ﬁles
Extend base trait

o.a.s.sql.catalyst.expressions.Expression.genCode()
Register the function

o.a.s.sql.catalyst.analysis.FunctionRegistry.registerFunction()
Augment DataFrame with new UDF (Scala implicits)

o.a.s.sql.functions.scala
Don’t forget about Python!

python.pyspark.sql.functions.py

56

IBM Spark
spark.tc
spark.tc
IBM Spark
Who Beneﬁts from Project Tungsten?
Users of DataFrames

All Spark SQL Queries

Catalyst

All RDDs

Serialization, Compression, and Aggregations
57

IBM Spark
spark.tc
spark.tc
IBM Spark
Performance Results
Query Time

Garbage
Collection
58
OOM’d on
Large Dataset!

IBM Spark
spark.tc
spark.tc
IBM Spark
Thank You!!!
Chris Fregly @cfregly
IBM Spark Technology Center
San Francisco, California
Relevant Links
advancedspark.com
Signup for the book & global meetup!
github.com/ﬂuxcapacitor/pipeline
Clone, contribute, and commit code!
hub.docker.com/r/ﬂuxcapacitor/pipeline/wiki
Run all demos in your own environment with Docker!

59

IBM Spark
spark.tc
IBM Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

More from Chris Fregly

More from Chris Fregly (20)

Recently uploaded

Recently uploaded (20)