Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles
IBM Spark
spark.tc

After Dark 1.5
High Performance, Real-time, Streaming,
Machine Learning, Natural Language Processing,
Text Analytics, and Recommendations

Chris Fregly
Principal Data Solutions Engineer
IBM Spark Technology Center
** We’re Hiring -- Only Nice People, Please!! **
November 20, 2015

IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Who Am I?
2

Streaming Data Engineer
Open Source Committer 

Data Solutions Engineer 
Apache Contributor
Principal Data Solutions Engineer
IBM Technology Center
Founder
Advanced Apache Meetup
Author
Advanced .
Due 2016
My Ma’s First Time in California

IBM Spark
spark.tc
spark.tc
IBM Spark
Random Slide: More Ma “First Time” Pics
3
In California
Using Chopsticks
Using “New” iPhone

IBM Spark
spark.tc
spark.tc
IBM Spark
Upcoming Meetups and Conferences
London Spark Meetup (Oct 12th)
Scotland Data Science Meetup (Oct 13th)
Dublin Spark Meetup (Oct 15th)
Barcelona Spark Meetup (Oct 20th)
Madrid Big Data Meetup (Oct 22nd)
Paris Spark Meetup (Oct 26th)
Amsterdam Spark Summit (Oct 27th)
Brussels Spark Meetup (Oct 30th)
Zurich Big Data Meetup (Nov 2nd)
Geneva Spark Meetup (Nov 5th)
San Francisco Datapalooza.io (Nov 10th)
4
San Francisco Advanced Spark (Nov 12th)
Oslo Big Data Hadoop Meetup (Nov 19th)
Helsinki Spark Meetup (Nov 20th)
Stockholm Spark Meetup (Nov 23rd)
Copenhagen Spark Meetup (Nov 25th)
Budapest Spark Meetup (Nov 26th)
Singapore Strata Conference (Dec 1st)
San Francisco Advanced Spark (Dec 8th)
Mountain View Advanced Spark (Dec 10th)
Toronto Spark Meetup (Dec 14th)
Austin Data Days Conference (Jan 2016)

IBM Spark
spark.tc
spark.tc
IBM Spark
Advanced Apache Spark Meetup
Meetup Metrics
1600+ Members in just 4 mos!
Top 5 Most Active Spark Meetup!!

Meetup Goals
  Dig deep into codebase of Spark and related projects
  Study integrations of Cassandra, ElasticSearch,

Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R
  Surface and share patterns and idioms of these

well-designed, distributed, big data components

IBM Spark
spark.tc
spark.tc
IBM Spark
All Slides and Code Are Available!

advancedspark.com
slideshare.net/cfregly
github.com/ﬂuxcapacitor
hub.docker.com/r/ﬂuxcapacitor

6

IBM Spark
spark.tc
spark.tc
IBM Spark
What is “

After Dark”?
Spark-based, Advanced Analytics Reference App
End-to-End, Scalable, Real-time Big Data Pipeline
Demonstration of Spark & Related Big Data Projects
7

IBM Spark
spark.tc
spark.tc
IBM Spark
Tools of This Talk
8
  Kafka
  Redis
  Docker
  Ganglia
  Cassandra
  Parquet, JSON, ORC, Avro
  Apache Zeppelin Notebooks
  Spark SQL, DataFrames, Hive
  ElasticSearch, Logstash, Kibana
  Spark ML, GraphX, Stanford CoreNLP
…
hub.docker.com/r/ﬂuxcapacitor

IBM Spark
spark.tc
spark.tc
IBM Spark
Themes of this Talk
 Filter
 Oﬀ-Heap
 Parallelize
 Approximate
 Find Similarity
 Minimize Seeks
 Maximize Scans
 Customize for Workload
 Tune Performance At Every Layer
9
  Be Nice, Collaborate!
Like a Mom!!

IBM Spark
spark.tc
spark.tc
IBM Spark
Presentation Outline
 Spark Core: Tuning & Mechanical Sympathy
 Spark SQL: Query Optimizing & Catalyst
 Spark Streaming: Scaling & Approximations
 Spark ML: Featurizing & Recommendations
10

IBM Spark
spark.tc
Spark Core: Tuning & Mechanical Sympathy
Understand and Acknowledge Mechanical Sympathy

Study AlphaSort and 100Tb GraySort Challenge

Dive Deep into Project Tungsten

11

IBM Spark
spark.tc
spark.tc
IBM Spark
Mechanical Sympathy
Hardware and software working together in harmony.

- Martin Thompson

http://mechanical-sympathy.blogspot.com

Whatever your data structure, my array will beat it.

- Scott Meyers

Every C++ Book, basically

12
Hair
Sympathy
- Bruce Jenner

IBM Spark
spark.tc
spark.tc
IBM Spark
Spark and Mechanical Sympathy
13
Project  
Tungsten
(Spark 1.4-1.6+)
GraySort
Challenge
(Spark 1.1-1.2)
Minimize Memory and GC
Maximize CPU Cache Locality
Saturate Network I/O
Saturate Disk I/O

IBM Spark
spark.tc
spark.tc
IBM Spark
AlphaSort Technique: Sort 100 Bytes Recs
14
Value
Ptr
Key
Dereference Not Required!
AlphaSort

List [(Key, Pointer)]

Key is directly available for comparison
Naïve

List [Pointer]

Must dereference key for comparison
Ptr
Dereference for Key Comparison
Key

IBM Spark
spark.tc
spark.tc
IBM Spark
CPU Cache Line and Memory Sympathy
Key (10 bytes)+Pointer (*4 bytes)*Compressed OOPs

= 14 bytes

15
Key
Ptr
Not CPU Cache-line Friendly!
Ptr
Key-Preﬁx
2x CPU Cache-line Friendly!
Key-Preﬁx (4 bytes) + Pointer (4 bytes)

= 8 bytes
Key (10 bytes)+Pad (2 bytes)+Pointer (4 bytes) 
= 16 bytes
Key
Ptr
Pad
/Pad
CPU Cache-line Friendly!

IBM Spark
spark.tc
spark.tc
IBM Spark
Performance Comparison
16

IBM Spark
spark.tc
spark.tc
IBM Spark
Similar Trick: Direct Cache Access (DCA)
Pull out packet header along side pointer to payload
17

IBM Spark
spark.tc
spark.tc
IBM Spark
CPU Cache Line Sizes
18
My 
Laptop
My 
SoftLayer 
BareMetal

IBM Spark
spark.tc
spark.tc
IBM Spark
Cache Hits: Sequential v Random Access
19

IBM Spark
spark.tc
Mechanical Sympathy
CPU Cache Lines and Matrix Multiplication

20

IBM Spark
spark.tc
spark.tc
IBM Spark
CPU Cache Naïve Matrix Multiplication
// Dot product of each row & column vector
for (i <- 0 until numRowA)
for (j <- 0 until numColsB)
for (k <- 0 until numColsA)
res[ i ][ j ] += matA[ i ][ k ] * matB[ k ][ j ];

21
Bad: Row-wise traversal,

not using CPU cache line, 
ineﬀective pre-fetching

IBM Spark
spark.tc
spark.tc
IBM Spark
CPU Cache Friendly Matrix Multiplication

// Transpose B
for (i <- 0 until numRowsB)

matBT[ i ][ j ] = matB[ j ][ i ];
 
// Modify dot product calculation for B Transpose
for (i <- 0 until numRowsA)
for (k <- 0 until numColsA)
res[ i ][ j ] += matA[ i ][ k ] * matBT[ j ][ k ];
22
Good: Full CPU cache line, 
eﬀective prefetching
OLD: res[ i ][ j ] += matA[ i ][ k ] * matB [ k ] [ j ];
Reference j 
before k

IBM Spark
spark.tc
spark.tc
IBM Spark
Instrumenting and Monitoring CPU
Use Linux perf command!
23
http://www.brendangregg.com/blog/2015-11-06/java-mixed-mode-flame-graphs.html

IBM Spark
spark.tc
Demo!
Compare CPU Naïve & Cache-Friendly Matrix Multiplication
24

IBM Spark
spark.tc
spark.tc
IBM Spark
Results of Matrix Multiply Comparison
Naïve Matrix Multiply
25
Cache-Friendly Matrix Multiply
~27x
~13x
~13x
~2x
perf stat -XX:-Inline –event
L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses,
LLC-prefetch-misses,cache-misses,stalled-cycles-frontend
~10x
55 hp
550 hp

IBM Spark
spark.tc
Mechanical Sympathy
CPU Cache Lines and Lock-Free Thread Sync

26

IBM Spark
spark.tc
spark.tc
IBM Spark
CPU Cache Naïve Tuple Counters
object CacheNaiveTupleIncrement {
var tuple = (0,0)
…

def increment(leftIncrement: Int, rightIncrement: Int) : (Int, Int) = {
this.synchronized {
tuple = (tuple._1 + leftIncrement, tuple._2 + rightIncrement)
tuple
}
}
}
27

IBM Spark
spark.tc
spark.tc
IBM Spark
CPU Cache Naïve Case Class Counters
case class MyTuple(left: Int, right: Int)

object CacheNaiveCaseClassCounters {
var tuple = new MyTuple(0,0)
…

def increment(leftIncrement: Int, rightIncrement: Int) : MyTuple = {
this.synchronized {
tuple = new MyTuple(tuple.left + leftIncrement,

tuple.right + rightIncrement)
tuple
}
}
}
28

IBM Spark
spark.tc
spark.tc
IBM Spark
CPU Cache Friendly Lock-Free Counters
object CacheFriendlyLockFreeCounters {
// a single Long (8-bytes) will maintain 2 separate Ints (4-bytes each)
val tuple = new AtomicLong()
…
def increment(leftIncrement: Int, rightIncrement: Int) : Long = {
var originalLong = 0L
var updatedLong = 0L
do {

originalLong = tuple.get()

val originalRightInt = originalLong.toInt // cast originalLong to Int to get right counter

val originalLeftInt = (originalLong >>> 32).toInt // shift right to get left counter

val updatedRightInt = originalRightInt + rightIncrement // increment right counter

val updatedLeftInt = originalLeftInt + leftIncrement // increment left counter

updatedLong = updatedLeftInt // update the new long with the left counter

updatedLong = updatedLong << 32 // shift the new long left

updatedLong += updatedRightInt // update the new long with the right counter
} while (tuple.compareAndSet(originalLong, updatedLong) == false)
updatedLong
}
29
Q: Why not @volatile long?
A: Java Memory Model  
does not guarantee synchronous 
updates of 64-bit longs or doubles

IBM Spark
spark.tc
Demo!
Compare CPU Naïve & Cache-Friendly Tuple Counter Sync
30

IBM Spark
spark.tc
spark.tc
IBM Spark
Results of Counters Comparison
Naïve Tuple Counters

Naïve Case Class Counters

Cache Friendly Lock-Free Counters
~2x
~1.5x
~3.5x
~2x
~2x
~1.5x
~1.5x
~1.5x

IBM Spark
spark.tc
spark.tc
IBM Spark
Proﬁling Visualizations: Flame Graphs
32
Example: Spark Word Count
Java Stack Traces
(-XX:+PreserveFramePointer)
Plateaus 
are Bad!!

IBM Spark
spark.tc
100TB Daytona GraySort Challenge
Focus on Network and Disk I/O Optimizations
Improve Data Structs/Algos for Sort & Shuﬄe
Saturate Network and Disk Controllers
33

IBM Spark
spark.tc
spark.tc
IBM Spark
Winning Results
34
Spark Goals
  Saturate Network I/O
  Saturate Disk I/O
(2013) (2014)

IBM Spark
spark.tc
spark.tc
IBM Spark
Winning Hardware Conﬁguration
Compute

206 Workers, 1 Master (AWS EC2 i2.8xlarge)

32 Intel Xeon CPU E5-2670 @ 2.5 Ghz

244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4

3 GBps mixed read/write disk I/O per node
Network

AWS Placement Groups, VPC, Enhanced Networking

Single Root I/O Virtualization (SR-IOV)

10 Gbps, low latency, low jitter (iperf: ~9.5 Gbps)
35

IBM Spark
spark.tc
spark.tc
IBM Spark
Winning Software Conﬁguration
Spark 1.2, OpenJDK 1.7
Disable caching, compression, spec execution, shuﬄe spill
Force NODE_LOCAL task scheduling for optimal data locality
HDFS 2.4.1 short-circuit local reads, 2x replication
Empirically chose between 4-6 partitions per cpu

206 nodes * 32 cores = 6592 cores

6592 cores * 4 = 26,368 partitions

6592 cores * 6 = 39,552 partitions

6592 cores * 4.25 = 28,000 partitions (empirical best)
Range partitioning takes advantage of sequential keyspace

Required ~10s of sampling 79 keys from in each partition
36

IBM Spark
spark.tc
spark.tc
IBM Spark
New Sort Shuffle Manager for Spark 1.2
Original “hash-based”

New “sort-based”

①  Use less OS resources (socket buffers, file descriptors)
②  TimSort partitions in-memory
③  MergeSort partitions on-disk into a single master file
④  Serve partitions from master file: seek once, sequential scan
37

IBM Spark
spark.tc
spark.tc
IBM Spark
Asynchronous Network Module
Switch to asyncronous Netty vs. synchronous java.nio
Switch to zero-copy epoll

Use only kernel-space between disk and network controllers
Custom memory management

spark.shuffle.blockTransferService=netty
Spark-Netty Performance Tuning

spark.shuffle.io.preferDirectBuffers=true

Reuse off-heap buffers

spark.shuffle.io.numConnectionsPerPeer=8 (for example)

Increase to saturate hosts with multiple disks (8x800 SSD)
38
Details in
SPARK-2468

IBM Spark
spark.tc
spark.tc
IBM Spark
Custom Algorithms and Data Structures
Optimized for sort & shuﬄe workloads
o.a.s.util.collection.TimSort[K,V]

Based on JDK 1.7 TimSort

Performs best with partially-sorted runs

Optimized for elements of (K,V) pairs

Sorts impl of SortDataFormat (ie. KVArraySortDataFormat)
o.a.s.util.collection.AppendOnlyMap

Open addressing hash, quadratic probing

Array of [(K, V), (K, V)]

Good memory locality

Keys never removed, values only append
39

IBM Spark
spark.tc
spark.tc
IBM Spark
Daytona GraySort Challenge Goal Success

1.1 Gbps/node network I/O (Reducers) 
Theoretical max = 1.25 Gbps for 10 GB ethernet
3 GBps/node disk I/O (Mappers)
40
Aggregate  
Cluster
Network I/O!
220 Gbps / 206 nodes ~= 1.1 Gbps per node

IBM Spark
spark.tc
spark.tc
IBM Spark
Shuffle Performance Tuning Tips
Hash Shuffle Manager (Deprecated)

spark.shuffle.consolidateFiles (Mapper)

o.a.s.shuffle.FileShuffleBlockResolver
Intermediate Files

Increase spark.shuffle.file.buffer (Reducer)

Increase spark.reducer.maxSizeInFlight if memory allows
Use Smaller Number of Larger Executors

Minimizes intermediate files and overall shuffle

More opportunity for PROCESS_LOCAL
SQL: BroadcastHashJoin vs. ShuffledHashJoin

spark.sql.autoBroadcastJoinThreshold

Use DataFrame.explain(true) or EXPLAIN to verify

41
Many Threads
(1 per CPU)

IBM Spark
spark.tc
Project Tungsten
Data Struts & Algos Operate Directly on Byte Arrays
Maximize CPU Cache Locality, Minimize GC
Utilize Dynamic Code Generation
42
SPARK-7076
(Spark 1.4)

IBM Spark
spark.tc
spark.tc
IBM Spark
Quick Review of Project Tungsten Jiras

43
SPARK-7076
(Spark 1.4)

IBM Spark
spark.tc
spark.tc
IBM Spark
Why is CPU the Bottleneck?
CPU is used for serialization, hashing, compression!

Network and Disk I/O bandwidth are relatively high

GraySort optimizations improved network & shuﬄe

Partitioning, pruning, and predicate pushdowns

Binary, compressed, columnar ﬁle formats (Parquet)
44

IBM Spark
spark.tc
spark.tc
IBM Spark
Yet Another Spark Shuffle Manager!
spark.shuffle.manager =

hash (Deprecated)

< 10,000 reducers

Output partition file hashes the key of (K,V) pair

Mapper creates an output file per partition

Leads to M*P output files for all partitions

sort (GraySort Challenge)

> 10,000 reducers

Default from Spark 1.2-1.5

Mapper creates single output file for all partitions

Minimizes OS resources, netty + epoll optimizes network I/O, disk I/O, and memory

Uses custom data structures and algorithms for sort-shuffle workload

Wins Daytona GraySort Challenge

tungsten-sort (Project Tungsten)

Default since 1.5

Modification of existing sort-based shuffle

Uses com.misc.Unsafe for self-managed memory and garbage collection

Maximize CPU utilization and cache locality with AlphaSort-inspired binary data structures/algorithms

Perform joins, sorts, and other operators on both serialized and compressed byte buffers
45

IBM Spark
spark.tc
spark.tc
IBM Spark
CPU & Memory Optimizations
Custom Managed Memory

Reduces GC overhead

Both on and oﬀ heap

Exact size calculations
Direct Binary Processing

Operate on serialized/compressed arrays

Kryo can reorder/sort serialized records

LZF can reorder/sort compressed records
More CPU Cache-aware Data Structs & Algorithms

o.a.s.sql.catalyst.expression.UnsafeRow

o.a.s.unsafe.map.BytesToBytesMap
Code Generation (default in 1.5)

Generate source code from overall query plan

100+ UDFs converted to use code generation
46
UnsafeFixedWithAggregationMap
TungstenAggregationIterator
CodeGenerator
GeneratorUnsafeRowJoiner
UnsafeSortDataFormat
UnsafeShuffleSortDataFormat
PackedRecordPointer
UnsafeRow
UnsafeInMemorySorter
UnsafeExternalSorter
UnsafeShuffleWriter
Mostly Same Join Code,
UnsafeProjection
UnsafeShuffleManager
UnsafeShuffleInMemorySorter
UnsafeShuffleExternalSorter
Details in
SPARK-7075

IBM Spark
spark.tc
spark.tc
IBM Spark
sun.misc.Unsafe
47
Info

addressSize()

pageSize()
Objects

allocateInstance()

objectFieldOffset()
Classes

staticFieldOffset()

defineClass()

defineAnonymousClass()

ensureClassInitialized()
Synchronization

monitorEnter()

tryMonitorEnter()

monitorExit()

compareAndSwapInt()

putOrderedInt()
Arrays

arrayBaseOffset()

arrayIndexScale()
Memory

allocateMemory()

copyMemory()

freeMemory()

getAddress() – not guaranteed after GC

getInt()/putInt()

getBoolean()/putBoolean()

getByte()/putByte()

getShort()/putShort()

getLong()/putLong()

getFloat()/putFloat()

getDouble()/putDouble()

getObjectVolatile()/putObjectVolatile()
Used by  
Tungsten

IBM Spark
spark.tc
spark.tc
IBM Spark
Spark + com.misc.Unsafe
48
org.apache.spark.sql.execution.
aggregate.SortBasedAggregate
aggregate.TungstenAggregate
aggregate.AggregationIterator
aggregate.udaf
aggregate.utils
SparkPlanner
rowFormatConverters
UnsafeFixedWidthAggregationMap
UnsafeExternalRowSorter
UnsafeKeyValueSorter
UnsafeKVExternalSorter
local.ConvertToUnsafeNode
local.ConvertToSafeNode
local.HashJoinNode
local.ProjectNode
local.LocalNode
local.BinaryHashJoinNode
local.NestedLoopJoinNode
joins.HashJoin
joins.HashSemiJoin
joins.HashedRelation
joins.BroadcastHashJoin
joins.ShuffledHashOuterJoin (not yet converted)
joins.BroadcastHashOuterJoin
joins.BroadcastLeftSemiJoinHash
joins.BroadcastNestedLoopJoin
joins.SortMergeJoin
joins.LeftSemiJoinBNL
joins.SortMergerOuterJoin
Exchange
SparkPlan
UnsafeRowSerializer
SortPrefixUtils
sort
basicOperators
aggregate.SortBasedAggregationIterator
aggregate.TungstenAggregationIterator
datasources.WriterContainer
datasources.json.JacksonParser
datasources.jdbc.JDBCRDD
org.apache.spark.
unsafe.Platform
unsafe.KVIterator
unsafe.array.LongArray
unsafe.array.ByteArrayMethods
unsafe.array.BitSet
unsafe.bitset.BitSetMethods
unsafe.hash.Murmur3_x86_32
unsafe.map.BytesToBytesMap
unsafe.map.HashMapGrowthStrategy
unsafe.memory.TaskMemoryManager
unsafe.memory.ExecutorMemoryManager
unsafe.memory.MemoryLocation
unsafe.memory.UnsafeMemoryAllocator
unsafe.memory.MemoryAllocator (trait/interface)
unsafe.memory.MemoryBlock
unsafe.memory.HeapMemoryAllocator
unsafe.memory.ExecutorMemoryManager
unsafe.sort.RecordComparator
unsafe.sort.PrefixComparator
unsafe.sort.PrefixComparators
unsafe.sort.UnsafeSorterSpillWriter
serializer.DummySerializationInstance
shuffle.unsafe.UnsafeShuffleManager
shuffle.unsafe.UnsafeShuffleSortDataFormat
shuffle.unsafe.SpillInfo
shuffle.unsafe.UnsafeShuffleWriter
shuffle.unsafe.UnsafeShuffleExternalSorter
shuffle.unsafe.PackedRecordPointer
shuffle.ShuffleMemoryManager
util.collection.unsafe.sort.UnsafeSorterSpillMerger
util.collection.unsafe.sort.UnsafeSorterSpillReader
util.collection.unsafe.sort.UnsafeSorterSpillWriter
util.collection.unsafe.sort.UnsafeShuffleInMemorySorter
util.collection.unsafe.sort.UnsafeInMemorySorter
util.collection.unsafe.sort.RecordPointerAndKeyPrefix
util.collection.unsafe.sort.UnsafeSorterIterator
network.shuffle.ExternalShuffleBlockResolver
scheduler.Task
rdd.SqlNewHadoopRDD
executor.Executor
org.apache.spark.sql.catalyst.expressions.
regexpExpressions
BoundAttribute
SortOrder
SpecializedGetters
ExpressionEvalHelper
UnsafeArrayData
UnsafeReaders
UnsafeMapData
Projection
LiteralGeneartor
UnsafeRow
JoinedRow
SpecializedGetters
InputFileName
SpecificMutableRow
codegen.CodeGenerator
codegen.GenerateProjection
codegen.GenerateUnsafeRowJoiner
codegen.GenerateSafeProjection
codegen.GenerateUnsafeProjection
codegen.BufferHolder
codegen.UnsafeRowWriter
codegen.UnsafeArrayWriter
complexTypeCreator
rows
literals
misc
stringExpressions
Over 200 source
files affected!!

IBM Spark
spark.tc
spark.tc
IBM Spark
Traditional Java Object Row Layout
4-byte String

Multi-ﬁeld Object

49

IBM Spark
spark.tc
spark.tc
IBM Spark
Custom Data Structures for Workload

UnsafeRow
(Dense Binary Row)

TaskMemoryManager
(Virtual Memory Address)

BytesToBytesMap
(Dense Binary HashMap)
50
Dense, 8-bytes per ﬁeld (word-aligned)
Key
Ptr
AlphaSort-Style (Key + Pointer)
OS-Style Memory Paging

IBM Spark
spark.tc
spark.tc
IBM Spark
UnsafeRow Layout Example
51
Pre-Tungsten

Tungsten

IBM Spark
spark.tc
spark.tc
IBM Spark
Custom Memory Management
o.a.s.memory. 

TaskMemoryManager & MemoryConsumer

Memory management: virtual memory allocation, pageing

Off-heap: direct 64-bit address

On-heap: 13-bit page num + 27-bit page offset
o.a.s.shuffle.sort.

PackedRecordPointer

64-bit word

(24-bit partition key, (13-bit page num, 27-bit page offset))
o.a.s.unsafe.types.

UTF8String

Primitive Array[Byte]
52
2^13 pages * 2^27 page size = 1 TB RAM per Task

IBM Spark
spark.tc
spark.tc
IBM Spark
Aggregations
o.a.s.sql.execution. 


Uses BytesToBytesMap

In-place updates of serialized data

No object creation on hot-path

Improved external agg support

No OOM’s for large, single key aggs
o.a.s.sql.catalyst.expression.codegen.

GenerateUnsafeRowJoiner

Combine 2 UnsafeRows into 1
o.a.s.sql.execution.aggregate.

TungstenAggregate & TungstenAggregationIterator

Operates directly on serialized, binary UnsafeRow

2 Steps: hash-based agg (grouping), then sort-based agg

Supports spilling and external merge sorting
53

IBM Spark
spark.tc
spark.tc
IBM Spark
Equality
Bitwise comparison on UnsafeRow

No need to calculate equals(), hashCode()

Row 1
Equals!
Row 2
54

IBM Spark
spark.tc
spark.tc
IBM Spark
Joins
Surprisingly, not many code changes

o.a.s.sql.catalyst.expressions.

UnsafeProjection

Converts InternalRow to UnsafeRow
55

IBM Spark
spark.tc
spark.tc
IBM Spark
Sorting
o.a.s.util.collection.unsafe.sort.

UnsafeSortDataFormat

UnsafeInMemorySorter


RecordPointerAndKeyPrefix 

UnsafeShuffleWriter
AlphaSort-Style Cache Friendly

56
Ptr
Key-Prefix
2x CPU Cache-line Friendly!
Using multiple subclasses of SortDataFormat
simultaneously will prevent JIT inlining.
This affects sort & shuffle performance.
Supports merging compressed records
if compression CODEC supports it (LZF)

IBM Spark
spark.tc
spark.tc
IBM Spark
Spilling
Eﬃcient Spilling

Exact data size is known

No need to maintain heuristics & approximations

Controls amount of spilling
Spill merge on compressed, binary records!

If compression CODEC supports it

57
UnsafeFixedWidthAggregationMap.getPeakMemoryUsedBytes()
Exact Peak Memory
for Spark Jobs

IBM Spark
spark.tc
spark.tc
IBM Spark
Code Generation
Problem
Boxing causes excessive object creation
Expensive expression tree evals per row
JVM can’t inline polymorphic impls
Solution
Codegen by-passes virtual function calls
Defer source code generation to each operator, UDF, UDAF
Use Scala quasiquote macros for Scala AST source code gen
Rewrite and optimize code for overall plan, 8-byte align, etc
Use Janino to compile generated source code into bytecode
58

IBM Spark
spark.tc
IBM | spark.tc
Spark SQL UDF Code Generation
100+ UDFs now generating code
More to come in Spark 1.6+
Details in
SPARK-8159, SPARK-9571
Each Implements
Expression.genCode()!

IBM Spark
spark.tc
spark.tc
IBM Spark
Creating a Custom UDF with Codegen
Study existing implementations

https://github.com/apache/spark/pull/7214/ﬁles
Extend base trait

o.a.s.sql.catalyst.expressions.Expression.genCode()
Register the function

o.a.s.sql.catalyst.analysis.FunctionRegistry.registerFunction()
Augment DataFrame with new UDF (Scala implicits)

o.a.s.sql.functions.scala
Don’t forget about Python!

python.pyspark.sql.functions.py

60

IBM Spark
spark.tc
spark.tc
IBM Spark
Who Beneﬁts from Project Tungsten?
Users of DataFrames

All Spark SQL Queries

Catalyst

All RDDs

Serialization, Compression, and Aggregations
61

IBM Spark
spark.tc
spark.tc
IBM Spark
Project Tungsten Performance Results
Query Time

Garbage
Collection
62
OOM’d on
Large Dataset!

IBM Spark
spark.tc
spark.tc
IBM Spark
63

IBM Spark
spark.tc
Spark SQL: Query Optimizing & Catalyst
Explore DataFrames/Datasets/DataSources, Catalyst

Review Partitions, Pruning, Pushdowns, File Formats

Create a Custom DataSource API Implementation

64

IBM Spark
spark.tc
spark.tc
IBM Spark
DataFrames
Inspired by R and Pandas DataFrames

Schema-aware
Cross language support

SQL, Python, Scala, Java, R
Levels performance of Python, Scala, Java, and R

Generates JVM bytecode vs serializing to Python
DataFrame is container for logical plan

Lazy transformations represented as tree
Only logical plan is sent from Python -> JVM

Only results returned from JVM -> Python
UDF and UDAF Support

Custom UDF support using registerFunction()

Experimental UDAF support (ie. HyperLogLog)
Supports existing Hive metastore if available

Small, ﬁle-based Hive metastore created if not available
*DataFrame.rdd returns underlying RDD if needed
65
Use DataFrames
instead of RDDs!!

IBM Spark
spark.tc
spark.tc
IBM Spark
Spark and Hive
Early days, Shark was “Hive on Spark”
Hive Optimizer slowly replaced with Catalyst
Always use HiveContext – even if not using Hive!

If no Hive, a small Hive metastore ﬁle is created
Spark 1.5+ supports all Hive versions 0.12+

Separate classloaders for isolation

Breaks dependency between Spark internal Hive
version
and User’s external Hive version
66

IBM Spark
spark.tc
spark.tc
IBM Spark
Catalyst Optimizer

Optimize DataFrame Transformation Tree
Subquery elimination: use aliases to collapse subqueries
Constant folding: replace expression with constant
Simplify filters: remove unnecessary filters
Predicate/filter pushdowns: avoid unnecessary data load
Projection collapsing: avoid unnecessary projections
Create Custom Rules
Rules are Scala Case Classes
val newPlan = MyFilterRule(analyzedPlan)
67
Implements
oas.sql.catalyst.rules.Rule
Apply to any plan stage

IBM Spark
spark.tc
spark.tc
IBM Spark
DataSources API
Relations (o.a.s.sql.sources.interfaces.scala)

BaseRelation (abstract class): Provides schema of data

TableScan (impl): Read all data from source

PrunedFilteredScan (impl): Column pruning & predicate pushdowns

InsertableRelation (impl): Insert/overwrite data based on SaveMode

RelationProvider (trait/interface): Handle options, BaseRelation factory
Execution (o.a.s.sql.execution.commands.scala)

RunnableCommand (trait/interface): Common commands like EXPLAIN

ExplainCommand(impl: case class)

CacheTableCommand(impl: case class)
Filters (o.a.s.sql.sources.ﬁlters.scala)

Filter (abstract class): Handles all predicates/ﬁlters supported by this source

EqualTo (impl)

GreaterThan (impl)

StringStartsWith (impl)
68

IBM Spark
spark.tc
spark.tc
IBM Spark
Native Spark SQL DataSources
69

IBM Spark
spark.tc
spark.tc
IBM Spark
Query Plan Debugging
70
gendersCsvDF.select($"id", $"gender").ﬁlter("gender != 'F'").ﬁlter("gender != 'M'").explain(true)
DataFrame.queryExecution.logical
DataFrame.queryExecution.analyzed
DataFrame.queryExecution.optimizedPlan
DataFrame.queryExecution.executedPlan

IBM Spark
spark.tc
spark.tc
IBM Spark
Query Plan Visualization & Metrics
71
Eﬀectiveness
of Filter
CPU Cache  
Friendly
Binary Format
Cost-based
Join Optimization
Similar to
MapReduce
Map-side Join
Peak Memory for
Joins and Aggs

IBM Spark
spark.tc
spark.tc
IBM Spark
JSON Data Source
DataFrame

val ratingsDF = sqlContext.read.format("json")
.load("file:/root/pipeline/datasets/dating/ratings.json.bz2")
-- or –
val ratingsDF = sqlContext.read.json 
("file:/root/pipeline/datasets/dating/ratings.json.bz2")
SQL Code
CREATE TABLE genders USING json
OPTIONS
(path "file:/root/pipeline/datasets/dating/genders.json.bz2")

72
json() convenience method

IBM Spark
spark.tc
spark.tc
IBM Spark
JDBC Data Source
Add Driver to Spark JVM System Classpath

$ export SPARK_CLASSPATH=<jdbc-driver.jar>

DataFrame

val jdbcConﬁg = Map("driver" -> "org.postgresql.Driver",

"url" -> "jdbc:postgresql:hostname:port/database",

"dbtable" -> ”schema.tablename")

df.read.format("jdbc").options(jdbcConﬁg).load()

SQL

CREATE TABLE genders USING jdbc  

OPTIONS (url, dbtable, driver, …)

73

IBM Spark
spark.tc
spark.tc
IBM Spark
Parquet Data Source
Configuration

spark.sql.parquet.filterPushdown=true

spark.sql.parquet.mergeSchema=true

spark.sql.parquet.cacheMetadata=true

spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo]
DataFrames

val gendersDF = sqlContext.read.format("parquet")

.load("file:/root/pipeline/datasets/dating/genders.parquet")

gendersDF.write.format("parquet").partitionBy("gender")

.save("file:/root/pipeline/datasets/dating/genders.parquet")
SQL

CREATE TABLE genders USING parquet

OPTIONS

(path "file:/root/pipeline/datasets/dating/genders.parquet")

74

IBM Spark
spark.tc
spark.tc
IBM Spark
ORC Data Source
Configuration

spark.sql.orc.filterPushdown=true
DataFrames

val gendersDF = sqlContext.read.format("orc")

.load("file:/root/pipeline/datasets/dating/genders")

gendersDF.write.format("orc").partitionBy("gender")

.save("file:/root/pipeline/datasets/dating/genders")
SQL

CREATE TABLE genders USING orc

OPTIONS

(path "file:/root/pipeline/datasets/dating/genders")

75

IBM Spark
spark.tc
spark.tc
IBM Spark
Third-Party Spark SQL DataSources
76
spark-packages.org

IBM Spark
spark.tc
spark.tc
IBM Spark
CSV DataSource (Databricks)
Github
https://github.com/databricks/spark-csv
Maven

com.databricks:spark-csv_2.10:1.2.0
Code

val gendersCsvDF = sqlContext.read

.format("com.databricks.spark.csv")

.load("ﬁle:/root/pipeline/datasets/dating/gender.csv.bz2")

.toDF("id", "gender")
77
toDF() is required if CSV does not contain header

IBM Spark
spark.tc
spark.tc
IBM Spark
ElasticSearch DataSource (Elastic.co)
Github

https://github.com/elastic/elasticsearch-hadoop

Maven

org.elasticsearch:elasticsearch-spark_2.10:2.1.0

Code

val esConﬁg = Map("pushdown" -> "true", "es.nodes" -> "<hostname>",  

"es.port" -> "<port>")

df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite)

.options(esConﬁg).save("<index>/<document-type>")

78

IBM Spark
spark.tc
spark.tc
IBM Spark
Elasticsearch Tips
Change id ﬁeld to not_analyzed to avoid indexing
Use term ﬁlter to build and cache the query
Perform multiple aggregations in a single request
Adapt scoring function to current trends at query time
79

IBM Spark
spark.tc
spark.tc
IBM Spark
AWS Redshift Data Source (Databricks)
Github

https://github.com/databricks/spark-redshift

Maven

com.databricks:spark-redshift:0.5.0

Code

val df: DataFrame = sqlContext.read

.format("com.databricks.spark.redshift")

.option("url", "jdbc:redshift://<hostname>:<port>/<database>…")

.option("query", "select x, count(*) my_table group by x")

.option("tempdir", "s3n://tmpdir")

.load(...)
80
UNLOAD and copy to tmp
bucket in S3 enables
parallel reads

IBM Spark
spark.tc
spark.tc
IBM Spark
DB2 and BigSQL DataSources (IBM)
Coming Soon!
81

IBM Spark
spark.tc
spark.tc
IBM Spark
Cassandra DataSource (DataStax)
Github

https://github.com/datastax/spark-cassandra-connector

Maven

com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1

Code

ratingsDF.write

.format("org.apache.spark.sql.cassandra")

.mode(SaveMode.Append)

.options(Map("keyspace"->"<keyspace>",

"table"->"<table>")).save(…)

82

IBM Spark
spark.tc
spark.tc
IBM Spark
Cassandra Pushdown Support
spark-cassandra-connector/…/o.a.s.sql.cassandra.PredicatePushDown.scala

Pushdown Predicate Rules

1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate

2. Only push down primary key column predicates with = or IN predicate.

3. If there are regular columns in the pushdown predicates, they should have
at least one EQ expression on an indexed column and no IN predicates.

4. All partition column predicates must be included in the predicates to be pushed down,
only the last part of the partition key can be an IN predicate. For each partition column,

only one predicate is allowed.

5. For cluster column predicates, only last predicate can be non-EQ predicate

including IN predicate, and preceding column predicates must be EQ predicates.

If there is only one cluster column predicate, the predicates could be any non-IN predicate.

6. There is no pushdown predicates if there is any OR condition or NOT IN condition.

7. We're not allowed to push down multiple predicates for the same column if any of them

is equality or IN predicate.

83

IBM Spark
spark.tc
spark.tc
IBM Spark
New Cassandra DataSource
By-pass CQL optimized for transactional data

Instead, do bulk reads/writes directly on SSTables

Similar to 5 year old Netflix Open Source project Aegisthus

Promotes Cassandra to first-class Analytics Option

Potentially only part of DataStax Enterprise?!

Please mail a nasty letter to your local DataStax office
84

IBM Spark
spark.tc
spark.tc
IBM Spark
Rumor of REST DataSource (Databricks)
Coming Soon?

Ask Michael Armbrust
Spark SQL Lead @ Databricks
85

IBM Spark
spark.tc
spark.tc
IBM Spark
Custom DataSource (Me and You!)
Coming Right Now!
86
DEMO ALERT!!

IBM Spark
spark.tc
spark.tc
IBM Spark
Create a Custom DataSource
Study Existing Native & Third-Party Data Sources
Native
Spark JDBC (o.a.s.sql.execution.datasources.jdbc)

class JDBCRelation extends BaseRelation

with PrunedFilteredScan

with InsertableRelation
Third-Party
DataStax Cassandra (o.a.s.sql.cassandra)

class CassandraSourceRelation extends BaseRelation

with PrunedFilteredScan

with InsertableRelation!

87

IBM Spark
spark.tc
Demo!
Create a Custom DataSource
88

IBM Spark
spark.tc
spark.tc
IBM Spark
Contribute a Custom Data Source
spark-packages.org

Managed by

Contains links to external github projects

Ratings and comments

Declare Spark version support for each package
Examples

https://github.com/databricks/spark-csv

https://github.com/databricks/spark-avro

https://github.com/databricks/spark-redshift
89

IBM Spark
spark.tc
spark.tc
IBM Spark
Parquet Columnar File Format
Based on Google Dremel
Collaboration with Twitter and Cloudera
Self-describing, evolving schema
Fast columnar aggregation
Supports ﬁlter pushdowns
Columnar storage format
Excellent compression
90

IBM Spark
spark.tc
spark.tc
IBM Spark
Types of Compression
Run Length Encoding: Repeated data
Dictionary Encoding: Fixed set of values
Delta, Preﬁx Encoding: Sorted data
91

IBM Spark
spark.tc
Demo!
Demonstrate File Formats, Partition Schemes, and Query Plans
92

IBM Spark
spark.tc
spark.tc
IBM Spark
Hive JDBC ODBC ThriftServer
Allow BI Tools to Query and Process Spark Data
Register Permanent Table

CREATE TABLE ratings(fromuserid INT, touserid INT, rating INT)

USING org.apache.spark.sql.json

OPTIONS (path "datasets/dating/ratings.json.bz2")
Register Temp Table

ratingsDF.registerTempTable("ratings_temp")
Conﬁguration

spark.sql.thriftServer.incrementalCollect=true

spark.driver.maxResultSize > 10gb (default)
93

IBM Spark
spark.tc
Demo!
Query and Process Spark Data from BI Tools
94

IBM Spark
spark.tc
spark.tc
IBM Spark
95

IBM Spark
spark.tc
Spark Streaming: Scaling & Approximations
Discuss Delivery Guarantees, Parallelism, and Stability

Compare Receiver and Receiver-less Impls

Demonstrate Stream Approximations

96

IBM Spark
spark.tc
spark.tc
IBM Spark
Non-Parallel Receiver Implementation
97

IBM Spark
spark.tc
spark.tc
IBM Spark
Receiver Implementation (Kinesis)
  KinesisRDD partitions store relevant offsets
  Single receiver required to see all data/offsets
  Kinesis offsets not deterministic like Kafka
  Partitions rebuild from Kinesis using offsets
  No Write Ahead Log (WAL) needed
  Optimizes happy path by avoiding the WAL
  At least once delivery guarantee
98

IBM Spark
spark.tc
spark.tc
IBM Spark
Parallel Receiver-less Implementation (Kafka)
99

IBM Spark
spark.tc
spark.tc
IBM Spark
Receiver-less Implementation (Kafka)
  KafkaRDD partitions store relevant oﬀsets
  Each partition acts as a Receiver
  Tasks/Executors pull from Kafka in parallel
  Partitions rebuild from Kafka using oﬀsets
  No Write Ahead Log (WAL) needed
  Optimizes happy path by avoiding the WAL
  At least once delivery guarantee
100

IBM Spark
spark.tc
spark.tc
IBM Spark
Maintain Stability of Stream Processing
Rate Limiting

Since Spark 1.2

Fixed limit on number of messages per second

Potential to drops messages on the ﬂoor

Back Pressure

Since Spark 1.5 (TypeSafe Contribution)

More dynamic than rate limiting

Push back on reliable, buﬀered source (Kafka, Kinesis)

Fundamentals of Control Theory and Observability
101

IBM Spark
spark.tc
Streaming Approximations
HyperLogLog and CountMin Sketch
102

IBM Spark
spark.tc
spark.tc
IBM Spark
HyperLogLog (HLL) Approx Distinct Count
  Approximate count distinct
  Twitter’s Algebird
  Better than HashSet
  Low, ﬁxed memory
  Only 1.5K, 2% error,10^9 counts (tunable)

Redis HLL: 12K per key, 0.81%, 2^64 counts
  Spark’s countApproxDistinctByKey()
  Streaming example in Spark codebase
103
http://research.neustar.biz/

IBM Spark
spark.tc
spark.tc
IBM Spark
CountMin Sketch (CMS) Approx Count
 Approximate count
 Twitter’s Algebird
 Better than HashMap
 Low, ﬁxed memory
 Known error bounds
 Large num counters
 Streaming example in Spark codebase
104

IBM Spark
spark.tc
Demo!
Using HLL and CMS for Streaming Count Approximations
105

IBM Spark
spark.tc
spark.tc
IBM Spark
Monte Carlo Simulations
From Manhattan Project (Atomic bomb)
Simulate movement of neutrons
Law of Large Numbers (LLN)
Average of results of many trials 
Converge on expected value
SparkPi example in Spark codebase

1 Argument: # of trials
 

Pi ~= # red dots 

/ # total dots

* 4
106

IBM Spark
spark.tc
Demo!
Using a Monte Carlo Simulation to Estimate Pi
107

IBM Spark
spark.tc
spark.tc
IBM Spark
Streaming Best Practices
Get Data Out of Streaming ASAP

Processing interval may exceed batch interval

Leads to unstable streaming system
Please Don’t…

Use updateStateByKey() like an in-memory DB

Put streaming jobs on the request/response hot path
Use Separate Jobs for Diﬀerent Batch Intervals

Small Batch Interval: Store raw data (Redis, Cassandra, etc)

Medium Batch Interval: Transform, join, process data

High Batch Interval: Model training
Gotchas

Tune streamingContext.remember()
Use Approximations!!
108

IBM Spark
spark.tc
spark.tc
IBM Spark
109

IBM Spark
spark.tc
Spark ML: Featurizing & Recommendations
Understand Similarity and Dimension Reduction

Demonstrate Sampling and Bucketing

Generate Recommendations
110

IBM Spark
spark.tc
Live, Interactive Demo!
sparkafterdark.com
111

IBM Spark
spark.tc
spark.tc
IBM Spark
Audience Participation Needed!!
112
->
You are 
here
->
Audience Instructions
  Navigate to sparkafterdark.com
  Click 3 actresses and 3 actors

  Wait for us to analyze together!
Note: This is totally anonymous!!

Project Links
  https://github.com/ﬂuxcapacitor/pipeline
  https://hub.docker.com/r/ﬂuxcapacitor

IBM Spark
spark.tc
Similarity
113

IBM Spark
spark.tc
spark.tc
IBM Spark
Types of Similarity
Euclidean
Linear-based measure
Suffers from Magnitude bias
Cosine
Angle-based measure
Adjusts for magnitude bias
Jaccard
Set intersection / union
Suffers Popularity bias
Log Likelihood
Netflix “Shawshank” Problem
Adjusts for popularity bias

114
Ali Matei Reynold Patrick Andy
Kimberly 1 1 1 1
Leslie 1 1!
Meredith 1 1 1
Lisa 1 1 1
Holden 1 1 1 1 1
z!

IBM Spark
spark.tc
spark.tc
IBM Spark
All-Pairs Similarity Comparison
Compare everything to everything
aka. “pair-wise similarity” or “similarity join”
Naïve shuﬄe: O(m*n^2); m=rows, n=cols

Minimize shuﬄe through approximations!
Reduce m (rows)
Sampling and bucketing
Reduce n (cols)
Remove most frequent value (ie.0)
Principle Component Analysis
115
Dimension reduction!!

IBM Spark
spark.tc
Dimension Reduction
Sampling and Bucketing
116

IBM Spark
spark.tc
spark.tc
IBM Spark
Reduce m: DIMSUM Sampling
“Dimension Independent Matrix Square Using MR”
Remove rows with low similarity probability
MLlib: RowMatrix.columnSimilarities(…)

Twitter: 40% eﬃciency gain vs. Cosine Similarity

117

IBM Spark
spark.tc
spark.tc
IBM Spark
Reduce m: LSH Bucketing
“Locality Sensitive Hashing”
Split m into b buckets
Use similarity hash algorithm
Requires pre-processing of data
Parallel compare bucket contents
O(m*n^2) -> O(m*n/b*b^2);

m=rows, n=cols, b=buckets
ie. 500k x 500k matrix

O(1.25e17) -> O(1.25e13); b=50
118
github.com/mrsqueeze/spark-hash

IBM Spark
spark.tc
spark.tc
IBM Spark
Reduce n: Remove Most Frequent Value
Eliminate most-frequent value
Represent other values with (index,value) pairs
Converts O(m*n^2) -> O(m*nnz^2);  
nnz=num nonzeros, nnz << n

Note: Choose most frequent value (may not be 0)
119
(index,value)
(index,value)

IBM Spark
spark.tc
Recommendations
Summary Statistics and Top-K Historical Analysis
Collaborative Filtering and Clustering
Text Featurization and NLP
120

IBM Spark
spark.tc
spark.tc
IBM Spark
Types of Recommendations
Non-personalized 
No preference or behavior data for user, yet
aka “Cold Start Problem”

Personalized 
User-Item Similarity 
Items that others with similar prefs have liked
Item-Item Similarity 
Items similar to your previously-liked items
121

IBM Spark
spark.tc
spark.tc
IBM Spark
Recommendation Terminology
Feedback
Explicit: like, rating
Implicit: search, click, hover, view, scroll
Feature Engineering
Dimension reduction, polynomial expansion
Hyper-parameter Tuning
K-Folds Cross Validation, Grid Search
Pipelines/Workflows
Chaining together Transformers and Evaluators
122

IBM Spark
spark.tc
spark.tc
IBM Spark
Single Machine ML Algorithms
Stay Local, Distribute As Needed
Helps migration of existing single-node algos to Spark
Convert between Spark and Pandas DataFrames
New “pdspark” package: integration w/ scikitlearn, R
123

IBM Spark
spark.tc
Non-Personalized Recommendations
Use Aggregate Data to Generate Recommendations
124

IBM Spark
spark.tc
spark.tc
IBM Spark
  Top Users by Like Count

“I might like users who have the most-likes overall
based on historical data.”
SparkSQL, DataFrames: Summary Stat, Aggs

125

IBM Spark
spark.tc
spark.tc
IBM Spark
  Top Inﬂuencers by Like Graph 

“I might like the most-inﬂuential users in overall like graph.”
GraphX: PageRank

126

IBM Spark
spark.tc
Demo!
Generate Non-Personalized Recommendations
127

IBM Spark
spark.tc
Personalized Recommendations
Understand Similarity and Personalized Recommendations
128

IBM Spark
spark.tc
spark.tc
IBM Spark
  Like Behavior of Similar Users
“I like the same people that you like.  
What other people did you like that I haven’t seen?”
MLlib: Matrix Factorization, User-Item Similarity
129

IBM Spark
spark.tc
Demo!
Generate Personalized Recommendations using  
Collaborative Filtering & Matrix Factorization
130

IBM Spark
spark.tc
spark.tc
IBM Spark
  Similar Text-based Proﬁles as Me 

“Our proﬁles have similar keywords and named entities.  
We might like each other!”
MLlib: Word2Vec, TF/IDF, k-skip n-grams
131

IBM Spark
spark.tc
spark.tc
IBM Spark
  Similar Profiles to Previous Likes 

132
“Your profile text has similar keywords and named entities to
other profiles of people I like. I might like you, too!”
MLlib: Word2Vec, TF/IDF, Doc Similarity

IBM Spark
spark.tc
spark.tc
IBM Spark
  Relevant, High-Value Emails

“Your initial email references a lot of things in my proﬁle. 
I might like you for making the eﬀort!”
MLlib: Word2Vec, TF/IDF, Entity Recognition

133
^
Her Email< My Profile

IBM Spark
spark.tc
Demo!
Feature Engineering for Text/NLP Use Cases
134

IBM Spark
spark.tc
The Future of Recommendations
135

IBM Spark
spark.tc
spark.tc
IBM Spark
  Eigenfaces: Facial Recognition
“Your face looks similar to others that I’ve liked. 
I might like you.”
MLlib: RowMatrix, PCA, Item-Item Similarity

136
Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html

IBM Spark
spark.tc
spark.tc
IBM Spark
  NLP Conversation Starter Bot!
“If your responses to my generic opening
lines are positive, I may read your proﬁle.”  
MLlib: TF/IDF, DecisionTrees,
Sentiment Analysis
137
Positive Negative

IBM Spark
spark.tc
138
Maintaining the Spark

IBM Spark
spark.tc
spark.tc
IBM Spark
⑨  Recommendations for Couples
“I want Mad Max. You want Message In a Bottle.  
Let’s ﬁnd something in between to watch tonight.”
MLlib: RowMatrix, Item-Item Similarity 
GraphX: Nearest Neighbors, Shortest Path

similar

similar
• 
plots ->
<- actors

139

IBM Spark
spark.tc
Final Recommendation!
140

IBM Spark
spark.tc
spark.tc
IBM Spark
  Get Off the Computer & Meet People!
Thank you, Helsinki!!
Chris Fregly @cfregly
IBM Spark Technology Center
San Francisco, CA, USA
Relevant Links
advancedspark.com
Signup for the book & global meetup!
github.com/ﬂuxcapacitor/pipeline
Clone, contribute, and commit code!
hub.docker.com/r/ﬂuxcapacitor/pipeline/wiki
Run all demos in your own environment with Docker!
141

IBM Spark
spark.tc
spark.tc
IBM Spark
More Relevant Links
http://meetup.com/Advanced-Apache-Spark-Meetup
http://advancedspark.com
http://github.com/fluxcapacitor/pipeline
http://hub.docker.com/r/fluxcapacitor/pipeline
http://sortbenchmark.org/ApacheSpark2014.pd
https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
http://0x0fff.com/spark-architecture-shuffle/
http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf
http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance
http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf
http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/
http://docs.scala-lang.org/overviews/quasiquotes/intro.html
http://lwn.net/Articles/252125/ (Memory Part 2: CPU Caches)
http://lwn.net/Articles/255364/ (Memory Part 5: What Programmers Can Do)
https://www.safaribooksonline.com/library/view/java-performance-the/9781449363512/ch04.html
http://web.eece.maine.edu/~vweaver/projects/perf_events/perf_event_open.html
http://www.brendangregg.com/perf.html
https://perf.wiki.kernel.org/index.php/Tutorial
http://techblog.netflix.com/2015/07/java-in-flames.html
http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html
http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#Java
http://sortbenchmark.org/ApacheSpark2014.pdf
https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
http://0x0fff.com/spark-architecture-shuffle/
http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf
http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance
http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf
http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/
http://docs.scala-lang.org/overviews/quasiquotes/intro.html
http://lwn.net/Articles/252125/ <-- Memory Part 2: CPU Caches
http://lwn.net/Articles/255364/ <-- Memory Part 5: What Programmers Can Do

142

IBM Spark
spark.tc
What’s Next?
143

IBM Spark
spark.tc
spark.tc
IBM Spark
What’s Next?
Autoscaling Spark Workers

Completely Docker-based

Docker Compose and Docker Machine
Lots of Demos and Examples!

Zeppelin & IPython/Jupyter notebooks

Advanced streaming use cases

Advanced ML, Graph, and NLP use cases
Performance Tuning and Proﬁling

Work closely with Brendan Gregg & Netﬂix

Surface & share more low-level details of Spark internals
144

IBM Spark
spark.tc
spark.tc
IBM Spark
Upcoming Meetups and Conferences
London Spark Meetup (Oct 12th)
Scotland Data Science Meetup (Oct 13th)
Dublin Spark Meetup (Oct 15th)
Barcelona Spark Meetup (Oct 20th)
Madrid Big Data Meetup (Oct 22nd)
Paris Spark Meetup (Oct 26th)
Amsterdam Spark Summit (Oct 27th)
Brussels Spark Meetup (Oct 30th)
Zurich Big Data Meetup (Nov 2nd)
Geneva Spark Meetup (Nov 5th)
San Francisco Datapalooza.io (Nov 10th)
145
San Francisco Advanced Spark (Nov 12th)
Oslo Big Data Hadoop Meetup (Nov 19th)
Helsinki Spark Meetup (Nov 20th)
Stockholm Spark Meetup (Nov 23rd)
Copenhagen Spark Meetup (Nov 25th)
Budapest Spark Meetup (Nov 26th)
Singapore Strata Conference (Dec 1st)
San Francisco Advanced Spark (Dec 8th)
Mountain View Advanced Spark (Dec 10th)
Toronto Spark Meetup (Dec 14th)
Austin Data Days Conference (Jan 2016)

IBM Spark
spark.tc
IBM Spark

Helsinki Spark Meetup Nov 20 2015

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Helsinki Spark Meetup Nov 20 2015

Similar to Helsinki Spark Meetup Nov 20 2015 (15)

More from Chris Fregly

More from Chris Fregly (20)

Recently uploaded

Recently uploaded (20)

Helsinki Spark Meetup Nov 20 2015