Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5: Real-time, Advanced Analytics with Spark 1.5, Kafka, Cassandra, ElasticSearch, Zeppelin, and Docker

Click to edit Master text styles
IBM Spark
spark.tc

After Dark 1.5
High Performance, Real-time, Streaming,
Machine Learning, Natural Language Processing,
Text Analytics, and Recommendations

Chris Fregly
Principal Data Solutions Engineer
IBM Spark Technology Center
** We’re Hiring -- Only Nice People, Please!! **
Brussels Spark Meetup
October 30, 2015

IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Follow Along – Slides Are Now Available!

https://www.slideshare.net/cfregly/
2

IBM Spark
spark.tc
spark.tc
IBM Spark
Who Am I?
3

Streaming Data Engineer
Netﬂix Open Source Committer 

Data Solutions Engineer 
Apache Contributor
Principal Data Solutions Engineer
IBM Technology Center
Meetup Organizer
Advanced Apache Meetup
Book Author
Advanced (2016)

IBM Spark
spark.tc
spark.tc
IBM Spark
Upcoming Meetups and Conferences
London Spark Meetup (Oct 12th)
Scotland Data Science Meetup (Oct 13th)
Dublin Spark Meetup (Oct 15th)
Barcelona Spark Meetup (Oct 20th)
Madrid Spark/Big Data Meetup (Oct 22nd)
Paris Spark Meetup (Oct 26th)
Amsterdam Spark Summit & Meetup (Oct 27th)
Delft Dutch Data Science Meetup (Oct 29th)
Brussels Spark Meetup (Oct 30th)
Zurich Big Data Developers Meetup (Nov 2nd)
Geneva Spark Meetup (Nov 5th)
4
San Francisco Datapalooza (Nov 10th)
San Francisco Advanced Apache Spark Meetup (Nov 12th)
Oslo Big Data Hadoop Meetup (Nov 18th)
Helsinki Spark Meetup (Nov 20th)
Stockholm Spark Meetup (Nov 23rd)
Copenhagen Spark Meetup (Nov 25th)
Budapest Spark Meetup (Nov 27th)
Singapore Strata Conference (Dec 1st)
San Francisco Advanced Apache Spark Meetup (Dec 8th)
Mountain View Advanced Apache Spark Meetup (Dec 10th)
Washington DC Advanced Apache Spark Meetup (Dec 17th)

Freg-a-palooza!

IBM Spark
spark.tc
spark.tc
IBM Spark
Advanced Apache Spark Meetup
Meetup Metrics
1500 members in just 3 mos!
4th most active Spark Meetup!!
meetup.com/Advanced-Apache-Spark-Meetup
Meetup Goals
Dig deep into codebases of Spark & related projects
Study integrations of Cassandra, ElasticSearch, 
Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R
Surface & share patterns & idioms of these  
well-designed, distributed, big data components

IBM Spark
spark.tc
spark.tc
IBM Spark
What is Spark After Dark?
Fun, Spark-based dating reference application
*Not a movie recommendation engine!!
Generate recommendations based on user similarity
Demonstrate Apache Spark & related big data projects
6

IBM Spark
spark.tc
spark.tc
IBM Spark
Tools of this Talk (github.com/ﬂuxcapacitor)
7
  Redis
  Docker
  Ganglia
  Streaming, Kafka
  Cassandra, NoSQL
  Parquet, JSON, ORC, Avro
  Apache Zeppelin Notebooks
  Spark SQL, DataFrames, Hive
  ElasticSearch, Logstash, Kibana
  Spark ML, GraphX, Stanford CoreNLP
and…

IBM Spark
spark.tc
spark.tc
IBM Spark
Overall Themes of this Talk
  Filter Early, Filter Deep
  Approximations are OK
  Minimize Random Seeks
  Maximize Sequential Scans
  Go Oﬀ-Heap when Possible
  Parallelism is Required at Scale
  Must Reduce Dimensions at Scale
  Seek Performance Gains at all Layers
  Customize Data Structs for your Workload
8
  Be Nice and Collaborate with your Peers!

IBM Spark
spark.tc
spark.tc
IBM Spark
Outline
Spark Core: Mechanical Sympathy & Tuning
Spark SQL: Catalyst & DataSources API
Spark Streaming: Scaling & Approximating
Spark ML: Featurizing & Recommending
9

IBM Spark
spark.tc
Understanding & Acknowledging Mechanical Sympathy

100TB GraySort Challenge, Project Tungsten

Shuﬄe Service and Dynamic Allocation
10

IBM Spark
spark.tc
spark.tc
IBM Spark
Spark and Mechanical Sympathy
http://mechanical-sympathy.blogspot.com

“Hardware and software working together in harmony”

-Martin Thompson

Saturate Network I/O
Saturate Disk I/O

Minimize Memory and GC
Maximize CPU Cache Locality
11
Project  
Tungsten
(Spark 1.4-1.6)
Daytona
GraySort
(Spark 1.1-1.2)

IBM Spark
spark.tc
spark.tc
IBM Spark
AlphaSort Trick for Sorting
AlphaSort paper, 1995

Chris Nyberg and Jim Gray

Naïve

List (Pointer-to-Record)

Requires Key to be dereferenced for comparison

AlphaSort

List (Key, Pointer)

Key is directly available for comparison

12
Ptr!
Ptr!Key!

IBM Spark
spark.tc
spark.tc
IBM Spark
Key! Ptr!
Pad!
/Pad
CPU Cache Line and Memory Sympathy
Key(10 bytes) + Pointer(4 bytes*) = 14 bytes

*4 bytes when using compressed OOPS (<32 GB heap)

Not binary in size 

Not CPU-cache friendly
Add Padding (2 bytes)

Key(10 bytes) + Pad(2 bytes)  
+ Pointer(4 bytes)=16 bytes
Key-Prefix, Pointer

Key distribution affects perf

Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes

13
Ptr!
Key-Prefix
Key! Ptr!
Cache-line 
Friendly!
2x Cache-line 
Friendly!
Not cache-line 
Friendly!

IBM Spark
spark.tc
spark.tc
IBM Spark
Performance Comparison
14

IBM Spark
spark.tc
spark.tc
IBM Spark
Similar Technique: Direct Cache Access
Packet header placed into CPU cache

15

IBM Spark
spark.tc
spark.tc
IBM Spark
CPU Cache Lines
16

IBM Spark
spark.tc
spark.tc
IBM Spark
Instrumenting and Monitoring CPU
Linux perf command!
17

IBM Spark
spark.tc
spark.tc
IBM Spark
CPU Cache Naïve Matrix Multiplication
// Dot product of each row & column vector
for (i <- 0 until numRowA)
for (j <- 0 until numColsB)
for (k <- 0 until numColsA)
res[ i ][ j ] += matA[ i ][ k ] * matB[ k ][ j ];

18
Bad: Row-wise traversal,

not using CPU cache line, 
ineﬀective pre-fetching

IBM Spark
spark.tc
spark.tc
IBM Spark
CPU Cache Friendly Matrix Multiplication

// Transpose B
for (i <- 0 until numRowsB)

matBT[ i ][ j ] = matB[ j ][ i ];
 
// Modify dot product calculation for B Transpose
for (i <- 0 until numRowsA)
for (k <- 0 until numColsA)
res[ i ][ j ] += matA[ i ][ k ] * matBT[ j ][ k ];
19
Good: Full CPU cache line, 
eﬀective prefetching
(Before: res[i][j] += matA[i][k] * matB[k][j];)
Reference j 
before k

IBM Spark
spark.tc
Demo!
Comparing CPU Naïve & Cache-Friendly Matrix Multiplication
20

IBM Spark
spark.tc
spark.tc
IBM Spark
Results of Naïve vs. Cache Friendly
Naïve Matrix Multiply
21
Cache Friendly Matrix Multiply
~72x
~8x
~3x
~3x
~2x
~7x
~10x
perf stat --repeat 5 --scale --event L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses,LLC-prefetch-misses,cache-misses,stalled-cycles-frontend
java -Xmx13G -XX:-Inline -jar ~/sbt/bin/sbt-launch.jar "tungsten/run-main com.advancedspark.tungsten.matrix.Cache[Friendly|Naïve]MatrixMultiply 256 1"

IBM Spark
spark.tc
spark.tc
IBM Spark
Visualizing and Finding Hotspots
Flame Graphs with Java Stack Traces
22
Images courtesy of http://techblog.netﬂix.com/2015/07/java-in-ﬂames.html!
Java Stack  
Traces!!

IBM Spark
spark.tc
100TB Daytona GraySort Challenge
Focus on Network and Disk I/O Optimizations
Improve Data Structs/Algos for Sort & Shuﬄe
Saturate Network and Disk Controllers
23

IBM Spark
spark.tc
spark.tc
IBM Spark
Winning Results
24
Spark Goals:
  Saturate Network I/O
  Saturate Disk I/O
(2013) (2014)

IBM Spark
spark.tc
spark.tc
IBM Spark
Winning Hardware Conﬁguration
Compute

206 EC2 Worker nodes, 1 Master node

AWS i2.8xlarge

32 Intel Xeon CPU E5-2670 @ 2.5 Ghz

244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4

NOOP I/O scheduler: FIFO, request merging, no reordering

3 GBps mixed read/write disk I/O per node

Network

Deployed within Placement Group/VPC

Using AWS Enhanced Networking

Single Root I/O Virtualization (SR-IOV): extension of PCIe

10 Gbps, low latency, low jitter (iperf showed ~9.5 Gbps)

25

IBM Spark
spark.tc
spark.tc
IBM Spark
Winning Software Conﬁguration
Spark 1.2, OpenJDK 1.7_<amazon-something>_u65-b17
Disable caching, compression, spec execution, shuﬄe spill
Force NODE_LOCAL task scheduling for optimal data locality
HDFS 2.4.1 short-circuit for local reads, 2x replication
4-6 tasks allocated / partition is Spark recommendation

206 nodes * 32 cores = 6592 cores

6592 cores * 4 = 26,368 partitions

6592 cores * 6 = 39,552 partitions

6592 cores * 4.25 = 28,000 partitions was empirically best
Range partitioning takes advantage of sequential keyspace
26

IBM Spark
spark.tc
spark.tc
IBM Spark
New Shuffle Manager
New “Sort-based” shuffle manager replaces Hash-based

New Data Structures and Algos for Shuffle Sort

ie. New TimSort for Arrays of (K,V) Pairs
27

IBM Spark
spark.tc
spark.tc
IBM Spark
New Network Module
Replaces old java.nio, low-level, socket-based code

Zero-copy epoll: kernel-space between disk & network

Custom memory management

spark.shuffle.blockTransferService=netty

Spark-Netty Performance Tuning

spark.shuffle.io.numConnectionsPerPeer

Increase to saturate hosts with multiple disks

spark.shuffle.io.preferDirectBuffers

On or Off-heap (Off-heap is default)

28

IBM Spark
spark.tc
spark.tc
IBM Spark
New Algorithms and Data Structures
Optimized for sort and shuﬄe
o.a.s.util.collection.TimSort

Based on JDK 1.7 TimSort

Performs best on partially-sorted datasets

Optimized for elements of (K,V) pairs

Sorts impl of SortDataFormat (ie. KVArraySortDataFormat)
o.a.s.util.collection.AppendOnlyMap

Open addressing hash, quadratic probing

Array of [(K, V), (K, V)]

Good memory locality

Keys never removed, values only append
29

IBM Spark
spark.tc
IBM | spark.tc
Met Performance Goals!
Reducers: 1.1 Gbps/node network I/O
(theoretical max = 1.25 Gbps for 10 GB ethernet)
Mappers: 3 GBps/node disk I/O (8x800 SSD)
206 nodes * 1.1 Gbps/node ~= 220 Gbps

IBM Spark
spark.tc
spark.tc
IBM Spark
Shuffle Performance Tuning Tips
Hash Shuffle Manager (no longer default)

spark.shuffle.consolidateFiles: mapper output files

o.a.s.shuffle.FileShuffleBlockResolver
Intermediate Files

Increase spark.shuffle.file.buffer: reduce seeks & sys calls

Increase spark.reducer.maxSizeInFlight if memory allows

Use smaller number of larger workers to reduce total files
SQL: BroadcastHashJoin vs. ShuffledHashJoin

spark.sql.autoBroadcastJoinThreshold

Use DataFrame.explain(true) or EXPLAIN to verify

31

IBM Spark
spark.tc
Project Tungsten
Focus on CPU Cache and Memory Optimizations
Further Improve Data Structures and Algorithms
Operate on Serialized/Compressed Data
Provide Path to Oﬀ Heap
32

IBM Spark
spark.tc
spark.tc
IBM Spark
Why is CPU the Bottleneck?
Network and Disk I/O bandwidth are relatively high

GraySort optimizations improved network & shuﬄe

More partitioning, pruning, and predicate pushdowns

Poprularity of columnar ﬁle formats like Parquet/ORC

CPU is used for serialization, hashing, compression!
33

IBM Spark
spark.tc
spark.tc
IBM Spark
Spark Shuffle Managers
spark.shuffle.manager =

hash < 10,000 Reducers

Output file determined by hashing the key of (K,V) pair

Each mapper creates an output buffer/file per reducer

Leads to M*R number of output buffers/files per shuffle

sort >= 10,000 Reducers

Default since Spark 1.2

Minimizes OS resources

Uses Netty to optimize Network I/O

Created custom Data Struts/Algos

Wins Daytona GraySort Challenge

unsafe -> Tungsten, Default in Spark 1.5

Uses com.misc.Unsafe to sellf-manage binary array buffers

Uses custom serialization format

Can operate on compressed and serialized buffers
34

IBM Spark
spark.tc
spark.tc
IBM Spark
New Data Structures
“I like your data structure, but my array will beat it!”

New Data Structures for Sort/Shuﬄe Workload
UnsafeRow:

BytesToBytesMap::

35

IBM Spark
spark.tc
spark.tc
IBM Spark
sun.misc.Unsafe
36
Info

addressSize()

pageSize()
Objects

allocateInstance()

objectFieldOffset()
Classes

staticFieldOffset()

defineClass()

defineAnonymousClass()

ensureClassInitialized()
Synchronization

monitorEnter()

tryMonitorEnter()

monitorExit()

compareAndSwapInt()

putOrderedInt()
Arrays

arrayBaseOffset()

arrayIndexScale()
Memory

allocateMemory()

copyMemory()

freeMemory()

getAddress() – not guaranteed after GC

getInt()/putInt()

getBoolean()/putBoolean()

getByte()/putByte()

getShort()/putShort()

getLong()/putLong()

getFloat()/putFloat()

getDouble()/putDouble()

getObjectVolatile()/putObjectVolatile()

IBM Spark
spark.tc
spark.tc
IBM Spark
Spark + com.misc.Unsafe
37
org.apache.spark.sql.execution.
aggregate.SortBasedAggregate
aggregate.TungstenAggregate
aggregate.AggregationIterator
aggregate.udaf
aggregate.utils
SparkPlanner
rowFormatConverters
UnsafeFixedWidthAggregationMap
UnsafeExternalSorter
UnsafeExternalRowSorter
UnsafeKeyValueSorter
UnsafeKVExternalSorter
local.ConvertToUnsafeNode
local.ConvertToSafeNode
local.HashJoinNode
local.ProjectNode
local.LocalNode
local.BinaryHashJoinNode
local.NestedLoopJoinNode
joins.HashJoin
joins.HashSemiJoin
joins.HashedRelation
joins.BroadcastHashJoin
joins.ShuffledHashOuterJoin (not yet converted)
joins.BroadcastHashOuterJoin
joins.BroadcastLeftSemiJoinHash
joins.BroadcastNestedLoopJoin
joins.SortMergeJoin
joins.LeftSemiJoinBNL
joins.SortMergerOuterJoin
Exchange
SparkPlan
UnsafeRowSerializer
SortPrefixUtils
sort
basicOperators
aggregate.SortBasedAggregationIterator
aggregate.TungstenAggregationIterator
datasources.WriterContainer
datasources.json.JacksonParser
datasources.jdbc.JDBCRDD
org.apache.spark.
unsafe.Platform
unsafe.KVIterator
unsafe.array.LongArray
unsafe.array.ByteArrayMethods
unsafe.array.BitSet
unsafe.bitset.BitSetMethods
unsafe.hash.Murmur3_x86_32
unsafe.map.BytesToBytesMap
unsafe.map.HashMapGrowthStrategy
unsafe.memory.TaskMemoryManager
unsafe.memory.ExecutorMemoryManager
unsafe.memory.MemoryLocation
unsafe.memory.UnsafeMemoryAllocator
unsafe.memory.MemoryAllocator (trait/interface)
unsafe.memory.MemoryBlock
unsafe.memory.HeapMemoryAllocator
unsafe.memory.ExecutorMemoryManager
unsafe.sort.RecordComparator
unsafe.sort.PrefixComparator
unsafe.sort.PrefixComparators
unsafe.sort.UnsafeSorterSpillWriter
serializer.DummySerializationInstance
shuffle.unsafe.UnsafeShuffleManager
shuffle.unsafe.UnsafeShuffleSortDataFormat
shuffle.unsafe.SpillInfo
shuffle.unsafe.UnsafeShuffleWriter
shuffle.unsafe.UnsafeShuffleExternalSorter
shuffle.unsafe.PackedRecordPointer
shuffle.ShuffleMemoryManager
util.collection.unsafe.sort.UnsafeSorterSpillMerger
util.collection.unsafe.sort.UnsafeSorterSpillReader
util.collection.unsafe.sort.UnsafeSorterSpillWriter
util.collection.unsafe.sort.UnsafeShuffleInMemorySorter
util.collection.unsafe.sort.UnsafeInMemorySorter
util.collection.unsafe.sort.RecordPointerAndKeyPrefix
util.collection.unsafe.sort.UnsafeSorterIterator
network.shuffle.ExternalShuffleBlockResolver
scheduler.Task
rdd.SqlNewHadoopRDD
executor.Executor
org.apache.spark.sql.catalyst.expressions.
regexpExpressions
BoundAttribute
SortOrder
SpecializedGetters
ExpressionEvalHelper
UnsafeArrayData
UnsafeReaders
UnsafeMapData
Projection
LiteralGeneartor
UnsafeRow
JoinedRow
SpecializedGetters
InputFileName
SpecificMutableRow
codegen.CodeGenerator
codegen.GenerateProjection
codegen.GenerateUnsafeRowJoiner
codegen.GenerateSafeProjection
codegen.GenerateUnsafeProjection
codegen.BufferHolder
codegen.UnsafeRowWriter
codegen.UnsafeArrayWriter
complexTypeCreator
rows
literals
misc
stringExpressions
Over 200 source
files affected!!

IBM Spark
spark.tc
spark.tc
IBM Spark
CPU & Memory Optimizations
Custom Managed Memory

Reduces GC overhead

Both on and oﬀ heap

Exact size calculations
Direct Binary Processing

Operate on serialized/compressed arrays

Kryo can reorder serialized records

LZF can reorder compressed records
More CPU Cache-aware Data Structs & Algorithms

o.a.s.unsafe.map.BytesToBytesMap vs. j.u.HashMap
Code Generation (default in 1.5)

Generate source code from overall query plan

Janino generates bytecode from source code

100+ UDFs converted to use code generation
38
UnsafeFixedWithAggregationMap,&
TungstenAggregationIterator
CodeGenerator &
GeneratorUnsafeRowJoiner
UnsafeSortDataFormat &
UnsafeShuffleSortDataFormat &
PackedRecordPointer &
UnsafeRow
UnsafeInMemorySorter &
UnsafeExternalSorter &
UnsafeShuffleWriter
Mostly Same Join Code,
added if (isUnsafeMode)
UnsafeShuffleManager &
UnsafeShuffleInMemorySorter &
UnsafeShuffleExternalSorterDetails inSPARK-7075

IBM Spark
spark.tc
IBM | spark.tc
Code Generation (Default in 1.5)
Problem
Generic expression evaluation
Expensive on JVM
JVM can’t inline polymorphic impls
Code generation by-passes poly
Virtual function calls
Branches based on expression type
Boxing causes excessive object creation
Implementation
Defer source code generation to each operator, type, etc
Scala quasiquotes provide AST manipulation & rewriting
Generates source code, compiled to bytecode w/ Janino
100+ UDFs now using code gen

IBM Spark
spark.tc
IBM | spark.tc
Code Generation: Spark SQL UDFs
100+ UDFs now using code gen – More to come in Spark 1.6!
Details in
SPARK-8159

IBM Spark
spark.tc
IBM | spark.tc
Project Tungsten in Other Spark Libraries
SortDataFormat<K, Buﬀer>: Base trait

UncompressedInBlockSort: MLlib.ALS

EdgeArraySortDataFormat: GraphX.Edge

IBM Spark
spark.tc
spark.tc
IBM Spark
Outline
42

IBM Spark
spark.tc
Spark SQL: Catalyst and DataSources API
Explore DataFrames, Datasets, DataSources, Catalyst

Creating a Custom DataSource API Implementation

Review Partitions, Pruning, Pushdowns, Formats

43

IBM Spark
spark.tc
spark.tc
IBM Spark
DataFrames API
Inspired by R and Pandas DataFrames

Schema-aware
Cross language support

SQL, Python, Scala, Java, R
Levels performance of Python, Scala, Java, and R

Generates JVM bytecode vs serializing to Python
DataFrame is container for logical plan

Lazy transformations represented as tree
Catalyst optimizer creates physical plan

Moves expressions up/down tree
UDF and UDAF Support

Custom UDF using registerFunction()

New, experimental UDAF support
Supports existing Hive metastore if available

Small, ﬁle-based Hive metastore created if not available
*DataFrame.rdd returns underlying RDD if needed
44
Use DataFrames
instead of RDDs!!

IBM Spark
spark.tc
spark.tc
IBM Spark
DataSources API
Relations (o.a.s.sql.sources.interfaces.scala)

BaseRelation (abstract class): Provides schema of data

TableScan (impl): Read all data from source

PrunedFilteredScan (impl): Column pruning & predicate pushdowns

InsertableRelation (impl): Insert/overwrite data based on SaveMode

RelationProvider (trait/interface): Handle options, BaseRelation factory
Execution (o.a.s.sql.execution.commands.scala)

RunnableCommand (trait/interface): Common commands like EXPLAIN

ExplainCommand(impl: case class)

CacheTableCommand(impl: case class)
Filters (o.a.s.sql.sources.ﬁlters.scala)

Filter (abstract class): Handles all predicates/ﬁlters supported by this source

EqualTo (impl)

GreaterThan (impl)

StringStartsWith (impl)
45

IBM Spark
spark.tc
spark.tc
IBM Spark
Native Spark SQL DataSources
46

IBM Spark
spark.tc
spark.tc
IBM Spark
JSON Data Source
DataFrame

val ratingsDF = sqlContext.read.format("json")
.load("file:/root/pipeline/datasets/dating/ratings.json.bz2")
-- or –
val ratingsDF = sqlContext.read.json 
("file:/root/pipeline/datasets/dating/ratings.json.bz2")
SQL Code
CREATE TABLE genders USING json
OPTIONS
(path "file:/root/pipeline/datasets/dating/genders.json.bz2")

47
json() convenience method

IBM Spark
spark.tc
spark.tc
IBM Spark
JDBC Data Source
Add Driver to Spark JVM System Classpath

$ export SPARK_CLASSPATH=<jdbc-driver.jar>

DataFrame

val jdbcConﬁg = Map("driver" -> "org.postgresql.Driver",

"url" -> "jdbc:postgresql:hostname:port/database",

"dbtable" -> ”schema.tablename")

df.read.format("jdbc").options(jdbcConﬁg).load()

SQL

CREATE TABLE genders USING jdbc  

OPTIONS (url, dbtable, driver, …)

48

IBM Spark
spark.tc
spark.tc
IBM Spark
Parquet Data Source
Configuration

spark.sql.parquet.filterPushdown=true

spark.sql.parquet.mergeSchema=true

spark.sql.parquet.cacheMetadata=true

spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo]
DataFrames

val gendersDF = sqlContext.read.format("parquet")

.load("file:/root/pipeline/datasets/dating/genders.parquet")

gendersDF.write.format("parquet").partitionBy("gender")

.save("file:/root/pipeline/datasets/dating/genders.parquet")
SQL

CREATE TABLE genders USING parquet

OPTIONS

(path "file:/root/pipeline/datasets/dating/genders.parquet")

49

IBM Spark
spark.tc
spark.tc
IBM Spark
ORC Data Source
Configuration

spark.sql.orc.filterPushdown=true
DataFrames

val gendersDF = sqlContext.read.format("orc")

.load("file:/root/pipeline/datasets/dating/genders")

gendersDF.write.format("orc").partitionBy("gender")

.save("file:/root/pipeline/datasets/dating/genders")
SQL

CREATE TABLE genders USING orc

OPTIONS

(path "file:/root/pipeline/datasets/dating/genders")

50

IBM Spark
spark.tc
spark.tc
IBM Spark
Third-Party Spark SQL DataSources
51

IBM Spark
spark.tc
spark.tc
IBM Spark
CSV DataSource (Databricks)
Github
https://github.com/databricks/spark-csv
Maven

com.databricks:spark-csv_2.10:1.2.0
Code

val gendersCsvDF = sqlContext.read

.format("com.databricks.spark.csv")

.load("ﬁle:/root/pipeline/datasets/dating/gender.csv.bz2")

.toDF("id", "gender")
52
toDF() is required if CSV does not contain header

IBM Spark
spark.tc
spark.tc
IBM Spark
Avro DataSource (Databricks)
Github

https://github.com/databricks/spark-avro

Maven

com.databricks:spark-avro_2.10:2.0.1

Code

val df = sqlContext.read

.format("com.databricks.spark.avro")

.load("ﬁle:/root/pipeline/datasets/dating/gender.avro”)

53

IBM Spark
spark.tc
spark.tc
IBM Spark
ElasticSearch DataSource (Elastic.co)
Github

https://github.com/elastic/elasticsearch-hadoop

Maven

org.elasticsearch:elasticsearch-spark_2.10:2.1.0

Code

val esConﬁg = Map("pushdown" -> "true", "es.nodes" -> "<hostname>",  

"es.port" -> "<port>")

df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite)

.options(esConﬁg).save("<index>/<document-type>")

54

IBM Spark
spark.tc
spark.tc
IBM Spark
AWS Redshift Data Source (Databricks)
Github

https://github.com/databricks/spark-redshift

Maven

com.databricks:spark-redshift:0.5.0

Code

val df: DataFrame = sqlContext.read

.format("com.databricks.spark.redshift")

.option("url", "jdbc:redshift://<hostname>:<port>/<database>…")

.option("query", "select x, count(*) my_table group by x")

.option("tempdir", "s3n://tmpdir")

.load(...)
55
UNLOAD and copy to tmp
bucket in S3 enables
parallel reads

IBM Spark
spark.tc
spark.tc
IBM Spark
Cassandra DataSource (DataStax)
Github

https://github.com/datastax/spark-cassandra-connector

Maven

com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1

Code

ratingsDF.write

.format("org.apache.spark.sql.cassandra")

.mode(SaveMode.Append)

.options(Map("keyspace"->"<keyspace>",

"table"->"<table>")).save(…)

56

IBM Spark
spark.tc
spark.tc
IBM Spark
Cassandra Pushdown Support
spark-cassandra-connector/…/o.a.s.sql.cassandra.PredicatePushDown.scala

Pushdown Predicate Rules

1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate

2. Only push down primary key column predicates with = or IN predicate.

3. If there are regular columns in the pushdown predicates, they should have
at least one EQ expression on an indexed column and no IN predicates.

4. All partition column predicates must be included in the predicates to be pushed down,
only the last part of the partition key can be an IN predicate. For each partition column,

only one predicate is allowed.

5. For cluster column predicates, only last predicate can be non-EQ predicate

including IN predicate, and preceding column predicates must be EQ predicates.

If there is only one cluster column predicate, the predicates could be any non-IN predicate.

6. There is no pushdown predicates if there is any OR condition or NOT IN condition.

7. We're not allowed to push down multiple predicates for the same column if any of them

is equality or IN predicate.

57

IBM Spark
spark.tc
spark.tc
IBM Spark
Rumor of New Cassandra DataSource
By-pass CQL front door used for transactional data

Bulk read/write directly from/to SSTables

Similar to existing Netflix Open Source project

https://github.com/Netflix/aegisthus

Promotes Cassandra to first-class Analytics Option

Potentially only part of DataStax Enterprise?!

Please mail a nasty letter to your local DataStax office

58

IBM Spark
spark.tc
spark.tc
IBM Spark
Cloudant DataSource (IBM)
Github

http://spark-packages.org/package/cloudant/spark-cloudant

Maven

com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1

Code

ratingsDF.write.format("com.cloudant.spark")

.mode(SaveMode.Append)

.options(Map("cloudant.host"->"<account>.cloudant.com",

"cloudant.username"->"<username>",

"cloudant.password"->"<password>"))

.save("<ﬁlename>")
59

IBM Spark
spark.tc
spark.tc
IBM Spark
DB2 and BigSQL DataSources (IBM)
Coming Soon!
60

IBM Spark
spark.tc
spark.tc
IBM Spark
Rumor of REST DataSource (Databricks)
Coming Soon?

Ask Michael Armbrust
Spark SQL Lead @ Databricks
61

IBM Spark
spark.tc
spark.tc
IBM Spark
Custom DataSource (Me and You All!)
Coming Right Now!
62
DEMO ALERT!!

IBM Spark
spark.tc
spark.tc
IBM Spark
Creating a New DataSource
Study Existing Native and Third-Party Data Source Impls

Native: JDBC (o.a.s.sql.execution.datasources.jdbc)

class JDBCRelation extends BaseRelation

with PrunedFilteredScan

with InsertableRelation
Third-Party: Cassandra (o.a.s.sql.cassandra)

class CassandraSourceRelation extends BaseRelation

with PrunedFilteredScan

with InsertableRelation

<Insert Your Custom Data Source Here!>

63

IBM Spark
spark.tc
Demo!
Create a Custom DataSource
64

IBM Spark
spark.tc
spark.tc
IBM Spark
Contributing a Custom Data Source
spark-packages.org

Managed by

Contains links to externally-managed github projects

Ratings and comments

Requires supporte Spark version for each package
Examples

https://github.com/databricks/spark-csv

https://github.com/databricks/spark-avro

https://github.com/databricks/spark-redshift

65

IBM Spark
spark.tc
spark.tc
IBM Spark
Catalyst Optimizer

Optimize DataFrame Transformation Tree
Subquery elimination: use aliases to collapse subqueries
Constant folding: replace expression with constant
Simplify filters: remove unnecessary filters
Predicate/filter pushdowns: avoid unnecessary data load
Projection collapsing: avoid unnecessary projections
Create Custom Rules
Rules are Scala Case Classes
val newPlan = MyFilterRule(analyzedPlan)

66
Implements!
oas.sql.catalyst.rules.Ruleå!
Apply to any stage!
JVM code
generation

IBM Spark
spark.tc
spark.tc
IBM Spark
Query Plan Debugging
67
gendersCsvDF.select($"id", $"gender").ﬁlter("gender != 'F'").ﬁlter("gender != 'M'").explain(true)
DataFrame.queryExecution.logical
DataFrame.queryExecution.analyzed
DataFrame.queryExecution.optimizedPlan
DataFrame.queryExecution.optimizedPlan

IBM Spark
spark.tc
spark.tc
IBM Spark
Query Plan Visualization & Query Metrics
68
Eﬀectiveness
of Filter
CPU Cache  
Friendly
Binary Format
Cost-based
Join Optimization
Similar to
MapReduce
Map-side Join
Peak Memory for
Joins and Aggs

IBM Spark
spark.tc
spark.tc
IBM Spark
Columnar Storage Format
69
Skip whole chunks with min-max heuristics 
stored in each chunk (sorted data only)

IBM Spark
spark.tc
spark.tc
IBM Spark
Parquet File Format
  Based on Google Dremel
  Implemented by Twitter and Cloudera
  Columnar storage format
  Optimized for fast columnar aggregations
  Tight compression
  Supports pushdowns
  Nested, self-describing, evolving schema
70

IBM Spark
spark.tc
spark.tc
IBM Spark
Types of Compression
  Run Length Encoding: Repeated data
  Dictionary Encoding: Fixed set of values
  Delta, Preﬁx Encoding: Sorted data
71

IBM Spark
spark.tc
Demo!
Demonstrate File Formats, Partition Schemes, and Query Plans
72

IBM Spark
spark.tc
spark.tc
IBM Spark
Sample Dataset
73
RATINGS
========
UserID,ProﬁleID,Rating
(1-10)
GENDERS
========
UserID,Gender
(M,F,U)

IBM Spark
spark.tc
spark.tc
IBM Spark
Hive JDBC ODBC ThriftServer
Allows BI Tools to connect to Spark DataSources
Must register data in Hive Metastore

Conﬁguration

spark.sql.thriftServer.incrementalCollect=true

spark.driver.maxResultSize > 10gb (default)
74

IBM Spark
spark.tc
Demo!
Accessing Cassandra Data through Beeline and Tableau
75

IBM Spark
spark.tc
spark.tc
IBM Spark
Outline
76

IBM Spark
spark.tc
Understand Parallelism, Recovery, and Back Pressure

Demo Common Streaming Count Approximations

Diﬀerence between Receiver and Receiver-less
77

IBM Spark
spark.tc
spark.tc
IBM Spark
Receiver-less “Direct” Kafka Streaming
  KafkaRDD partitions store relevant oﬀsets
  Each partition acts as a Receiver
  Tasks/workers pull from Kafka in parallel
  Partitions rebuild from Kafka using oﬀsets
  No Write Ahead Log (WAL) needed
  Optimizes happy path by avoiding the WAL
  At least once delivery guarantee
78

IBM Spark
spark.tc
spark.tc
IBM Spark
Parallelism of Direct Kafka Streaming
79

IBM Spark
spark.tc
spark.tc
IBM Spark
Receiver-based Kinesis Streaming
  KinesisRDD partitions store relevant offsets
  Single receiver required to see all data/offsets
  Kinesis offsets not deterministic like Kafka
  Partitions rebuild from Kinesis using offsets
  No Write Ahead Log (WAL) needed
  Optimizes happy path by avoiding the WAL
  At least once delivery guarantee
80

IBM Spark
spark.tc
spark.tc
IBM Spark
Streaming Back Pressure
More than Throttling

Push back on the source

Requires buﬀered source (Kafka, Kinesis)

Based on fundamentals of Control Theory

Contributed by TypeSafe
81

IBM Spark
spark.tc
spark.tc
IBM Spark
HyperLogLog
  Approximate cardinality 
(approx count distinct)
  Fixed, low memory
  Tunable error percentage
  Only 1.5KB @ 2% error,10^9 elements
  Twitter’s Algebird
  Streaming example in Spark codebase
  Spark’s countApproxDistinctByKey()
82
http://research.neustar.biz/

IBM Spark
spark.tc
spark.tc
IBM Spark
Count Min Sketch
  Approximate counters
  Better than HashMap
  Low, ﬁxed memory
  Known error bounds
  Large num of counters
  From Twitter Algebird
  Streaming example in Spark codebase
83

IBM Spark
spark.tc
spark.tc
IBM Spark
Monte Carlo Simulations
From Manhattan Project (A-bomb)
Simulate movement of neutrons

Law of Large Numbers (LLN)
Average of results of many trials 
Converge on expected value

SparkPi example in
Spark codebase 

Pi ~ (# red dots / 

# total dots * 4)
84

IBM Spark
spark.tc
spark.tc
IBM Spark
Outline
85

IBM Spark
spark.tc
Understand Similarity and Dimension Reduction

Approximate with Sampling and Bucketing

Generate 10 Recommendations
86

IBM Spark
spark.tc
Live, Interactive Demo!
sparkafterdark.com
87

IBM Spark
spark.tc
spark.tc
IBM Spark
Audience Participation Needed!!
88
->
You are 
here
->
Audience Instructions
  Navigate to sparkafterdark.com
  Click 3 actresses and 3 actors

  Wait for us to analyze together!
Note: This is totally anonymous!!

Project Links
  https://github.com/ﬂuxcapacitor/pipeline
  https://hub.docker.com/r/ﬂuxcapacitor

IBM Spark
spark.tc
Similarity
89

IBM Spark
spark.tc
spark.tc
IBM Spark
Types of Similarity
Euclidean: linear measure
Magnitude bias
Cosine: angle measure
Adjust for magnitude bias
Jaccard: (intersection / union)
Popularity bias
Log Likelihood
Adjust for popularity bias

90
Ali Matei Reynold Patrick Andy
Kimberly 1 1 1 1
Leslie 1 1!
Meredith 1 1 1
Lisa 1 1 1
Holden 1 1 1 1 1
z!

IBM Spark
spark.tc
spark.tc
IBM Spark
All-Pairs Similarity Comparison
Compare everything to everything
aka. “pair-wise similarity” or “similarity join”
Naïve shuﬄe: O(m*n^2); m=rows, n=cols

Minimize shuﬄe through approximations!
Reduce m (rows)
Sampling and bucketing
Reduce n (cols)
Remove most frequent value (ie.0)
Principle Component Analysis
91
Dimension reduction!!

IBM Spark
spark.tc
Dimension Reduction
Sampling and Bucketing
92

IBM Spark
spark.tc
spark.tc
IBM Spark
Reduce m: DIMSUM Sampling
“Dimension Independent Matrix Square Using MR”
Remove rows with low similarity probability
MLlib: RowMatrix.columnSimilarities(…)

Twitter: 40% eﬃciency gain vs. Cosine Similarity

93

IBM Spark
spark.tc
spark.tc
IBM Spark
Reduce m: LSH Bucketing
“Locality Sensitive Hashing”
Split m into b buckets
Use similarity hash algorithm
Requires pre-processing of data
Compare bucket contents in parallel
Converts O(m*n^2) -> O(m*n/b*b^2);

m=rows, n=cols, b=buckets
ie. 500k x 500k matrix

O(1.25e17) -> O(1.25e13); b=50

github.com/mrsqueeze/spark-hash ->
94

IBM Spark
spark.tc
spark.tc
IBM Spark
Reduce n: Remove Most Frequent Value
Eliminate most-frequent value
Represent other values with (index,value) pairs
Converts O(m*n^2) -> O(m*nnz^2);  
nnz=num nonzeros, nnz << n

Note: Choose most frequent value (may not be 0)
95
(index,value)
(index,value)

IBM Spark
spark.tc
Recommendations
Summary Statistics and Top-K Historical Analysis
Collaborative Filtering and Clustering
Text Featurization and NLP
96

IBM Spark
spark.tc
spark.tc
IBM Spark
Types of Recommendations
Non-personalized 
No preference or behavior data for user, yet
aka “Cold Start Problem”

Personalized 
User-Item Similarity 
Items that others with similar prefs have liked
Item-Item Similarity 
Items similar to your previously-liked items
97

IBM Spark
spark.tc
spark.tc
IBM Spark
Recommendation Terminology
Feedback
Explicit: like, rating
Implicit: search, click, hover, view, scroll
Feature Engineering
Dimension reduction, polynomial expansion
Hyper-parameter Tuning
K-Folds Cross Validation, Grid Search
Pipelines/Workflows
Chaining together Transformers and Evaluators
98

IBM Spark
spark.tc
Non-Personalized Recommendations
Use Aggregate Data to Generate Recommendations
99

IBM Spark
spark.tc
spark.tc
IBM Spark
  Top Users by Like Count

“I might like users who have the most-likes overall
based on historical data.”
SparkSQL, DataFrames: Summary Stat, Aggs

100

IBM Spark
spark.tc
spark.tc
IBM Spark
  Top Inﬂuencers by Like Graph 

“I might like the most-inﬂuential users in overall like graph.”
GraphX: PageRank

101

IBM Spark
spark.tc
Demo!
Generate Recommnedations using Summary Stats & PageRank
102

IBM Spark
spark.tc
Personalized Recommendations
Use Similarity to Generate Personalized Recommendations
103

IBM Spark
spark.tc
spark.tc
IBM Spark
  Like Behavior of Similar Users
“I like the same people that you like.  
What other people did you like that I haven’t seen?”
MLlib: Matrix Factorization, User-Item Similarity
104

IBM Spark
spark.tc
Demo!
Generate Recommendations using  
Collaborative Filtering and Matrix Factorization
105

IBM Spark
spark.tc
spark.tc
IBM Spark
  Similar Text-based Proﬁles as Me 

“Our proﬁles have similar keywords and named entities.  
We might like each other!”
MLlib: Word2Vec, TF/IDF, k-skip n-grams
106

IBM Spark
spark.tc
spark.tc
IBM Spark
  Similar Profiles to Previous Likes 

107
“Your profile text has similar keywords and named entities to
other profiles of people I like. I might like you, too!”
MLlib: Word2Vec, TF/IDF, Doc Similarity

IBM Spark
spark.tc
spark.tc
IBM Spark
  Relevant, High-Value Emails

“Your initial email references a lot of things in my proﬁle. 
I might like you for making the eﬀort!”
MLlib: Word2Vec, TF/IDF, Entity Recognition

108
^
Her Email< My Profile

IBM Spark
spark.tc
Demo!
Feature Engineering for Text/NLP Use Cases
109

IBM Spark
spark.tc
The Future of Recommendations
110

IBM Spark
spark.tc
spark.tc
IBM Spark
  Eigenfaces: Facial Recognition
“Your face looks similar to others that I’ve liked. 
I might like you.”
MLlib: RowMatrix, PCA, Item-Item Similarity

111
Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html

IBM Spark
spark.tc
spark.tc
IBM Spark
  NLP Conversation Starter Bot!
“If your responses to my generic opening
lines are positive, I may read your proﬁle.”  
MLlib: TF/IDF, DecisionTrees,
Sentiment Analysis
112
Positive Negative

IBM Spark
spark.tc
113
Maintaining the Spark

IBM Spark
spark.tc
spark.tc
IBM Spark
⑨  Recommendations for Couples
“I want Mad Max. You want Message In a Bottle.  
Let’s ﬁnd something in between to watch tonight.”
MLlib: RowMatrix, Item-Item Similarity 
GraphX: Nearest Neighbors, Shortest Path

similar

similar
• 
plots ->
<- actors

114

IBM Spark
spark.tc
Final Recommendation!
115

IBM Spark
spark.tc
spark.tc
IBM Spark
  Get Off the Computer & Meet People!
Thank you, Brussels!!
Chris Fregly @cfregly
IBM Spark Technology Center
San Francisco, CA, USA
Relevant Links
advancedspark.com
Signup for the book & global meetup!
github.com/ﬂuxcapacitor/pipeline
Clone, contribute, and commit code!
hub.docker.com/r/ﬂuxcapacitor/pipeline/wiki
Run all demos in your own environment with Docker!
116

IBM Spark
spark.tc
spark.tc
IBM Spark
More Relevant Links
http://meetup.com/Advanced-Apache-Spark-Meetup
http://advancedspark.com
http://github.com/fluxcapacitor/pipeline
http://hub.docker.com/r/fluxcapacitor/pipeline
http://sortbenchmark.org/ApacheSpark2014.pd
https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
http://0x0fff.com/spark-architecture-shuffle/
http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf
http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance
http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf
http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/
http://docs.scala-lang.org/overviews/quasiquotes/intro.html
http://lwn.net/Articles/252125/ (Memory Part 2: CPU Caches)
http://lwn.net/Articles/255364/ (Memory Part 5: What Programmers Can Do)
https://www.safaribooksonline.com/library/view/java-performance-the/9781449363512/ch04.html
http://web.eece.maine.edu/~vweaver/projects/perf_events/perf_event_open.html
http://www.brendangregg.com/perf.html
https://perf.wiki.kernel.org/index.php/Tutorial
http://techblog.netflix.com/2015/07/java-in-flames.html
http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html
http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#Java
http://sortbenchmark.org/ApacheSpark2014.pdf
https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
http://0x0fff.com/spark-architecture-shuffle/
http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf
http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance
http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf
http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/
http://docs.scala-lang.org/overviews/quasiquotes/intro.html
http://lwn.net/Articles/252125/ <-- Memory Part 2: CPU Caches
http://lwn.net/Articles/255364/ <-- Memory Part 5: What Programmers Can Do

117

IBM Spark
spark.tc
What’s Next?
118

IBM Spark
spark.tc
spark.tc
IBM Spark
What’s Next?
Autoscaling Spark Workers

Completely Docker-based

Docker Compose and Docker Machine
Lots of Demos and Examples!

Zeppelin & IPython/Jupyter notebooks

Advanced streaming use cases

Advanced ML, Graph, and NLP use cases
Performance Tuning and Proﬁling

Work closely with Brendan Gregg & Netﬂix

Surface & share more low-level details of Spark internals
119

IBM Spark
spark.tc
spark.tc
IBM Spark
Upcoming Meetups and Conferences
London Spark Meetup (Oct 12th)
Scotland Data Science Meetup (Oct 13th)
Dublin Spark Meetup (Oct 15th)
Barcelona Spark Meetup (Oct 20th)
Madrid Spark/Big Data Meetup (Oct 22nd)
Paris Spark Meetup (Oct 26th)
Amsterdam Spark Summit & Meetup (Oct 27th)
Delft Dutch Data Science Meetup (Oct 29th)
Brussels Spark Meetup (Oct 30th)
Zurich Big Data Developers Meetup (Nov 2nd)
Geneva Spark Meetup (Nov 5th)
120
San Francisco Datapalooza (Nov 10th)
San Francisco Advanced Apache Spark Meetup (Nov 12th)
Oslo Big Data Hadoop Meetup (Nov 18th)
Helsinki Spark Meetup (Nov 20th)
Stockholm Spark Meetup (Nov 23rd)
Copenhagen Spark Meetup (Nov 25th)
Budapest Spark Meetup (Nov 27th)
Singapore Strata Conference (Dec 1st)
San Francisco Advanced Apache Spark Meetup (Dec 8th)
Mountain View Advanced Apache Spark Meetup (Dec 10th)
Washington DC Advanced Apache Spark Meetup (Dec 17th)

Freg-a-palooza!

IBM Spark
spark.tc
IBM Spark

Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5: Real-time, Advanced Analytics with Spark 1.5, Kafka, Cassandra, ElasticSearch, Zeppelin, and Docker

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5: Real-time, Advanced Analytics with Spark 1.5, Kafka, Cassandra, ElasticSearch, Zeppelin, and Docker

Similar to Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5: Real-time, Advanced Analytics with Spark 1.5, Kafka, Cassandra, ElasticSearch, Zeppelin, and Docker (16)

More from Chris Fregly

More from Chris Fregly (20)

Recently uploaded

Recently uploaded (20)

Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5: Real-time, Advanced Analytics with Spark 1.5, Kafka, Cassandra, ElasticSearch, Zeppelin, and Docker