Toronto Spark Meetup Dec 14 2015

Chris Fregly
Chris FreglyAI and Machine Learning @ AWS, O'Reilly Author @ Data Science on AWS, Founder @ PipelineAI, Formerly Databricks, Netflix,
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
After Dark 1.5+End-to-End, Real-time, Advanced Analytics and ML

Big Data Reference Pipeline
Toronto Apache Spark Meetup
Dec 14th, 2015
Chris Fregly
Principal Data Solutions Engineer
We’re Hiring - Only Nice People!
advancedspark.com!
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Who Am I?
2

Streaming Data Engineer
Netflix Open Source Committer


Data Solutions Engineer

Apache Contributor
Principal Data Solutions Engineer
IBM Technology Center
Meetup Founder
Advanced Apache Meetup
Book Author
Advanced .
Due 2016
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Advanced Apache Spark Meetup
Meetup Metrics
Top 5 Most Active Spark Meetup!
1800+ Members in just 4 mos!!
2100+ Downloads of Docker Image

 with many Meetup Demos!!!

 @ advancedspark.com
Meetup Goals
Code dive deep into Spark and related open source code bases
Study integrations with Cassandra, ElasticSearch,Tachyon, S3,

 
 BlinkDB, Mesos, YARN, Kafka, R, etc
Surface and share patterns and idioms of well-designed, 

 
 distributed, big data processing systems
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
World Tour: Freg-a-Palooza!
London Spark Meetup (Oct 12th)
Scotland Data Science Meetup (Oct 13th)
Dublin Spark Meetup (Oct 15th)
Barcelona Spark Meetup (Oct 20th)
Madrid Big Data Meetup (Oct 22nd)
Paris Spark Meetup (Oct 26th)
Amsterdam Spark Summit (Oct 27th)
Brussels Spark Meetup (Oct 30th)
Zurich Big Data Meetup (Nov 2nd)
Geneva Spark Meetup (Nov 5th)
4
Oslo Big Data Hadoop Meetup (Nov 19th)
Helsinki Spark Meetup (Nov 20th)
Stockholm Spark Meetup (Nov 23rd)
Copenhagen Spark Meetup (Nov 25th)
Budapest Spark Meetup (Nov 26th)
Istanbul Spark Meetup (Nov 28th)
Singapore Spark Meetup (Dec 1st)
Sydney Spark Meetup (Dec 8th)
Melbourne Spark Meetup (Dec 9th)
Toronto Spark Meetup (Dec 14th)
LAST MEETUP
OF 2015!!
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Topics of This Talk (20-30 mins each)
  Spark Streaming and Spark ML

Generating Real-time Recommendations
  Spark Core

Tuning and Profiling
  Spark SQL

Tuning and Customizing
5
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Live, Interactive Demo!
Kafka, Cassandra, ElasticSearch, Redis, Spark ML

6
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Audience Participation Required!
7
You -> 
Audience Instructions
 Navigate to 

sparkafterdark.com
 Swipe software used in

Production Only!




Data ->
Scientist
This is Totally Anonymous!!
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Topics of This Talk (15-20 mins each)
  Spark Streaming and Spark ML

Kafka, Cassandra, ElasticSearch, Redis, Docker
  Spark Core

Tuning and Profiling
  Spark SQL

Tuning and Customizing
8
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Mechanical Sympathy
“Hardware and software working together.”

- Martin Thompson

 http://mechanical-sympathy.blogspot.com


“Whatever your data structure, my array wins.”

- Scott Meyers

 Every C++ Book, basically

9
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Spark and Mechanical Sympathy
10
Project 

Tungsten
(Spark 1.4-1.6+)
100TB

GraySort
Challenge
(Spark 1.1-1.2)
Minimize Memory & GC
Maximize CPU Cache
Saturate Network I/O
Saturate Disk I/O
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Refresher
11
My

Laptop
aka
“LLC”
My

SoftLayer

BareMetal
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Sympathy (AlphaSort paper)
Key (10 bytes) + Pointer (4 bytes) = 14 bytes

12
Key
 Ptr
Pre-process & Pull Key From Record
Ptr
Key-Prefix
2x CPU Cache-line Friendly!
Key-Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes
Key (10 bytes) + Pad (2 bytes) + Pointer (4 bytes) = 16 bytes
Ptr
Must Dereference to Compare Key
Key
Pointer (4 bytes) = 4 bytes

Key
 Ptr
Pad
/Pad
 CPU Cache-line Friendly!
Dereference full key only 
to resolve prefix duplicates
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Sort Performance Comparison
13
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Sequential vs Random Cache Misses
14
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Demo!
Sorting

15
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Instrumenting and Monitoring CPU
Use Linux perf command!
16
http://www.brendangregg.com/blog/2015-11-06/java-mixed-mode-flame-graphs.html
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Results of Random vs. Sequential Sort
17
Naïve Random Access
Pointer Sort
Cache Friendly Sequential
Key/Pointer Sort
Ptr
 Key
Must Dereference to Compare Key
Key
 Ptr
Pre-process & Pull Key From Record
-35%
-90%
-68%
-26%
% Change
-55%
perf stat –event 
L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses,LLC-prefetch-misses
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Demo!
Matrix Multiplication

18
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Naïve Matrix Multiplication
// Dot product of each row & column vector
for (i <- 0 until numRowA)
for (j <- 0 until numColsB)
for (k <- 0 until numColsA)
res[ i ][ j ] += matA[ i ][ k ] * matB[ k ][ j ];

19
Bad: Row-wise traversal,

 not using full CPU cache line,

ineffective pre-fetching
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Friendly Matrix Multiplication
// Transpose B
for (i <- 0 until numRowsB)
for (j <- 0 until numColsB)
matBT[ i ][ j ] = matB[ j ][ i ];
20
Good: Full CPU Cache Line,

Effective Prefetching
OLD: matB [ k ][ j ];
// Modify algo for Transpose B
for (i <- 0 until numRowsA)
for (j <- 0 until numColsB)
for (k <- 0 until numColsA)
res[ i ][ j ] += matA[ i ][ k ] * matBT[ j ][ k ];
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Results Of Matrix Multiplication
Cache-Friendly Matrix Multiply
21
Naïve Matrix Multiply
perf stat –event 
L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses, 
LLC-prefetch-misses,cache-misses,stalled-cycles-frontend
-96%
-93%
-93%
-70%
-53%
% Change
-63%
+8543%?
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Demo!
Thread Synchronization
22
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Thread and Context Switch Sympathy
Problem
Atomically Increment 2 Counters 

(each at different increments) by
1000’s of Simultaneous Threads

23
Possible Solutions
  Synchronized Immutable
  Synchronized Mutable
  AtomicReference CAS
  Volatile?
Context Switches are Expen$ive!!
aka
“LLC”
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Synchronized Immutable Counters
case class Counters(left: Int, right: Int)

object SynchronizedImmutableCounters {
var counters = new Counters(0,0)
def getCounters(): Counters = {

this.synchronized { counters }
}
def increment(leftIncrement: Int, rightIncrement: Int): Unit = {
this.synchronized {
counters = new Counters(counters.left + leftIncrement, 

 
 
 
 
 
 
 
 
 counters.right + rightIncrement)
}
}
}
24
Locks whole
outer object!!
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Synchronized Mutable Counters
class MutableCounters(left: Int, right: Int) {
def increment(leftIncrement: Int, rightIncrement: Int): Unit={
this.synchronized {…}
}
def getCountersTuple(): (Int, Int) = {
this.synchronized{ (counters.left, counters.right) }
}
}
object SynchronizedMutableCounters {
val counters = new MutableCounters(0,0)
…
def increment(leftIncrement: Int, rightIncrement: Int): Unit = {
counters.increment(leftIncrement, rightIncrement)
}
}
25
Locks just
MutableCounters
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Lock-Free AtomicReference Counters
object LockFreeAtomicReferenceCounters {
val counters = new AtomicReference[Counters](new Counters(0,0))
…
def increment(leftIncrement: Int, rightIncrement: Int) : Long = {
var originalCounters: Counters = null
var updatedCounters: Counters = null
do {
originalCounters = getCounters()
updatedCounters = new Counters(originalCounters.left+ leftIncrement,
originalCounters.right+ rightIncrement)
}

 // Retry lock-free, optimistic compareAndSet() until AtomicRef updates
while !(counters.compareAndSet(originalCounters, updatedCounters))
}
} 
26
Lock Free!!
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Lock-Free AtomicLong Counters
object LockFreeAtomicLongCounters {
// a single Long (64-bit) will maintain 2 separate Ints (32-bits each)
val counters = new AtomicLong()
…
def increment(leftIncrement: Int, rightIncrement: Int): Unit = {
var originalCounters = 0L
var updatedCounters = 0L
do {
originalCounters = counters.get()

 …

 // Store two 32-bit Int into one 64-bit Long

 // Use >>> 32 and << 32 to set and retrieve each Int from the Long

 } 
// Retry lock-free, optimistic compareAndSet() until AtomicLong updates
while !(counters.compareAndSet(originalCounters, updatedCounters))
}
} 
27
Q: Why not 

@volatile long?
A: The JVM does not
guarantee atomic

updates of 64-bit 
longs and doubles
Lock Free!!
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Results of Thread Synchronization
Immutable Case Class





28
Lock-Free AtomicLong
-64%
-46%
-17%
-33%
perf stat –event 
context-switches,L1-dcache-load-misses,L1-dcache-prefetch-misses, 
LLC-load-misses, LLC-prefetch-misses,cache-misses,stalled-cycles-frontend
% Change
-31%
-32%
-33%
-27%
case class Counters(left: Int, right: Int)
...
this.synchronized {
counters = new Counters(counters.left + leftIncrement, 

 
 
 
 counters.right + rightIncrement)
}
val counters = new AtomicLong()
…
do {
…
} while !(counters.compareAndSet(originalCounters, 

updatedCounters))
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Profile Visualizations: Flame Graphs
29
Example: Spark Word Count
Java Stack Traces are Good! JDK 1.8+
(-XX:-Inline -XX:+PreserveFramePointer)
Plateaus are Bad!
I/O stalls, Heavy CPU

serialization, etc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Project Tungsten: CPU and Memory
Create Custom Data Structures & Algorithms

Operate on serialized and compressed ByteArrays!
Minimize Garbage Collection

Reuse ByteArrays

In-place updates for aggregations
Maximize CPU Cache Effectiveness

8-byte alignment

AlphaSort-based Key-Prefix
Utilize Catalyst Dynamic Code Generation

Dynamic optimizations using entire query plan

Developer implements genCode() to create Scala source code (String)

Project Janino compiles source code into JVM bytecode
30
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Why is CPU the Bottleneck?
CPU for serialization, hashing, & compression

Spark 1.2 updates saturated Network, Disk I/O

10x increase in I/O throughput relative to CPU

More partition, pruning, and pushdown support

Newer columnar file formats help reduce I/O
31
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Custom Data Structs & Algos: Aggs
UnsafeFixedWidthAggregationMap

Uses BytesToBytesMap internally

In-place updates of serialized aggregation

No object creation on hot-path


TungstenAggregate & TungstenAggregationIterator

Operates directly on serialized, binary UnsafeRow

2 steps to avoid single-key OOMs
  Hash-based (grouping) agg spills to disk if needed
  Sort-based agg performs external merge sort on spills
32
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Custom Data Structures & Algorithms
o.a.s.util.collection.unsafe.sort. 


UnsafeSortDataFormat

UnsafeExternalSorter

UnsafeShuffleWriter

UnsafeInMemorySorter

RecordPointerAndKeyPrefix



33
Ptr
Key-Prefix
2x CPU Cache-line Friendly!
SortDataFormat<RecordPointerAndKeyPrefix, Long[ ]>

Note: Mixing multiple subclasses of SortDataFormat 
simultaneously will prevent JIT inlining.
Supports merging compressed records
(if compression CODEC supports it, ie. LZF)
In-place external sorting of spilled BytesToBytes data
AlphaSort-based, 8-byte aligned sort key
In-place sorting of BytesToBytesMap data
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Code Generation
Problem

Boxing creates excessive objects

Expression tree evaluations are costly

JVM can’t inline polymorphic impls

 
Lack of polymorphism == poor code design
Solution

Code generation enables inlining

Rewrite and optimize code using overall plan, 8-byte align

Defer source code generation to each operator, UDF, UDAF

Use Janino to compile generated source code into bytecode

 
(More IDE friendly than Scala quasiquotes)
34
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Autoscaling Spark Workers (Spark 1.5+)
Scaling up is easy J

SparkContext.addExecutors() until max is reached
Scaling down is hard L

SparkContext.removeExecutors()

Lose RDD cache inside Executor JVM

Must rebuild active RDD partitions in another Executor JVM
Uses External Shuffle Service from Spark 1.1-1.2

If Executor JVM dies/restarts, shuffle keeps shufflin’!
35
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
“Hidden” Spark Submit REST API
http://arturmkrtchyan.com/apache-spark-hidden-rest-api
Submit Spark Job

curl -X POST http://127.0.0.1:6066/v1/submissions/create 

 
--header "Content-Type:application/json;charset=UTF-8" 

 
--data ’{"action" : "CreateSubmissionRequest”, 
"mainClass" : "org.apache.spark.examples.SparkPi”,

 
 
"sparkProperties" : {

 
 
 "spark.jars" : "file:/spark/lib/spark-examples-1.5.1.jar",

 
 "spark.app.name" : "SparkPi",…

 
}}’
Get Spark Job Status

curl http://127.0.0.1:6066/v1/submissions/status/<job-id-from-submit-request>
Kill Spark Job

curl -X POST http://127.0.0.1:6066/v1/submissions/kill/<job-id-from-submit-request>
36
(the snitch)
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Outline
  Spark Streaming and Spark ML

Kafka, Cassandra, ElasticSearch, Redis, Docker
  Spark Core

Tuning and Profiling
  Spark SQL

Tuning and Customizing
37
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Parquet Columnar File Format
Based on Google Dremel paper ~2010
Collaboration with Twitter and Cloudera
Columnar storage format for fast columnar aggs
Supports evolving schema
Supports pushdowns
Support nested partitions
Tight compression
Min/max heuristics enable file and chunk skipping
38
Min/Max Heuristics
For Chunk Skipping
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Partitions
Partition Based on Data Access Patterns

/genders.parquet/gender=M/…

 
 
 
 
 
 
 /gender=F/… <-- Use Case: Access Users by Gender

 
 
 
 
 
 /gender=U/…
Dynamic Partition Creation (Write)

Dynamically create partitions on write based on column (ie. Gender)

SQL: INSERT TABLE genders PARTITION (gender) SELECT …

DF: gendersDF.write.format("parquet").partitionBy("gender")

.save("/genders.parquet")
Partition Discovery (Read)

Dynamically infer partitions on read based on paths (ie. /gender=F/…)

SQL: SELECT id FROM genders WHERE gender=F

DF: gendersDF.read.format("parquet").load("/genders.parquet/").select($"id"). 


 
 
.where("gender=F")
39
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Pruning
Partition Pruning

Filter out rows by partition


 
SELECT id, gender FROM genders WHERE gender = ‘F’

Column Pruning

Filter out columns by column filter

Extremely useful for columnar storage formats (ie. Parquet)

 
Skip entire blocks of columns


 
SELECT id, gender FROM genders

40
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Pushdowns
aka. Predicate or Filter Pushdowns 
Predicate returns true or false for given function
Filters rows deep into the data source
Reduces number of rows returned
Data Source must implement PrunedFilteredScan


 
def buildScan(requiredColumns: Array[String], 
filters: Array[Filter]): RDD[Row]
41
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Demo!
File Formats, Partitions, Pushdowns, and Joins
42
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Predicate Pushdowns & Filter Collapsing
43
Filter pushdown!
No Extra Filter Pass
Filter collapse
1 Extra Filter Pass
2 Extra Filter Passes
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Join Between Partitioned & Unpartitioned
44
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Join Between Partitioned & Partitioned
45
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
46
Cartesian Join vs. Inner Join
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
47
Broadcast Join vs. Normal Shuffle Join
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Visualizing the Query Plan
48
Effectiveness 
of Filter
Cost-based
Join Optimization
Similar to
MapReduce

Map-side Join
& DistributedCache
Peak Memory for
Joins and Aggs
UnsafeFixedWidthAggregationMap

getPeakMemoryUsedBytes()
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Data Source API
Relations (o.a.s.sql.sources.interfaces.scala)

BaseRelation (abstract class): Provides schema of data

 
TableScan (impl): Read all data from source 


 
PrunedFilteredScan (impl): Column pruning & predicate pushdowns

 
InsertableRelation (impl): Insert/overwrite data based on SaveMode

RelationProvider (trait/interface): Handle options, BaseRelation factory

Filters (o.a.s.sql.sources.filters.scala)

Filter (abstract class): Handles all filters supported by this source

 
EqualTo (impl)

 
GreaterThan (impl)

 
StringStartsWith (impl)
49
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Native Spark SQL Data Sources
50
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
JSON Data Source
DataFrame

val ratingsDF = sqlContext.read.format("json")
.load("file:/root/pipeline/datasets/dating/ratings.json.bz2")
-- or –
val ratingsDF = sqlContext.read.json

("file:/root/pipeline/datasets/dating/ratings.json.bz2")
SQL Code
CREATE TABLE genders USING json
OPTIONS 
(path "file:/root/pipeline/datasets/dating/genders.json.bz2")

51
json() convenience method
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Parquet Data Source
Configuration

spark.sql.parquet.filterPushdown=true

spark.sql.parquet.mergeSchema=false (unless your schema is evolving)

spark.sql.parquet.cacheMetadata=true (requires sqlContext.refreshTable())

spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo]
DataFrames

val gendersDF = sqlContext.read.format("parquet")

 .load("file:/root/pipeline/datasets/dating/genders.parquet")

gendersDF.write.format("parquet").partitionBy("gender")

 .save("file:/root/pipeline/datasets/dating/genders.parquet") 
SQL

CREATE TABLE genders USING parquet

OPTIONS 

 
(path "file:/root/pipeline/datasets/dating/genders.parquet")

52
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
ElasticSearch Data Source
Github

https://github.com/elastic/elasticsearch-hadoop

Maven

org.elasticsearch:elasticsearch-spark_2.10:2.1.0

Code

val esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>", 


 
 
 
 
 
 
 "es.port" -> "<port>")

df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite)

 
.options(esConfig).save("<index>/<document-type>")

53
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Cassandra Data Source
Github

https://github.com/datastax/spark-cassandra-connector

Maven

com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1

Code

ratingsDF.write

 
.format("org.apache.spark.sql.cassandra")

 
.mode(SaveMode.Append)

 
.options(Map("keyspace"->"<keyspace>",

 
 
 
 
 
 "table"->"<table>")).save(…)

54
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Tips for Cassandra Analytics
By-pass Cassandra CQL “front door”

CQL Optimized for Transactions

Bulk read and write directly against SSTables

Check out Netflix OSS project “Aegisthus”

Cassandra becomesa first-class analytics option

Replicated analytics cluster no longer needed
55
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
DB2 and BigSQL Data Sources (IBM)
Coming Soon!
56
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Creating a Custom Data Source
① Study existing implementations

o.a.s.sql.execution.datasources.jdbc.JDBCRelation
② Extend base traits & implement required methods

o.a.s.sql.sources.{BaseRelation,PrunedFilterScan}

Spark JDBC (o.a.s.sql.execution.datasources.jdbc)

 class JDBCRelation extends BaseRelation

 
with PrunedFilteredScan 


 
with InsertableRelation
DataStax Cassandra (o.a.s.sql.cassandra)

 class CassandraSourceRelation extends BaseRelation

 
with PrunedFilteredScan 


 
with InsertableRelation!
57
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Demo!
Create a Custom Data Source
58
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Publishing Custom Data Sources
59
spark-packages.org
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
IBM | spark.tc
Spark SQL UDF Code Generation
100+ UDFs now generating code
More to come in Spark 1.6+
Details in 

SPARK-8159, SPARK-9571
Every UDF must use Expressions and

implement Expression.genCode()
to participate in the fun
Lambdas (RDD or Dataset API)
and sqlContext.udf.registerFunction()
are not enough!!
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Creating a Custom UDF with Code Gen
① Study existing implementations
o.a.s.sql.catalyst.expressions.Substring


② Extend and implement base trait

 o.a.s.sql.catalyst.expressions.Expression.genCode



③ Don’t forget about Python!

 python.pyspark.sql.functions.py
61
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Demo!
Creating a Custom UDF participating in Code Generation
62
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Spark 1.6 Improvements
Adaptiveness, Metrics, Datasets, and Streaming State 
63
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Adaptive Query Execution
Adapt query execution using data from previous stages
Dynamically choose spark.sql.shuffle.partitions (default 200)
64
Broadcast Join
(popular keys)
Shuffle Join
(not-so-popular keys)
Adaptive
Hybrid

Join
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Adaptive Memory Management
Spark <1.6

Manual configure between 2 memory regions

 
Spark execution engine (shuffles, joins, sorts, aggs)

 
 
spark.shuffle.memoryFraction

 
RDD Data Cache

 
 
spark.storage.memoryFraction 
Spark 1.6+

Unified memory regions

Dynamically expand/contract memory regions

Supports minimum for RDD storage (LRU Cache)
65
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Metrics
Shows exact memory usage per operator & per node
Helps debugging and identifying skew
66
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Spark SQL API
Datasets type safe API (similar to RDDs) utilizing Tungsten

val ds = sqlContext.read.text("ratings.csv").as[String]

val df = ds.flatMap(_.split(",")).filter(_ != "").toDF() // RDD API, convert to DF

val agg = df.groupBy($"rating").agg(count("*") as "ct”).orderBy($"ct" desc)
Typed Aggregators used alongside UDFs and UDAFs

val simpleSum = new Aggregator[Int, Int, Int] with Serializable {

 def zero: Int = 0

 def reduce(b: Int, a: Int) = b + a 

 def merge(b1: Int, b2: Int) = b1 + b2

 def finish(b: Int) = b

}.toColumn

val sum = Seq(1,2,3,4).toDS().select(simpleSum)
Query files directly without registerTempTable()

%sql SELECT * FROM json.`/datasets/movielens/ml-latest/movies.json`
67
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Spark Streaming State Management
New trackStateByKey()

Store deltas, compact later

More efficient per-key state update

Session TTL

Integrated A/B Testing (?!)

Show Failed Output in Admin UI

Better debugging
68
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Thank You!!!
Chris Fregly
IBM Spark Tech Center 
(http://spark.tc)
San Francisco, California, USA
advancedspark.com
Sign up for the Meetup and Book
Contribute on Github
Run All Demos in Docker
2100+ Docker Downloads!!
Find me on LinkedIn, Twitter, Github, Email, Fax
69
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
What’s Next?
70
Advanced Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Upcoming Features of Advanced Spark
Autoscaling Spark Workers

Completely Docker-based

Docker Compose, Google Kubernetes
Lots of Demos and Examples

More Zeppelin, iPython/Jupyter notebooks

More advanced analytics use cases
Performance Profiling

Work closely with Netflix and Databricks

Identify and fix performance bottlenecks

71
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
World Tour: Freg-a-Palooza!
London Spark Meetup (Oct 12th)
Scotland Data Science Meetup (Oct 13th)
Dublin Spark Meetup (Oct 15th)
Barcelona Spark Meetup (Oct 20th)
Madrid Big Data Meetup (Oct 22nd)
Paris Spark Meetup (Oct 26th)
Amsterdam Spark Summit (Oct 27th)
Brussels Spark Meetup (Oct 30th)
Zurich Big Data Meetup (Nov 2nd)
Geneva Spark Meetup (Nov 5th)
72
Oslo Big Data Hadoop Meetup (Nov 19th)
Helsinki Spark Meetup (Nov 20th)
Stockholm Spark Meetup (Nov 23rd)
Copenhagen Spark Meetup (Nov 25th)
Budapest Spark Meetup (Nov 26th)
Istanbul Spark Meetup (Nov 28th)
Singapore Spark Meetup (Dec 1st)
Sydney Spark Meetup (Dec 8th)
Melbourne Spark Meetup (Dec 9th)
Toronto Spark Meetup (Dec 14th)
I’M DONE 
FOR 2015!!
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
advancedspark.com
@cfregly
1 of 73

Recommended

Melbourne Spark Meetup Dec 09 2015 by
Melbourne Spark Meetup Dec 09 2015Melbourne Spark Meetup Dec 09 2015
Melbourne Spark Meetup Dec 09 2015Chris Fregly
533 views73 slides
DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations by
DC Spark Users Group March 15 2016 - Spark and Netflix RecommendationsDC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
DC Spark Users Group March 15 2016 - Spark and Netflix RecommendationsChris Fregly
1.7K views117 slides
Singapore Spark Meetup Dec 01 2015 by
Singapore Spark Meetup Dec 01 2015Singapore Spark Meetup Dec 01 2015
Singapore Spark Meetup Dec 01 2015Chris Fregly
1.1K views66 slides
Dallas DFW Data Science Meetup Jan 21 2016 by
Dallas DFW Data Science Meetup Jan 21 2016Dallas DFW Data Science Meetup Jan 21 2016
Dallas DFW Data Science Meetup Jan 21 2016Chris Fregly
505 views39 slides
Istanbul Spark Meetup Nov 28 2015 by
Istanbul Spark Meetup Nov 28 2015Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015Chris Fregly
1.3K views108 slides
Copenhagen Spark Meetup Nov 25, 2015 by
Copenhagen Spark Meetup Nov 25, 2015Copenhagen Spark Meetup Nov 25, 2015
Copenhagen Spark Meetup Nov 25, 2015Chris Fregly
770 views109 slides

More Related Content

What's hot

Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ... by
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...Chris Fregly
1.6K views84 slides
Spark Summit East NYC Meetup 02-16-2016 by
Spark Summit East NYC Meetup 02-16-2016  Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016 Chris Fregly
1.1K views82 slides
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar... by
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...Chris Fregly
3.4K views85 slides
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016 by
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016Chris Fregly
1.4K views28 slides
Chicago Spark Meetup 03 01 2016 - Spark and Recommendations by
Chicago Spark Meetup 03 01 2016 - Spark and RecommendationsChicago Spark Meetup 03 01 2016 - Spark and Recommendations
Chicago Spark Meetup 03 01 2016 - Spark and RecommendationsChris Fregly
1K views85 slides
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures... by
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...Chris Fregly
1.6K views39 slides

What's hot(20)

Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ... by Chris Fregly
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Chris Fregly1.6K views
Spark Summit East NYC Meetup 02-16-2016 by Chris Fregly
Spark Summit East NYC Meetup 02-16-2016  Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016
Chris Fregly1.1K views
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar... by Chris Fregly
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Chris Fregly3.4K views
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016 by Chris Fregly
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Chris Fregly1.4K views
Chicago Spark Meetup 03 01 2016 - Spark and Recommendations by Chris Fregly
Chicago Spark Meetup 03 01 2016 - Spark and RecommendationsChicago Spark Meetup 03 01 2016 - Spark and Recommendations
Chicago Spark Meetup 03 01 2016 - Spark and Recommendations
Chris Fregly1K views
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures... by Chris Fregly
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Chris Fregly1.6K views
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016 by Chris Fregly
USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
Chris Fregly1.6K views
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5 by Chris Fregly
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Chris Fregly665 views
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016 by Chris Fregly
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Chris Fregly887 views
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep... by Athens Big Data
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
Athens Big Data123 views
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl... by Chris Fregly
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...
Chris Fregly2.1K views
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015 by Chris Fregly
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Chris Fregly1.1K views
Helsinki Spark Meetup Nov 20 2015 by Chris Fregly
Helsinki Spark Meetup Nov 20 2015Helsinki Spark Meetup Nov 20 2015
Helsinki Spark Meetup Nov 20 2015
Chris Fregly899 views
Boston Spark Meetup May 24, 2016 by Chris Fregly
Boston Spark Meetup May 24, 2016Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016
Chris Fregly2.1K views
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc... by Chris Fregly
Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...
Chris Fregly793 views
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G... by Chris Fregly
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Chris Fregly1.2K views
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark... by Chris Fregly
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Chris Fregly2.2K views
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ... by Chris Fregly
Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
Chris Fregly2K views
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap... by Chris Fregly
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Chris Fregly2.7K views

Similar to Toronto Spark Meetup Dec 14 2015

Advanced Apache Spark Meetup Project Tungsten Nov 12 2015 by
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Chris Fregly
12.1K views60 slides
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor... by
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...Chris Fregly
3.7K views49 slides
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da... by
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Chris Fregly
5.2K views41 slides
London Spark Meetup Project Tungsten Oct 12 2015 by
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
1.1K views55 slides
Build a deep learning pipeline on apache spark for ads optimization by
Build a deep learning pipeline on apache spark for ads optimizationBuild a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationCraig Chao
1.8K views42 slides
Advertising Fraud Detection at Scale at T-Mobile by
Advertising Fraud Detection at Scale at T-MobileAdvertising Fraud Detection at Scale at T-Mobile
Advertising Fraud Detection at Scale at T-MobileDatabricks
450 views35 slides

Similar to Toronto Spark Meetup Dec 14 2015(15)

Advanced Apache Spark Meetup Project Tungsten Nov 12 2015 by Chris Fregly
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Chris Fregly12.1K views
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor... by Chris Fregly
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Chris Fregly3.7K views
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da... by Chris Fregly
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Chris Fregly5.2K views
London Spark Meetup Project Tungsten Oct 12 2015 by Chris Fregly
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
Chris Fregly1.1K views
Build a deep learning pipeline on apache spark for ads optimization by Craig Chao
Build a deep learning pipeline on apache spark for ads optimizationBuild a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimization
Craig Chao1.8K views
Advertising Fraud Detection at Scale at T-Mobile by Databricks
Advertising Fraud Detection at Scale at T-MobileAdvertising Fraud Detection at Scale at T-Mobile
Advertising Fraud Detection at Scale at T-Mobile
Databricks450 views
An Insider’s Guide to Maximizing Spark SQL Performance by Takuya UESHIN
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL Performance
Takuya UESHIN6.1K views
Apache Spark 2.0: Faster, Easier, and Smarter by Databricks
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks10.5K views
Ac922 watson 180208 v1 by IBM Sverige
Ac922 watson 180208 v1Ac922 watson 180208 v1
Ac922 watson 180208 v1
IBM Sverige541 views
Spark Summit EU 2015: Lessons from 300+ production users by Databricks
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
Databricks10.5K views
Jump Start with Apache Spark 2.0 on Databricks by Anyscale
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Anyscale1.1K views
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics by Miklos Christine
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine1.2K views
Big Data Processing with .NET and Spark (SQLBits 2020) by Michael Rys
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys337 views
Just enough DevOps for Data Scientists (Part II) by Databricks
Just enough DevOps for Data Scientists (Part II)Just enough DevOps for Data Scientists (Part II)
Just enough DevOps for Data Scientists (Part II)
Databricks1.1K views

More from Chris Fregly

AWS reInvent 2022 reCap AI/ML and Data by
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataChris Fregly
338 views79 slides
Pandas on AWS - Let me count the ways.pdf by
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfChris Fregly
191 views32 slides
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated by
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedChris Fregly
1.9K views15 slides
Amazon reInvent 2020 Recap: AI and Machine Learning by
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine LearningChris Fregly
1.2K views25 slides
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod... by
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...Chris Fregly
900 views39 slides
Quantum Computing with Amazon Braket by
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon BraketChris Fregly
1K views35 slides

More from Chris Fregly(20)

AWS reInvent 2022 reCap AI/ML and Data by Chris Fregly
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and Data
Chris Fregly338 views
Pandas on AWS - Let me count the ways.pdf by Chris Fregly
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdf
Chris Fregly191 views
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated by Chris Fregly
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Chris Fregly1.9K views
Amazon reInvent 2020 Recap: AI and Machine Learning by Chris Fregly
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine Learning
Chris Fregly1.2K views
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod... by Chris Fregly
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Chris Fregly900 views
Quantum Computing with Amazon Braket by Chris Fregly
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon Braket
Chris Fregly1K views
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person by Chris Fregly
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
Chris Fregly2.6K views
AWS Re:Invent 2019 Re:Cap by Chris Fregly
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:Cap
Chris Fregly2.1K views
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo... by Chris Fregly
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
Chris Fregly3.9K views
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -... by Chris Fregly
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Chris Fregly1.2K views
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ... by Chris Fregly
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Chris Fregly3.7K views
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T... by Chris Fregly
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Chris Fregly597 views
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -... by Chris Fregly
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
Chris Fregly1.1K views
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer... by Chris Fregly
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
Chris Fregly607 views
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ... by Chris Fregly
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Chris Fregly5.3K views
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to... by Chris Fregly
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
Chris Fregly2.5K views
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern... by Chris Fregly
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Chris Fregly963 views
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -... by Chris Fregly
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
Chris Fregly3.9K views
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +... by Chris Fregly
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
Chris Fregly1.4K views
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S... by Chris Fregly
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
Chris Fregly2.5K views

Recently uploaded

DevsRank by
DevsRankDevsRank
DevsRankdevsrank786
11 views1 slide
DSD-INT 2023 Delft3D FM Suite 2024.01 2D3D - New features + Improvements - Ge... by
DSD-INT 2023 Delft3D FM Suite 2024.01 2D3D - New features + Improvements - Ge...DSD-INT 2023 Delft3D FM Suite 2024.01 2D3D - New features + Improvements - Ge...
DSD-INT 2023 Delft3D FM Suite 2024.01 2D3D - New features + Improvements - Ge...Deltares
17 views12 slides
DSD-INT 2023 - Delft3D User Days - Welcome - Day 3 - Afternoon by
DSD-INT 2023 - Delft3D User Days - Welcome - Day 3 - AfternoonDSD-INT 2023 - Delft3D User Days - Welcome - Day 3 - Afternoon
DSD-INT 2023 - Delft3D User Days - Welcome - Day 3 - AfternoonDeltares
15 views43 slides
Copilot Prompting Toolkit_All Resources.pdf by
Copilot Prompting Toolkit_All Resources.pdfCopilot Prompting Toolkit_All Resources.pdf
Copilot Prompting Toolkit_All Resources.pdfRiccardo Zamana
8 views4 slides
Citi TechTalk Session 2: Kafka Deep Dive by
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Diveconfluent
17 views60 slides
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -... by
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...Deltares
6 views15 slides

Recently uploaded(20)

DSD-INT 2023 Delft3D FM Suite 2024.01 2D3D - New features + Improvements - Ge... by Deltares
DSD-INT 2023 Delft3D FM Suite 2024.01 2D3D - New features + Improvements - Ge...DSD-INT 2023 Delft3D FM Suite 2024.01 2D3D - New features + Improvements - Ge...
DSD-INT 2023 Delft3D FM Suite 2024.01 2D3D - New features + Improvements - Ge...
Deltares17 views
DSD-INT 2023 - Delft3D User Days - Welcome - Day 3 - Afternoon by Deltares
DSD-INT 2023 - Delft3D User Days - Welcome - Day 3 - AfternoonDSD-INT 2023 - Delft3D User Days - Welcome - Day 3 - Afternoon
DSD-INT 2023 - Delft3D User Days - Welcome - Day 3 - Afternoon
Deltares15 views
Copilot Prompting Toolkit_All Resources.pdf by Riccardo Zamana
Copilot Prompting Toolkit_All Resources.pdfCopilot Prompting Toolkit_All Resources.pdf
Copilot Prompting Toolkit_All Resources.pdf
Riccardo Zamana8 views
Citi TechTalk Session 2: Kafka Deep Dive by confluent
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
confluent17 views
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -... by Deltares
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...
Deltares6 views
DSD-INT 2023 Thermobaricity in 3D DCSM-FM - taking pressure into account in t... by Deltares
DSD-INT 2023 Thermobaricity in 3D DCSM-FM - taking pressure into account in t...DSD-INT 2023 Thermobaricity in 3D DCSM-FM - taking pressure into account in t...
DSD-INT 2023 Thermobaricity in 3D DCSM-FM - taking pressure into account in t...
Deltares9 views
Generic or specific? Making sensible software design decisions by Bert Jan Schrijver
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
Neo4j y GenAI by Neo4j
Neo4j y GenAI Neo4j y GenAI
Neo4j y GenAI
Neo4j45 views
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra... by Marc Müller
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra....NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...
Marc Müller38 views
El Arte de lo Possible by Neo4j
El Arte de lo PossibleEl Arte de lo Possible
El Arte de lo Possible
Neo4j39 views
Navigating container technology for enhanced security by Niklas Saari by Metosin Oy
Navigating container technology for enhanced security by Niklas SaariNavigating container technology for enhanced security by Niklas Saari
Navigating container technology for enhanced security by Niklas Saari
Metosin Oy12 views
Unmasking the Dark Art of Vectored Exception Handling: Bypassing XDR and EDR ... by Donato Onofri
Unmasking the Dark Art of Vectored Exception Handling: Bypassing XDR and EDR ...Unmasking the Dark Art of Vectored Exception Handling: Bypassing XDR and EDR ...
Unmasking the Dark Art of Vectored Exception Handling: Bypassing XDR and EDR ...
Donato Onofri773 views

Toronto Spark Meetup Dec 14 2015

  • 1. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc After Dark 1.5+End-to-End, Real-time, Advanced Analytics and ML
 Big Data Reference Pipeline Toronto Apache Spark Meetup Dec 14th, 2015 Chris Fregly Principal Data Solutions Engineer We’re Hiring - Only Nice People! advancedspark.com!
  • 2. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Who Am I? 2 Streaming Data Engineer Netflix Open Source Committer
 Data Solutions Engineer
 Apache Contributor Principal Data Solutions Engineer IBM Technology Center Meetup Founder Advanced Apache Meetup Book Author Advanced . Due 2016
  • 3. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Advanced Apache Spark Meetup Meetup Metrics Top 5 Most Active Spark Meetup! 1800+ Members in just 4 mos!! 2100+ Downloads of Docker Image with many Meetup Demos!!! @ advancedspark.com Meetup Goals Code dive deep into Spark and related open source code bases Study integrations with Cassandra, ElasticSearch,Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R, etc Surface and share patterns and idioms of well-designed, distributed, big data processing systems
  • 4. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark World Tour: Freg-a-Palooza! London Spark Meetup (Oct 12th) Scotland Data Science Meetup (Oct 13th) Dublin Spark Meetup (Oct 15th) Barcelona Spark Meetup (Oct 20th) Madrid Big Data Meetup (Oct 22nd) Paris Spark Meetup (Oct 26th) Amsterdam Spark Summit (Oct 27th) Brussels Spark Meetup (Oct 30th) Zurich Big Data Meetup (Nov 2nd) Geneva Spark Meetup (Nov 5th) 4 Oslo Big Data Hadoop Meetup (Nov 19th) Helsinki Spark Meetup (Nov 20th) Stockholm Spark Meetup (Nov 23rd) Copenhagen Spark Meetup (Nov 25th) Budapest Spark Meetup (Nov 26th) Istanbul Spark Meetup (Nov 28th) Singapore Spark Meetup (Dec 1st) Sydney Spark Meetup (Dec 8th) Melbourne Spark Meetup (Dec 9th) Toronto Spark Meetup (Dec 14th) LAST MEETUP OF 2015!!
  • 5. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Topics of This Talk (20-30 mins each)   Spark Streaming and Spark ML
 Generating Real-time Recommendations   Spark Core
 Tuning and Profiling   Spark SQL
 Tuning and Customizing 5
  • 6. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Live, Interactive Demo! Kafka, Cassandra, ElasticSearch, Redis, Spark ML 6
  • 7. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Audience Participation Required! 7 You -> Audience Instructions  Navigate to 
 sparkafterdark.com  Swipe software used in
 Production Only! Data -> Scientist This is Totally Anonymous!!
  • 8. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Topics of This Talk (15-20 mins each)   Spark Streaming and Spark ML
 Kafka, Cassandra, ElasticSearch, Redis, Docker   Spark Core
 Tuning and Profiling   Spark SQL
 Tuning and Customizing 8
  • 9. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Mechanical Sympathy “Hardware and software working together.” - Martin Thompson http://mechanical-sympathy.blogspot.com “Whatever your data structure, my array wins.” - Scott Meyers Every C++ Book, basically 9
  • 10. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark and Mechanical Sympathy 10 Project 
 Tungsten (Spark 1.4-1.6+) 100TB
 GraySort Challenge (Spark 1.1-1.2) Minimize Memory & GC Maximize CPU Cache Saturate Network I/O Saturate Disk I/O
  • 11. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Refresher 11 My
 Laptop aka “LLC” My
 SoftLayer
 BareMetal
  • 12. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Sympathy (AlphaSort paper) Key (10 bytes) + Pointer (4 bytes) = 14 bytes 12 Key Ptr Pre-process & Pull Key From Record Ptr Key-Prefix 2x CPU Cache-line Friendly! Key-Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes Key (10 bytes) + Pad (2 bytes) + Pointer (4 bytes) = 16 bytes Ptr Must Dereference to Compare Key Key Pointer (4 bytes) = 4 bytes Key Ptr Pad /Pad CPU Cache-line Friendly! Dereference full key only to resolve prefix duplicates
  • 13. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Sort Performance Comparison 13
  • 14. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Sequential vs Random Cache Misses 14
  • 15. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Demo! Sorting 15
  • 16. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Instrumenting and Monitoring CPU Use Linux perf command! 16 http://www.brendangregg.com/blog/2015-11-06/java-mixed-mode-flame-graphs.html
  • 17. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Results of Random vs. Sequential Sort 17 Naïve Random Access Pointer Sort Cache Friendly Sequential Key/Pointer Sort Ptr Key Must Dereference to Compare Key Key Ptr Pre-process & Pull Key From Record -35% -90% -68% -26% % Change -55% perf stat –event L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses,LLC-prefetch-misses
  • 18. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Demo! Matrix Multiplication 18
  • 19. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Naïve Matrix Multiplication // Dot product of each row & column vector for (i <- 0 until numRowA) for (j <- 0 until numColsB) for (k <- 0 until numColsA) res[ i ][ j ] += matA[ i ][ k ] * matB[ k ][ j ]; 19 Bad: Row-wise traversal, not using full CPU cache line,
 ineffective pre-fetching
  • 20. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Friendly Matrix Multiplication // Transpose B for (i <- 0 until numRowsB) for (j <- 0 until numColsB) matBT[ i ][ j ] = matB[ j ][ i ]; 20 Good: Full CPU Cache Line,
 Effective Prefetching OLD: matB [ k ][ j ]; // Modify algo for Transpose B for (i <- 0 until numRowsA) for (j <- 0 until numColsB) for (k <- 0 until numColsA) res[ i ][ j ] += matA[ i ][ k ] * matBT[ j ][ k ];
  • 21. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Results Of Matrix Multiplication Cache-Friendly Matrix Multiply 21 Naïve Matrix Multiply perf stat –event L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses, LLC-prefetch-misses,cache-misses,stalled-cycles-frontend -96% -93% -93% -70% -53% % Change -63% +8543%?
  • 22. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Demo! Thread Synchronization 22
  • 23. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Thread and Context Switch Sympathy Problem Atomically Increment 2 Counters 
 (each at different increments) by 1000’s of Simultaneous Threads 23 Possible Solutions   Synchronized Immutable   Synchronized Mutable   AtomicReference CAS   Volatile? Context Switches are Expen$ive!! aka “LLC”
  • 24. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Synchronized Immutable Counters case class Counters(left: Int, right: Int) object SynchronizedImmutableCounters { var counters = new Counters(0,0) def getCounters(): Counters = { this.synchronized { counters } } def increment(leftIncrement: Int, rightIncrement: Int): Unit = { this.synchronized { counters = new Counters(counters.left + leftIncrement, counters.right + rightIncrement) } } } 24 Locks whole outer object!!
  • 25. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Synchronized Mutable Counters class MutableCounters(left: Int, right: Int) { def increment(leftIncrement: Int, rightIncrement: Int): Unit={ this.synchronized {…} } def getCountersTuple(): (Int, Int) = { this.synchronized{ (counters.left, counters.right) } } } object SynchronizedMutableCounters { val counters = new MutableCounters(0,0) … def increment(leftIncrement: Int, rightIncrement: Int): Unit = { counters.increment(leftIncrement, rightIncrement) } } 25 Locks just MutableCounters
  • 26. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Lock-Free AtomicReference Counters object LockFreeAtomicReferenceCounters { val counters = new AtomicReference[Counters](new Counters(0,0)) … def increment(leftIncrement: Int, rightIncrement: Int) : Long = { var originalCounters: Counters = null var updatedCounters: Counters = null do { originalCounters = getCounters() updatedCounters = new Counters(originalCounters.left+ leftIncrement, originalCounters.right+ rightIncrement) } // Retry lock-free, optimistic compareAndSet() until AtomicRef updates while !(counters.compareAndSet(originalCounters, updatedCounters)) } } 26 Lock Free!!
  • 27. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Lock-Free AtomicLong Counters object LockFreeAtomicLongCounters { // a single Long (64-bit) will maintain 2 separate Ints (32-bits each) val counters = new AtomicLong() … def increment(leftIncrement: Int, rightIncrement: Int): Unit = { var originalCounters = 0L var updatedCounters = 0L do { originalCounters = counters.get() … // Store two 32-bit Int into one 64-bit Long // Use >>> 32 and << 32 to set and retrieve each Int from the Long } // Retry lock-free, optimistic compareAndSet() until AtomicLong updates while !(counters.compareAndSet(originalCounters, updatedCounters)) } } 27 Q: Why not 
 @volatile long? A: The JVM does not guarantee atomic
 updates of 64-bit longs and doubles Lock Free!!
  • 28. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Results of Thread Synchronization Immutable Case Class 28 Lock-Free AtomicLong -64% -46% -17% -33% perf stat –event context-switches,L1-dcache-load-misses,L1-dcache-prefetch-misses, LLC-load-misses, LLC-prefetch-misses,cache-misses,stalled-cycles-frontend % Change -31% -32% -33% -27% case class Counters(left: Int, right: Int) ... this.synchronized { counters = new Counters(counters.left + leftIncrement, counters.right + rightIncrement) } val counters = new AtomicLong() … do { … } while !(counters.compareAndSet(originalCounters, 
 updatedCounters))
  • 29. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Profile Visualizations: Flame Graphs 29 Example: Spark Word Count Java Stack Traces are Good! JDK 1.8+ (-XX:-Inline -XX:+PreserveFramePointer) Plateaus are Bad! I/O stalls, Heavy CPU
 serialization, etc
  • 30. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Project Tungsten: CPU and Memory Create Custom Data Structures & Algorithms Operate on serialized and compressed ByteArrays! Minimize Garbage Collection Reuse ByteArrays In-place updates for aggregations Maximize CPU Cache Effectiveness 8-byte alignment AlphaSort-based Key-Prefix Utilize Catalyst Dynamic Code Generation Dynamic optimizations using entire query plan Developer implements genCode() to create Scala source code (String) Project Janino compiles source code into JVM bytecode 30
  • 31. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Why is CPU the Bottleneck? CPU for serialization, hashing, & compression Spark 1.2 updates saturated Network, Disk I/O 10x increase in I/O throughput relative to CPU More partition, pruning, and pushdown support Newer columnar file formats help reduce I/O 31
  • 32. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Custom Data Structs & Algos: Aggs UnsafeFixedWidthAggregationMap Uses BytesToBytesMap internally In-place updates of serialized aggregation No object creation on hot-path TungstenAggregate & TungstenAggregationIterator Operates directly on serialized, binary UnsafeRow 2 steps to avoid single-key OOMs   Hash-based (grouping) agg spills to disk if needed   Sort-based agg performs external merge sort on spills 32
  • 33. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Custom Data Structures & Algorithms o.a.s.util.collection.unsafe.sort. UnsafeSortDataFormat UnsafeExternalSorter UnsafeShuffleWriter UnsafeInMemorySorter RecordPointerAndKeyPrefix
 33 Ptr Key-Prefix 2x CPU Cache-line Friendly! SortDataFormat<RecordPointerAndKeyPrefix, Long[ ]>
 Note: Mixing multiple subclasses of SortDataFormat simultaneously will prevent JIT inlining. Supports merging compressed records (if compression CODEC supports it, ie. LZF) In-place external sorting of spilled BytesToBytes data AlphaSort-based, 8-byte aligned sort key In-place sorting of BytesToBytesMap data
  • 34. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Code Generation Problem Boxing creates excessive objects Expression tree evaluations are costly JVM can’t inline polymorphic impls Lack of polymorphism == poor code design Solution Code generation enables inlining Rewrite and optimize code using overall plan, 8-byte align Defer source code generation to each operator, UDF, UDAF Use Janino to compile generated source code into bytecode (More IDE friendly than Scala quasiquotes) 34
  • 35. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Autoscaling Spark Workers (Spark 1.5+) Scaling up is easy J SparkContext.addExecutors() until max is reached Scaling down is hard L SparkContext.removeExecutors() Lose RDD cache inside Executor JVM Must rebuild active RDD partitions in another Executor JVM Uses External Shuffle Service from Spark 1.1-1.2 If Executor JVM dies/restarts, shuffle keeps shufflin’! 35
  • 36. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark “Hidden” Spark Submit REST API http://arturmkrtchyan.com/apache-spark-hidden-rest-api Submit Spark Job curl -X POST http://127.0.0.1:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data ’{"action" : "CreateSubmissionRequest”, "mainClass" : "org.apache.spark.examples.SparkPi”, "sparkProperties" : { "spark.jars" : "file:/spark/lib/spark-examples-1.5.1.jar", "spark.app.name" : "SparkPi",… }}’ Get Spark Job Status curl http://127.0.0.1:6066/v1/submissions/status/<job-id-from-submit-request> Kill Spark Job curl -X POST http://127.0.0.1:6066/v1/submissions/kill/<job-id-from-submit-request> 36 (the snitch)
  • 37. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Outline   Spark Streaming and Spark ML
 Kafka, Cassandra, ElasticSearch, Redis, Docker   Spark Core
 Tuning and Profiling   Spark SQL
 Tuning and Customizing 37
  • 38. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Parquet Columnar File Format Based on Google Dremel paper ~2010 Collaboration with Twitter and Cloudera Columnar storage format for fast columnar aggs Supports evolving schema Supports pushdowns Support nested partitions Tight compression Min/max heuristics enable file and chunk skipping 38 Min/Max Heuristics For Chunk Skipping
  • 39. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Partitions Partition Based on Data Access Patterns /genders.parquet/gender=M/… /gender=F/… <-- Use Case: Access Users by Gender /gender=U/… Dynamic Partition Creation (Write) Dynamically create partitions on write based on column (ie. Gender) SQL: INSERT TABLE genders PARTITION (gender) SELECT … DF: gendersDF.write.format("parquet").partitionBy("gender")
 .save("/genders.parquet") Partition Discovery (Read) Dynamically infer partitions on read based on paths (ie. /gender=F/…) SQL: SELECT id FROM genders WHERE gender=F DF: gendersDF.read.format("parquet").load("/genders.parquet/").select($"id"). .where("gender=F") 39
  • 40. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Pruning Partition Pruning Filter out rows by partition SELECT id, gender FROM genders WHERE gender = ‘F’ Column Pruning Filter out columns by column filter Extremely useful for columnar storage formats (ie. Parquet) Skip entire blocks of columns SELECT id, gender FROM genders 40
  • 41. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Pushdowns aka. Predicate or Filter Pushdowns Predicate returns true or false for given function Filters rows deep into the data source Reduces number of rows returned Data Source must implement PrunedFilteredScan def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] 41
  • 42. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Demo! File Formats, Partitions, Pushdowns, and Joins 42
  • 43. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Predicate Pushdowns & Filter Collapsing 43 Filter pushdown! No Extra Filter Pass Filter collapse 1 Extra Filter Pass 2 Extra Filter Passes
  • 44. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Join Between Partitioned & Unpartitioned 44
  • 45. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Join Between Partitioned & Partitioned 45
  • 46. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark 46 Cartesian Join vs. Inner Join
  • 47. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark 47 Broadcast Join vs. Normal Shuffle Join
  • 48. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Visualizing the Query Plan 48 Effectiveness of Filter Cost-based Join Optimization Similar to MapReduce
 Map-side Join & DistributedCache Peak Memory for Joins and Aggs UnsafeFixedWidthAggregationMap
 getPeakMemoryUsedBytes()
  • 49. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Data Source API Relations (o.a.s.sql.sources.interfaces.scala) BaseRelation (abstract class): Provides schema of data TableScan (impl): Read all data from source PrunedFilteredScan (impl): Column pruning & predicate pushdowns InsertableRelation (impl): Insert/overwrite data based on SaveMode RelationProvider (trait/interface): Handle options, BaseRelation factory Filters (o.a.s.sql.sources.filters.scala) Filter (abstract class): Handles all filters supported by this source EqualTo (impl) GreaterThan (impl) StringStartsWith (impl) 49
  • 50. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Native Spark SQL Data Sources 50
  • 51. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark JSON Data Source DataFrame val ratingsDF = sqlContext.read.format("json") .load("file:/root/pipeline/datasets/dating/ratings.json.bz2") -- or – val ratingsDF = sqlContext.read.json
 ("file:/root/pipeline/datasets/dating/ratings.json.bz2") SQL Code CREATE TABLE genders USING json OPTIONS (path "file:/root/pipeline/datasets/dating/genders.json.bz2") 51 json() convenience method
  • 52. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Parquet Data Source Configuration spark.sql.parquet.filterPushdown=true spark.sql.parquet.mergeSchema=false (unless your schema is evolving) spark.sql.parquet.cacheMetadata=true (requires sqlContext.refreshTable()) spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo] DataFrames val gendersDF = sqlContext.read.format("parquet") .load("file:/root/pipeline/datasets/dating/genders.parquet") gendersDF.write.format("parquet").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders.parquet") SQL CREATE TABLE genders USING parquet OPTIONS (path "file:/root/pipeline/datasets/dating/genders.parquet") 52
  • 53. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark ElasticSearch Data Source Github https://github.com/elastic/elasticsearch-hadoop Maven org.elasticsearch:elasticsearch-spark_2.10:2.1.0 Code val esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>", 
 "es.port" -> "<port>") df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite) .options(esConfig).save("<index>/<document-type>") 53
  • 54. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Cassandra Data Source Github https://github.com/datastax/spark-cassandra-connector Maven com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1 Code ratingsDF.write .format("org.apache.spark.sql.cassandra") .mode(SaveMode.Append) .options(Map("keyspace"->"<keyspace>", "table"->"<table>")).save(…) 54
  • 55. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Tips for Cassandra Analytics By-pass Cassandra CQL “front door” CQL Optimized for Transactions Bulk read and write directly against SSTables Check out Netflix OSS project “Aegisthus” Cassandra becomesa first-class analytics option Replicated analytics cluster no longer needed 55
  • 56. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark DB2 and BigSQL Data Sources (IBM) Coming Soon! 56
  • 57. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Creating a Custom Data Source ① Study existing implementations o.a.s.sql.execution.datasources.jdbc.JDBCRelation ② Extend base traits & implement required methods o.a.s.sql.sources.{BaseRelation,PrunedFilterScan} Spark JDBC (o.a.s.sql.execution.datasources.jdbc) class JDBCRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation DataStax Cassandra (o.a.s.sql.cassandra) class CassandraSourceRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation! 57
  • 58. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Demo! Create a Custom Data Source 58
  • 59. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Publishing Custom Data Sources 59 spark-packages.org
  • 60. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc IBM | spark.tc Spark SQL UDF Code Generation 100+ UDFs now generating code More to come in Spark 1.6+ Details in 
 SPARK-8159, SPARK-9571 Every UDF must use Expressions and
 implement Expression.genCode() to participate in the fun Lambdas (RDD or Dataset API) and sqlContext.udf.registerFunction() are not enough!!
  • 61. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Creating a Custom UDF with Code Gen ① Study existing implementations o.a.s.sql.catalyst.expressions.Substring ② Extend and implement base trait o.a.s.sql.catalyst.expressions.Expression.genCode ③ Don’t forget about Python! python.pyspark.sql.functions.py 61
  • 62. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Demo! Creating a Custom UDF participating in Code Generation 62
  • 63. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Spark 1.6 Improvements Adaptiveness, Metrics, Datasets, and Streaming State 63
  • 64. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Adaptive Query Execution Adapt query execution using data from previous stages Dynamically choose spark.sql.shuffle.partitions (default 200) 64 Broadcast Join (popular keys) Shuffle Join (not-so-popular keys) Adaptive Hybrid
 Join
  • 65. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Adaptive Memory Management Spark <1.6 Manual configure between 2 memory regions Spark execution engine (shuffles, joins, sorts, aggs) spark.shuffle.memoryFraction RDD Data Cache spark.storage.memoryFraction Spark 1.6+ Unified memory regions Dynamically expand/contract memory regions Supports minimum for RDD storage (LRU Cache) 65
  • 66. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Metrics Shows exact memory usage per operator & per node Helps debugging and identifying skew 66
  • 67. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark SQL API Datasets type safe API (similar to RDDs) utilizing Tungsten val ds = sqlContext.read.text("ratings.csv").as[String] val df = ds.flatMap(_.split(",")).filter(_ != "").toDF() // RDD API, convert to DF val agg = df.groupBy($"rating").agg(count("*") as "ct”).orderBy($"ct" desc) Typed Aggregators used alongside UDFs and UDAFs val simpleSum = new Aggregator[Int, Int, Int] with Serializable { def zero: Int = 0 def reduce(b: Int, a: Int) = b + a def merge(b1: Int, b2: Int) = b1 + b2 def finish(b: Int) = b }.toColumn val sum = Seq(1,2,3,4).toDS().select(simpleSum) Query files directly without registerTempTable() %sql SELECT * FROM json.`/datasets/movielens/ml-latest/movies.json` 67
  • 68. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark Streaming State Management New trackStateByKey() Store deltas, compact later More efficient per-key state update Session TTL Integrated A/B Testing (?!) Show Failed Output in Admin UI Better debugging 68
  • 69. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Thank You!!! Chris Fregly IBM Spark Tech Center (http://spark.tc) San Francisco, California, USA advancedspark.com Sign up for the Meetup and Book Contribute on Github Run All Demos in Docker 2100+ Docker Downloads!! Find me on LinkedIn, Twitter, Github, Email, Fax 69
  • 70. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc What’s Next? 70 Advanced Spark
  • 71. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Upcoming Features of Advanced Spark Autoscaling Spark Workers Completely Docker-based Docker Compose, Google Kubernetes Lots of Demos and Examples More Zeppelin, iPython/Jupyter notebooks More advanced analytics use cases Performance Profiling Work closely with Netflix and Databricks Identify and fix performance bottlenecks 71
  • 72. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark World Tour: Freg-a-Palooza! London Spark Meetup (Oct 12th) Scotland Data Science Meetup (Oct 13th) Dublin Spark Meetup (Oct 15th) Barcelona Spark Meetup (Oct 20th) Madrid Big Data Meetup (Oct 22nd) Paris Spark Meetup (Oct 26th) Amsterdam Spark Summit (Oct 27th) Brussels Spark Meetup (Oct 30th) Zurich Big Data Meetup (Nov 2nd) Geneva Spark Meetup (Nov 5th) 72 Oslo Big Data Hadoop Meetup (Nov 19th) Helsinki Spark Meetup (Nov 20th) Stockholm Spark Meetup (Nov 23rd) Copenhagen Spark Meetup (Nov 25th) Budapest Spark Meetup (Nov 26th) Istanbul Spark Meetup (Nov 28th) Singapore Spark Meetup (Dec 1st) Sydney Spark Meetup (Dec 8th) Melbourne Spark Meetup (Dec 9th) Toronto Spark Meetup (Dec 14th) I’M DONE FOR 2015!!
  • 73. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark advancedspark.com @cfregly