SlideShare a Scribd company logo
1 of 118
Download to read offline
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles

 


 


 

 
After Dark 1.5
High Performance, Real-time, Streaming,
Machine Learning, Natural Language Processing,
Text Analytics, and Recommendations

Chris Fregly
Principal Data Solutions Engineer
IBM Spark Technology Center
** We’re Hiring -- Only Nice People, Please!! **
Paris Spark Meetup
October 26, 2015
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Who Am I?
2

Streaming Data Engineer
Netflix Open Source Committer

Data Solutions Engineer

Apache Contributor

Principal Data Solutions Engineer
IBM Technology Center
Meetup Organizer
Advanced Apache Meetup
Book Author
Advanced (2016)
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Advanced Apache Spark Meetup
Meetup Metrics
1400+ members in just 3 mos!
4th most active Spark Meetup!!
meetup.com/Advanced-Apache-Spark-Meetup
Meetup Goals
  Dig deep into Spark & extended-Spark codebase
  Study integrations incl Cassandra, ElasticSearch,

Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R
  Surface & share patterns & idioms of these 

well-designed, distributed, big data components
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Upcoming Meetups and Conferences
London Spark Meetup (Oct 12th)
Scotland Data Science Meetup (Oct 13th)
Dublin Spark Meetup (Oct 15th)
Barcelona Spark Meetup (Oct 20th)
Madrid Spark/Big Data Meetup (Oct 22nd)
Paris Spark Meetup (Oct 26th)
Amsterdam Spark Summit & Meetup (Oct 27th)
Delft Dutch Data Science Meetup (Oct 29th) 
Brussels Spark Meetup (Oct 30th)
Zurich Big Data Developers Meetup (Nov 2nd)
Geneva Spark Meetup (Nov 5th)
4
San Francisco Datapalooza (Nov 10th)
San Francisco Advanced Apache Spark Meetup (Nov 12th)
Oslo Big Data Hadoop Meetup (Nov 18th)
Helsinki Spark Meetup (Nov 20th)
Stockholm Spark Meetup (Nov 23rd)
Copenhagen Spark Meetup (Nov 25th)
Budapest Spark Meetup (Nov 27th)
Singapore Strata Conference (Dec 1st)
San Francisco Advanced Apache Spark Meetup (Dec 8th)
Mountain View Advanced Apache Spark Meetup (Dec 10th)
Washington DC Advanced Apache Spark Meetup (Dec 17th)

Freg-a-palooza!
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
What is Spark After Dark?
Fun, Spark-based dating reference application 
*Not a movie recommendation engine!!
Generate recommendations based on user similarity
Demonstrate Apache Spark and related big data
projects
5
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Tools of this Talk
6
  Redis
  Docker
  Ganglia
  Streaming, Kafka
  Cassandra, NoSQL
  Parquet, JSON, ORC, Avro
  Apache Zeppelin Notebooks
  Spark SQL, DataFrames, Hive
  ElasticSearch, Logstash, Kibana
  Spark ML, GraphX, Stanford CoreNLP
and…
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Overall Themes of this Talk
  Filter Early, Filter Deep
  Approximations are OK
  Minimize Random Seeks
  Maximize Sequential Scans
  Go Off-Heap when Possible
  Parallelism is Required at Scale
  Must Reduce Dimensions at Scale
  Seek Performance Gains at all Layers
  Customize Data Structs for your Workload
7
  Be Nice and Collaborate with your Peers!
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
High-Level Sections
Spark Core: Performance Tuning
Spark SQL: DataSources and Tuning
Spark Streaming: Scale, Tuning, Approx
Spark ML: Scale, Dim Reduce, NLP 
8
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Spark Core: Performance Tuning
Acknowledging Mechanical Sympathy

100TB Daytona GraySort Challenge

Project Tungsten
9
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Acknowledging Mechanical Sympathy
“Hardware and software working together in harmony”

-Martin Thompson

http://mechanical-sympathy.blogspot.com

Spark Mechanical Sympathy Concerns

Saturate Network I/O

Saturate Disk I/O

Minimize Memory Footprint and GC

Maximize CPU Cache Locality

10
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Spark and Mechanical Sympathy
Saturate Network I/O
Saturate Disk I/O

Minimize Memory and GC
Maximize CPU Cache Locality

11
Project 

Tungsten
Spark 1.4-1.6
Daytona 
GraySort
Spark 1.1-1.2
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
AlphaSort Trick for Sorting 
AlphaSort paper, 1995

Chris Nyberg and Jim Gray

Naïve

List (Pointer-to-Record)

Requires Key to be dereferenced for comparison

AlphaSort

List (Key, Pointer)

Key is directly available for comparison

12
Ptr!
Ptr!Key!
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Key! Ptr!
Pad!
/Pad
CPU Cache Line and Memory Sympathy
Key(10 bytes) + Pointer(4 bytes*) = 14 bytes

*4 bytes when using compressed OOPS (<32 GB heap)

Not binary in size


Not CPU-cache friendly
Add Padding (2 bytes)

Key(10 bytes) + Pad(2 bytes) 

+ Pointer(4 bytes)=16 bytes
Key-Prefix, Pointer

Key distribution affects perf

Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes

13
Ptr!
Key-Prefix
Key! Ptr!
Cache-line

Friendly!
2x Cache-line

Friendly!
Not cache-line

Friendly!
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Performance Comparison
14
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Similar Technique: Direct Cache Access
Packet header placed into CPU cache

15
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Lines
16
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Instrumenting and Monitoring CPU
Linux perf command!
17
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Naïve Matrix Multiplication
// Find dot product of each row and column vector
for (i = 0; i < N; ++i)
for (j = 0; j < N; ++j)
for (k = 0; k < N; ++k)
res[i][j] += matA[i][k] * matB[k][j];

18
Skipping row-wise,
not using full CPU cache line,

ineffective pre-fetching
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Friendly Matrix Multiplication


// Transpose B
for (i = 0; i < N; ++i)
for (j = 0; j < N; ++j)

matBtran [i][j] = matB[j][i];


// Modify dot product calculation for B transpose
for (i = 0; i < N; ++i)
for (j = 0; j < N; ++j)
for (k = 0; k < N; ++k)
res[i][j] += matA[i][k] * matBtran[j][k];
19
Good use of CPU cache line,
effective prefetching
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Demo!
Comparing CPU Naïve & Cache-Friendly Matrix Multiplication
20
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Results of Naïve vs. Cache Friendly
Naïve Matrix Multiply
21
Cache Friendly Matrix Multiply
~72x
~8x
~3x
~3x
~2x
~7x
~10x
perf stat --repeat 5 --scale --event L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses,LLC-prefetch-misses,cache-misses,stalled-cycles-frontend 
java -Xmx13G -XX:-Inline -jar ~/sbt/bin/sbt-launch.jar "tungsten/run-main com.advancedspark.tungsten.matrix.Cache[Friendly|Naïve]MatrixMultiply 256 1"
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Visualizing and Finding Hotspots
Flame Graphs with Java Stack Traces
22
Images courtesy of http://techblog.netflix.com/2015/07/java-in-flames.html!
Java Stack Traces!!
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
100TB Daytona GraySort Challenge
Focus on Network and Disk I/O Optimizations
Improve Data Structs/Algos for Sort & Shuffle
Saturate Network and Disk Controllers
23
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Winning Results
24
Spark Goals:
  Saturate Network I/O
  Saturate Disk I/O
(2013) (2014)
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Winning Hardware Configuration
Compute

206 EC2 Worker nodes, 1 Master node

AWS i2.8xlarge

32 Intel Xeon CPU E5-2670 @ 2.5 Ghz

244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4

NOOP I/O scheduler: FIFO, request merging, no reordering

3 GBps mixed read/write disk I/O per node

Network

Deployed within Placement Group/VPC

Using AWS Enhanced Networking

Single Root I/O Virtualization (SR-IOV): extension of PCIe

10 Gbps, low latency, low jitter (iperf showed ~9.5 Gbps)

25
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Winning Software Configuration
Spark 1.2, OpenJDK 1.7_<amazon-something>_u65-b17
Disable caching, compression, spec execution, shuffle spill
Force NODE_LOCAL task scheduling for optimal data locality
HDFS 2.4.1 short-circuit for local reads, 2x replication
4-6 tasks allocated / partition is Spark recommendation

206 nodes * 32 cores = 6592 cores 

6592 cores * 4 = 26,368 partitions

6592 cores * 6 = 39,552 partitions

6592 cores * 4.25 = 28,000 partitions was empirically best
Range partitioning takes advantage of sequential keyspace
26
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
New Shuffle Manager
New “Sort-based” shuffle manager replaces Hash-based 








New Data Structures and Algos for Shuffle Sort

ie. New TimSort for Arrays of (K,V) Pairs
27
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
New Network Module
Replaces old java.nio, low-level, socket-based code

Zero-copy epoll: kernel-space between disk & network

Custom memory management

spark.shuffle.blockTransferService=netty

Spark-Netty Performance Tuning

spark.shuffle.io.numConnectionsPerPeer

 
Increase to saturate hosts with multiple disks

spark.shuffle.io.preferDirectBuffers

 
On or Off-heap (Off-heap is default)

28
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
New Algorithms and Data Structures
Optimized for sort and shuffle 
o.a.s.util.collection.TimSort

Based on JDK 1.7 TimSort

Performs best on partially-sorted datasets 

Optimized for elements of (K,V) pairs

Sorts impl of SortDataFormat (ie. KVArraySortDataFormat)
o.a.s.util.collection.AppendOnlyMap

Open addressing hash, quadratic probing

Array of [(K, V), (K, V)] 

Good memory locality

Keys never removed, values only append
29
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
IBM | spark.tc
Met Performance Goals!
Reducers: 1.1 Gbps/node network I/O
(theoretical max = 1.25 Gbps for 10 GB ethernet)
Mappers: 3 GBps/node disk I/O (8x800 SSD)
206 nodes * 1.1 Gbps/node ~= 220 Gbps
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Shuffle Performance Tuning Tips
Hash Shuffle Manager (no longer default)

spark.shuffle.consolidateFiles: mapper output files

o.a.s.shuffle.FileShuffleBlockResolver
Intermediate Files

Increase spark.shuffle.file.buffer: reduce seeks & sys calls

Increase spark.reducer.maxSizeInFlight if memory allows

Use smaller number of larger workers to reduce total files
SQL: BroadcastHashJoin vs. ShuffledHashJoin

spark.sql.autoBroadcastJoinThreshold


Use DataFrame.explain(true) or EXPLAIN to verify

31
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Project Tungsten
Focus on CPU Cache and Memory Optimizations
Further Improve Data Structures and Algorithms
Operate on Serialized/Compressed Data
Provide Path to Off Heap
32
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Why is CPU the Bottleneck?
Network and Disk I/O bandwidth are relatively high

GraySort optimizations improved network & shuffle

More partitioning, pruning, and predicate pushdowns

Poprularity of columnar file formats like Parquet/ORC

CPU is used for serialization, hashing, compression!
33
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Spark Shuffle Managers
spark.shuffle.manager =

hash < 10,000 Reducers

 
Output file determined by hashing the key of (K,V) pair

 
Each mapper creates an output buffer/file per reducer

 
Leads to M*R number of output buffers/files per shuffle

sort >= 10,000 Reducers

 
Default since Spark 1.2

 
Minimizes OS resources

 
Uses Netty to optimize Network I/O

 
Created custom Data Struts/Algos 


 
Wins Daytona GraySort Challenge 

unsafe -> Tungsten, Default in Spark 1.5

 
Uses com.misc.Unsafe to sellf-manage binary array buffers

 
Uses custom serialization format

 
Can operate on compressed and serialized buffers
34
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
New Data Structures




“I don’t know your data structure, but my array[] will beat it!”
Custom Data Structures for Sort/Shuffle Workload

UnsafeRow: 


BytesToBytesMap:: 


35
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
sun.misc.Unsafe
36
Info

addressSize()

pageSize()
Objects

allocateInstance()

objectFieldOffset()
Classes

staticFieldOffset()

defineClass()

defineAnonymousClass()

ensureClassInitialized()
Synchronization

monitorEnter()

tryMonitorEnter()

monitorExit()

compareAndSwapInt()

putOrderedInt()
Arrays

arrayBaseOffset()

arrayIndexScale()
Memory

allocateMemory()

copyMemory()

freeMemory()

getAddress() – not guaranteed after GC

getInt()/putInt()

getBoolean()/putBoolean()

getByte()/putByte()

getShort()/putShort()

getLong()/putLong()

getFloat()/putFloat()

getDouble()/putDouble()

getObjectVolatile()/putObjectVolatile()
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Spark + com.misc.Unsafe
37
org.apache.spark.sql.execution.
aggregate.SortBasedAggregate
aggregate.TungstenAggregate
aggregate.AggregationIterator
aggregate.udaf
aggregate.utils
SparkPlanner
rowFormatConverters
UnsafeFixedWidthAggregationMap
UnsafeExternalSorter
UnsafeExternalRowSorter
UnsafeKeyValueSorter
UnsafeKVExternalSorter
local.ConvertToUnsafeNode
local.ConvertToSafeNode
local.HashJoinNode
local.ProjectNode
local.LocalNode
local.BinaryHashJoinNode
local.NestedLoopJoinNode
joins.HashJoin
joins.HashSemiJoin
joins.HashedRelation
joins.BroadcastHashJoin
joins.ShuffledHashOuterJoin (not yet converted)
joins.BroadcastHashOuterJoin
joins.BroadcastLeftSemiJoinHash
joins.BroadcastNestedLoopJoin
joins.SortMergeJoin
joins.LeftSemiJoinBNL
joins.SortMergerOuterJoin
Exchange
SparkPlan
UnsafeRowSerializer
SortPrefixUtils
sort
basicOperators
aggregate.SortBasedAggregationIterator
aggregate.TungstenAggregationIterator
datasources.WriterContainer
datasources.json.JacksonParser
datasources.jdbc.JDBCRDD
org.apache.spark.
unsafe.Platform
unsafe.KVIterator
unsafe.array.LongArray
unsafe.array.ByteArrayMethods
unsafe.array.BitSet
unsafe.bitset.BitSetMethods
unsafe.hash.Murmur3_x86_32
unsafe.map.BytesToBytesMap
unsafe.map.HashMapGrowthStrategy
unsafe.memory.TaskMemoryManager
unsafe.memory.ExecutorMemoryManager
unsafe.memory.MemoryLocation
unsafe.memory.UnsafeMemoryAllocator
unsafe.memory.MemoryAllocator (trait/interface)
unsafe.memory.MemoryBlock
unsafe.memory.HeapMemoryAllocator
unsafe.memory.ExecutorMemoryManager
unsafe.sort.RecordComparator
unsafe.sort.PrefixComparator
unsafe.sort.PrefixComparators
unsafe.sort.UnsafeSorterSpillWriter
serializer.DummySerializationInstance
shuffle.unsafe.UnsafeShuffleManager
shuffle.unsafe.UnsafeShuffleSortDataFormat
shuffle.unsafe.SpillInfo
shuffle.unsafe.UnsafeShuffleWriter
shuffle.unsafe.UnsafeShuffleExternalSorter
shuffle.unsafe.PackedRecordPointer
shuffle.ShuffleMemoryManager
util.collection.unsafe.sort.UnsafeSorterSpillMerger
util.collection.unsafe.sort.UnsafeSorterSpillReader
util.collection.unsafe.sort.UnsafeSorterSpillWriter
util.collection.unsafe.sort.UnsafeShuffleInMemorySorter
util.collection.unsafe.sort.UnsafeInMemorySorter
util.collection.unsafe.sort.RecordPointerAndKeyPrefix
util.collection.unsafe.sort.UnsafeSorterIterator
network.shuffle.ExternalShuffleBlockResolver
scheduler.Task
rdd.SqlNewHadoopRDD
executor.Executor
org.apache.spark.sql.catalyst.expressions.
regexpExpressions
BoundAttribute
SortOrder
SpecializedGetters
ExpressionEvalHelper
UnsafeArrayData
UnsafeReaders
UnsafeMapData
Projection
LiteralGeneartor
UnsafeRow
JoinedRow
SpecializedGetters
InputFileName
SpecificMutableRow
codegen.CodeGenerator
codegen.GenerateProjection
codegen.GenerateUnsafeRowJoiner
codegen.GenerateSafeProjection
codegen.GenerateUnsafeProjection
codegen.BufferHolder
codegen.UnsafeRowWriter
codegen.UnsafeArrayWriter
complexTypeCreator
rows
literals
misc
stringExpressions
Over 200 source
files affected!!
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU & Memory Optimizations
Custom Managed Memory

Reduces GC overhead

Both on and off heap

Exact size calculations
Direct Binary Processing

Operate on serialized/compressed arrays

Kryo can reorder serialized records

LZF can reorder compressed records
More CPU Cache-aware Data Structs & Algorithms

o.a.s.unsafe.map.BytesToBytesMap vs. j.u.HashMap
Code Generation (default in 1.5)

Generate source code from overall query plan

Janino generates bytecode from source code

100+ UDFs converted to use code generation
38
UnsafeFixedWithAggregationMap,&
TungstenAggregationIterator
CodeGenerator &
GeneratorUnsafeRowJoiner
UnsafeSortDataFormat &
UnsafeShuffleSortDataFormat &
PackedRecordPointer &
UnsafeRow
UnsafeInMemorySorter &
UnsafeExternalSorter &
UnsafeShuffleWriter
Mostly Same Join Code,
added if (isUnsafeMode)
UnsafeShuffleManager &
UnsafeShuffleInMemorySorter &
UnsafeShuffleExternalSorterDetails inSPARK-7075
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
IBM | spark.tc
Code Generation (Default in 1.5)
Problem
Generic expression evaluation
Expensive on JVM
Virtual func calls
Branches based on expression type
Boxing causes excessive object creation 
Implementation
Defer source code generation to each operator, type, etc
Scala quasiquotes provide AST manipulation & rewriting
Generates source code, compiled to bytecode w/ Janino
100+ UDFs now using code gen
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
IBM | spark.tc
Code Generation: Spark SQL UDFs
100+ UDFs now using code gen – More to come in Spark 1.6!
Details in
SPARK-8159
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
IBM | spark.tc
Project Tungsten in Other Spark Libraries
SortDataFormat<K, Buffer>: Base trait

UncompressedInBlockSort: MLlib.ALS

EdgeArraySortDataFormat: GraphX.Edge
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Spark SQL: DataSources and Tuning
Understand Partitions, Pruning, Predicate Pushdowns

Understand DataFrames, Catalyst, DataSources

Create a DataSource Implementation

 42
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Partitions
Partition based on data usage patterns

/genders.parquet/gender=M/…

 
 
 
 
 
 
 /gender=F/… <-- Use case: access users by gender

 
 
 
 
 
 /gender=U/…
Partition Discovery (Read Path)

Infer partitions from organization of data (ie. gender=F)

Dynamic Partitions (Write Path)

Dynamically create partitions based on given column(s)



SQL: INSERT TABLE genders PARTITION (gender) SELECT …

DF: gendersDF.write.format("parquet").partitionBy("gender").save(…)
43
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Pruning
Partition Pruning

Filter out entire rows that have been pre-partitioned

SELECT id, gender FROM genders where gender = ‘U’

Column Pruning

Filter out entire columns for all rows if not required

Optimized for columnar storage formats (Parquet)

Minimize data shuffle during joins
44
gender = partition key
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Predicate Pushdowns
“Predicate” == “Filter”
Filters rows as deep into the data source as possible
Predicate returns [true|false] for given func/condition
45
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Putting It All Together
Reduce Columns: Column Pruning
Reduce Rows: Partitioning, Predicate Pushdown
SELECT b FROM table WHERE a in [a2,a3]

46
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
DataFrames Overview
Inspired by R and Pandas DataFrames
Cross language support

SQL, Python, Scala, Java, R
Levels performance of Python, Scala, Java, and R

Generates JVM bytecode vs serialize/pickle to Python
DataFrame is Container for Logical Plan

Lazy transformations represented as tree

Catalyst Optimizer creates physical plan
DataFrame.rdd returns the underlying RDD if needed
Custom UDF using registerFunction()
New, experimental UDAF support
47
Use DataFrames
instead of RDDs!!
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Catalyst Optimizer



Optimize DataFrame Transformation Tree
Subquery elimination: use aliases to collapse subqueries
Constant folding: replace expression with constant
Simplify filters: remove unnecessary filters
Predicate/filter pushdowns: avoid unnecessary data load
Projection collapsing: avoid unnecessary projections
Create Custom Rules
Rules are Scala Case Classes
val newPlan = MyFilterRule(analyzedPlan)

48
Implements!
oas.sql.catalyst.rules.Ruleå!
Apply to any stage!
JVM code
generation
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Columnar Storage Format
49
Skip whole chunks with min-max heuristics

stored in each chunk (sorted data only)
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Parquet File Format
  Based on Google Dremel 
  Implemented by Twitter and Cloudera
  Columnar storage format
  Optimized for fast columnar aggregations
  Tight compression
  Supports pushdowns
  Nested, self-describing, evolving schema
50
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Types of Compression
  Run Length Encoding: Repeated data
  Dictionary Encoding: Fixed set of values
  Delta, Prefix Encoding: Sorted data
51
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Query Plan Debugging
52
gendersCsvDF.select($"id", $"gender").filter("gender != 'F'").filter("gender != 'M'").explain(true)
DataFrame.queryExecution.logical
DataFrame.queryExecution.analyzed
DataFrame.queryExecution.optimizedPlan
DataFrame.queryExecution.optimizedPlan
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Query Plan Visualization & Query Metrics
53
Effectiveness 
of Filter
CPU Cache 

Friendly
Binary Format
 Cost-based
Join Optimization
Similar to
MapReduce
Map-side Join
Peak Memory for
Joins and Aggs
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Demo!
Show Various File Formats, Partitioning Schemes, 

DataSource Implementations, and Query Plans
54
RATINGS 
========
UserID,ProfileID,Rating 
(1-10)
GENDERS
========
UserID,Gender 
(M,F,U)
Anonymous, Public
Dating Dataset
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
DataSources API
Relations (o.a.s.sql.sources.interfaces.scala)

BaseRelation (abstract class): Provides schema of data

 
TableScan (impl): Read all data from source 


 
PrunedFilteredScan (impl): Column pruning & predicate pushdowns

 
InsertableRelation (impl): Insert/overwrite data based on SaveMode

RelationProvider (trait/interface): Handle options, BaseRelation factory
Execution (o.a.s.sql.execution.commands.scala)

RunnableCommand (trait/interface): Common commands like EXPLAIN

 
ExplainCommand(impl: case class)

 
CacheTableCommand(impl: case class)
Filters (o.a.s.sql.sources.filters.scala)

Filter (abstract class): Handles all predicates/filters supported by this source

 
EqualTo (impl)

 
GreaterThan (impl)

 
StringStartsWith (impl)
55
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Native Spark SQL DataSources
56
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
JSON Data Source
DataFrame

val ratingsDF = sqlContext.read.format("json")
.load("file:/root/pipeline/datasets/dating/ratings.json.bz2")
-- or –
val ratingsDF = sqlContext.read.json

("file:/root/pipeline/datasets/dating/ratings.json.bz2")
SQL Code
CREATE TABLE genders USING json
OPTIONS 
(path "file:/root/pipeline/datasets/dating/genders.json.bz2")

57
json() convenience method
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
JDBC Data Source
Add Driver to Spark JVM System Classpath

$ export SPARK_CLASSPATH=<jdbc-driver.jar>

DataFrame

val jdbcConfig = Map("driver" -> "org.postgresql.Driver",

 
"url" -> "jdbc:postgresql:hostname:port/database", 

 
"dbtable" -> ”schema.tablename")

df.read.format("jdbc").options(jdbcConfig).load()

SQL

CREATE TABLE genders USING jdbc 


 
OPTIONS (url, dbtable, driver, …)

58
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Parquet Data Source
Configuration

spark.sql.parquet.filterPushdown=true

spark.sql.parquet.mergeSchema=true

spark.sql.parquet.cacheMetadata=true

spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo]
DataFrames

val gendersDF = sqlContext.read.format("parquet")

 .load("file:/root/pipeline/datasets/dating/genders.parquet")

gendersDF.write.format("parquet").partitionBy("gender")

 .save("file:/root/pipeline/datasets/dating/genders.parquet") 
SQL

CREATE TABLE genders USING parquet

OPTIONS 

 
(path "file:/root/pipeline/datasets/dating/genders.parquet")

59
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
ORC Data Source
Configuration

spark.sql.orc.filterPushdown=true
DataFrames

val gendersDF = sqlContext.read.format("orc")

 
.load("file:/root/pipeline/datasets/dating/genders")

gendersDF.write.format("orc").partitionBy("gender")

 
.save("file:/root/pipeline/datasets/dating/genders")
SQL

CREATE TABLE genders USING orc

OPTIONS 

 
(path "file:/root/pipeline/datasets/dating/genders")

60
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Third-Party Spark SQL DataSources
61
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CSV DataSource (Databricks)
Github
https://github.com/databricks/spark-csv
Maven

com.databricks:spark-csv_2.10:1.2.0
Code

val gendersCsvDF = sqlContext.read

 
.format("com.databricks.spark.csv”)

 
.load("file:/root/pipeline/datasets/dating/gender.csv.bz2")

 
.toDF("id", "gender")
62
toDF() is required if CSV does not contain header
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Avro DataSource (Databricks)
Github

https://github.com/databricks/spark-avro

Maven

com.databricks:spark-avro_2.10:2.0.1

Code

val df = sqlContext.read

 
.format("com.databricks.spark.avro")

 
.load("file:/root/pipeline/datasets/dating/gender.avro”)

63
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
ElasticSearch DataSource (Elastic.co)
Github

https://github.com/elastic/elasticsearch-hadoop

Maven

org.elasticsearch:elasticsearch-spark_2.10:2.1.0

Code

val esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>", 


 
"es.port" -> "<port>")

df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite)

 
.options(esConfig).save("<index>/<document-type>")

64
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
AWS Redshift Data Source (Databricks)
Github

https://github.com/databricks/spark-redshift

Maven

com.databricks:spark-redshift:0.5.0

Code

val df: DataFrame = sqlContext.read

 
.format("com.databricks.spark.redshift")

 
.option("url", "jdbc:redshift://<hostname>:<port>/<database>…")

 
.option("query", "select x, count(*) my_table group by x")

 
.option("tempdir", "s3n://tmpdir")

 
.load(...)
65
UNLOAD and copy to tmp
bucket in S3 enables
parallel reads
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Cassandra DataSource (DataStax)
Github

https://github.com/datastax/spark-cassandra-connector

Maven

com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1

Code

ratingsDF.write

 
.format("org.apache.spark.sql.cassandra")

 
.mode(SaveMode.Append)

 
.options(Map("keyspace"->"<keyspace>",

 
 
 
 
 
 "table"->"<table>")).save(…)

66
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Cassandra Pushdown Support
spark-cassandra-connector/…/o.a.s.sql.cassandra.PredicatePushDown.scala


Pushdown Predicate Rules

1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate

2. Only push down primary key column predicates with = or IN predicate.

3. If there are regular columns in the pushdown predicates, they should have
at least one EQ expression on an indexed column and no IN predicates.

4. All partition column predicates must be included in the predicates to be pushed down,
only the last part of the partition key can be an IN predicate. For each partition column,

only one predicate is allowed.

5. For cluster column predicates, only last predicate can be non-EQ predicate

including IN predicate, and preceding column predicates must be EQ predicates.

If there is only one cluster column predicate, the predicates could be any non-IN predicate.

6. There is no pushdown predicates if there is any OR condition or NOT IN condition.

7. We're not allowed to push down multiple predicates for the same column if any of them

is equality or IN predicate.

67
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Rumor of New Cassandra DataSource
By-pass CQL front door used for transactional data

Bulk read/write directly from/to SSTables

Similar to existing Netflix Open Source project

 
https://github.com/Netflix/aegisthus

Promotes Cassandra to first-class Analytics Option

Potentially only part of DataStax Enterprise?!

Please mail a nasty letter to your local DataStax office

68
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Creating a Custom Data Source
Study Existing Native and Third-Party Data Source Impls

Native: JDBC (o.a.s.sql.execution.datasources.jdbc)

class JDBCRelation extends BaseRelation

 
with PrunedFilteredScan 


 
with InsertableRelation
Third-Party: Cassandra (o.a.s.sql.cassandra)

class CassandraSourceRelation extends BaseRelation

 
with PrunedFilteredScan 


 
with InsertableRelation

<Insert Your Custom Data Source Here!>

69
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Cloudant DataSource (IBM)
Github

http://spark-packages.org/package/cloudant/spark-cloudant

Maven

com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1

Code

 
ratingsDF.write.format("com.cloudant.spark")

 
 
.mode(SaveMode.Append)

 
 
.options(Map("cloudant.host"->"<account>.cloudant.com",

 
 
 
 
 
 
 "cloudant.username"->"<username>",

 
 
 
 
 
 
 "cloudant.password"->"<password>"))

 
 
.save("<filename>")
70
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
DB2 and BigSQL DataSources (IBM)
Coming Soon!
71
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Rumor of REST DataSource (Databricks)
Coming Soon?







Ask Michael Armbrust
Spark SQL Lead @ Databricks
72
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Custom DataSource (Me and You All!)
Coming Right Now!
73
DEMO ALERT!!
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Demo!
Create a Custom DataSource
74
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Contributing a Custom Data Source
spark-packages.org

Managed by

Contains links to externally-managed github projects

Ratings and comments

Requires supporte Spark version for each package
Examples

https://github.com/databricks/spark-csv

https://github.com/databricks/spark-avro

https://github.com/databricks/spark-redshift





75
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Spark Streaming: Scaling & Approximations
Understand Parallelism, Recovery, and Back Pressure

Describe Common Streaming Count Approximations
76
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Direct Kafka Streaming
  KafkaRDD partitions store relevant offsets
  Each partition acts as a Receiver
  Tasks/workers pull from Kafka in parallel
  Partitions rebuild from Kafka using offsets
  No Write Ahead Log (WAL) needed
  Optimizes happy path by avoiding the WAL
  At least once delivery guarantee
77
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Parallelism of Direct Kafka Streaming
78
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Not-so-direct Kinesis Streaming
  KinesisRDD partitions store relevant offsets
  Single receiver required to see all data/offsets
  Kinesis offsets not deterministic like Kafka
  Partitions rebuild from Kinesis using offsets
  No Write Ahead Log (WAL) needed
  Optimizes happy path by avoiding the WAL
  At least once delivery guarantee
79
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Streaming Back Pressure
More than Throttling

Push back on the source

Requires buffered source (Kafka, Kinesis)

Based on fundamentals of Control Theory

Contributed by TypeSafe
80
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
HyperLogLog
  Approximate cardinality

(approx count distinct)
  Fixed, low memory
  Tunable error percentage 
  Only 1.5KB @ 2% error,10^9 elements
  Twitter’s Algebird
  Streaming example in Spark codebase
  Spark’s countApproxDistinctByKey()
81
http://research.neustar.biz/
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Count Min Sketch
  Approximate counters
  Better than HashMap
  Low, fixed memory
  Known error bounds
  Large num of counters
  From Twitter Algebird
  Streaming example in Spark codebase
82
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Monte Carlo Simulations
From Manhattan Project (A-bomb)
Simulate movement of neutrons

Law of Large Numbers (LLN)
Average of results of many trials

Converge on expected value

SparkPi example in 
Spark codebase


 
 
 
 
 
 
 
 
 Pi ~ (# red dots /


 
 
 
 
 
 
 
 
 
 
 # total dots * 4) 
83
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Spark ML: High Scale Machine Learning
Define Similarity and Dimension Reduction

Describe Sampling and Bucketing

Generate 10 Recommendations
84
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Live, Interactive Demo!
sparkafterdark.com
85
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Audience Participation Needed!!
86
->
You are

here 
->
Audience Instructions
  Navigate to sparkafterdark.com
  Click 3 actresses and 3 actors

  Wait for us to analyze together!
Note: This is totally anonymous!!

Project Links
  https://github.com/fluxcapacitor/pipeline
  https://hub.docker.com/r/fluxcapacitor
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Similarity
87
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Types of Similarity
Euclidean: linear measure
Magnitude bias
Cosine: angle measure
Adjust for magnitude bias
Jaccard: (intersection / union)
Popularity bias
Log Likelihood
Adjust for popularity bias

88
		 Ali	 Matei	 Reynold	 Patrick	 Andy	
Kimberly	 1	 1	 1	 1	
Leslie	 1	 1!
Meredith	 1	 1	 1	
Lisa	 1	 1	 1	
Holden	 1	 1	 1	 1	 1	
z!
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
All-Pairs Similarity Comparison
Compare everything to everything
aka. “pair-wise similarity” or “similarity join”
Naïve shuffle: O(m*n^2); m=rows, n=cols

Minimize shuffle through approximations!
Reduce m (rows)
Sampling and bucketing 
Reduce n (cols)
Remove most frequent value (ie.0)
Principle Component Analysis
89
Dimension reduction!!
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Dimension Reduction
Sampling and Bucketing
90
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Reduce m: DIMSUM Sampling
“Dimension Independent Matrix Square Using MR”
Remove rows with low similarity probability
MLlib: RowMatrix.columnSimilarities(…)




Twitter: 40% efficiency gain over Cosine Similarity

91
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Reduce m: LSH Bucketing
“Locality Sensitive Hashing”
Split m into b buckets 
Use similarity hash algorithm
Requires pre-processing of data
Compare bucket contents in parallel
Converts O(m*n^2) -> O(m*n/b*b^2);

m=rows, n=cols, b=buckets
ie. 500k x 500k matrix

O(1.25e17) -> O(1.25e13); b=50

github.com/mrsqueeze/spark-hash
92
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Reduce n: Remove Most Frequent Value
Eliminate most-frequent value
Represent other values with (index,value) pairs
Converts O(m*n^2) -> O(m*nnz^2); 

nnz=num nonzeros, nnz << n





Note: Choose most frequent value (may not be 0)
93
(index,value)
(index,value)
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Recommendations
Summary Statistics and Historical Analysis
Collaborative Filtering and Clustering
Text Featurization and NLP
94
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Types of Recommendations
Non-personalized

No preference or behavior data for user, yet
aka “Cold Start Problem”

Personalized

User-Item Similarity

Items that others with similar prefs have liked
Item-Item Similarity

Items similar to your previously-liked items
95
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Recommendation Terminology
User
User seeking recommendations
Item
Item that has been liked or rated
Feedback
Explicit: like, rating
Implicit: search, click, hover, view, scroll
Feature Engineering
Dimension reduction
96
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Non-Personalized Recommendations
Use Aggregate Data to Generate Recommendations
97
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
  Top Users by Like Count

“I might like users who have the most-likes overall
based on historical data.”
SparkSQL, DataFrames: Summary Stat, Aggs






98
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
  Top Influencers by Like Graph


“I might like the most-influential users in overall like graph.”
GraphX: PageRank







99
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Demo!
Generate Recommnedations using Summary Stats & PageRank
100
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Personalized Recommendations
Use Similarity to Generate Personalized Recommendations
101
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
  Like Behavior of Similar Users
“I like the same people that you like. 

What other people did you like that I haven’t seen?” 
MLlib: Matrix Factorization, User-Item Similarity
102
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Demo!
Generate Recommendations using 

Collaborative Filtering and Matrix Factorization
103
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
  Similar Text-based Profiles as Me


“Our profiles have similar keywords and named entities. 

We might like each other!”
MLlib: Word2Vec, TF/IDF, k-skip n-grams
104
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
  Similar Profiles to Previous Likes


105
“Your profile text has similar keywords and named entities to
other profiles of people I like. I might like you, too!”
MLlib: Word2Vec, TF/IDF, Doc Similarity
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
  Relevant, High-Value Emails

 
 “Your initial email references a lot of things in my profile.

I might like you for making the effort!”
MLlib: Word2Vec, TF/IDF, Entity Recognition






106
^
Her Email< My Profile
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
The Future of Recommendations
107
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
  Eigenfaces: Facial Recognition
“Your face looks similar to others that I’ve liked.

I might like you.”
MLlib: RowMatrix, PCA, Item-Item Similarity




108
Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
  NLP Conversation Starter Bot! 
“If your responses to my generic opening
lines are positive, I may read your profile.” 

MLlib: TF/IDF, DecisionTrees,
Sentiment Analysis
109
Positive Negative
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
110
Maintaining the Spark
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
⑨  Recommendations for Couples
“I want Mad Max. You want Message In a Bottle. 

Let’s find something in between to watch tonight.”
MLlib: RowMatrix, Item-Item Similarity

GraphX: Nearest Neighbors, Shortest Path



 
 similar 
 
 similar
•  
 plots ->
 <- actors

 

111
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Final Recommendation!
112
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
  Get Off the Computer & Meet People!
Thank you, Paris!!
Chris Fregly @cfregly
IBM Spark Technology Center 
San Francisco, CA, USA
Relevant Links
advancedspark.com
Signup for the book & global meetup!
github.com/fluxcapacitor/pipeline
Clone, contribute, and commit code!
hub.docker.com/r/fluxcapacitor/pipeline/wiki
Run all demos in your own environment with Docker!
113
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
More Relevant Links
http://meetup.com/Advanced-Apache-Spark-Meetup
http://advancedspark.com
http://github.com/fluxcapacitor/pipeline
http://hub.docker.com/r/fluxcapacitor/pipeline
http://sortbenchmark.org/ApacheSpark2014.pd
https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
http://0x0fff.com/spark-architecture-shuffle/
http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf
http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance
http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf
http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/
http://docs.scala-lang.org/overviews/quasiquotes/intro.html
http://lwn.net/Articles/252125/ (Memory Part 2: CPU Caches)
http://lwn.net/Articles/255364/ (Memory Part 5: What Programmers Can Do)
https://www.safaribooksonline.com/library/view/java-performance-the/9781449363512/ch04.html
http://web.eece.maine.edu/~vweaver/projects/perf_events/perf_event_open.html
http://www.brendangregg.com/perf.html
https://perf.wiki.kernel.org/index.php/Tutorial
http://techblog.netflix.com/2015/07/java-in-flames.html
http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html
http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#Java
http://sortbenchmark.org/ApacheSpark2014.pdf
https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
http://0x0fff.com/spark-architecture-shuffle/
http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf
http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance
http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf
http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/
http://docs.scala-lang.org/overviews/quasiquotes/intro.html
http://lwn.net/Articles/252125/ <-- Memory Part 2: CPU Caches
http://lwn.net/Articles/255364/ <-- Memory Part 5: What Programmers Can Do

114
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
What’s Next?
115
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
What’s Next?
Autoscaling Spark Workers

Completely Docker-based

Docker Compose and Docker Machine
Lots of Demos and Examples!

Zeppelin & IPython/Jupyter notebooks

Advanced streaming use cases

Advanced ML, Graph, and NLP use cases
Performance Tuning and Profiling

Work closely with Brendan Gregg & Netflix

Surface & share more low-level details of Spark internals
116
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Upcoming Meetups and Conferences
London Spark Meetup (Oct 12th)
Scotland Data Science Meetup (Oct 13th)
Dublin Spark Meetup (Oct 15th)
Barcelona Spark Meetup (Oct 20th)
Madrid Spark/Big Data Meetup (Oct 22nd)
Paris Spark Meetup (Oct 26th)
Amsterdam Spark Summit & Meetup (Oct 27th)
Delft Dutch Data Science Meetup (Oct 29th) 
Brussels Spark Meetup (Oct 30th)
Zurich Big Data Developers Meetup (Nov 2nd)
Geneva Spark Meetup (Nov 5th)
117
San Francisco Datapalooza (Nov 10th)
San Francisco Advanced Apache Spark Meetup (Nov 12th)
Oslo Big Data Hadoop Meetup (Nov 18th)
Helsinki Spark Meetup (Nov 20th)
Stockholm Spark Meetup (Nov 23rd)
Copenhagen Spark Meetup (Nov 25th)
Budapest Spark Meetup (Nov 27th)
Singapore Strata Conference (Dec 1st)
San Francisco Advanced Apache Spark Meetup (Dec 8th)
Mountain View Advanced Apache Spark Meetup (Dec 10th)
Washington DC Advanced Apache Spark Meetup (Dec 17th)

Freg-a-palooza!
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Power of data. Simplicity of design. Speed of innovation.
IBM Spark

More Related Content

What's hot

Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...Chris Fregly
 
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...Chris Fregly
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Chris Fregly
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksData Con LA
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIRyuji Tamagawa
 
Spark r under the hood with Hossein Falaki
Spark r under the hood with Hossein FalakiSpark r under the hood with Hossein Falaki
Spark r under the hood with Hossein FalakiDatabricks
 
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Chris Fregly
 
Helsinki Spark Meetup Nov 20 2015
Helsinki Spark Meetup Nov 20 2015Helsinki Spark Meetup Nov 20 2015
Helsinki Spark Meetup Nov 20 2015Chris Fregly
 
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016Chris Fregly
 
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5Chris Fregly
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)hiteshnd
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best PracticesCloudera, Inc.
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaApache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaDatabricks
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Chris Fregly
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRDatabricks
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Spark Summit
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMeeraj Kunnumpurath
 

What's hot (20)

Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
 
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of Databricks
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
 
Spark r under the hood with Hossein Falaki
Spark r under the hood with Hossein FalakiSpark r under the hood with Hossein Falaki
Spark r under the hood with Hossein Falaki
 
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
 
Helsinki Spark Meetup Nov 20 2015
Helsinki Spark Meetup Nov 20 2015Helsinki Spark Meetup Nov 20 2015
Helsinki Spark Meetup Nov 20 2015
 
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
 
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaApache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache Spark
 

Viewers also liked

Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraRussell Spitzer
 
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016Chris Fregly
 
Chicago Spark Meetup 03 01 2016 - Spark and Recommendations
Chicago Spark Meetup 03 01 2016 - Spark and RecommendationsChicago Spark Meetup 03 01 2016 - Spark and Recommendations
Chicago Spark Meetup 03 01 2016 - Spark and RecommendationsChris Fregly
 
Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Chris Fregly
 
Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016Chris Fregly
 
Spark Solution for Rank Product
Spark Solution for Rank ProductSpark Solution for Rank Product
Spark Solution for Rank ProductMahmoud Parsian
 
Docker 基本概念與指令操作
Docker  基本概念與指令操作Docker  基本概念與指令操作
Docker 基本概念與指令操作NUTC, imac
 
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...Chris Fregly
 
Performance in Spark 2.0, PDX Spark Meetup 8/18/16
Performance in Spark 2.0, PDX Spark Meetup 8/18/16Performance in Spark 2.0, PDX Spark Meetup 8/18/16
Performance in Spark 2.0, PDX Spark Meetup 8/18/16pdx_spark
 
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Alexey Zinoviev
 
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016Chris Fregly
 
使用 CLI 管理 OpenStack 平台
使用 CLI 管理 OpenStack 平台使用 CLI 管理 OpenStack 平台
使用 CLI 管理 OpenStack 平台NUTC, imac
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityDatabricks
 
Austin Data Meetup 092014 - Spark
Austin Data Meetup 092014 - SparkAustin Data Meetup 092014 - Spark
Austin Data Meetup 092014 - SparkSteve Blackmon
 
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick WendellDatabricks
 
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...Chris Fregly
 

Viewers also liked (20)

Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
 
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
 
Chicago Spark Meetup 03 01 2016 - Spark and Recommendations
Chicago Spark Meetup 03 01 2016 - Spark and RecommendationsChicago Spark Meetup 03 01 2016 - Spark and Recommendations
Chicago Spark Meetup 03 01 2016 - Spark and Recommendations
 
Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016
 
Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016
 
Spark Solution for Rank Product
Spark Solution for Rank ProductSpark Solution for Rank Product
Spark Solution for Rank Product
 
Docker 基本概念與指令操作
Docker  基本概念與指令操作Docker  基本概念與指令操作
Docker 基本概念與指令操作
 
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
 
Performance in Spark 2.0, PDX Spark Meetup 8/18/16
Performance in Spark 2.0, PDX Spark Meetup 8/18/16Performance in Spark 2.0, PDX Spark Meetup 8/18/16
Performance in Spark 2.0, PDX Spark Meetup 8/18/16
 
Apache Spark Essentials
Apache Spark EssentialsApache Spark Essentials
Apache Spark Essentials
 
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
 
Meetup Spark 2.0
Meetup Spark 2.0Meetup Spark 2.0
Meetup Spark 2.0
 
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
 
使用 CLI 管理 OpenStack 平台
使用 CLI 管理 OpenStack 平台使用 CLI 管理 OpenStack 平台
使用 CLI 管理 OpenStack 平台
 
Apache streams 2015
Apache streams 2015Apache streams 2015
Apache streams 2015
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
Spark in the BigData dark
Spark in the BigData darkSpark in the BigData dark
Spark in the BigData dark
 
Austin Data Meetup 092014 - Spark
Austin Data Meetup 092014 - SparkAustin Data Meetup 092014 - Spark
Austin Data Meetup 092014 - Spark
 
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
 
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
 

Similar to Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Apache Spark Meetup to Date

Copenhagen Spark Meetup Nov 25, 2015
Copenhagen Spark Meetup Nov 25, 2015Copenhagen Spark Meetup Nov 25, 2015
Copenhagen Spark Meetup Nov 25, 2015Chris Fregly
 
Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015Chris Fregly
 
Budapest Big Data Meetup Nov 26 2015
Budapest Big Data Meetup Nov 26 2015Budapest Big Data Meetup Nov 26 2015
Budapest Big Data Meetup Nov 26 2015Chris Fregly
 
Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015Chris Fregly
 
Sydney Spark Meetup Dec 08, 2015
Sydney Spark Meetup Dec 08, 2015Sydney Spark Meetup Dec 08, 2015
Sydney Spark Meetup Dec 08, 2015Chris Fregly
 
Melbourne Spark Meetup Dec 09 2015
Melbourne Spark Meetup Dec 09 2015Melbourne Spark Meetup Dec 09 2015
Melbourne Spark Meetup Dec 09 2015Chris Fregly
 
Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016  Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016 Chris Fregly
 
Singapore Spark Meetup Dec 01 2015
Singapore Spark Meetup Dec 01 2015Singapore Spark Meetup Dec 01 2015
Singapore Spark Meetup Dec 01 2015Chris Fregly
 
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...Chris Fregly
 
Dallas DFW Data Science Meetup Jan 21 2016
Dallas DFW Data Science Meetup Jan 21 2016Dallas DFW Data Science Meetup Jan 21 2016
Dallas DFW Data Science Meetup Jan 21 2016Chris Fregly
 
DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
DC Spark Users Group March 15 2016 - Spark and Netflix RecommendationsDC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
DC Spark Users Group March 15 2016 - Spark and Netflix RecommendationsChris Fregly
 
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...Chris Fregly
 
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...Athens Big Data
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Chris Fregly
 
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...Anya Bida
 
Just enough DevOps for Data Scientists (Part II)
Just enough DevOps for Data Scientists (Part II)Just enough DevOps for Data Scientists (Part II)
Just enough DevOps for Data Scientists (Part II)Databricks
 
Build a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationBuild a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationCraig Chao
 
Powering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationPowering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationDatabricks
 

Similar to Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Apache Spark Meetup to Date (20)

Copenhagen Spark Meetup Nov 25, 2015
Copenhagen Spark Meetup Nov 25, 2015Copenhagen Spark Meetup Nov 25, 2015
Copenhagen Spark Meetup Nov 25, 2015
 
Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015
 
Budapest Big Data Meetup Nov 26 2015
Budapest Big Data Meetup Nov 26 2015Budapest Big Data Meetup Nov 26 2015
Budapest Big Data Meetup Nov 26 2015
 
Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015
 
Sydney Spark Meetup Dec 08, 2015
Sydney Spark Meetup Dec 08, 2015Sydney Spark Meetup Dec 08, 2015
Sydney Spark Meetup Dec 08, 2015
 
Melbourne Spark Meetup Dec 09 2015
Melbourne Spark Meetup Dec 09 2015Melbourne Spark Meetup Dec 09 2015
Melbourne Spark Meetup Dec 09 2015
 
Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016  Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016
 
Singapore Spark Meetup Dec 01 2015
Singapore Spark Meetup Dec 01 2015Singapore Spark Meetup Dec 01 2015
Singapore Spark Meetup Dec 01 2015
 
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
 
Dallas DFW Data Science Meetup Jan 21 2016
Dallas DFW Data Science Meetup Jan 21 2016Dallas DFW Data Science Meetup Jan 21 2016
Dallas DFW Data Science Meetup Jan 21 2016
 
DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
DC Spark Users Group March 15 2016 - Spark and Netflix RecommendationsDC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
 
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
 
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...
 
Just enough DevOps for Data Scientists (Part II)
Just enough DevOps for Data Scientists (Part II)Just enough DevOps for Data Scientists (Part II)
Just enough DevOps for Data Scientists (Part II)
 
Build a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationBuild a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimization
 
Powering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationPowering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script Transformation
 

More from Chris Fregly

AWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataChris Fregly
 
Pandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfChris Fregly
 
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupRay AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupChris Fregly
 
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedChris Fregly
 
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine LearningChris Fregly
 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...Chris Fregly
 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon BraketChris Fregly
 
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-PersonChris Fregly
 
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapChris Fregly
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...Chris Fregly
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Chris Fregly
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Chris Fregly
 
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Chris Fregly
 
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...Chris Fregly
 
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...Chris Fregly
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Chris Fregly
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...Chris Fregly
 
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Chris Fregly
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...Chris Fregly
 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...Chris Fregly
 

More from Chris Fregly (20)

AWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and Data
 
Pandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdf
 
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupRay AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
 
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
 
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine Learning
 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon Braket
 
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
 
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:Cap
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
 
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
 
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
 
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
 
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
 

Recently uploaded

How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencessuser9e7c64
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Lecture # 8 software design and architecture (SDA).ppt
Lecture # 8 software design and architecture (SDA).pptLecture # 8 software design and architecture (SDA).ppt
Lecture # 8 software design and architecture (SDA).pptesrabilgic2
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfYashikaSharma391629
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 

Recently uploaded (20)

How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Lecture # 8 software design and architecture (SDA).ppt
Lecture # 8 software design and architecture (SDA).pptLecture # 8 software design and architecture (SDA).ppt
Lecture # 8 software design and architecture (SDA).ppt
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 

Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Apache Spark Meetup to Date

  • 1. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles After Dark 1.5 High Performance, Real-time, Streaming, Machine Learning, Natural Language Processing, Text Analytics, and Recommendations Chris Fregly Principal Data Solutions Engineer IBM Spark Technology Center ** We’re Hiring -- Only Nice People, Please!! ** Paris Spark Meetup October 26, 2015
  • 2. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Who Am I? 2 Streaming Data Engineer Netflix Open Source Committer Data Solutions Engineer
 Apache Contributor Principal Data Solutions Engineer IBM Technology Center Meetup Organizer Advanced Apache Meetup Book Author Advanced (2016)
  • 3. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Advanced Apache Spark Meetup Meetup Metrics 1400+ members in just 3 mos! 4th most active Spark Meetup!! meetup.com/Advanced-Apache-Spark-Meetup Meetup Goals   Dig deep into Spark & extended-Spark codebase   Study integrations incl Cassandra, ElasticSearch,
 Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R   Surface & share patterns & idioms of these 
 well-designed, distributed, big data components
  • 4. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Upcoming Meetups and Conferences London Spark Meetup (Oct 12th) Scotland Data Science Meetup (Oct 13th) Dublin Spark Meetup (Oct 15th) Barcelona Spark Meetup (Oct 20th) Madrid Spark/Big Data Meetup (Oct 22nd) Paris Spark Meetup (Oct 26th) Amsterdam Spark Summit & Meetup (Oct 27th) Delft Dutch Data Science Meetup (Oct 29th) Brussels Spark Meetup (Oct 30th) Zurich Big Data Developers Meetup (Nov 2nd) Geneva Spark Meetup (Nov 5th) 4 San Francisco Datapalooza (Nov 10th) San Francisco Advanced Apache Spark Meetup (Nov 12th) Oslo Big Data Hadoop Meetup (Nov 18th) Helsinki Spark Meetup (Nov 20th) Stockholm Spark Meetup (Nov 23rd) Copenhagen Spark Meetup (Nov 25th) Budapest Spark Meetup (Nov 27th) Singapore Strata Conference (Dec 1st) San Francisco Advanced Apache Spark Meetup (Dec 8th) Mountain View Advanced Apache Spark Meetup (Dec 10th) Washington DC Advanced Apache Spark Meetup (Dec 17th) Freg-a-palooza!
  • 5. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark What is Spark After Dark? Fun, Spark-based dating reference application *Not a movie recommendation engine!! Generate recommendations based on user similarity Demonstrate Apache Spark and related big data projects 5
  • 6. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Tools of this Talk 6   Redis   Docker   Ganglia   Streaming, Kafka   Cassandra, NoSQL   Parquet, JSON, ORC, Avro   Apache Zeppelin Notebooks   Spark SQL, DataFrames, Hive   ElasticSearch, Logstash, Kibana   Spark ML, GraphX, Stanford CoreNLP and…
  • 7. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Overall Themes of this Talk   Filter Early, Filter Deep   Approximations are OK   Minimize Random Seeks   Maximize Sequential Scans   Go Off-Heap when Possible   Parallelism is Required at Scale   Must Reduce Dimensions at Scale   Seek Performance Gains at all Layers   Customize Data Structs for your Workload 7   Be Nice and Collaborate with your Peers!
  • 8. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark High-Level Sections Spark Core: Performance Tuning Spark SQL: DataSources and Tuning Spark Streaming: Scale, Tuning, Approx Spark ML: Scale, Dim Reduce, NLP 8
  • 9. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Spark Core: Performance Tuning Acknowledging Mechanical Sympathy 100TB Daytona GraySort Challenge Project Tungsten 9
  • 10. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Acknowledging Mechanical Sympathy “Hardware and software working together in harmony” -Martin Thompson http://mechanical-sympathy.blogspot.com Spark Mechanical Sympathy Concerns Saturate Network I/O Saturate Disk I/O Minimize Memory Footprint and GC Maximize CPU Cache Locality 10
  • 11. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark and Mechanical Sympathy Saturate Network I/O Saturate Disk I/O Minimize Memory and GC Maximize CPU Cache Locality 11 Project 
 Tungsten Spark 1.4-1.6 Daytona GraySort Spark 1.1-1.2
  • 12. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark AlphaSort Trick for Sorting AlphaSort paper, 1995 Chris Nyberg and Jim Gray Naïve List (Pointer-to-Record) Requires Key to be dereferenced for comparison AlphaSort List (Key, Pointer) Key is directly available for comparison 12 Ptr! Ptr!Key!
  • 13. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Key! Ptr! Pad! /Pad CPU Cache Line and Memory Sympathy Key(10 bytes) + Pointer(4 bytes*) = 14 bytes *4 bytes when using compressed OOPS (<32 GB heap) Not binary in size
 Not CPU-cache friendly Add Padding (2 bytes) Key(10 bytes) + Pad(2 bytes) 
 + Pointer(4 bytes)=16 bytes Key-Prefix, Pointer Key distribution affects perf Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes 13 Ptr! Key-Prefix Key! Ptr! Cache-line
 Friendly! 2x Cache-line
 Friendly! Not cache-line
 Friendly!
  • 14. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Performance Comparison 14
  • 15. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Similar Technique: Direct Cache Access Packet header placed into CPU cache 15
  • 16. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Lines 16
  • 17. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Instrumenting and Monitoring CPU Linux perf command! 17
  • 18. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Naïve Matrix Multiplication // Find dot product of each row and column vector for (i = 0; i < N; ++i) for (j = 0; j < N; ++j) for (k = 0; k < N; ++k) res[i][j] += matA[i][k] * matB[k][j]; 18 Skipping row-wise, not using full CPU cache line,
 ineffective pre-fetching
  • 19. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Friendly Matrix Multiplication // Transpose B for (i = 0; i < N; ++i) for (j = 0; j < N; ++j) matBtran [i][j] = matB[j][i];
 // Modify dot product calculation for B transpose for (i = 0; i < N; ++i) for (j = 0; j < N; ++j) for (k = 0; k < N; ++k) res[i][j] += matA[i][k] * matBtran[j][k]; 19 Good use of CPU cache line, effective prefetching
  • 20. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Comparing CPU Naïve & Cache-Friendly Matrix Multiplication 20
  • 21. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Results of Naïve vs. Cache Friendly Naïve Matrix Multiply 21 Cache Friendly Matrix Multiply ~72x ~8x ~3x ~3x ~2x ~7x ~10x perf stat --repeat 5 --scale --event L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses,LLC-prefetch-misses,cache-misses,stalled-cycles-frontend java -Xmx13G -XX:-Inline -jar ~/sbt/bin/sbt-launch.jar "tungsten/run-main com.advancedspark.tungsten.matrix.Cache[Friendly|Naïve]MatrixMultiply 256 1"
  • 22. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Visualizing and Finding Hotspots Flame Graphs with Java Stack Traces 22 Images courtesy of http://techblog.netflix.com/2015/07/java-in-flames.html! Java Stack Traces!!
  • 23. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles 100TB Daytona GraySort Challenge Focus on Network and Disk I/O Optimizations Improve Data Structs/Algos for Sort & Shuffle Saturate Network and Disk Controllers 23
  • 24. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Winning Results 24 Spark Goals:   Saturate Network I/O   Saturate Disk I/O (2013) (2014)
  • 25. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Winning Hardware Configuration Compute 206 EC2 Worker nodes, 1 Master node AWS i2.8xlarge 32 Intel Xeon CPU E5-2670 @ 2.5 Ghz 244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4 NOOP I/O scheduler: FIFO, request merging, no reordering 3 GBps mixed read/write disk I/O per node Network Deployed within Placement Group/VPC Using AWS Enhanced Networking Single Root I/O Virtualization (SR-IOV): extension of PCIe 10 Gbps, low latency, low jitter (iperf showed ~9.5 Gbps) 25
  • 26. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Winning Software Configuration Spark 1.2, OpenJDK 1.7_<amazon-something>_u65-b17 Disable caching, compression, spec execution, shuffle spill Force NODE_LOCAL task scheduling for optimal data locality HDFS 2.4.1 short-circuit for local reads, 2x replication 4-6 tasks allocated / partition is Spark recommendation 206 nodes * 32 cores = 6592 cores 6592 cores * 4 = 26,368 partitions 6592 cores * 6 = 39,552 partitions 6592 cores * 4.25 = 28,000 partitions was empirically best Range partitioning takes advantage of sequential keyspace 26
  • 27. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark New Shuffle Manager New “Sort-based” shuffle manager replaces Hash-based New Data Structures and Algos for Shuffle Sort ie. New TimSort for Arrays of (K,V) Pairs 27
  • 28. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark New Network Module Replaces old java.nio, low-level, socket-based code Zero-copy epoll: kernel-space between disk & network Custom memory management spark.shuffle.blockTransferService=netty Spark-Netty Performance Tuning spark.shuffle.io.numConnectionsPerPeer Increase to saturate hosts with multiple disks spark.shuffle.io.preferDirectBuffers On or Off-heap (Off-heap is default) 28
  • 29. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark New Algorithms and Data Structures Optimized for sort and shuffle o.a.s.util.collection.TimSort Based on JDK 1.7 TimSort Performs best on partially-sorted datasets Optimized for elements of (K,V) pairs Sorts impl of SortDataFormat (ie. KVArraySortDataFormat) o.a.s.util.collection.AppendOnlyMap Open addressing hash, quadratic probing Array of [(K, V), (K, V)] Good memory locality Keys never removed, values only append 29
  • 30. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles IBM | spark.tc Met Performance Goals! Reducers: 1.1 Gbps/node network I/O (theoretical max = 1.25 Gbps for 10 GB ethernet) Mappers: 3 GBps/node disk I/O (8x800 SSD) 206 nodes * 1.1 Gbps/node ~= 220 Gbps
  • 31. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Shuffle Performance Tuning Tips Hash Shuffle Manager (no longer default) spark.shuffle.consolidateFiles: mapper output files o.a.s.shuffle.FileShuffleBlockResolver Intermediate Files Increase spark.shuffle.file.buffer: reduce seeks & sys calls Increase spark.reducer.maxSizeInFlight if memory allows Use smaller number of larger workers to reduce total files SQL: BroadcastHashJoin vs. ShuffledHashJoin spark.sql.autoBroadcastJoinThreshold Use DataFrame.explain(true) or EXPLAIN to verify 31
  • 32. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Project Tungsten Focus on CPU Cache and Memory Optimizations Further Improve Data Structures and Algorithms Operate on Serialized/Compressed Data Provide Path to Off Heap 32
  • 33. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Why is CPU the Bottleneck? Network and Disk I/O bandwidth are relatively high GraySort optimizations improved network & shuffle More partitioning, pruning, and predicate pushdowns Poprularity of columnar file formats like Parquet/ORC CPU is used for serialization, hashing, compression! 33
  • 34. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark Shuffle Managers spark.shuffle.manager = hash < 10,000 Reducers Output file determined by hashing the key of (K,V) pair Each mapper creates an output buffer/file per reducer Leads to M*R number of output buffers/files per shuffle sort >= 10,000 Reducers Default since Spark 1.2 Minimizes OS resources Uses Netty to optimize Network I/O Created custom Data Struts/Algos Wins Daytona GraySort Challenge unsafe -> Tungsten, Default in Spark 1.5 Uses com.misc.Unsafe to sellf-manage binary array buffers Uses custom serialization format Can operate on compressed and serialized buffers 34
  • 35. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark New Data Structures “I don’t know your data structure, but my array[] will beat it!” Custom Data Structures for Sort/Shuffle Workload UnsafeRow: BytesToBytesMap:: 35
  • 36. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark sun.misc.Unsafe 36 Info addressSize() pageSize() Objects allocateInstance() objectFieldOffset() Classes staticFieldOffset() defineClass() defineAnonymousClass() ensureClassInitialized() Synchronization monitorEnter() tryMonitorEnter() monitorExit() compareAndSwapInt() putOrderedInt() Arrays arrayBaseOffset() arrayIndexScale() Memory allocateMemory() copyMemory() freeMemory() getAddress() – not guaranteed after GC getInt()/putInt() getBoolean()/putBoolean() getByte()/putByte() getShort()/putShort() getLong()/putLong() getFloat()/putFloat() getDouble()/putDouble() getObjectVolatile()/putObjectVolatile()
  • 37. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark + com.misc.Unsafe 37 org.apache.spark.sql.execution. aggregate.SortBasedAggregate aggregate.TungstenAggregate aggregate.AggregationIterator aggregate.udaf aggregate.utils SparkPlanner rowFormatConverters UnsafeFixedWidthAggregationMap UnsafeExternalSorter UnsafeExternalRowSorter UnsafeKeyValueSorter UnsafeKVExternalSorter local.ConvertToUnsafeNode local.ConvertToSafeNode local.HashJoinNode local.ProjectNode local.LocalNode local.BinaryHashJoinNode local.NestedLoopJoinNode joins.HashJoin joins.HashSemiJoin joins.HashedRelation joins.BroadcastHashJoin joins.ShuffledHashOuterJoin (not yet converted) joins.BroadcastHashOuterJoin joins.BroadcastLeftSemiJoinHash joins.BroadcastNestedLoopJoin joins.SortMergeJoin joins.LeftSemiJoinBNL joins.SortMergerOuterJoin Exchange SparkPlan UnsafeRowSerializer SortPrefixUtils sort basicOperators aggregate.SortBasedAggregationIterator aggregate.TungstenAggregationIterator datasources.WriterContainer datasources.json.JacksonParser datasources.jdbc.JDBCRDD org.apache.spark. unsafe.Platform unsafe.KVIterator unsafe.array.LongArray unsafe.array.ByteArrayMethods unsafe.array.BitSet unsafe.bitset.BitSetMethods unsafe.hash.Murmur3_x86_32 unsafe.map.BytesToBytesMap unsafe.map.HashMapGrowthStrategy unsafe.memory.TaskMemoryManager unsafe.memory.ExecutorMemoryManager unsafe.memory.MemoryLocation unsafe.memory.UnsafeMemoryAllocator unsafe.memory.MemoryAllocator (trait/interface) unsafe.memory.MemoryBlock unsafe.memory.HeapMemoryAllocator unsafe.memory.ExecutorMemoryManager unsafe.sort.RecordComparator unsafe.sort.PrefixComparator unsafe.sort.PrefixComparators unsafe.sort.UnsafeSorterSpillWriter serializer.DummySerializationInstance shuffle.unsafe.UnsafeShuffleManager shuffle.unsafe.UnsafeShuffleSortDataFormat shuffle.unsafe.SpillInfo shuffle.unsafe.UnsafeShuffleWriter shuffle.unsafe.UnsafeShuffleExternalSorter shuffle.unsafe.PackedRecordPointer shuffle.ShuffleMemoryManager util.collection.unsafe.sort.UnsafeSorterSpillMerger util.collection.unsafe.sort.UnsafeSorterSpillReader util.collection.unsafe.sort.UnsafeSorterSpillWriter util.collection.unsafe.sort.UnsafeShuffleInMemorySorter util.collection.unsafe.sort.UnsafeInMemorySorter util.collection.unsafe.sort.RecordPointerAndKeyPrefix util.collection.unsafe.sort.UnsafeSorterIterator network.shuffle.ExternalShuffleBlockResolver scheduler.Task rdd.SqlNewHadoopRDD executor.Executor org.apache.spark.sql.catalyst.expressions. regexpExpressions BoundAttribute SortOrder SpecializedGetters ExpressionEvalHelper UnsafeArrayData UnsafeReaders UnsafeMapData Projection LiteralGeneartor UnsafeRow JoinedRow SpecializedGetters InputFileName SpecificMutableRow codegen.CodeGenerator codegen.GenerateProjection codegen.GenerateUnsafeRowJoiner codegen.GenerateSafeProjection codegen.GenerateUnsafeProjection codegen.BufferHolder codegen.UnsafeRowWriter codegen.UnsafeArrayWriter complexTypeCreator rows literals misc stringExpressions Over 200 source files affected!!
  • 38. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU & Memory Optimizations Custom Managed Memory Reduces GC overhead Both on and off heap Exact size calculations Direct Binary Processing Operate on serialized/compressed arrays Kryo can reorder serialized records LZF can reorder compressed records More CPU Cache-aware Data Structs & Algorithms o.a.s.unsafe.map.BytesToBytesMap vs. j.u.HashMap Code Generation (default in 1.5) Generate source code from overall query plan Janino generates bytecode from source code 100+ UDFs converted to use code generation 38 UnsafeFixedWithAggregationMap,& TungstenAggregationIterator CodeGenerator & GeneratorUnsafeRowJoiner UnsafeSortDataFormat & UnsafeShuffleSortDataFormat & PackedRecordPointer & UnsafeRow UnsafeInMemorySorter & UnsafeExternalSorter & UnsafeShuffleWriter Mostly Same Join Code, added if (isUnsafeMode) UnsafeShuffleManager & UnsafeShuffleInMemorySorter & UnsafeShuffleExternalSorterDetails inSPARK-7075
  • 39. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles IBM | spark.tc Code Generation (Default in 1.5) Problem Generic expression evaluation Expensive on JVM Virtual func calls Branches based on expression type Boxing causes excessive object creation Implementation Defer source code generation to each operator, type, etc Scala quasiquotes provide AST manipulation & rewriting Generates source code, compiled to bytecode w/ Janino 100+ UDFs now using code gen
  • 40. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles IBM | spark.tc Code Generation: Spark SQL UDFs 100+ UDFs now using code gen – More to come in Spark 1.6! Details in SPARK-8159
  • 41. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles IBM | spark.tc Project Tungsten in Other Spark Libraries SortDataFormat<K, Buffer>: Base trait UncompressedInBlockSort: MLlib.ALS EdgeArraySortDataFormat: GraphX.Edge
  • 42. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Spark SQL: DataSources and Tuning Understand Partitions, Pruning, Predicate Pushdowns Understand DataFrames, Catalyst, DataSources Create a DataSource Implementation 42
  • 43. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Partitions Partition based on data usage patterns /genders.parquet/gender=M/… /gender=F/… <-- Use case: access users by gender /gender=U/… Partition Discovery (Read Path) Infer partitions from organization of data (ie. gender=F) Dynamic Partitions (Write Path) Dynamically create partitions based on given column(s) SQL: INSERT TABLE genders PARTITION (gender) SELECT … DF: gendersDF.write.format("parquet").partitionBy("gender").save(…) 43
  • 44. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Pruning Partition Pruning Filter out entire rows that have been pre-partitioned SELECT id, gender FROM genders where gender = ‘U’ Column Pruning Filter out entire columns for all rows if not required Optimized for columnar storage formats (Parquet) Minimize data shuffle during joins 44 gender = partition key
  • 45. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Predicate Pushdowns “Predicate” == “Filter” Filters rows as deep into the data source as possible Predicate returns [true|false] for given func/condition 45
  • 46. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Putting It All Together Reduce Columns: Column Pruning Reduce Rows: Partitioning, Predicate Pushdown SELECT b FROM table WHERE a in [a2,a3] 46
  • 47. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark DataFrames Overview Inspired by R and Pandas DataFrames Cross language support SQL, Python, Scala, Java, R Levels performance of Python, Scala, Java, and R Generates JVM bytecode vs serialize/pickle to Python DataFrame is Container for Logical Plan Lazy transformations represented as tree Catalyst Optimizer creates physical plan DataFrame.rdd returns the underlying RDD if needed Custom UDF using registerFunction() New, experimental UDAF support 47 Use DataFrames instead of RDDs!!
  • 48. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Catalyst Optimizer Optimize DataFrame Transformation Tree Subquery elimination: use aliases to collapse subqueries Constant folding: replace expression with constant Simplify filters: remove unnecessary filters Predicate/filter pushdowns: avoid unnecessary data load Projection collapsing: avoid unnecessary projections Create Custom Rules Rules are Scala Case Classes val newPlan = MyFilterRule(analyzedPlan) 48 Implements! oas.sql.catalyst.rules.Ruleå! Apply to any stage! JVM code generation
  • 49. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Columnar Storage Format 49 Skip whole chunks with min-max heuristics
 stored in each chunk (sorted data only)
  • 50. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Parquet File Format   Based on Google Dremel   Implemented by Twitter and Cloudera   Columnar storage format   Optimized for fast columnar aggregations   Tight compression   Supports pushdowns   Nested, self-describing, evolving schema 50
  • 51. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Types of Compression   Run Length Encoding: Repeated data   Dictionary Encoding: Fixed set of values   Delta, Prefix Encoding: Sorted data 51
  • 52. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Query Plan Debugging 52 gendersCsvDF.select($"id", $"gender").filter("gender != 'F'").filter("gender != 'M'").explain(true) DataFrame.queryExecution.logical DataFrame.queryExecution.analyzed DataFrame.queryExecution.optimizedPlan DataFrame.queryExecution.optimizedPlan
  • 53. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Query Plan Visualization & Query Metrics 53 Effectiveness of Filter CPU Cache 
 Friendly Binary Format Cost-based Join Optimization Similar to MapReduce Map-side Join Peak Memory for Joins and Aggs
  • 54. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Show Various File Formats, Partitioning Schemes, 
 DataSource Implementations, and Query Plans 54 RATINGS ======== UserID,ProfileID,Rating (1-10) GENDERS ======== UserID,Gender (M,F,U) Anonymous, Public Dating Dataset
  • 55. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark DataSources API Relations (o.a.s.sql.sources.interfaces.scala) BaseRelation (abstract class): Provides schema of data TableScan (impl): Read all data from source PrunedFilteredScan (impl): Column pruning & predicate pushdowns InsertableRelation (impl): Insert/overwrite data based on SaveMode RelationProvider (trait/interface): Handle options, BaseRelation factory Execution (o.a.s.sql.execution.commands.scala) RunnableCommand (trait/interface): Common commands like EXPLAIN ExplainCommand(impl: case class) CacheTableCommand(impl: case class) Filters (o.a.s.sql.sources.filters.scala) Filter (abstract class): Handles all predicates/filters supported by this source EqualTo (impl) GreaterThan (impl) StringStartsWith (impl) 55
  • 56. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Native Spark SQL DataSources 56
  • 57. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark JSON Data Source DataFrame val ratingsDF = sqlContext.read.format("json") .load("file:/root/pipeline/datasets/dating/ratings.json.bz2") -- or – val ratingsDF = sqlContext.read.json
 ("file:/root/pipeline/datasets/dating/ratings.json.bz2") SQL Code CREATE TABLE genders USING json OPTIONS (path "file:/root/pipeline/datasets/dating/genders.json.bz2") 57 json() convenience method
  • 58. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark JDBC Data Source Add Driver to Spark JVM System Classpath $ export SPARK_CLASSPATH=<jdbc-driver.jar> DataFrame val jdbcConfig = Map("driver" -> "org.postgresql.Driver", "url" -> "jdbc:postgresql:hostname:port/database", "dbtable" -> ”schema.tablename") df.read.format("jdbc").options(jdbcConfig).load() SQL CREATE TABLE genders USING jdbc 
 OPTIONS (url, dbtable, driver, …) 58
  • 59. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Parquet Data Source Configuration spark.sql.parquet.filterPushdown=true spark.sql.parquet.mergeSchema=true spark.sql.parquet.cacheMetadata=true spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo] DataFrames val gendersDF = sqlContext.read.format("parquet") .load("file:/root/pipeline/datasets/dating/genders.parquet") gendersDF.write.format("parquet").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders.parquet") SQL CREATE TABLE genders USING parquet OPTIONS (path "file:/root/pipeline/datasets/dating/genders.parquet") 59
  • 60. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark ORC Data Source Configuration spark.sql.orc.filterPushdown=true DataFrames val gendersDF = sqlContext.read.format("orc") .load("file:/root/pipeline/datasets/dating/genders") gendersDF.write.format("orc").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders") SQL CREATE TABLE genders USING orc OPTIONS (path "file:/root/pipeline/datasets/dating/genders") 60
  • 61. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Third-Party Spark SQL DataSources 61
  • 62. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CSV DataSource (Databricks) Github https://github.com/databricks/spark-csv Maven com.databricks:spark-csv_2.10:1.2.0 Code val gendersCsvDF = sqlContext.read .format("com.databricks.spark.csv”) .load("file:/root/pipeline/datasets/dating/gender.csv.bz2") .toDF("id", "gender") 62 toDF() is required if CSV does not contain header
  • 63. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Avro DataSource (Databricks) Github https://github.com/databricks/spark-avro Maven com.databricks:spark-avro_2.10:2.0.1 Code val df = sqlContext.read .format("com.databricks.spark.avro") .load("file:/root/pipeline/datasets/dating/gender.avro”) 63
  • 64. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark ElasticSearch DataSource (Elastic.co) Github https://github.com/elastic/elasticsearch-hadoop Maven org.elasticsearch:elasticsearch-spark_2.10:2.1.0 Code val esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>", 
 "es.port" -> "<port>") df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite) .options(esConfig).save("<index>/<document-type>") 64
  • 65. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark AWS Redshift Data Source (Databricks) Github https://github.com/databricks/spark-redshift Maven com.databricks:spark-redshift:0.5.0 Code val df: DataFrame = sqlContext.read .format("com.databricks.spark.redshift") .option("url", "jdbc:redshift://<hostname>:<port>/<database>…") .option("query", "select x, count(*) my_table group by x") .option("tempdir", "s3n://tmpdir") .load(...) 65 UNLOAD and copy to tmp bucket in S3 enables parallel reads
  • 66. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Cassandra DataSource (DataStax) Github https://github.com/datastax/spark-cassandra-connector Maven com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1 Code ratingsDF.write .format("org.apache.spark.sql.cassandra") .mode(SaveMode.Append) .options(Map("keyspace"->"<keyspace>", "table"->"<table>")).save(…) 66
  • 67. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Cassandra Pushdown Support spark-cassandra-connector/…/o.a.s.sql.cassandra.PredicatePushDown.scala Pushdown Predicate Rules 1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate 2. Only push down primary key column predicates with = or IN predicate. 3. If there are regular columns in the pushdown predicates, they should have at least one EQ expression on an indexed column and no IN predicates. 4. All partition column predicates must be included in the predicates to be pushed down, only the last part of the partition key can be an IN predicate. For each partition column, only one predicate is allowed. 5. For cluster column predicates, only last predicate can be non-EQ predicate including IN predicate, and preceding column predicates must be EQ predicates. If there is only one cluster column predicate, the predicates could be any non-IN predicate. 6. There is no pushdown predicates if there is any OR condition or NOT IN condition. 7. We're not allowed to push down multiple predicates for the same column if any of them is equality or IN predicate. 67
  • 68. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Rumor of New Cassandra DataSource By-pass CQL front door used for transactional data Bulk read/write directly from/to SSTables Similar to existing Netflix Open Source project https://github.com/Netflix/aegisthus Promotes Cassandra to first-class Analytics Option Potentially only part of DataStax Enterprise?! Please mail a nasty letter to your local DataStax office 68
  • 69. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Creating a Custom Data Source Study Existing Native and Third-Party Data Source Impls Native: JDBC (o.a.s.sql.execution.datasources.jdbc) class JDBCRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation Third-Party: Cassandra (o.a.s.sql.cassandra) class CassandraSourceRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation <Insert Your Custom Data Source Here!> 69
  • 70. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Cloudant DataSource (IBM) Github http://spark-packages.org/package/cloudant/spark-cloudant Maven com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1 Code ratingsDF.write.format("com.cloudant.spark") .mode(SaveMode.Append) .options(Map("cloudant.host"->"<account>.cloudant.com", "cloudant.username"->"<username>", "cloudant.password"->"<password>")) .save("<filename>") 70
  • 71. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark DB2 and BigSQL DataSources (IBM) Coming Soon! 71
  • 72. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Rumor of REST DataSource (Databricks) Coming Soon? Ask Michael Armbrust Spark SQL Lead @ Databricks 72
  • 73. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Custom DataSource (Me and You All!) Coming Right Now! 73 DEMO ALERT!!
  • 74. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Create a Custom DataSource 74
  • 75. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Contributing a Custom Data Source spark-packages.org Managed by Contains links to externally-managed github projects Ratings and comments Requires supporte Spark version for each package Examples https://github.com/databricks/spark-csv https://github.com/databricks/spark-avro https://github.com/databricks/spark-redshift 75
  • 76. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Spark Streaming: Scaling & Approximations Understand Parallelism, Recovery, and Back Pressure Describe Common Streaming Count Approximations 76
  • 77. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Direct Kafka Streaming   KafkaRDD partitions store relevant offsets   Each partition acts as a Receiver   Tasks/workers pull from Kafka in parallel   Partitions rebuild from Kafka using offsets   No Write Ahead Log (WAL) needed   Optimizes happy path by avoiding the WAL   At least once delivery guarantee 77
  • 78. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Parallelism of Direct Kafka Streaming 78
  • 79. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Not-so-direct Kinesis Streaming   KinesisRDD partitions store relevant offsets   Single receiver required to see all data/offsets   Kinesis offsets not deterministic like Kafka   Partitions rebuild from Kinesis using offsets   No Write Ahead Log (WAL) needed   Optimizes happy path by avoiding the WAL   At least once delivery guarantee 79
  • 80. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Streaming Back Pressure More than Throttling Push back on the source Requires buffered source (Kafka, Kinesis) Based on fundamentals of Control Theory Contributed by TypeSafe 80
  • 81. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark HyperLogLog   Approximate cardinality
 (approx count distinct)   Fixed, low memory   Tunable error percentage   Only 1.5KB @ 2% error,10^9 elements   Twitter’s Algebird   Streaming example in Spark codebase   Spark’s countApproxDistinctByKey() 81 http://research.neustar.biz/
  • 82. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Count Min Sketch   Approximate counters   Better than HashMap   Low, fixed memory   Known error bounds   Large num of counters   From Twitter Algebird   Streaming example in Spark codebase 82
  • 83. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Monte Carlo Simulations From Manhattan Project (A-bomb) Simulate movement of neutrons Law of Large Numbers (LLN) Average of results of many trials
 Converge on expected value SparkPi example in Spark codebase
 Pi ~ (# red dots /
 # total dots * 4) 83
  • 84. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Spark ML: High Scale Machine Learning Define Similarity and Dimension Reduction Describe Sampling and Bucketing Generate 10 Recommendations 84
  • 85. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Live, Interactive Demo! sparkafterdark.com 85
  • 86. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Audience Participation Needed!! 86 -> You are
 here -> Audience Instructions   Navigate to sparkafterdark.com   Click 3 actresses and 3 actors   Wait for us to analyze together! Note: This is totally anonymous!! Project Links   https://github.com/fluxcapacitor/pipeline   https://hub.docker.com/r/fluxcapacitor
  • 87. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Similarity 87
  • 88. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Types of Similarity Euclidean: linear measure Magnitude bias Cosine: angle measure Adjust for magnitude bias Jaccard: (intersection / union) Popularity bias Log Likelihood Adjust for popularity bias 88 Ali Matei Reynold Patrick Andy Kimberly 1 1 1 1 Leslie 1 1! Meredith 1 1 1 Lisa 1 1 1 Holden 1 1 1 1 1 z!
  • 89. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark All-Pairs Similarity Comparison Compare everything to everything aka. “pair-wise similarity” or “similarity join” Naïve shuffle: O(m*n^2); m=rows, n=cols Minimize shuffle through approximations! Reduce m (rows) Sampling and bucketing Reduce n (cols) Remove most frequent value (ie.0) Principle Component Analysis 89 Dimension reduction!!
  • 90. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Dimension Reduction Sampling and Bucketing 90
  • 91. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Reduce m: DIMSUM Sampling “Dimension Independent Matrix Square Using MR” Remove rows with low similarity probability MLlib: RowMatrix.columnSimilarities(…) Twitter: 40% efficiency gain over Cosine Similarity 91
  • 92. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Reduce m: LSH Bucketing “Locality Sensitive Hashing” Split m into b buckets Use similarity hash algorithm Requires pre-processing of data Compare bucket contents in parallel Converts O(m*n^2) -> O(m*n/b*b^2); m=rows, n=cols, b=buckets ie. 500k x 500k matrix O(1.25e17) -> O(1.25e13); b=50 github.com/mrsqueeze/spark-hash 92
  • 93. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Reduce n: Remove Most Frequent Value Eliminate most-frequent value Represent other values with (index,value) pairs Converts O(m*n^2) -> O(m*nnz^2); 
 nnz=num nonzeros, nnz << n Note: Choose most frequent value (may not be 0) 93 (index,value) (index,value)
  • 94. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Recommendations Summary Statistics and Historical Analysis Collaborative Filtering and Clustering Text Featurization and NLP 94
  • 95. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Types of Recommendations Non-personalized
 No preference or behavior data for user, yet aka “Cold Start Problem” Personalized
 User-Item Similarity
 Items that others with similar prefs have liked Item-Item Similarity
 Items similar to your previously-liked items 95
  • 96. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Recommendation Terminology User User seeking recommendations Item Item that has been liked or rated Feedback Explicit: like, rating Implicit: search, click, hover, view, scroll Feature Engineering Dimension reduction 96
  • 97. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Non-Personalized Recommendations Use Aggregate Data to Generate Recommendations 97
  • 98. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Top Users by Like Count “I might like users who have the most-likes overall based on historical data.” SparkSQL, DataFrames: Summary Stat, Aggs 98
  • 99. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Top Influencers by Like Graph
 “I might like the most-influential users in overall like graph.” GraphX: PageRank 99
  • 100. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Generate Recommnedations using Summary Stats & PageRank 100
  • 101. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Personalized Recommendations Use Similarity to Generate Personalized Recommendations 101
  • 102. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Like Behavior of Similar Users “I like the same people that you like. 
 What other people did you like that I haven’t seen?” MLlib: Matrix Factorization, User-Item Similarity 102
  • 103. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Generate Recommendations using 
 Collaborative Filtering and Matrix Factorization 103
  • 104. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Similar Text-based Profiles as Me
 “Our profiles have similar keywords and named entities. 
 We might like each other!” MLlib: Word2Vec, TF/IDF, k-skip n-grams 104
  • 105. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Similar Profiles to Previous Likes
 105 “Your profile text has similar keywords and named entities to other profiles of people I like. I might like you, too!” MLlib: Word2Vec, TF/IDF, Doc Similarity
  • 106. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Relevant, High-Value Emails “Your initial email references a lot of things in my profile.
 I might like you for making the effort!” MLlib: Word2Vec, TF/IDF, Entity Recognition 106 ^ Her Email< My Profile
  • 107. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles The Future of Recommendations 107
  • 108. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Eigenfaces: Facial Recognition “Your face looks similar to others that I’ve liked.
 I might like you.” MLlib: RowMatrix, PCA, Item-Item Similarity 108 Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
  • 109. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   NLP Conversation Starter Bot! “If your responses to my generic opening lines are positive, I may read your profile.” 
 MLlib: TF/IDF, DecisionTrees, Sentiment Analysis 109 Positive Negative
  • 110. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles 110 Maintaining the Spark
  • 111. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark ⑨  Recommendations for Couples “I want Mad Max. You want Message In a Bottle. 
 Let’s find something in between to watch tonight.” MLlib: RowMatrix, Item-Item Similarity
 GraphX: Nearest Neighbors, Shortest Path similar similar •  plots -> <- actors 111
  • 112. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Final Recommendation! 112
  • 113. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Get Off the Computer & Meet People! Thank you, Paris!! Chris Fregly @cfregly IBM Spark Technology Center San Francisco, CA, USA Relevant Links advancedspark.com Signup for the book & global meetup! github.com/fluxcapacitor/pipeline Clone, contribute, and commit code! hub.docker.com/r/fluxcapacitor/pipeline/wiki Run all demos in your own environment with Docker! 113
  • 114. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark More Relevant Links http://meetup.com/Advanced-Apache-Spark-Meetup http://advancedspark.com http://github.com/fluxcapacitor/pipeline http://hub.docker.com/r/fluxcapacitor/pipeline http://sortbenchmark.org/ApacheSpark2014.pd https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html http://0x0fff.com/spark-architecture-shuffle/ http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/ http://docs.scala-lang.org/overviews/quasiquotes/intro.html http://lwn.net/Articles/252125/ (Memory Part 2: CPU Caches) http://lwn.net/Articles/255364/ (Memory Part 5: What Programmers Can Do) https://www.safaribooksonline.com/library/view/java-performance-the/9781449363512/ch04.html http://web.eece.maine.edu/~vweaver/projects/perf_events/perf_event_open.html http://www.brendangregg.com/perf.html https://perf.wiki.kernel.org/index.php/Tutorial http://techblog.netflix.com/2015/07/java-in-flames.html http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#Java http://sortbenchmark.org/ApacheSpark2014.pdf https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html http://0x0fff.com/spark-architecture-shuffle/ http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/ http://docs.scala-lang.org/overviews/quasiquotes/intro.html http://lwn.net/Articles/252125/ <-- Memory Part 2: CPU Caches http://lwn.net/Articles/255364/ <-- Memory Part 5: What Programmers Can Do 114
  • 115. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles What’s Next? 115
  • 116. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark What’s Next? Autoscaling Spark Workers Completely Docker-based Docker Compose and Docker Machine Lots of Demos and Examples! Zeppelin & IPython/Jupyter notebooks Advanced streaming use cases Advanced ML, Graph, and NLP use cases Performance Tuning and Profiling Work closely with Brendan Gregg & Netflix Surface & share more low-level details of Spark internals 116
  • 117. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Upcoming Meetups and Conferences London Spark Meetup (Oct 12th) Scotland Data Science Meetup (Oct 13th) Dublin Spark Meetup (Oct 15th) Barcelona Spark Meetup (Oct 20th) Madrid Spark/Big Data Meetup (Oct 22nd) Paris Spark Meetup (Oct 26th) Amsterdam Spark Summit & Meetup (Oct 27th) Delft Dutch Data Science Meetup (Oct 29th) Brussels Spark Meetup (Oct 30th) Zurich Big Data Developers Meetup (Nov 2nd) Geneva Spark Meetup (Nov 5th) 117 San Francisco Datapalooza (Nov 10th) San Francisco Advanced Apache Spark Meetup (Nov 12th) Oslo Big Data Hadoop Meetup (Nov 18th) Helsinki Spark Meetup (Nov 20th) Stockholm Spark Meetup (Nov 23rd) Copenhagen Spark Meetup (Nov 25th) Budapest Spark Meetup (Nov 27th) Singapore Strata Conference (Dec 1st) San Francisco Advanced Apache Spark Meetup (Dec 8th) Mountain View Advanced Apache Spark Meetup (Dec 10th) Washington DC Advanced Apache Spark Meetup (Dec 17th) Freg-a-palooza!
  • 118. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Power of data. Simplicity of design. Speed of innovation. IBM Spark