Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona GraySort Challenge - Oct 22, 2015

Chris Fregly
Chris FreglyAI and Machine Learning @ AWS, O'Reilly Author @ Data Science on AWS, Founder @ PipelineAI, Formerly Databricks, Netflix,
How Spark Beat Hadoop @ 100 TB Sort
+ 
Project Tungsten
Madrid Spark, Big Data, Bluemix Meetup
Chris Fregly, Principal Data Solutions Engineer
IBM Spark Technology Center
Oct 22, 2015
Power of data. Simplicity of design. Speed of innovation.
IBM | spark.tc
IBM | spark.tc
Who am I?! !
Streaming Data Engineer!
Netflix Open Source Committer!
!
Data Solutions Engineer!
Apache Contributor!
!
Principal Data Solutions Engineer!
IBM Technology Center!
Meetup Organizer!
Advanced Apache Meetup!
Book Author!
Advanced Spark (2016)!
IBM | spark.tc
Advanced Apache Spark Meetup
Total Spark Experts: ~1400 in only 3 mos!!
4th most active Spark Meetup in the world!!
!
Goals!
Dig deep into the Spark & extended-Spark codebase!
!
Study integrations such as Cassandra, ElasticSearch,!
Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R, etc!
!
Surface and share the patterns and idioms of these !
well-designed, distributed, big data components!
IBM | spark.tc
Freg-a-palooza Upcoming World Tour
  London Spark Meetup (Oct 12th)!
  Scotland Data Science Meetup (Oct 13th)!
  Dublin Spark Meetup (Oct 15th)!
  Barcelona Spark Meetup (Oct 20th)!
  Madrid Spark/Big Data Meetup (Oct 22nd)!
  Paris Spark Meetup (Oct 26th)!
  Amsterdam Spark Summit (Oct 27th – Oct 29th)!
  Delft Dutch Data Science Meetup (Oct 29th) !
  Brussels Spark Meetup (Oct 30th)!
  Zurich Big Data Developers Meetup (Nov 2nd)!
Daytona GraySort tChallenge
sortbenchmark.org!
IBM | spark.tc
Topics of this Talk: Mechanical Sympathy!
Tungsten => Bare Metal!
Seek Once, Scan Sequentially!!
CPU Cache Locality and Efficiency!
Use Data Structs Customized to Your Workload!
Go Off-Heap Whenever Possible !
spark.unsafe.offHeap=true!
IBM | spark.tc
What is the Daytona GraySort Challenge?!
Key Metric!
Throughput of sorting 100TB of 100 byte data,10 byte key!
Total time includes launching app and writing output file!
!
Daytona!
App must be general purpose!
!
Gray!
Named after Jim Gray!
IBM | spark.tc
Daytona GraySort Challenge: Input and Resources!
Input!
Records are 100 bytes in length!
First 10 bytes are random key!
Input generator: ordinal.com/gensort.html!
28,000 fixed-size partitions for 100 TB sort!
250,000 fixed-size partitions for 1 PB sort!
1 partition = 1 HDFS block = 1 executor !
Aligned to avoid partial read I/O ie. imaginary data!
Hardware and Runtime Resources!
Commercially available and off-the-shelf!
Unmodified, no over/under-clocking!
Generates 500TB of disk I/O, 200TB network I/O!
IBM | spark.tc
Daytona GraySort Challenge: Rules!
Must sort to/from OS files in secondary storage!
!
No raw disk since I/O subsystem is being tested!
!
File and device striping (RAID 0) are encouraged!
!
Output file(s) must have correct key order!
IBM | spark.tc
Daytona GraySort Challenge: Task Scheduling!
Types of Data Locality!
PROCESS_LOCAL!
NODE_LOCAL!
RACK_LOCAL!
ANY!
!
Delay Scheduling!
spark.locality.wait.node: time to wait for next shitty level!
Set to infinite to reduce shittiness, force NODE_LOCAL!
Straggling Executor JVMs naturally fade away on each run!
Decreasing!
Level of!
Read !
Performance!
IBM | spark.tc
Daytona GraySort Challenge: Winning Results!
On-disk only, in-memory caching disabled!!
EC2 (i2.8xlarge)! EC2 (i2.8xlarge)!
28,000!
partitions!
250,000 !
partitions (!!)!
(3 GBps/node!
* 206 nodes)!
IBM | spark.tc
Daytona GraySort Challenge: EC2 Configuration!
206 EC2 Worker nodes, 1 Master node!
AWS i2.8xlarge!
32 Intel Xeon CPU E5-2670 @ 2.5 Ghz!
244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4!
NOOP I/O scheduler: FIFO, request merging, no reordering!
3 GBps mixed read/write disk I/O per node!
Deployed within Placement Group/VPC!
Enhanced Networking!
Single Root I/O Virtualization (SR-IOV): extension of PCIe!
10 Gbps, low latency, low jitter (iperf showed ~9.5 Gbps)!
IBM | spark.tc
Daytona GraySort Challenge: Winning Configuration!
Spark 1.2, OpenJDK 1.7_<amazon-something>_u65-b17!
Disabled in-memory caching -- all on-disk!!
HDFS 2.4.1 short-circuit local reads, 2x replication!
Writes flushed after each of the 5 runs!
28,000 partitions / (206 nodes * 32 cores) = 4.25 runs, round up 5 runs!
Netty 4.0.23.Final with native epoll!
Speculative Execution disabled: spark.speculation=false!
Force NODE_LOCAL: spark.locality.wait.node=Infinite !
Force Netty Off-Heap: spark.shuffle.io.preferDirectBuffers!
Spilling disabled: spark.shuffle.spill=false!
All compression disabled (network, on-disk, etc)!
IBM | spark.tc
Daytona GraySort Challenge: Partitioning!
Range Partitioning (vs. Hash Partitioning)!
Take advantage of sequential key space!
Similar keys grouped together within a partition!
Ranges defined by sampling 79 values per partition!
Driver sorts samples and defines range boundaries!
Sampling took ~10 seconds for 28,000 partitions!
!
IBM | spark.tc
Daytona GraySort Challenge: Why Bother?!
Sorting relies heavily on shuffle, I/O subsystem!
!
Shuffle is major bottleneck in big data processing!
Large number of partitions can exhaust OS resources!
!
Shuffle optimization benefits all high-level libraries!
!
Goal is to saturate network controller on all nodes!
~125 MB/s (1 GB ethernet), 1.25 GB/s (10 GB ethernet)!
IBM | spark.tc
Daytona GraySort Challenge: Per Node Results!
!
!
!
!
!
Reducers: ~1.1 GB/s/node network I/O!
(max 1.25 Gbps for 10 GB ethernet)!
Mappers: 3 GB/s/node disk I/O (8x800 SSD)!
206 nodes * 1.1 Gbps/node ~= 220 Gbps !
Quick Shuffle Refresher
!
!
!
!
!
!
!
!
!
!
!
!
IBM | spark.tc
Shuffle Overview!
All to All, Cartesian Product Operation!
Least ->!
Useful!
Example!
I Could!
Find ->!
!
!
!
!
!
!
!
!
!
!
!
!
IBM | spark.tc
Spark Shuffle Overview!
Most ->!
Confusing!
Example!
I Could!
Find ->!
Stages are Defined by Shuffle Boundaries!
IBM | spark.tc
Shuffle Intermediate Data: Spill to Disk!
Intermediate shuffle data stored in memory!
Spill to Disk!
spark.shuffle.spill=true!
spark.shuffle.memoryFraction=% of all shuffle buffers!
Competes with spark.storage.memoryFraction!
Bump this up from default!! Will help Spark SQL, too.!
Skipped Stages!
Reuse intermediate shuffle data found on reducer!
DAG for that partition can be truncated!
IBM | spark.tc
Shuffle Intermediate Data: Compression!
spark.shuffle.compress!
Compress outputs (mapper)!
!
spark.shuffle.spill.compress!
Compress spills (reducer)!
!
spark.io.compression.codec!
LZF: Most workloads (new default for Spark)!
Snappy: LARGE workloads (less memory required to compress)!
IBM | spark.tc
Spark Shuffle Operations!
join!
distinct!
cogroup!
coalesce!
repartition!
sortByKey!
groupByKey!
reduceByKey!
aggregateByKey!
IBM | spark.tc
Spark Shuffle Managers!
spark.shuffle.manager = {!
hash < 10,000 Reducers!
Output file determined by hashing the key of (K,V) pair!
Each mapper creates an output buffer/file per reducer!
Leads to M*R number of output buffers/files per shuffle!
sort >= 10,000 Reducers!
Default since Spark 1.2!
Wins Daytona GraySort Challenge w/ 250,000 reducers!!!
tungsten-sort -> Default in Spark 1.5!
Uses com.misc.Unsafe for direct access to off heap!
}!
IBM | spark.tc
Shuffle Managers!
IBM | spark.tc
Shuffle Performance Tuning!
Hash Shuffle Manager (no longer default)!
spark.shuffle.consolidateFiles: mapper output files!
o.a.s.shuffle.FileShuffleBlockResolver!
Intermediate Files!
Increase spark.shuffle.file.buffer: reduce seeks & sys calls!
Increase spark.reducer.maxSizeInFlight if memory allows!
Use smaller number of larger workers to reduce total files!
SQL: BroadcastHashJoin vs. ShuffledHashJoin!
spark.sql.autoBroadcastJoinThreshold !
Use DataFrame.explain(true) or EXPLAIN to verify!
Mechanical Sympathy
IBM | spark.tc
Mechanical Sympathy!
Use as much of the CPU cache line as possible!!!
!
!
!
!
!
!
!
!
!
IBM | spark.tc
Naïve Matrix Multiplication: Not Cache Friendly!
Naive:!
for (i = 0; i < N; ++i)!
for (j = 0; j < N; ++j)!
for (k = 0; k < N; ++k)!
res[i][j] += mat1[i][k] * mat2[k][j];!
Clever: !
double mat2transpose [N][N];!
for (i = 0; i < N; ++i)!
for (j = 0; j < N; ++j)!
mat2transpose[i][j] = mat2[j][i];!
for (i = 0; i < N; ++i)!
for (j = 0; j < N; ++j)!
for (k = 0; k < N; ++k)!
res[i][j] += mat1[i][k] * mat2transpose[j][k];!
Prefetch Not Effective
On !
Row Wise Traversal!
Force All !
Column Traversal by!
Transposing Matrix 2!
Winning Optimizations 
Deployed across Spark 1.1 and 1.2
IBM | spark.tc
Daytona GraySort Challenge: Winning Optimizations!
CPU-Cache Locality: Mechanical Sympathy!
& Cache Locality/Alignment!
!
Optimized Sort Algorithm: Elements of (K, V) Pairs!
!
Reduce Network Overhead: Async Netty, epoll!
!
Reduce OS Resource Utilization: Sort Shuffle!
IBM | spark.tc
CPU-Cache Locality: Mechanical Sympathy!
AlphaSort paper ~1995!
Chris Nyberg and Jim Gray!
!
Naïve!
List (Pointer-to-Record)!
Requires Key to be dereferenced for comparison!
!
AlphaSort!
List (Key, Pointer-to-Record)!
Key is directly available for comparison!
!
Key! Ptr!
Ptr!
IBM | spark.tc
CPU-Cache Locality: Cache Locality/Alignment!
Key(10 bytes) + Pointer(4 bytes*) = 14 bytes!
*4 bytes when using compressed OOPS (<32 GB heap)!
Not binary in size!
Not CPU-cache friendly!
Cache Alignment Options!
Add Padding (2 bytes)!
Key(10 bytes) + Pad(2 bytes) + Pointer(4 bytes)=16 bytes!
(Key-Prefix, Pointer-to-Record)!
Key distribution affects performance!
Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes!
Key!
Key!
Ptr!
Ptr!
Ptr!
Key-Prefx!
Pad!
With Padding!
Cache-line!
Friendly!
IBM | spark.tc
CPU-Cache Locality: Performance Comparison!
IBM | spark.tc
Similar Technique: Direct Cache Access!
^ Packet header placed into CPU cache ^!
IBM | spark.tc
Optimized Sort Algorithm: Elements of (K, V) Pairs!
o.a.s.util.collection.TimSort!
Based on JDK 1.7 TimSort!
Performs best on partially-sorted datasets !
Optimized for elements of (K,V) pairs!
Sorts impl of SortDataFormat (ie. KVArraySortDataFormat)!
!
o.a.s.util.collection.AppendOnlyMap!
Open addressing hash, quadratic probing!
Array of [(K, V), (K, V)] !
Good memory locality!
Keys never removed, values only append!
(^2 Probing)!
IBM | spark.tc
Reduce Network Overhead: Async Netty, epoll!
New Network Module based on Async Netty!
Replaces old java.nio, low-level, socket-based code!
Zero-copy epoll uses kernel-space between disk & network!
Custom memory management reduces GC pauses!
spark.shuffle.blockTransferService=netty!
Spark-Netty Performance Tuning!
spark.shuffle.io.numConnectionsPerPeer!
Increase to saturate hosts with multiple disks!
spark.shuffle.io.preferDirectBuffers!
On or Off-heap (Off-heap is default)!
IBM | spark.tc
Hash Shuffle Manager!!
!
!
!
!
!
!
!
!
!
!
M*R num open files per shuffle; M=num mappers!
R=num reducers!
Mapper Opens 1 File per Partition/Reducer!
HDFS!
(2x repl)!
HDFS!
(2x repl)!
S!
IBM | spark.tc
Reduce OS Resource Utilization: Sort Shuffle!
!
!
!
!
!
!
!
!
M open files per shuffle; M = num of mappers!
spark.shuffle.sort.bypassMergeThreshold!
Merge Sort!
(Disk)!
Reducers seek and
scan from range offset!
of Master File on
Mapper!
TimSort!
(RAM)!
HDFS!
(2x repl)!
HDFS!
(2x repl)!
SPARK-2926:!
Replace
TimSort w/
Merge Sort!
(Memory)!
Mapper Merge Sorts Partitions into 1 Master File
Indexed by Partition Range Offsets!
<- Master->!
File!
Project Tungsten
Deployed across Spark 1.4 and 1.5
IBM | spark.tc
Significant Spark Core Changes!
Disk!
Network!
CPU!
Memory!
Daytona GraySort Optimizations!
(Spark 1.1-1.2, Late 2014)!
Tungsten Optimizations!
(Spark 1.4-1.5, Late 2015)!
IBM | spark.tc
Why is CPU the Bottleneck?!
Network and Disk I/O bandwidth are relatively high!
!
GraySort optimizations improved network & shuffle!
!
Predicate pushdowns and partition pruning!
!
Columnar file formats like Parquet and ORC!
!
CPU used for serialization, hashing, compression!
IBM | spark.tc
tungsten-sort Shuffle Manager!
“I don’t know your data structure, but my array[] will beat it!”
Custom Data Structures for Sort/Shuffle Workload!
UnsafeRow: !
!
!
!
Rows are !
8-byte aligned
Primitives are inlined!
Row.equals(), Row.hashCode()!
operate on raw bytes!
Offset (Int) and Length (Int)!
Stored in a single Long!
IBM | spark.tc
sun.misc.Unsafe!
Info!
addressSize()!
pageSize()!
Objects!
allocateInstance()!
objectFieldOffset()!
Classes!
staticFieldOffset()!
defineClass()!
defineAnonymousClass()!
ensureClassInitialized()!
Synchronization!
monitorEnter()!
tryMonitorEnter()!
monitorExit()!
compareAndSwapInt()!
putOrderedInt()!
Arrays!
arrayBaseOffset()!
arrayIndexScale()!
Memory!
allocateMemory()!
copyMemory()!
freeMemory()!
getAddress() – not guaranteed correct if GC occurs!
getInt()/putInt()!
getBoolean()/putBoolean()!
getByte()/putByte()!
getShort()/putShort()!
getLong()/putLong()!
getFloat()/putFloat()!
getDouble()/putDouble()!
getObjectVolatile()/putObjectVolatile()!
Used by Spark!
IBM | spark.tc
Spark + com.misc.Unsafe!
org.apache.spark.sql.execution.!
aggregate.SortBasedAggregate!
aggregate.TungstenAggregate!
aggregate.AggregationIterator!
aggregate.udaf!
aggregate.utils!
SparkPlanner!
rowFormatConverters!
UnsafeFixedWidthAggregationMap!
UnsafeExternalSorter!
UnsafeExternalRowSorter!
UnsafeKeyValueSorter!
UnsafeKVExternalSorter!
local.ConvertToUnsafeNode!
local.ConvertToSafeNode!
local.HashJoinNode!
local.ProjectNode!
local.LocalNode!
local.BinaryHashJoinNode!
local.NestedLoopJoinNode!
joins.HashJoin!
joins.HashSemiJoin!
joins.HashedRelation!
joins.BroadcastHashJoin!
joins.ShuffledHashOuterJoin (not yet converted)!
joins.BroadcastHashOuterJoin!
joins.BroadcastLeftSemiJoinHash!
joins.BroadcastNestedLoopJoin!
joins.SortMergeJoin!
joins.LeftSemiJoinBNL!
joins.SortMergerOuterJoin!
Exchange!
SparkPlan!
UnsafeRowSerializer!
SortPrefixUtils!
sort!
basicOperators!
aggregate.SortBasedAggregationIterator!
aggregate.TungstenAggregationIterator!
datasources.WriterContainer!
datasources.json.JacksonParser!
datasources.jdbc.JDBCRDD!
Window!
org.apache.spark.!
unsafe.Platform!
unsafe.KVIterator!
unsafe.array.LongArray!
unsafe.array.ByteArrayMethods!
unsafe.array.BitSet!
unsafe.bitset.BitSetMethods!
unsafe.hash.Murmur3_x86_32!
unsafe.map.BytesToBytesMap!
unsafe.map.HashMapGrowthStrategy!
unsafe.memory.TaskMemoryManager!
unsafe.memory.ExecutorMemoryManager!
unsafe.memory.MemoryLocation!
unsafe.memory.UnsafeMemoryAllocator!
unsafe.memory.MemoryAllocator (trait/interface)!
unsafe.memory.MemoryBlock!
unsafe.memory.HeapMemoryAllocator!
unsafe.memory.ExecutorMemoryManager!
unsafe.sort.RecordComparator!
unsafe.sort.PrefixComparator!
unsafe.sort.PrefixComparators!
unsafe.sort.UnsafeSorterSpillWriter!
serializer.DummySerializationInstance!
shuffle.unsafe.UnsafeShuffleManager!
shuffle.unsafe.UnsafeShuffleSortDataFormat!
shuffle.unsafe.SpillInfo!
shuffle.unsafe.UnsafeShuffleWriter!
shuffle.unsafe.UnsafeShuffleExternalSorter!
shuffle.unsafe.PackedRecordPointer!
shuffle.ShuffleMemoryManager!
util.collection.unsafe.sort.UnsafeSorterSpillMerger!
util.collection.unsafe.sort.UnsafeSorterSpillReader!
util.collection.unsafe.sort.UnsafeSorterSpillWriter!
util.collection.unsafe.sort.UnsafeShuffleInMemorySorter!
util.collection.unsafe.sort.UnsafeInMemorySorter!
util.collection.unsafe.sort.RecordPointerAndKeyPrefix!
util.collection.unsafe.sort.UnsafeSorterIterator!
network.shuffle.ExternalShuffleBlockResolver!
scheduler.Task!
rdd.SqlNewHadoopRDD!
executor.Executor!
org.apache.spark.sql.catalyst.expressions.!
regexpExpressions!
BoundAttribute!
SortOrder!
SpecializedGetters!
ExpressionEvalHelper!
UnsafeArrayData!
UnsafeReaders!
UnsafeMapData!
Projection!
LiteralGeneartor!
UnsafeRow!
JoinedRow!
SpecializedGetters!
InputFileName!
SpecificMutableRow!
codegen.CodeGenerator!
codegen.GenerateProjection!
codegen.GenerateUnsafeRowJoiner!
codegen.GenerateSafeProjection!
codegen.GenerateUnsafeProjection!
codegen.BufferHolder!
codegen.UnsafeRowWriter!
codegen.UnsafeArrayWriter!
complexTypeCreator!
rows!
literals!
misc!
stringExpressions!
Over 200 source!
files affected!!!
IBM | spark.tc
CPU and Memory Optimizations!
Custom Managed Memory

Reduces GC overhead

Both on and off heap

Exact size calculations
Direct Binary Processing

Operate on serialized/compressed arrays

Kryo can reorder serialized records

LZF can reorder compressed records
More CPU Cache-aware Data Structs & Algorithms

o.a.s.unsafe.map.BytesToBytesMap vs. j.u.HashMap
Code Generation (default in 1.5)

Generate source code from overall query plan

Janino generates bytecode from source code

100+ UDFs converted to use code generation
Details in !
SPARK-7075!
UnsafeFixedWithAggregationMap,& !
TungstenAggregationIterator!
CodeGenerator &!
GeneratorUnsafeRowJoiner!UnsafeSortDataFormat &!
UnsafeShuffleSortDataFormat &!
PackedRecordPointer &!
UnsafeRow!
UnsafeInMemorySorter &
UnsafeExternalSorter &
UnsafeShuffleWriter!
Mostly Same Join Code,!
added if (isUnsafeMode)!
UnsafeShuffleManager &!
UnsafeShuffleInMemorySorter &
UnsafeShuffleExternalSorter!
IBM | spark.tc
Code Generation!
Turned on by default in Spark 1.5
Problem: Generic expression evaluation

Expensive on JVM

Virtual func calls

Branches based on expression type

Excessive object creation due to primitive boxing
Implementation

Defer the source code generation to each operator, type, etc

Scala quasiquotes provide Scala AST manipulation/rewriting

Generated source code is compiled to bytecode w/ Janino

100+ UDFs now using code gen
IBM | spark.tc
Code Generation: Spark SQL UDFs!
100+ UDFs now using code gen – More to come in Spark 1.6!
Details in !
SPARK-8159!
IBM | spark.tc
Project Tungsten: Beyond Core and Spark SQL!
SortDataFormat<K, Buffer>: Base trait
UncompressedInBlockSort: MLlib.ALS
EdgeArraySortDataFormat: GraphX.Edge
IBM | spark.tc
Relevant Links!
  http://sortbenchmark.org/ApacheSpark2014.pdf!
!
  https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html!
  http://0x0fff.com/spark-architecture-shuffle/!
  http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf!
  http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improv
e-performance!
  http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf!
  http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/!
  http://docs.scala-lang.org/overviews/quasiquotes/intro.html!
!
  http://lwn.net/Articles/252125/ <-- Memory Part 2: CPU Caches!
  http://lwn.net/Articles/255364/ <-- Memory Part 5: What Programmers Can Do!
Signup for the book and meetup!
advancedspark.com
Clone all code used today!
github.com/fluxcapacitor/pipeline
Run all demos presented today!
hub.docker.com/r/fluxcapacitor/pipeline
IBM | spark.tc
Sign up for our newsletter at
Thank You, Madrid!!
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
1 of 51

Recommended

Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark... by
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...Chris Fregly
2.2K views55 slides
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl... by
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...Chris Fregly
2.1K views59 slides
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap... by
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Chris Fregly
2.7K views118 slides
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ... by
Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...Chris Fregly
2K views56 slides
Dublin Ireland Spark Meetup October 15, 2015 by
Dublin Ireland Spark Meetup October 15, 2015Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015Chris Fregly
729 views59 slides
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc... by
Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...Chris Fregly
793 views121 slides

More Related Content

What's hot

Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor... by
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...Chris Fregly
3.7K views49 slides
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark by
Spark After Dark:  Real time Advanced Analytics and Machine Learning with SparkSpark After Dark:  Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark: Real time Advanced Analytics and Machine Learning with SparkChris Fregly
6.1K views55 slides
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015 by
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Chris Fregly
1.1K views121 slides
Helsinki Spark Meetup Nov 20 2015 by
Helsinki Spark Meetup Nov 20 2015Helsinki Spark Meetup Nov 20 2015
Helsinki Spark Meetup Nov 20 2015Chris Fregly
899 views146 slides
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5 by
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5Chris Fregly
665 views104 slides
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass... by
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Chris Fregly
4.8K views42 slides

What's hot(20)

Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor... by Chris Fregly
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Chris Fregly3.7K views
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark by Chris Fregly
Spark After Dark:  Real time Advanced Analytics and Machine Learning with SparkSpark After Dark:  Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Chris Fregly6.1K views
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015 by Chris Fregly
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Chris Fregly1.1K views
Helsinki Spark Meetup Nov 20 2015 by Chris Fregly
Helsinki Spark Meetup Nov 20 2015Helsinki Spark Meetup Nov 20 2015
Helsinki Spark Meetup Nov 20 2015
Chris Fregly899 views
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5 by Chris Fregly
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Chris Fregly665 views
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass... by Chris Fregly
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Chris Fregly4.8K views
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar... by Chris Fregly
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Chris Fregly3.4K views
Copenhagen Spark Meetup Nov 25, 2015 by Chris Fregly
Copenhagen Spark Meetup Nov 25, 2015Copenhagen Spark Meetup Nov 25, 2015
Copenhagen Spark Meetup Nov 25, 2015
Chris Fregly770 views
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016 by Chris Fregly
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Chris Fregly887 views
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures... by Chris Fregly
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Chris Fregly1.6K views
Toronto Spark Meetup Dec 14 2015 by Chris Fregly
Toronto Spark Meetup Dec 14 2015Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015
Chris Fregly1.3K views
Sydney Spark Meetup Dec 08, 2015 by Chris Fregly
Sydney Spark Meetup Dec 08, 2015Sydney Spark Meetup Dec 08, 2015
Sydney Spark Meetup Dec 08, 2015
Chris Fregly539 views
DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations by Chris Fregly
DC Spark Users Group March 15 2016 - Spark and Netflix RecommendationsDC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
Chris Fregly1.7K views
London Spark Meetup Project Tungsten Oct 12 2015 by Chris Fregly
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
Chris Fregly1.1K views
Melbourne Spark Meetup Dec 09 2015 by Chris Fregly
Melbourne Spark Meetup Dec 09 2015Melbourne Spark Meetup Dec 09 2015
Melbourne Spark Meetup Dec 09 2015
Chris Fregly533 views
Budapest Big Data Meetup Nov 26 2015 by Chris Fregly
Budapest Big Data Meetup Nov 26 2015Budapest Big Data Meetup Nov 26 2015
Budapest Big Data Meetup Nov 26 2015
Chris Fregly812 views
Boston Spark Meetup May 24, 2016 by Chris Fregly
Boston Spark Meetup May 24, 2016Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016
Chris Fregly2.1K views
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016 by Chris Fregly
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Chris Fregly1.4K views
Spark after Dark by Chris Fregly of Databricks by Data Con LA
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of Databricks
Data Con LA4K views
Singapore Spark Meetup Dec 01 2015 by Chris Fregly
Singapore Spark Meetup Dec 01 2015Singapore Spark Meetup Dec 01 2015
Singapore Spark Meetup Dec 01 2015
Chris Fregly1.1K views

Viewers also liked

USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016 by
USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016Chris Fregly
1.6K views74 slides
Chicago Spark Meetup 03 01 2016 - Spark and Recommendations by
Chicago Spark Meetup 03 01 2016 - Spark and RecommendationsChicago Spark Meetup 03 01 2016 - Spark and Recommendations
Chicago Spark Meetup 03 01 2016 - Spark and RecommendationsChris Fregly
1K views85 slides
Atlanta MLconf Machine Learning Conference 09-23-2016 by
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Chris Fregly
1.1K views42 slides
Atlanta Spark User Meetup 09 22 2016 by
Atlanta Spark User Meetup 09 22 2016Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016Chris Fregly
576 views72 slides
Spark - The beginnings by
Spark -  The beginningsSpark -  The beginnings
Spark - The beginningsDaniel Leon
340 views27 slides
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks by
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksBig Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksData Con LA
906 views55 slides

Viewers also liked(20)

USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016 by Chris Fregly
USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
Chris Fregly1.6K views
Chicago Spark Meetup 03 01 2016 - Spark and Recommendations by Chris Fregly
Chicago Spark Meetup 03 01 2016 - Spark and RecommendationsChicago Spark Meetup 03 01 2016 - Spark and Recommendations
Chicago Spark Meetup 03 01 2016 - Spark and Recommendations
Chris Fregly1K views
Atlanta MLconf Machine Learning Conference 09-23-2016 by Chris Fregly
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016
Chris Fregly1.1K views
Atlanta Spark User Meetup 09 22 2016 by Chris Fregly
Atlanta Spark User Meetup 09 22 2016Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016
Chris Fregly576 views
Spark - The beginnings by Daniel Leon
Spark -  The beginningsSpark -  The beginnings
Spark - The beginnings
Daniel Leon340 views
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks by Data Con LA
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksBig Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Data Con LA906 views
5 Ways to Protect Your Healthcare Organization from a Ransomware Attack - HIM... by ClearDATACloud
5 Ways to Protect Your Healthcare Organization from a Ransomware Attack - HIM...5 Ways to Protect Your Healthcare Organization from a Ransomware Attack - HIM...
5 Ways to Protect Your Healthcare Organization from a Ransomware Attack - HIM...
ClearDATACloud1.8K views
Big data y la inteligencia de negocios by nnakasone
Big data y la inteligencia de negociosBig data y la inteligencia de negocios
Big data y la inteligencia de negocios
nnakasone479 views
Apache spark linkedin by Yukti Kaura
Apache spark linkedinApache spark linkedin
Apache spark linkedin
Yukti Kaura1K views
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor... by Chris Fregly
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Chris Fregly1.9K views
New directions for Apache Spark in 2015 by Databricks
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015
Databricks11.7K views
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24... by Chris Fregly
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Chris Fregly937 views
Extreme-scale Ad-Tech using Spark and Databricks at MediaMath by Spark Summit
Extreme-scale Ad-Tech using Spark and Databricks at MediaMathExtreme-scale Ad-Tech using Spark and Databricks at MediaMath
Extreme-scale Ad-Tech using Spark and Databricks at MediaMath
Spark Summit1.3K views
Introduction to Spark (Intern Event Presentation) by Databricks
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
Databricks2.9K views
How Apache Spark fits in the Big Data landscape by Paco Nathan
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
Paco Nathan6.9K views
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No... by Chris Fregly
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Chris Fregly1.6K views
5 Myths about Spark and Big Data by Nik Rouda by Spark Summit
5 Myths about Spark and Big Data by Nik Rouda5 Myths about Spark and Big Data by Nik Rouda
5 Myths about Spark and Big Data by Nik Rouda
Spark Summit6K views

Similar to Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona GraySort Challenge - Oct 22, 2015

Extreme Apache Spark: how in 3 months we created a pipeline that can process ... by
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
25.4K views65 slides
Spark After Dark - LA Apache Spark Users Group - Feb 2015 by
Spark After Dark - LA Apache Spark Users Group - Feb 2015Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Chris Fregly
5.1K views116 slides
Apache Spark: What's under the hood by
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
907 views66 slides
Apache Spark Core—Deep Dive—Proper Optimization by
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
6.1K views50 slides
Top 5 mistakes when writing Spark applications by
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsmarkgrover
394 views75 slides
Spark Summit EU 2015: Lessons from 300+ production users by
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
10.5K views34 slides

Similar to Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona GraySort Challenge - Oct 22, 2015(20)

Extreme Apache Spark: how in 3 months we created a pipeline that can process ... by Josef A. Habdank
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank25.4K views
Spark After Dark - LA Apache Spark Users Group - Feb 2015 by Chris Fregly
Spark After Dark - LA Apache Spark Users Group - Feb 2015Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Chris Fregly5.1K views
Apache Spark: What's under the hood by Adarsh Pannu
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
Adarsh Pannu907 views
Apache Spark Core—Deep Dive—Proper Optimization by Databricks
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks6.1K views
Top 5 mistakes when writing Spark applications by markgrover
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
markgrover394 views
Spark Summit EU 2015: Lessons from 300+ production users by Databricks
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
Databricks10.5K views
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ... by Databricks
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Databricks2.8K views
Osd ctw spark by Wisely chen
Osd ctw sparkOsd ctw spark
Osd ctw spark
Wisely chen2.2K views
OCF.tw's talk about "Introduction to spark" by Giivee The
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
Giivee The2.3K views
Dallas DFW Data Science Meetup Jan 21 2016 by Chris Fregly
Dallas DFW Data Science Meetup Jan 21 2016Dallas DFW Data Science Meetup Jan 21 2016
Dallas DFW Data Science Meetup Jan 21 2016
Chris Fregly505 views
Top 5 mistakes when writing Spark applications by hadooparchbook
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook11.3K views
Top 5 Mistakes When Writing Spark Applications by Spark Summit
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit26.4K views
Re-Architecting Spark For Performance Understandability by Jen Aman
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman2.7K views
Re-Architecting Spark For Performance Understandability by Jen Aman
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman436 views
Spark cassandra connector.API, Best Practices and Use-Cases by Duyhai Doan
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-Cases
Duyhai Doan8.2K views
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi... by Alex Levenson
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Alex Levenson84.7K views
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter by DataWorks Summit
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
DataWorks Summit22.9K views
Top 5 mistakes when writing Spark applications by hadooparchbook
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook14.6K views
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I... by Databricks
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Databricks623 views

More from Chris Fregly

AWS reInvent 2022 reCap AI/ML and Data by
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataChris Fregly
345 views79 slides
Pandas on AWS - Let me count the ways.pdf by
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfChris Fregly
191 views32 slides
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated by
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedChris Fregly
1.9K views15 slides
Amazon reInvent 2020 Recap: AI and Machine Learning by
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine LearningChris Fregly
1.2K views25 slides
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod... by
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...Chris Fregly
900 views39 slides
Quantum Computing with Amazon Braket by
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon BraketChris Fregly
1K views35 slides

More from Chris Fregly(20)

AWS reInvent 2022 reCap AI/ML and Data by Chris Fregly
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and Data
Chris Fregly345 views
Pandas on AWS - Let me count the ways.pdf by Chris Fregly
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdf
Chris Fregly191 views
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated by Chris Fregly
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Chris Fregly1.9K views
Amazon reInvent 2020 Recap: AI and Machine Learning by Chris Fregly
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine Learning
Chris Fregly1.2K views
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod... by Chris Fregly
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Chris Fregly900 views
Quantum Computing with Amazon Braket by Chris Fregly
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon Braket
Chris Fregly1K views
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person by Chris Fregly
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
Chris Fregly2.6K views
AWS Re:Invent 2019 Re:Cap by Chris Fregly
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:Cap
Chris Fregly2.1K views
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo... by Chris Fregly
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
Chris Fregly3.9K views
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -... by Chris Fregly
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Chris Fregly1.2K views
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ... by Chris Fregly
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Chris Fregly3.7K views
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T... by Chris Fregly
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Chris Fregly597 views
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -... by Chris Fregly
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
Chris Fregly1.1K views
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer... by Chris Fregly
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
Chris Fregly607 views
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ... by Chris Fregly
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Chris Fregly5.3K views
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to... by Chris Fregly
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
Chris Fregly2.5K views
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern... by Chris Fregly
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Chris Fregly963 views
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -... by Chris Fregly
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
Chris Fregly3.9K views
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +... by Chris Fregly
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
Chris Fregly1.4K views
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S... by Chris Fregly
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
Chris Fregly2.5K views

Recently uploaded

SAP FOR TYRE INDUSTRY.pdf by
SAP FOR TYRE INDUSTRY.pdfSAP FOR TYRE INDUSTRY.pdf
SAP FOR TYRE INDUSTRY.pdfVirendra Rai, PMP
27 views3 slides
Sprint 226 by
Sprint 226Sprint 226
Sprint 226ManageIQ
8 views18 slides
Advanced API Mocking Techniques by
Advanced API Mocking TechniquesAdvanced API Mocking Techniques
Advanced API Mocking TechniquesDimpy Adhikary
23 views11 slides
tecnologia18.docx by
tecnologia18.docxtecnologia18.docx
tecnologia18.docxnosi6702
5 views5 slides
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra... by
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra....NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...Marc Müller
41 views62 slides
Copilot Prompting Toolkit_All Resources.pdf by
Copilot Prompting Toolkit_All Resources.pdfCopilot Prompting Toolkit_All Resources.pdf
Copilot Prompting Toolkit_All Resources.pdfRiccardo Zamana
11 views4 slides

Recently uploaded(20)

Sprint 226 by ManageIQ
Sprint 226Sprint 226
Sprint 226
ManageIQ8 views
Advanced API Mocking Techniques by Dimpy Adhikary
Advanced API Mocking TechniquesAdvanced API Mocking Techniques
Advanced API Mocking Techniques
Dimpy Adhikary23 views
tecnologia18.docx by nosi6702
tecnologia18.docxtecnologia18.docx
tecnologia18.docx
nosi67025 views
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra... by Marc Müller
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra....NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...
Marc Müller41 views
Copilot Prompting Toolkit_All Resources.pdf by Riccardo Zamana
Copilot Prompting Toolkit_All Resources.pdfCopilot Prompting Toolkit_All Resources.pdf
Copilot Prompting Toolkit_All Resources.pdf
Riccardo Zamana11 views
20231129 - Platform @ localhost 2023 - Application-driven infrastructure with... by sparkfabrik
20231129 - Platform @ localhost 2023 - Application-driven infrastructure with...20231129 - Platform @ localhost 2023 - Application-driven infrastructure with...
20231129 - Platform @ localhost 2023 - Application-driven infrastructure with...
sparkfabrik8 views
Generic or specific? Making sensible software design decisions by Bert Jan Schrijver
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
Quality Engineer: A Day in the Life by John Valentino
Quality Engineer: A Day in the LifeQuality Engineer: A Day in the Life
Quality Engineer: A Day in the Life
John Valentino6 views
Fleet Management Software in India by Fleetable
Fleet Management Software in India Fleet Management Software in India
Fleet Management Software in India
Fleetable12 views
predicting-m3-devopsconMunich-2023.pptx by Tier1 app
predicting-m3-devopsconMunich-2023.pptxpredicting-m3-devopsconMunich-2023.pptx
predicting-m3-devopsconMunich-2023.pptx
Tier1 app7 views
Dapr Unleashed: Accelerating Microservice Development by Miroslav Janeski
Dapr Unleashed: Accelerating Microservice DevelopmentDapr Unleashed: Accelerating Microservice Development
Dapr Unleashed: Accelerating Microservice Development
Miroslav Janeski12 views
AI and Ml presentation .pptx by FayazAli87
AI and Ml presentation .pptxAI and Ml presentation .pptx
AI and Ml presentation .pptx
FayazAli8712 views

Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona GraySort Challenge - Oct 22, 2015

  • 1. How Spark Beat Hadoop @ 100 TB Sort + Project Tungsten Madrid Spark, Big Data, Bluemix Meetup Chris Fregly, Principal Data Solutions Engineer IBM Spark Technology Center Oct 22, 2015 Power of data. Simplicity of design. Speed of innovation. IBM | spark.tc
  • 2. IBM | spark.tc Who am I?! ! Streaming Data Engineer! Netflix Open Source Committer! ! Data Solutions Engineer! Apache Contributor! ! Principal Data Solutions Engineer! IBM Technology Center! Meetup Organizer! Advanced Apache Meetup! Book Author! Advanced Spark (2016)!
  • 3. IBM | spark.tc Advanced Apache Spark Meetup Total Spark Experts: ~1400 in only 3 mos!! 4th most active Spark Meetup in the world!! ! Goals! Dig deep into the Spark & extended-Spark codebase! ! Study integrations such as Cassandra, ElasticSearch,! Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R, etc! ! Surface and share the patterns and idioms of these ! well-designed, distributed, big data components!
  • 4. IBM | spark.tc Freg-a-palooza Upcoming World Tour   London Spark Meetup (Oct 12th)!   Scotland Data Science Meetup (Oct 13th)!   Dublin Spark Meetup (Oct 15th)!   Barcelona Spark Meetup (Oct 20th)!   Madrid Spark/Big Data Meetup (Oct 22nd)!   Paris Spark Meetup (Oct 26th)!   Amsterdam Spark Summit (Oct 27th – Oct 29th)!   Delft Dutch Data Science Meetup (Oct 29th) !   Brussels Spark Meetup (Oct 30th)!   Zurich Big Data Developers Meetup (Nov 2nd)!
  • 6. IBM | spark.tc Topics of this Talk: Mechanical Sympathy! Tungsten => Bare Metal! Seek Once, Scan Sequentially!! CPU Cache Locality and Efficiency! Use Data Structs Customized to Your Workload! Go Off-Heap Whenever Possible ! spark.unsafe.offHeap=true!
  • 7. IBM | spark.tc What is the Daytona GraySort Challenge?! Key Metric! Throughput of sorting 100TB of 100 byte data,10 byte key! Total time includes launching app and writing output file! ! Daytona! App must be general purpose! ! Gray! Named after Jim Gray!
  • 8. IBM | spark.tc Daytona GraySort Challenge: Input and Resources! Input! Records are 100 bytes in length! First 10 bytes are random key! Input generator: ordinal.com/gensort.html! 28,000 fixed-size partitions for 100 TB sort! 250,000 fixed-size partitions for 1 PB sort! 1 partition = 1 HDFS block = 1 executor ! Aligned to avoid partial read I/O ie. imaginary data! Hardware and Runtime Resources! Commercially available and off-the-shelf! Unmodified, no over/under-clocking! Generates 500TB of disk I/O, 200TB network I/O!
  • 9. IBM | spark.tc Daytona GraySort Challenge: Rules! Must sort to/from OS files in secondary storage! ! No raw disk since I/O subsystem is being tested! ! File and device striping (RAID 0) are encouraged! ! Output file(s) must have correct key order!
  • 10. IBM | spark.tc Daytona GraySort Challenge: Task Scheduling! Types of Data Locality! PROCESS_LOCAL! NODE_LOCAL! RACK_LOCAL! ANY! ! Delay Scheduling! spark.locality.wait.node: time to wait for next shitty level! Set to infinite to reduce shittiness, force NODE_LOCAL! Straggling Executor JVMs naturally fade away on each run! Decreasing! Level of! Read ! Performance!
  • 11. IBM | spark.tc Daytona GraySort Challenge: Winning Results! On-disk only, in-memory caching disabled!! EC2 (i2.8xlarge)! EC2 (i2.8xlarge)! 28,000! partitions! 250,000 ! partitions (!!)! (3 GBps/node! * 206 nodes)!
  • 12. IBM | spark.tc Daytona GraySort Challenge: EC2 Configuration! 206 EC2 Worker nodes, 1 Master node! AWS i2.8xlarge! 32 Intel Xeon CPU E5-2670 @ 2.5 Ghz! 244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4! NOOP I/O scheduler: FIFO, request merging, no reordering! 3 GBps mixed read/write disk I/O per node! Deployed within Placement Group/VPC! Enhanced Networking! Single Root I/O Virtualization (SR-IOV): extension of PCIe! 10 Gbps, low latency, low jitter (iperf showed ~9.5 Gbps)!
  • 13. IBM | spark.tc Daytona GraySort Challenge: Winning Configuration! Spark 1.2, OpenJDK 1.7_<amazon-something>_u65-b17! Disabled in-memory caching -- all on-disk!! HDFS 2.4.1 short-circuit local reads, 2x replication! Writes flushed after each of the 5 runs! 28,000 partitions / (206 nodes * 32 cores) = 4.25 runs, round up 5 runs! Netty 4.0.23.Final with native epoll! Speculative Execution disabled: spark.speculation=false! Force NODE_LOCAL: spark.locality.wait.node=Infinite ! Force Netty Off-Heap: spark.shuffle.io.preferDirectBuffers! Spilling disabled: spark.shuffle.spill=false! All compression disabled (network, on-disk, etc)!
  • 14. IBM | spark.tc Daytona GraySort Challenge: Partitioning! Range Partitioning (vs. Hash Partitioning)! Take advantage of sequential key space! Similar keys grouped together within a partition! Ranges defined by sampling 79 values per partition! Driver sorts samples and defines range boundaries! Sampling took ~10 seconds for 28,000 partitions! !
  • 15. IBM | spark.tc Daytona GraySort Challenge: Why Bother?! Sorting relies heavily on shuffle, I/O subsystem! ! Shuffle is major bottleneck in big data processing! Large number of partitions can exhaust OS resources! ! Shuffle optimization benefits all high-level libraries! ! Goal is to saturate network controller on all nodes! ~125 MB/s (1 GB ethernet), 1.25 GB/s (10 GB ethernet)!
  • 16. IBM | spark.tc Daytona GraySort Challenge: Per Node Results! ! ! ! ! ! Reducers: ~1.1 GB/s/node network I/O! (max 1.25 Gbps for 10 GB ethernet)! Mappers: 3 GB/s/node disk I/O (8x800 SSD)! 206 nodes * 1.1 Gbps/node ~= 220 Gbps !
  • 18. ! ! ! ! ! ! ! ! ! ! ! ! IBM | spark.tc Shuffle Overview! All to All, Cartesian Product Operation! Least ->! Useful! Example! I Could! Find ->!
  • 19. ! ! ! ! ! ! ! ! ! ! ! ! IBM | spark.tc Spark Shuffle Overview! Most ->! Confusing! Example! I Could! Find ->! Stages are Defined by Shuffle Boundaries!
  • 20. IBM | spark.tc Shuffle Intermediate Data: Spill to Disk! Intermediate shuffle data stored in memory! Spill to Disk! spark.shuffle.spill=true! spark.shuffle.memoryFraction=% of all shuffle buffers! Competes with spark.storage.memoryFraction! Bump this up from default!! Will help Spark SQL, too.! Skipped Stages! Reuse intermediate shuffle data found on reducer! DAG for that partition can be truncated!
  • 21. IBM | spark.tc Shuffle Intermediate Data: Compression! spark.shuffle.compress! Compress outputs (mapper)! ! spark.shuffle.spill.compress! Compress spills (reducer)! ! spark.io.compression.codec! LZF: Most workloads (new default for Spark)! Snappy: LARGE workloads (less memory required to compress)!
  • 22. IBM | spark.tc Spark Shuffle Operations! join! distinct! cogroup! coalesce! repartition! sortByKey! groupByKey! reduceByKey! aggregateByKey!
  • 23. IBM | spark.tc Spark Shuffle Managers! spark.shuffle.manager = {! hash < 10,000 Reducers! Output file determined by hashing the key of (K,V) pair! Each mapper creates an output buffer/file per reducer! Leads to M*R number of output buffers/files per shuffle! sort >= 10,000 Reducers! Default since Spark 1.2! Wins Daytona GraySort Challenge w/ 250,000 reducers!!! tungsten-sort -> Default in Spark 1.5! Uses com.misc.Unsafe for direct access to off heap! }!
  • 25. IBM | spark.tc Shuffle Performance Tuning! Hash Shuffle Manager (no longer default)! spark.shuffle.consolidateFiles: mapper output files! o.a.s.shuffle.FileShuffleBlockResolver! Intermediate Files! Increase spark.shuffle.file.buffer: reduce seeks & sys calls! Increase spark.reducer.maxSizeInFlight if memory allows! Use smaller number of larger workers to reduce total files! SQL: BroadcastHashJoin vs. ShuffledHashJoin! spark.sql.autoBroadcastJoinThreshold ! Use DataFrame.explain(true) or EXPLAIN to verify!
  • 27. IBM | spark.tc Mechanical Sympathy! Use as much of the CPU cache line as possible!!! ! ! ! ! ! ! ! ! !
  • 28. IBM | spark.tc Naïve Matrix Multiplication: Not Cache Friendly! Naive:! for (i = 0; i < N; ++i)! for (j = 0; j < N; ++j)! for (k = 0; k < N; ++k)! res[i][j] += mat1[i][k] * mat2[k][j];! Clever: ! double mat2transpose [N][N];! for (i = 0; i < N; ++i)! for (j = 0; j < N; ++j)! mat2transpose[i][j] = mat2[j][i];! for (i = 0; i < N; ++i)! for (j = 0; j < N; ++j)! for (k = 0; k < N; ++k)! res[i][j] += mat1[i][k] * mat2transpose[j][k];! Prefetch Not Effective On ! Row Wise Traversal! Force All ! Column Traversal by! Transposing Matrix 2!
  • 29. Winning Optimizations Deployed across Spark 1.1 and 1.2
  • 30. IBM | spark.tc Daytona GraySort Challenge: Winning Optimizations! CPU-Cache Locality: Mechanical Sympathy! & Cache Locality/Alignment! ! Optimized Sort Algorithm: Elements of (K, V) Pairs! ! Reduce Network Overhead: Async Netty, epoll! ! Reduce OS Resource Utilization: Sort Shuffle!
  • 31. IBM | spark.tc CPU-Cache Locality: Mechanical Sympathy! AlphaSort paper ~1995! Chris Nyberg and Jim Gray! ! Naïve! List (Pointer-to-Record)! Requires Key to be dereferenced for comparison! ! AlphaSort! List (Key, Pointer-to-Record)! Key is directly available for comparison! ! Key! Ptr! Ptr!
  • 32. IBM | spark.tc CPU-Cache Locality: Cache Locality/Alignment! Key(10 bytes) + Pointer(4 bytes*) = 14 bytes! *4 bytes when using compressed OOPS (<32 GB heap)! Not binary in size! Not CPU-cache friendly! Cache Alignment Options! Add Padding (2 bytes)! Key(10 bytes) + Pad(2 bytes) + Pointer(4 bytes)=16 bytes! (Key-Prefix, Pointer-to-Record)! Key distribution affects performance! Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes! Key! Key! Ptr! Ptr! Ptr! Key-Prefx! Pad! With Padding! Cache-line! Friendly!
  • 33. IBM | spark.tc CPU-Cache Locality: Performance Comparison!
  • 34. IBM | spark.tc Similar Technique: Direct Cache Access! ^ Packet header placed into CPU cache ^!
  • 35. IBM | spark.tc Optimized Sort Algorithm: Elements of (K, V) Pairs! o.a.s.util.collection.TimSort! Based on JDK 1.7 TimSort! Performs best on partially-sorted datasets ! Optimized for elements of (K,V) pairs! Sorts impl of SortDataFormat (ie. KVArraySortDataFormat)! ! o.a.s.util.collection.AppendOnlyMap! Open addressing hash, quadratic probing! Array of [(K, V), (K, V)] ! Good memory locality! Keys never removed, values only append! (^2 Probing)!
  • 36. IBM | spark.tc Reduce Network Overhead: Async Netty, epoll! New Network Module based on Async Netty! Replaces old java.nio, low-level, socket-based code! Zero-copy epoll uses kernel-space between disk & network! Custom memory management reduces GC pauses! spark.shuffle.blockTransferService=netty! Spark-Netty Performance Tuning! spark.shuffle.io.numConnectionsPerPeer! Increase to saturate hosts with multiple disks! spark.shuffle.io.preferDirectBuffers! On or Off-heap (Off-heap is default)!
  • 37. IBM | spark.tc Hash Shuffle Manager!! ! ! ! ! ! ! ! ! ! ! M*R num open files per shuffle; M=num mappers! R=num reducers! Mapper Opens 1 File per Partition/Reducer! HDFS! (2x repl)! HDFS! (2x repl)!
  • 38. S! IBM | spark.tc Reduce OS Resource Utilization: Sort Shuffle! ! ! ! ! ! ! ! ! M open files per shuffle; M = num of mappers! spark.shuffle.sort.bypassMergeThreshold! Merge Sort! (Disk)! Reducers seek and scan from range offset! of Master File on Mapper! TimSort! (RAM)! HDFS! (2x repl)! HDFS! (2x repl)! SPARK-2926:! Replace TimSort w/ Merge Sort! (Memory)! Mapper Merge Sorts Partitions into 1 Master File Indexed by Partition Range Offsets! <- Master->! File!
  • 39. Project Tungsten Deployed across Spark 1.4 and 1.5
  • 40. IBM | spark.tc Significant Spark Core Changes! Disk! Network! CPU! Memory! Daytona GraySort Optimizations! (Spark 1.1-1.2, Late 2014)! Tungsten Optimizations! (Spark 1.4-1.5, Late 2015)!
  • 41. IBM | spark.tc Why is CPU the Bottleneck?! Network and Disk I/O bandwidth are relatively high! ! GraySort optimizations improved network & shuffle! ! Predicate pushdowns and partition pruning! ! Columnar file formats like Parquet and ORC! ! CPU used for serialization, hashing, compression!
  • 42. IBM | spark.tc tungsten-sort Shuffle Manager! “I don’t know your data structure, but my array[] will beat it!” Custom Data Structures for Sort/Shuffle Workload! UnsafeRow: ! ! ! ! Rows are ! 8-byte aligned Primitives are inlined! Row.equals(), Row.hashCode()! operate on raw bytes! Offset (Int) and Length (Int)! Stored in a single Long!
  • 44. IBM | spark.tc Spark + com.misc.Unsafe! org.apache.spark.sql.execution.! aggregate.SortBasedAggregate! aggregate.TungstenAggregate! aggregate.AggregationIterator! aggregate.udaf! aggregate.utils! SparkPlanner! rowFormatConverters! UnsafeFixedWidthAggregationMap! UnsafeExternalSorter! UnsafeExternalRowSorter! UnsafeKeyValueSorter! UnsafeKVExternalSorter! local.ConvertToUnsafeNode! local.ConvertToSafeNode! local.HashJoinNode! local.ProjectNode! local.LocalNode! local.BinaryHashJoinNode! local.NestedLoopJoinNode! joins.HashJoin! joins.HashSemiJoin! joins.HashedRelation! joins.BroadcastHashJoin! joins.ShuffledHashOuterJoin (not yet converted)! joins.BroadcastHashOuterJoin! joins.BroadcastLeftSemiJoinHash! joins.BroadcastNestedLoopJoin! joins.SortMergeJoin! joins.LeftSemiJoinBNL! joins.SortMergerOuterJoin! Exchange! SparkPlan! UnsafeRowSerializer! SortPrefixUtils! sort! basicOperators! aggregate.SortBasedAggregationIterator! aggregate.TungstenAggregationIterator! datasources.WriterContainer! datasources.json.JacksonParser! datasources.jdbc.JDBCRDD! Window! org.apache.spark.! unsafe.Platform! unsafe.KVIterator! unsafe.array.LongArray! unsafe.array.ByteArrayMethods! unsafe.array.BitSet! unsafe.bitset.BitSetMethods! unsafe.hash.Murmur3_x86_32! unsafe.map.BytesToBytesMap! unsafe.map.HashMapGrowthStrategy! unsafe.memory.TaskMemoryManager! unsafe.memory.ExecutorMemoryManager! unsafe.memory.MemoryLocation! unsafe.memory.UnsafeMemoryAllocator! unsafe.memory.MemoryAllocator (trait/interface)! unsafe.memory.MemoryBlock! unsafe.memory.HeapMemoryAllocator! unsafe.memory.ExecutorMemoryManager! unsafe.sort.RecordComparator! unsafe.sort.PrefixComparator! unsafe.sort.PrefixComparators! unsafe.sort.UnsafeSorterSpillWriter! serializer.DummySerializationInstance! shuffle.unsafe.UnsafeShuffleManager! shuffle.unsafe.UnsafeShuffleSortDataFormat! shuffle.unsafe.SpillInfo! shuffle.unsafe.UnsafeShuffleWriter! shuffle.unsafe.UnsafeShuffleExternalSorter! shuffle.unsafe.PackedRecordPointer! shuffle.ShuffleMemoryManager! util.collection.unsafe.sort.UnsafeSorterSpillMerger! util.collection.unsafe.sort.UnsafeSorterSpillReader! util.collection.unsafe.sort.UnsafeSorterSpillWriter! util.collection.unsafe.sort.UnsafeShuffleInMemorySorter! util.collection.unsafe.sort.UnsafeInMemorySorter! util.collection.unsafe.sort.RecordPointerAndKeyPrefix! util.collection.unsafe.sort.UnsafeSorterIterator! network.shuffle.ExternalShuffleBlockResolver! scheduler.Task! rdd.SqlNewHadoopRDD! executor.Executor! org.apache.spark.sql.catalyst.expressions.! regexpExpressions! BoundAttribute! SortOrder! SpecializedGetters! ExpressionEvalHelper! UnsafeArrayData! UnsafeReaders! UnsafeMapData! Projection! LiteralGeneartor! UnsafeRow! JoinedRow! SpecializedGetters! InputFileName! SpecificMutableRow! codegen.CodeGenerator! codegen.GenerateProjection! codegen.GenerateUnsafeRowJoiner! codegen.GenerateSafeProjection! codegen.GenerateUnsafeProjection! codegen.BufferHolder! codegen.UnsafeRowWriter! codegen.UnsafeArrayWriter! complexTypeCreator! rows! literals! misc! stringExpressions! Over 200 source! files affected!!!
  • 45. IBM | spark.tc CPU and Memory Optimizations! Custom Managed Memory Reduces GC overhead Both on and off heap Exact size calculations Direct Binary Processing Operate on serialized/compressed arrays Kryo can reorder serialized records LZF can reorder compressed records More CPU Cache-aware Data Structs & Algorithms o.a.s.unsafe.map.BytesToBytesMap vs. j.u.HashMap Code Generation (default in 1.5) Generate source code from overall query plan Janino generates bytecode from source code 100+ UDFs converted to use code generation Details in ! SPARK-7075! UnsafeFixedWithAggregationMap,& ! TungstenAggregationIterator! CodeGenerator &! GeneratorUnsafeRowJoiner!UnsafeSortDataFormat &! UnsafeShuffleSortDataFormat &! PackedRecordPointer &! UnsafeRow! UnsafeInMemorySorter & UnsafeExternalSorter & UnsafeShuffleWriter! Mostly Same Join Code,! added if (isUnsafeMode)! UnsafeShuffleManager &! UnsafeShuffleInMemorySorter & UnsafeShuffleExternalSorter!
  • 46. IBM | spark.tc Code Generation! Turned on by default in Spark 1.5 Problem: Generic expression evaluation Expensive on JVM Virtual func calls Branches based on expression type Excessive object creation due to primitive boxing Implementation Defer the source code generation to each operator, type, etc Scala quasiquotes provide Scala AST manipulation/rewriting Generated source code is compiled to bytecode w/ Janino 100+ UDFs now using code gen
  • 47. IBM | spark.tc Code Generation: Spark SQL UDFs! 100+ UDFs now using code gen – More to come in Spark 1.6! Details in ! SPARK-8159!
  • 48. IBM | spark.tc Project Tungsten: Beyond Core and Spark SQL! SortDataFormat<K, Buffer>: Base trait UncompressedInBlockSort: MLlib.ALS EdgeArraySortDataFormat: GraphX.Edge
  • 49. IBM | spark.tc Relevant Links!   http://sortbenchmark.org/ApacheSpark2014.pdf! !   https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html!   http://0x0fff.com/spark-architecture-shuffle/!   http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf!   http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improv e-performance!   http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf!   http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/!   http://docs.scala-lang.org/overviews/quasiquotes/intro.html! !   http://lwn.net/Articles/252125/ <-- Memory Part 2: CPU Caches!   http://lwn.net/Articles/255364/ <-- Memory Part 5: What Programmers Can Do!
  • 50. Signup for the book and meetup! advancedspark.com Clone all code used today! github.com/fluxcapacitor/pipeline Run all demos presented today! hub.docker.com/r/fluxcapacitor/pipeline IBM | spark.tc Sign up for our newsletter at Thank You, Madrid!!
  • 51. Power of data. Simplicity of design. Speed of innovation. IBM Spark