Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Helsinki Spark Meetup Nov 20 2015

709 views

Published on

Helsinki Spark Meetup Nov 20, 2015: Spark After Dark 1.5: Real-time, Advanced Analytics with Spark 1.5, Kafka, Cassandra, ElasticSearch, Zeppelin, and Docker

Published in: Software

Helsinki Spark Meetup Nov 20 2015

  1. 1. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles After Dark 1.5 High Performance, Real-time, Streaming, Machine Learning, Natural Language Processing, Text Analytics, and Recommendations Chris Fregly Principal Data Solutions Engineer IBM Spark Technology Center ** We’re Hiring -- Only Nice People, Please!! ** November 20, 2015
  2. 2. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Who Am I? 2 Streaming Data Engineer Open Source Committer
 Data Solutions Engineer
 Apache Contributor Principal Data Solutions Engineer IBM Technology Center Founder Advanced Apache Meetup Author Advanced . Due 2016 My Ma’s First Time in California
  3. 3. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Random Slide: More Ma “First Time” Pics 3 In California Using Chopsticks Using “New” iPhone
  4. 4. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Upcoming Meetups and Conferences London Spark Meetup (Oct 12th) Scotland Data Science Meetup (Oct 13th) Dublin Spark Meetup (Oct 15th) Barcelona Spark Meetup (Oct 20th) Madrid Big Data Meetup (Oct 22nd) Paris Spark Meetup (Oct 26th) Amsterdam Spark Summit (Oct 27th) Brussels Spark Meetup (Oct 30th) Zurich Big Data Meetup (Nov 2nd) Geneva Spark Meetup (Nov 5th) San Francisco Datapalooza.io (Nov 10th) 4 San Francisco Advanced Spark (Nov 12th) Oslo Big Data Hadoop Meetup (Nov 19th) Helsinki Spark Meetup (Nov 20th) Stockholm Spark Meetup (Nov 23rd) Copenhagen Spark Meetup (Nov 25th) Budapest Spark Meetup (Nov 26th) Singapore Strata Conference (Dec 1st) San Francisco Advanced Spark (Dec 8th) Mountain View Advanced Spark (Dec 10th) Toronto Spark Meetup (Dec 14th) Austin Data Days Conference (Jan 2016)
  5. 5. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Advanced Apache Spark Meetup Meetup Metrics 1600+ Members in just 4 mos! Top 5 Most Active Spark Meetup!! Meetup Goals   Dig deep into codebase of Spark and related projects   Study integrations of Cassandra, ElasticSearch, Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R   Surface and share patterns and idioms of these well-designed, distributed, big data components
  6. 6. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark All Slides and Code Are Available! advancedspark.com slideshare.net/cfregly github.com/fluxcapacitor hub.docker.com/r/fluxcapacitor 6
  7. 7. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark What is “ After Dark”? Spark-based, Advanced Analytics Reference App End-to-End, Scalable, Real-time Big Data Pipeline Demonstration of Spark & Related Big Data Projects 7 github.com/fluxcapacitor
  8. 8. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Tools of This Talk 8   Kafka   Redis   Docker   Ganglia   Cassandra   Parquet, JSON, ORC, Avro   Apache Zeppelin Notebooks   Spark SQL, DataFrames, Hive   ElasticSearch, Logstash, Kibana   Spark ML, GraphX, Stanford CoreNLP … github.com/fluxcapacitor hub.docker.com/r/fluxcapacitor
  9. 9. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Themes of this Talk  Filter  Off-Heap  Parallelize  Approximate  Find Similarity  Minimize Seeks  Maximize Scans  Customize for Workload  Tune Performance At Every Layer 9   Be Nice, Collaborate! Like a Mom!!
  10. 10. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Presentation Outline  Spark Core: Tuning & Mechanical Sympathy  Spark SQL: Query Optimizing & Catalyst  Spark Streaming: Scaling & Approximations  Spark ML: Featurizing & Recommendations 10
  11. 11. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Spark Core: Tuning & Mechanical Sympathy Understand and Acknowledge Mechanical Sympathy Study AlphaSort and 100Tb GraySort Challenge Dive Deep into Project Tungsten 11
  12. 12. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Mechanical Sympathy Hardware and software working together in harmony. - Martin Thompson http://mechanical-sympathy.blogspot.com Whatever your data structure, my array will beat it. - Scott Meyers Every C++ Book, basically 12 Hair Sympathy - Bruce Jenner
  13. 13. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark and Mechanical Sympathy 13 Project 
 Tungsten (Spark 1.4-1.6+) GraySort Challenge (Spark 1.1-1.2) Minimize Memory and GC Maximize CPU Cache Locality Saturate Network I/O Saturate Disk I/O
  14. 14. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark AlphaSort Technique: Sort 100 Bytes Recs 14 Value Ptr Key Dereference Not Required! AlphaSort List [(Key, Pointer)] Key is directly available for comparison Naïve List [Pointer] Must dereference key for comparison Ptr Dereference for Key Comparison Key
  15. 15. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Line and Memory Sympathy Key (10 bytes)+Pointer (*4 bytes)*Compressed OOPs = 14 bytes 15 Key Ptr Not CPU Cache-line Friendly! Ptr Key-Prefix 2x CPU Cache-line Friendly! Key-Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes Key (10 bytes)+Pad (2 bytes)+Pointer (4 bytes)
 = 16 bytes Key Ptr Pad /Pad CPU Cache-line Friendly!
  16. 16. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Performance Comparison 16
  17. 17. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Similar Trick: Direct Cache Access (DCA) Pull out packet header along side pointer to payload 17
  18. 18. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Line Sizes 18 My
 Laptop My
 SoftLayer
 BareMetal
  19. 19. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Cache Hits: Sequential v Random Access 19
  20. 20. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Mechanical Sympathy CPU Cache Lines and Matrix Multiplication 20
  21. 21. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Naïve Matrix Multiplication // Dot product of each row & column vector for (i <- 0 until numRowA) for (j <- 0 until numColsB) for (k <- 0 until numColsA) res[ i ][ j ] += matA[ i ][ k ] * matB[ k ][ j ]; 21 Bad: Row-wise traversal, not using CPU cache line,
 ineffective pre-fetching
  22. 22. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Friendly Matrix Multiplication // Transpose B for (i <- 0 until numRowsB) for (j <- 0 until numColsB) matBT[ i ][ j ] = matB[ j ][ i ]; 
 // Modify dot product calculation for B Transpose for (i <- 0 until numRowsA) for (j <- 0 until numColsB) for (k <- 0 until numColsA) res[ i ][ j ] += matA[ i ][ k ] * matBT[ j ][ k ]; 22 Good: Full CPU cache line,
 effective prefetching OLD: res[ i ][ j ] += matA[ i ][ k ] * matB [ k ] [ j ]; Reference j
 before k
  23. 23. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Instrumenting and Monitoring CPU Use Linux perf command! 23 http://www.brendangregg.com/blog/2015-11-06/java-mixed-mode-flame-graphs.html
  24. 24. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Compare CPU Naïve & Cache-Friendly Matrix Multiplication 24
  25. 25. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Results of Matrix Multiply Comparison Naïve Matrix Multiply 25 Cache-Friendly Matrix Multiply ~27x ~13x ~13x ~2x perf stat -XX:-Inline –event L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses, LLC-prefetch-misses,cache-misses,stalled-cycles-frontend ~10x 55 hp 550 hp
  26. 26. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Mechanical Sympathy CPU Cache Lines and Lock-Free Thread Sync 26
  27. 27. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Naïve Tuple Counters object CacheNaiveTupleIncrement { var tuple = (0,0) … def increment(leftIncrement: Int, rightIncrement: Int) : (Int, Int) = { this.synchronized { tuple = (tuple._1 + leftIncrement, tuple._2 + rightIncrement) tuple } } } 27
  28. 28. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Naïve Case Class Counters case class MyTuple(left: Int, right: Int) object CacheNaiveCaseClassCounters { var tuple = new MyTuple(0,0) … def increment(leftIncrement: Int, rightIncrement: Int) : MyTuple = { this.synchronized { tuple = new MyTuple(tuple.left + leftIncrement, tuple.right + rightIncrement) tuple } } } 28
  29. 29. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Friendly Lock-Free Counters object CacheFriendlyLockFreeCounters { // a single Long (8-bytes) will maintain 2 separate Ints (4-bytes each) val tuple = new AtomicLong() … def increment(leftIncrement: Int, rightIncrement: Int) : Long = { var originalLong = 0L var updatedLong = 0L do { originalLong = tuple.get() val originalRightInt = originalLong.toInt // cast originalLong to Int to get right counter val originalLeftInt = (originalLong >>> 32).toInt // shift right to get left counter val updatedRightInt = originalRightInt + rightIncrement // increment right counter val updatedLeftInt = originalLeftInt + leftIncrement // increment left counter updatedLong = updatedLeftInt // update the new long with the left counter updatedLong = updatedLong << 32 // shift the new long left updatedLong += updatedRightInt // update the new long with the right counter } while (tuple.compareAndSet(originalLong, updatedLong) == false) updatedLong } 29 Q: Why not @volatile long? A: Java Memory Model 
 does not guarantee synchronous
 updates of 64-bit longs or doubles
  30. 30. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Compare CPU Naïve & Cache-Friendly Tuple Counter Sync 30
  31. 31. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Results of Counters Comparison Naïve Tuple Counters Naïve Case Class Counters Cache Friendly Lock-Free Counters ~2x ~1.5x ~3.5x ~2x ~2x ~1.5x ~1.5x ~1.5x
  32. 32. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Profiling Visualizations: Flame Graphs 32 Example: Spark Word Count Java Stack Traces (-XX:+PreserveFramePointer) Plateaus
 are Bad!!
  33. 33. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles 100TB Daytona GraySort Challenge Focus on Network and Disk I/O Optimizations Improve Data Structs/Algos for Sort & Shuffle Saturate Network and Disk Controllers 33
  34. 34. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Winning Results 34 Spark Goals   Saturate Network I/O   Saturate Disk I/O (2013) (2014)
  35. 35. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Winning Hardware Configuration Compute 206 Workers, 1 Master (AWS EC2 i2.8xlarge) 32 Intel Xeon CPU E5-2670 @ 2.5 Ghz 244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4 3 GBps mixed read/write disk I/O per node Network AWS Placement Groups, VPC, Enhanced Networking Single Root I/O Virtualization (SR-IOV) 10 Gbps, low latency, low jitter (iperf: ~9.5 Gbps) 35
  36. 36. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Winning Software Configuration Spark 1.2, OpenJDK 1.7 Disable caching, compression, spec execution, shuffle spill Force NODE_LOCAL task scheduling for optimal data locality HDFS 2.4.1 short-circuit local reads, 2x replication Empirically chose between 4-6 partitions per cpu 206 nodes * 32 cores = 6592 cores 6592 cores * 4 = 26,368 partitions 6592 cores * 6 = 39,552 partitions 6592 cores * 4.25 = 28,000 partitions (empirical best) Range partitioning takes advantage of sequential keyspace Required ~10s of sampling 79 keys from in each partition 36
  37. 37. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark New Sort Shuffle Manager for Spark 1.2 Original “hash-based” New “sort-based” ①  Use less OS resources (socket buffers, file descriptors) ②  TimSort partitions in-memory ③  MergeSort partitions on-disk into a single master file ④  Serve partitions from master file: seek once, sequential scan 37
  38. 38. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Asynchronous Network Module Switch to asyncronous Netty vs. synchronous java.nio Switch to zero-copy epoll Use only kernel-space between disk and network controllers Custom memory management spark.shuffle.blockTransferService=netty Spark-Netty Performance Tuning spark.shuffle.io.preferDirectBuffers=true Reuse off-heap buffers spark.shuffle.io.numConnectionsPerPeer=8 (for example) Increase to saturate hosts with multiple disks (8x800 SSD) 38 Details in SPARK-2468
  39. 39. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Custom Algorithms and Data Structures Optimized for sort & shuffle workloads o.a.s.util.collection.TimSort[K,V] Based on JDK 1.7 TimSort Performs best with partially-sorted runs Optimized for elements of (K,V) pairs Sorts impl of SortDataFormat (ie. KVArraySortDataFormat) o.a.s.util.collection.AppendOnlyMap Open addressing hash, quadratic probing Array of [(K, V), (K, V)] Good memory locality Keys never removed, values only append 39
  40. 40. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Daytona GraySort Challenge Goal Success 1.1 Gbps/node network I/O (Reducers)
 Theoretical max = 1.25 Gbps for 10 GB ethernet 3 GBps/node disk I/O (Mappers) 40 Aggregate 
 Cluster Network I/O! 220 Gbps / 206 nodes ~= 1.1 Gbps per node
  41. 41. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Shuffle Performance Tuning Tips Hash Shuffle Manager (Deprecated) spark.shuffle.consolidateFiles (Mapper) o.a.s.shuffle.FileShuffleBlockResolver Intermediate Files Increase spark.shuffle.file.buffer (Reducer) Increase spark.reducer.maxSizeInFlight if memory allows Use Smaller Number of Larger Executors Minimizes intermediate files and overall shuffle More opportunity for PROCESS_LOCAL SQL: BroadcastHashJoin vs. ShuffledHashJoin spark.sql.autoBroadcastJoinThreshold Use DataFrame.explain(true) or EXPLAIN to verify 41 Many Threads (1 per CPU)
  42. 42. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Project Tungsten Data Struts & Algos Operate Directly on Byte Arrays Maximize CPU Cache Locality, Minimize GC Utilize Dynamic Code Generation 42 SPARK-7076 (Spark 1.4)
  43. 43. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Quick Review of Project Tungsten Jiras 43 SPARK-7076 (Spark 1.4)
  44. 44. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Why is CPU the Bottleneck? CPU is used for serialization, hashing, compression! Network and Disk I/O bandwidth are relatively high GraySort optimizations improved network & shuffle Partitioning, pruning, and predicate pushdowns Binary, compressed, columnar file formats (Parquet) 44
  45. 45. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Yet Another Spark Shuffle Manager! spark.shuffle.manager = hash (Deprecated) < 10,000 reducers Output partition file hashes the key of (K,V) pair Mapper creates an output file per partition Leads to M*P output files for all partitions sort (GraySort Challenge) > 10,000 reducers Default from Spark 1.2-1.5 Mapper creates single output file for all partitions Minimizes OS resources, netty + epoll optimizes network I/O, disk I/O, and memory Uses custom data structures and algorithms for sort-shuffle workload Wins Daytona GraySort Challenge tungsten-sort (Project Tungsten) Default since 1.5 Modification of existing sort-based shuffle Uses com.misc.Unsafe for self-managed memory and garbage collection Maximize CPU utilization and cache locality with AlphaSort-inspired binary data structures/algorithms Perform joins, sorts, and other operators on both serialized and compressed byte buffers 45
  46. 46. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU & Memory Optimizations Custom Managed Memory Reduces GC overhead Both on and off heap Exact size calculations Direct Binary Processing Operate on serialized/compressed arrays Kryo can reorder/sort serialized records LZF can reorder/sort compressed records More CPU Cache-aware Data Structs & Algorithms o.a.s.sql.catalyst.expression.UnsafeRow o.a.s.unsafe.map.BytesToBytesMap Code Generation (default in 1.5) Generate source code from overall query plan 100+ UDFs converted to use code generation 46 UnsafeFixedWithAggregationMap TungstenAggregationIterator CodeGenerator GeneratorUnsafeRowJoiner UnsafeSortDataFormat UnsafeShuffleSortDataFormat PackedRecordPointer UnsafeRow UnsafeInMemorySorter UnsafeExternalSorter UnsafeShuffleWriter Mostly Same Join Code, UnsafeProjection UnsafeShuffleManager UnsafeShuffleInMemorySorter UnsafeShuffleExternalSorter Details in SPARK-7075
  47. 47. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark sun.misc.Unsafe 47 Info addressSize() pageSize() Objects allocateInstance() objectFieldOffset() Classes staticFieldOffset() defineClass() defineAnonymousClass() ensureClassInitialized() Synchronization monitorEnter() tryMonitorEnter() monitorExit() compareAndSwapInt() putOrderedInt() Arrays arrayBaseOffset() arrayIndexScale() Memory allocateMemory() copyMemory() freeMemory() getAddress() – not guaranteed after GC getInt()/putInt() getBoolean()/putBoolean() getByte()/putByte() getShort()/putShort() getLong()/putLong() getFloat()/putFloat() getDouble()/putDouble() getObjectVolatile()/putObjectVolatile() Used by 
 Tungsten
  48. 48. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark + com.misc.Unsafe 48 org.apache.spark.sql.execution. aggregate.SortBasedAggregate aggregate.TungstenAggregate aggregate.AggregationIterator aggregate.udaf aggregate.utils SparkPlanner rowFormatConverters UnsafeFixedWidthAggregationMap UnsafeExternalSorter UnsafeExternalRowSorter UnsafeKeyValueSorter UnsafeKVExternalSorter local.ConvertToUnsafeNode local.ConvertToSafeNode local.HashJoinNode local.ProjectNode local.LocalNode local.BinaryHashJoinNode local.NestedLoopJoinNode joins.HashJoin joins.HashSemiJoin joins.HashedRelation joins.BroadcastHashJoin joins.ShuffledHashOuterJoin (not yet converted) joins.BroadcastHashOuterJoin joins.BroadcastLeftSemiJoinHash joins.BroadcastNestedLoopJoin joins.SortMergeJoin joins.LeftSemiJoinBNL joins.SortMergerOuterJoin Exchange SparkPlan UnsafeRowSerializer SortPrefixUtils sort basicOperators aggregate.SortBasedAggregationIterator aggregate.TungstenAggregationIterator datasources.WriterContainer datasources.json.JacksonParser datasources.jdbc.JDBCRDD org.apache.spark. unsafe.Platform unsafe.KVIterator unsafe.array.LongArray unsafe.array.ByteArrayMethods unsafe.array.BitSet unsafe.bitset.BitSetMethods unsafe.hash.Murmur3_x86_32 unsafe.map.BytesToBytesMap unsafe.map.HashMapGrowthStrategy unsafe.memory.TaskMemoryManager unsafe.memory.ExecutorMemoryManager unsafe.memory.MemoryLocation unsafe.memory.UnsafeMemoryAllocator unsafe.memory.MemoryAllocator (trait/interface) unsafe.memory.MemoryBlock unsafe.memory.HeapMemoryAllocator unsafe.memory.ExecutorMemoryManager unsafe.sort.RecordComparator unsafe.sort.PrefixComparator unsafe.sort.PrefixComparators unsafe.sort.UnsafeSorterSpillWriter serializer.DummySerializationInstance shuffle.unsafe.UnsafeShuffleManager shuffle.unsafe.UnsafeShuffleSortDataFormat shuffle.unsafe.SpillInfo shuffle.unsafe.UnsafeShuffleWriter shuffle.unsafe.UnsafeShuffleExternalSorter shuffle.unsafe.PackedRecordPointer shuffle.ShuffleMemoryManager util.collection.unsafe.sort.UnsafeSorterSpillMerger util.collection.unsafe.sort.UnsafeSorterSpillReader util.collection.unsafe.sort.UnsafeSorterSpillWriter util.collection.unsafe.sort.UnsafeShuffleInMemorySorter util.collection.unsafe.sort.UnsafeInMemorySorter util.collection.unsafe.sort.RecordPointerAndKeyPrefix util.collection.unsafe.sort.UnsafeSorterIterator network.shuffle.ExternalShuffleBlockResolver scheduler.Task rdd.SqlNewHadoopRDD executor.Executor org.apache.spark.sql.catalyst.expressions. regexpExpressions BoundAttribute SortOrder SpecializedGetters ExpressionEvalHelper UnsafeArrayData UnsafeReaders UnsafeMapData Projection LiteralGeneartor UnsafeRow JoinedRow SpecializedGetters InputFileName SpecificMutableRow codegen.CodeGenerator codegen.GenerateProjection codegen.GenerateUnsafeRowJoiner codegen.GenerateSafeProjection codegen.GenerateUnsafeProjection codegen.BufferHolder codegen.UnsafeRowWriter codegen.UnsafeArrayWriter complexTypeCreator rows literals misc stringExpressions Over 200 source files affected!!
  49. 49. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Traditional Java Object Row Layout 4-byte String Multi-field Object 49
  50. 50. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Custom Data Structures for Workload UnsafeRow (Dense Binary Row) TaskMemoryManager (Virtual Memory Address) BytesToBytesMap (Dense Binary HashMap) 50 Dense, 8-bytes per field (word-aligned) Key Ptr AlphaSort-Style (Key + Pointer) OS-Style Memory Paging
  51. 51. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark UnsafeRow Layout Example 51 Pre-Tungsten Tungsten
  52. 52. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Custom Memory Management o.a.s.memory.
 TaskMemoryManager & MemoryConsumer Memory management: virtual memory allocation, pageing Off-heap: direct 64-bit address On-heap: 13-bit page num + 27-bit page offset o.a.s.shuffle.sort. PackedRecordPointer 64-bit word (24-bit partition key, (13-bit page num, 27-bit page offset)) o.a.s.unsafe.types. UTF8String Primitive Array[Byte] 52 2^13 pages * 2^27 page size = 1 TB RAM per Task
  53. 53. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark UnsafeFixedWidthAggregationMap Aggregations o.a.s.sql.execution.
 UnsafeFixedWidthAggregationMap Uses BytesToBytesMap In-place updates of serialized data No object creation on hot-path Improved external agg support No OOM’s for large, single key aggs o.a.s.sql.catalyst.expression.codegen. GenerateUnsafeRowJoiner Combine 2 UnsafeRows into 1 o.a.s.sql.execution.aggregate. TungstenAggregate & TungstenAggregationIterator Operates directly on serialized, binary UnsafeRow 2 Steps: hash-based agg (grouping), then sort-based agg Supports spilling and external merge sorting 53
  54. 54. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Equality Bitwise comparison on UnsafeRow No need to calculate equals(), hashCode() Row 1 Equals! Row 2 54
  55. 55. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Joins Surprisingly, not many code changes o.a.s.sql.catalyst.expressions. UnsafeProjection Converts InternalRow to UnsafeRow 55
  56. 56. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Sorting o.a.s.util.collection.unsafe.sort. UnsafeSortDataFormat UnsafeInMemorySorter UnsafeExternalSorter RecordPointerAndKeyPrefix
 UnsafeShuffleWriter AlphaSort-Style Cache Friendly 56 Ptr Key-Prefix 2x CPU Cache-line Friendly! Using multiple subclasses of SortDataFormat simultaneously will prevent JIT inlining. This affects sort & shuffle performance. Supports merging compressed records if compression CODEC supports it (LZF)
  57. 57. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spilling Efficient Spilling Exact data size is known No need to maintain heuristics & approximations Controls amount of spilling Spill merge on compressed, binary records! If compression CODEC supports it 57 UnsafeFixedWidthAggregationMap.getPeakMemoryUsedBytes() Exact Peak Memory for Spark Jobs
  58. 58. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Code Generation Problem Boxing causes excessive object creation Expensive expression tree evals per row JVM can’t inline polymorphic impls Solution Codegen by-passes virtual function calls Defer source code generation to each operator, UDF, UDAF Use Scala quasiquote macros for Scala AST source code gen Rewrite and optimize code for overall plan, 8-byte align, etc Use Janino to compile generated source code into bytecode 58
  59. 59. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles IBM | spark.tc Spark SQL UDF Code Generation 100+ UDFs now generating code More to come in Spark 1.6+ Details in SPARK-8159, SPARK-9571 Each Implements Expression.genCode()!
  60. 60. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Creating a Custom UDF with Codegen Study existing implementations https://github.com/apache/spark/pull/7214/files Extend base trait o.a.s.sql.catalyst.expressions.Expression.genCode() Register the function o.a.s.sql.catalyst.analysis.FunctionRegistry.registerFunction() Augment DataFrame with new UDF (Scala implicits) o.a.s.sql.functions.scala Don’t forget about Python! python.pyspark.sql.functions.py 60
  61. 61. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Who Benefits from Project Tungsten? Users of DataFrames All Spark SQL Queries Catalyst All RDDs Serialization, Compression, and Aggregations 61
  62. 62. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Project Tungsten Performance Results Query Time Garbage Collection 62 OOM’d on Large Dataset!
  63. 63. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Presentation Outline  Spark Core: Tuning & Mechanical Sympathy  Spark SQL: Query Optimizing & Catalyst  Spark Streaming: Scaling & Approximations  Spark ML: Featurizing & Recommendations 63
  64. 64. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Spark SQL: Query Optimizing & Catalyst Explore DataFrames/Datasets/DataSources, Catalyst Review Partitions, Pruning, Pushdowns, File Formats Create a Custom DataSource API Implementation 64
  65. 65. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark DataFrames Inspired by R and Pandas DataFrames Schema-aware Cross language support SQL, Python, Scala, Java, R Levels performance of Python, Scala, Java, and R Generates JVM bytecode vs serializing to Python DataFrame is container for logical plan Lazy transformations represented as tree Only logical plan is sent from Python -> JVM Only results returned from JVM -> Python UDF and UDAF Support Custom UDF support using registerFunction() Experimental UDAF support (ie. HyperLogLog) Supports existing Hive metastore if available Small, file-based Hive metastore created if not available *DataFrame.rdd returns underlying RDD if needed 65 Use DataFrames instead of RDDs!!
  66. 66. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark and Hive Early days, Shark was “Hive on Spark” Hive Optimizer slowly replaced with Catalyst Always use HiveContext – even if not using Hive! If no Hive, a small Hive metastore file is created Spark 1.5+ supports all Hive versions 0.12+ Separate classloaders for isolation Breaks dependency between Spark internal Hive version and User’s external Hive version 66
  67. 67. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Catalyst Optimizer Optimize DataFrame Transformation Tree Subquery elimination: use aliases to collapse subqueries Constant folding: replace expression with constant Simplify filters: remove unnecessary filters Predicate/filter pushdowns: avoid unnecessary data load Projection collapsing: avoid unnecessary projections Create Custom Rules Rules are Scala Case Classes val newPlan = MyFilterRule(analyzedPlan) 67 Implements oas.sql.catalyst.rules.Rule Apply to any plan stage
  68. 68. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark DataSources API Relations (o.a.s.sql.sources.interfaces.scala) BaseRelation (abstract class): Provides schema of data TableScan (impl): Read all data from source PrunedFilteredScan (impl): Column pruning & predicate pushdowns InsertableRelation (impl): Insert/overwrite data based on SaveMode RelationProvider (trait/interface): Handle options, BaseRelation factory Execution (o.a.s.sql.execution.commands.scala) RunnableCommand (trait/interface): Common commands like EXPLAIN ExplainCommand(impl: case class) CacheTableCommand(impl: case class) Filters (o.a.s.sql.sources.filters.scala) Filter (abstract class): Handles all predicates/filters supported by this source EqualTo (impl) GreaterThan (impl) StringStartsWith (impl) 68
  69. 69. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Native Spark SQL DataSources 69
  70. 70. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Query Plan Debugging 70 gendersCsvDF.select($"id", $"gender").filter("gender != 'F'").filter("gender != 'M'").explain(true) DataFrame.queryExecution.logical DataFrame.queryExecution.analyzed DataFrame.queryExecution.optimizedPlan DataFrame.queryExecution.executedPlan
  71. 71. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Query Plan Visualization & Metrics 71 Effectiveness of Filter CPU Cache 
 Friendly Binary Format Cost-based Join Optimization Similar to MapReduce Map-side Join Peak Memory for Joins and Aggs
  72. 72. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark JSON Data Source DataFrame val ratingsDF = sqlContext.read.format("json") .load("file:/root/pipeline/datasets/dating/ratings.json.bz2") -- or – val ratingsDF = sqlContext.read.json
 ("file:/root/pipeline/datasets/dating/ratings.json.bz2") SQL Code CREATE TABLE genders USING json OPTIONS (path "file:/root/pipeline/datasets/dating/genders.json.bz2") 72 json() convenience method
  73. 73. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark JDBC Data Source Add Driver to Spark JVM System Classpath $ export SPARK_CLASSPATH=<jdbc-driver.jar> DataFrame val jdbcConfig = Map("driver" -> "org.postgresql.Driver", "url" -> "jdbc:postgresql:hostname:port/database", "dbtable" -> ”schema.tablename") df.read.format("jdbc").options(jdbcConfig).load() SQL CREATE TABLE genders USING jdbc 
 OPTIONS (url, dbtable, driver, …) 73
  74. 74. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Parquet Data Source Configuration spark.sql.parquet.filterPushdown=true spark.sql.parquet.mergeSchema=true spark.sql.parquet.cacheMetadata=true spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo] DataFrames val gendersDF = sqlContext.read.format("parquet") .load("file:/root/pipeline/datasets/dating/genders.parquet") gendersDF.write.format("parquet").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders.parquet") SQL CREATE TABLE genders USING parquet OPTIONS (path "file:/root/pipeline/datasets/dating/genders.parquet") 74
  75. 75. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark ORC Data Source Configuration spark.sql.orc.filterPushdown=true DataFrames val gendersDF = sqlContext.read.format("orc") .load("file:/root/pipeline/datasets/dating/genders") gendersDF.write.format("orc").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders") SQL CREATE TABLE genders USING orc OPTIONS (path "file:/root/pipeline/datasets/dating/genders") 75
  76. 76. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Third-Party Spark SQL DataSources 76 spark-packages.org
  77. 77. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CSV DataSource (Databricks) Github https://github.com/databricks/spark-csv Maven com.databricks:spark-csv_2.10:1.2.0 Code val gendersCsvDF = sqlContext.read .format("com.databricks.spark.csv") .load("file:/root/pipeline/datasets/dating/gender.csv.bz2") .toDF("id", "gender") 77 toDF() is required if CSV does not contain header
  78. 78. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark ElasticSearch DataSource (Elastic.co) Github https://github.com/elastic/elasticsearch-hadoop Maven org.elasticsearch:elasticsearch-spark_2.10:2.1.0 Code val esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>", 
 "es.port" -> "<port>") df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite) .options(esConfig).save("<index>/<document-type>") 78
  79. 79. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Elasticsearch Tips Change id field to not_analyzed to avoid indexing Use term filter to build and cache the query Perform multiple aggregations in a single request Adapt scoring function to current trends at query time 79
  80. 80. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark AWS Redshift Data Source (Databricks) Github https://github.com/databricks/spark-redshift Maven com.databricks:spark-redshift:0.5.0 Code val df: DataFrame = sqlContext.read .format("com.databricks.spark.redshift") .option("url", "jdbc:redshift://<hostname>:<port>/<database>…") .option("query", "select x, count(*) my_table group by x") .option("tempdir", "s3n://tmpdir") .load(...) 80 UNLOAD and copy to tmp bucket in S3 enables parallel reads
  81. 81. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark DB2 and BigSQL DataSources (IBM) Coming Soon! 81
  82. 82. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Cassandra DataSource (DataStax) Github https://github.com/datastax/spark-cassandra-connector Maven com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1 Code ratingsDF.write .format("org.apache.spark.sql.cassandra") .mode(SaveMode.Append) .options(Map("keyspace"->"<keyspace>", "table"->"<table>")).save(…) 82
  83. 83. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Cassandra Pushdown Support spark-cassandra-connector/…/o.a.s.sql.cassandra.PredicatePushDown.scala Pushdown Predicate Rules 1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate 2. Only push down primary key column predicates with = or IN predicate. 3. If there are regular columns in the pushdown predicates, they should have at least one EQ expression on an indexed column and no IN predicates. 4. All partition column predicates must be included in the predicates to be pushed down, only the last part of the partition key can be an IN predicate. For each partition column, only one predicate is allowed. 5. For cluster column predicates, only last predicate can be non-EQ predicate including IN predicate, and preceding column predicates must be EQ predicates. If there is only one cluster column predicate, the predicates could be any non-IN predicate. 6. There is no pushdown predicates if there is any OR condition or NOT IN condition. 7. We're not allowed to push down multiple predicates for the same column if any of them is equality or IN predicate. 83
  84. 84. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark New Cassandra DataSource By-pass CQL optimized for transactional data Instead, do bulk reads/writes directly on SSTables Similar to 5 year old Netflix Open Source project Aegisthus Promotes Cassandra to first-class Analytics Option Potentially only part of DataStax Enterprise?! Please mail a nasty letter to your local DataStax office 84
  85. 85. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Rumor of REST DataSource (Databricks) Coming Soon? Ask Michael Armbrust Spark SQL Lead @ Databricks 85
  86. 86. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Custom DataSource (Me and You!) Coming Right Now! 86 DEMO ALERT!!
  87. 87. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Create a Custom DataSource Study Existing Native & Third-Party Data Sources Native Spark JDBC (o.a.s.sql.execution.datasources.jdbc) class JDBCRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation Third-Party DataStax Cassandra (o.a.s.sql.cassandra) class CassandraSourceRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation! 87
  88. 88. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Create a Custom DataSource 88
  89. 89. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Contribute a Custom Data Source spark-packages.org Managed by Contains links to external github projects Ratings and comments Declare Spark version support for each package Examples https://github.com/databricks/spark-csv https://github.com/databricks/spark-avro https://github.com/databricks/spark-redshift 89
  90. 90. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Parquet Columnar File Format Based on Google Dremel Collaboration with Twitter and Cloudera Self-describing, evolving schema Fast columnar aggregation Supports filter pushdowns Columnar storage format Excellent compression 90
  91. 91. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Types of Compression Run Length Encoding: Repeated data Dictionary Encoding: Fixed set of values Delta, Prefix Encoding: Sorted data 91
  92. 92. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Demonstrate File Formats, Partition Schemes, and Query Plans 92
  93. 93. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Hive JDBC ODBC ThriftServer Allow BI Tools to Query and Process Spark Data Register Permanent Table CREATE TABLE ratings(fromuserid INT, touserid INT, rating INT) USING org.apache.spark.sql.json OPTIONS (path "datasets/dating/ratings.json.bz2") Register Temp Table ratingsDF.registerTempTable("ratings_temp") Configuration spark.sql.thriftServer.incrementalCollect=true spark.driver.maxResultSize > 10gb (default) 93
  94. 94. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Query and Process Spark Data from BI Tools 94
  95. 95. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Presentation Outline  Spark Core: Tuning & Mechanical Sympathy  Spark SQL: Query Optimizing & Catalyst  Spark Streaming: Scaling & Approximations  Spark ML: Featurizing & Recommendations 95
  96. 96. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Spark Streaming: Scaling & Approximations Discuss Delivery Guarantees, Parallelism, and Stability Compare Receiver and Receiver-less Impls Demonstrate Stream Approximations 96
  97. 97. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Non-Parallel Receiver Implementation 97
  98. 98. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Receiver Implementation (Kinesis)   KinesisRDD partitions store relevant offsets   Single receiver required to see all data/offsets   Kinesis offsets not deterministic like Kafka   Partitions rebuild from Kinesis using offsets   No Write Ahead Log (WAL) needed   Optimizes happy path by avoiding the WAL   At least once delivery guarantee 98
  99. 99. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Parallel Receiver-less Implementation (Kafka) 99
  100. 100. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Receiver-less Implementation (Kafka)   KafkaRDD partitions store relevant offsets   Each partition acts as a Receiver   Tasks/Executors pull from Kafka in parallel   Partitions rebuild from Kafka using offsets   No Write Ahead Log (WAL) needed   Optimizes happy path by avoiding the WAL   At least once delivery guarantee 100
  101. 101. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Maintain Stability of Stream Processing Rate Limiting Since Spark 1.2 Fixed limit on number of messages per second Potential to drops messages on the floor Back Pressure Since Spark 1.5 (TypeSafe Contribution) More dynamic than rate limiting Push back on reliable, buffered source (Kafka, Kinesis) Fundamentals of Control Theory and Observability 101
  102. 102. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Streaming Approximations HyperLogLog and CountMin Sketch 102
  103. 103. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark HyperLogLog (HLL) Approx Distinct Count   Approximate count distinct   Twitter’s Algebird   Better than HashSet   Low, fixed memory   Only 1.5K, 2% error,10^9 counts (tunable) Redis HLL: 12K per key, 0.81%, 2^64 counts   Spark’s countApproxDistinctByKey()   Streaming example in Spark codebase 103 http://research.neustar.biz/
  104. 104. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CountMin Sketch (CMS) Approx Count  Approximate count  Twitter’s Algebird  Better than HashMap  Low, fixed memory  Known error bounds  Large num counters  Streaming example in Spark codebase 104
  105. 105. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Using HLL and CMS for Streaming Count Approximations 105
  106. 106. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Monte Carlo Simulations From Manhattan Project (Atomic bomb) Simulate movement of neutrons Law of Large Numbers (LLN) Average of results of many trials
 Converge on expected value SparkPi example in Spark codebase 1 Argument: # of trials 
 Pi ~= # red dots
 / # total dots * 4 106
  107. 107. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Using a Monte Carlo Simulation to Estimate Pi 107
  108. 108. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Streaming Best Practices Get Data Out of Streaming ASAP Processing interval may exceed batch interval Leads to unstable streaming system Please Don’t… Use updateStateByKey() like an in-memory DB Put streaming jobs on the request/response hot path Use Separate Jobs for Different Batch Intervals Small Batch Interval: Store raw data (Redis, Cassandra, etc) Medium Batch Interval: Transform, join, process data High Batch Interval: Model training Gotchas Tune streamingContext.remember() Use Approximations!! 108
  109. 109. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Presentation Outline  Spark Core: Tuning & Mechanical Sympathy  Spark SQL: Query Optimizing & Catalyst  Spark Streaming: Scaling & Approximations  Spark ML: Featurizing & Recommendations 109
  110. 110. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Spark ML: Featurizing & Recommendations Understand Similarity and Dimension Reduction Demonstrate Sampling and Bucketing Generate Recommendations 110
  111. 111. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Live, Interactive Demo! sparkafterdark.com 111
  112. 112. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Audience Participation Needed!! 112 -> You are
 here -> Audience Instructions   Navigate to sparkafterdark.com   Click 3 actresses and 3 actors   Wait for us to analyze together! Note: This is totally anonymous!! Project Links   https://github.com/fluxcapacitor/pipeline   https://hub.docker.com/r/fluxcapacitor
  113. 113. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Similarity 113
  114. 114. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Types of Similarity Euclidean Linear-based measure Suffers from Magnitude bias Cosine Angle-based measure Adjusts for magnitude bias Jaccard Set intersection / union Suffers Popularity bias Log Likelihood Netflix “Shawshank” Problem Adjusts for popularity bias 114 Ali Matei Reynold Patrick Andy Kimberly 1 1 1 1 Leslie 1 1! Meredith 1 1 1 Lisa 1 1 1 Holden 1 1 1 1 1 z!
  115. 115. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark All-Pairs Similarity Comparison Compare everything to everything aka. “pair-wise similarity” or “similarity join” Naïve shuffle: O(m*n^2); m=rows, n=cols Minimize shuffle through approximations! Reduce m (rows) Sampling and bucketing Reduce n (cols) Remove most frequent value (ie.0) Principle Component Analysis 115 Dimension reduction!!
  116. 116. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Dimension Reduction Sampling and Bucketing 116
  117. 117. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Reduce m: DIMSUM Sampling “Dimension Independent Matrix Square Using MR” Remove rows with low similarity probability MLlib: RowMatrix.columnSimilarities(…) Twitter: 40% efficiency gain vs. Cosine Similarity 117
  118. 118. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Reduce m: LSH Bucketing “Locality Sensitive Hashing” Split m into b buckets Use similarity hash algorithm Requires pre-processing of data Parallel compare bucket contents O(m*n^2) -> O(m*n/b*b^2); m=rows, n=cols, b=buckets ie. 500k x 500k matrix O(1.25e17) -> O(1.25e13); b=50 118 github.com/mrsqueeze/spark-hash
  119. 119. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Reduce n: Remove Most Frequent Value Eliminate most-frequent value Represent other values with (index,value) pairs Converts O(m*n^2) -> O(m*nnz^2); 
 nnz=num nonzeros, nnz << n Note: Choose most frequent value (may not be 0) 119 (index,value) (index,value)
  120. 120. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Recommendations Summary Statistics and Top-K Historical Analysis Collaborative Filtering and Clustering Text Featurization and NLP 120
  121. 121. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Types of Recommendations Non-personalized
 No preference or behavior data for user, yet aka “Cold Start Problem” Personalized
 User-Item Similarity
 Items that others with similar prefs have liked Item-Item Similarity
 Items similar to your previously-liked items 121
  122. 122. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Recommendation Terminology Feedback Explicit: like, rating Implicit: search, click, hover, view, scroll Feature Engineering Dimension reduction, polynomial expansion Hyper-parameter Tuning K-Folds Cross Validation, Grid Search Pipelines/Workflows Chaining together Transformers and Evaluators 122
  123. 123. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Single Machine ML Algorithms Stay Local, Distribute As Needed Helps migration of existing single-node algos to Spark Convert between Spark and Pandas DataFrames New “pdspark” package: integration w/ scikitlearn, R 123
  124. 124. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Non-Personalized Recommendations Use Aggregate Data to Generate Recommendations 124
  125. 125. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Top Users by Like Count “I might like users who have the most-likes overall based on historical data.” SparkSQL, DataFrames: Summary Stat, Aggs 125
  126. 126. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Top Influencers by Like Graph
 “I might like the most-influential users in overall like graph.” GraphX: PageRank 126
  127. 127. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Generate Non-Personalized Recommendations 127
  128. 128. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Personalized Recommendations Understand Similarity and Personalized Recommendations 128
  129. 129. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Like Behavior of Similar Users “I like the same people that you like. 
 What other people did you like that I haven’t seen?” MLlib: Matrix Factorization, User-Item Similarity 129
  130. 130. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Generate Personalized Recommendations using 
 Collaborative Filtering & Matrix Factorization 130
  131. 131. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Similar Text-based Profiles as Me
 “Our profiles have similar keywords and named entities. 
 We might like each other!” MLlib: Word2Vec, TF/IDF, k-skip n-grams 131
  132. 132. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Similar Profiles to Previous Likes
 132 “Your profile text has similar keywords and named entities to other profiles of people I like. I might like you, too!” MLlib: Word2Vec, TF/IDF, Doc Similarity
  133. 133. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Relevant, High-Value Emails “Your initial email references a lot of things in my profile.
 I might like you for making the effort!” MLlib: Word2Vec, TF/IDF, Entity Recognition 133 ^ Her Email< My Profile
  134. 134. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Feature Engineering for Text/NLP Use Cases 134
  135. 135. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles The Future of Recommendations 135
  136. 136. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Eigenfaces: Facial Recognition “Your face looks similar to others that I’ve liked.
 I might like you.” MLlib: RowMatrix, PCA, Item-Item Similarity 136 Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
  137. 137. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   NLP Conversation Starter Bot! “If your responses to my generic opening lines are positive, I may read your profile.” 
 MLlib: TF/IDF, DecisionTrees, Sentiment Analysis 137 Positive Negative
  138. 138. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles 138 Maintaining the Spark
  139. 139. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark ⑨  Recommendations for Couples “I want Mad Max. You want Message In a Bottle. 
 Let’s find something in between to watch tonight.” MLlib: RowMatrix, Item-Item Similarity
 GraphX: Nearest Neighbors, Shortest Path similar similar •  plots -> <- actors 139
  140. 140. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Final Recommendation! 140
  141. 141. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Get Off the Computer & Meet People! Thank you, Helsinki!! Chris Fregly @cfregly IBM Spark Technology Center San Francisco, CA, USA Relevant Links advancedspark.com Signup for the book & global meetup! github.com/fluxcapacitor/pipeline Clone, contribute, and commit code! hub.docker.com/r/fluxcapacitor/pipeline/wiki Run all demos in your own environment with Docker! 141
  142. 142. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark More Relevant Links http://meetup.com/Advanced-Apache-Spark-Meetup http://advancedspark.com http://github.com/fluxcapacitor/pipeline http://hub.docker.com/r/fluxcapacitor/pipeline http://sortbenchmark.org/ApacheSpark2014.pd https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html http://0x0fff.com/spark-architecture-shuffle/ http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/ http://docs.scala-lang.org/overviews/quasiquotes/intro.html http://lwn.net/Articles/252125/ (Memory Part 2: CPU Caches) http://lwn.net/Articles/255364/ (Memory Part 5: What Programmers Can Do) https://www.safaribooksonline.com/library/view/java-performance-the/9781449363512/ch04.html http://web.eece.maine.edu/~vweaver/projects/perf_events/perf_event_open.html http://www.brendangregg.com/perf.html https://perf.wiki.kernel.org/index.php/Tutorial http://techblog.netflix.com/2015/07/java-in-flames.html http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#Java http://sortbenchmark.org/ApacheSpark2014.pdf https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html http://0x0fff.com/spark-architecture-shuffle/ http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/ http://docs.scala-lang.org/overviews/quasiquotes/intro.html http://lwn.net/Articles/252125/ <-- Memory Part 2: CPU Caches http://lwn.net/Articles/255364/ <-- Memory Part 5: What Programmers Can Do 142
  143. 143. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles What’s Next? 143
  144. 144. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark What’s Next? Autoscaling Spark Workers Completely Docker-based Docker Compose and Docker Machine Lots of Demos and Examples! Zeppelin & IPython/Jupyter notebooks Advanced streaming use cases Advanced ML, Graph, and NLP use cases Performance Tuning and Profiling Work closely with Brendan Gregg & Netflix Surface & share more low-level details of Spark internals 144
  145. 145. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Upcoming Meetups and Conferences London Spark Meetup (Oct 12th) Scotland Data Science Meetup (Oct 13th) Dublin Spark Meetup (Oct 15th) Barcelona Spark Meetup (Oct 20th) Madrid Big Data Meetup (Oct 22nd) Paris Spark Meetup (Oct 26th) Amsterdam Spark Summit (Oct 27th) Brussels Spark Meetup (Oct 30th) Zurich Big Data Meetup (Nov 2nd) Geneva Spark Meetup (Nov 5th) San Francisco Datapalooza.io (Nov 10th) 145 San Francisco Advanced Spark (Nov 12th) Oslo Big Data Hadoop Meetup (Nov 19th) Helsinki Spark Meetup (Nov 20th) Stockholm Spark Meetup (Nov 23rd) Copenhagen Spark Meetup (Nov 25th) Budapest Spark Meetup (Nov 26th) Singapore Strata Conference (Dec 1st) San Francisco Advanced Spark (Dec 8th) Mountain View Advanced Spark (Dec 10th) Toronto Spark Meetup (Dec 14th) Austin Data Days Conference (Jan 2016)
  146. 146. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Power of data. Simplicity of design. Speed of innovation. IBM Spark

×