Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015

890 views

Published on

Zurich, Berlin, Vienna Spark Meetup Nov 02 2015

* Title *

Spark After Dark 1.5:  Real-time, Advanced Analytics with Spark 1.5, Kafka, Cassandra, ElasticSearch, Zeppelin, and Docker

* Abstract *

Combining the most popular and technically-deep material from his wildly popular Advanced Apache Spark Meetup, Chris Fregly will provide code-level deep dives into the latest performance and scalability advancements within the Apache Spark Ecosystem by exploring the following:

1) Building a Scalable and Performant Spark SQL/DataFrames Data Source Connector such as Spark-CSV, Spark-Cassandra, Spark-ElasticSearch, and Spark-Redshift

2) Speeding Up Spark SQL Queries using Partition Pruning and Predicate Pushdowns with CSV, JSON, Parquet, Avro, and ORC

3) Tuning Spark Streaming Performance and Fault Tolerance with KafkaRDD and KinesisRDD

4) Maintaining Stability during High Scale Streaming Ingestion using Approximations and Probabilistic Data Structures from Spark, Redis, and Twitter's Algebird

5) Building Effective Machine Learning Models using Feature Engineering, Dimension Reduction, and Natural Language Processing with MLlib/GraphX, ML Pipelines, DIMSUM, Locality Sensitive Hashing, and Stanford's CoreNLP

6) Tuning Core Spark Performance by Acknowledging Mechanical Sympathy for the Physical Limitations of OS and Hardware Resources such as CPU, Memory, Network, and Disk with Project Tungsten, Asynchronous Netty, and Linux epoll

* Demos *

This talk features many interesting and audience-interactive demos - as well as code-level deep dives into many of the projects listed above.

All demo code is available on Github at the following link: https://github.com/fluxcapacitor/pipeline/wiki

In addition, the entire demo environment has been Dockerized and made available for download on Docker Hub at the following link: https://hub.docker.com/r/fluxcapacitor/pipeline/

* Speaker Bio *

Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, a Netflix Open Source Committer, as well as the Organizer of the global Advanced Apache Spark Meetup and Author of the Upcoming Book, Advanced Spark.

Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.

When Chris isn’t contributing to Spark and other open source projects, he’s creating book chapters, slides, and demos to share knowledge with his peers at meetups and conferences throughout the world.

Published in: Software
  • Be the first to comment

Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015

  1. 1. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles After Dark 1.5 High Performance, Real-time, Streaming, Machine Learning, Natural Language Processing, Text Analytics, and Recommendations Chris Fregly Principal Data Solutions Engineer IBM Spark Technology Center ** We’re Hiring -- Only Nice People, Please!! ** Zurich Spark Meetup November 2, 2015
  2. 2. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Follow Along – Slides Are Now Available! https://www.slideshare.net/cfregly/ 2
  3. 3. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Who Am I? 3 Streaming Data Engineer Netflix Open Source Committer
 Data Solutions Engineer
 Apache Contributor Principal Data Solutions Engineer IBM Technology Center Meetup Organizer Advanced Apache Meetup Book Author Advanced (2016)
  4. 4. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Upcoming Meetups and Conferences London Spark Meetup (Oct 12th) Scotland Data Science Meetup (Oct 13th) Dublin Spark Meetup (Oct 15th) Barcelona Spark Meetup (Oct 20th) Madrid Spark/Big Data Meetup (Oct 22nd) Paris Spark Meetup (Oct 26th) Amsterdam Spark Summit & Meetup (Oct 27th) Delft Dutch Data Science Meetup (Oct 29th) Brussels Spark Meetup (Oct 30th) Zurich Big Data Developers Meetup (Nov 2nd) Geneva Spark Meetup (Nov 5th) 4 San Francisco Datapalooza (Nov 10th) San Francisco Advanced Apache Spark (Nov 12th) Oslo Big Data Hadoop Meetup (Nov 18th) Helsinki Spark Meetup (Nov 20th) Stockholm Spark Meetup (Nov 23rd) Copenhagen Spark Meetup (Nov 25th) Budapest Spark Meetup (Nov 27th) Singapore Strata Conference (Dec 1st) San Francisco Advanced Apache Spark (Dec 8th) Mountain View Advanced Apache Spark (Dec 10th) Washington DC DC Spark Meetup (Dec 17th)
  5. 5. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Advanced Apache Spark Meetup Meetup Metrics 1500 members in just 3 mos! 4th most active Spark Meetup!! meetup.com/Advanced-Apache-Spark-Meetup Meetup Goals Dig deep into codebases of Spark & related projects Study integrations of Cassandra, ElasticSearch,
 Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R Surface & share patterns & idioms of these 
 well-designed, distributed, big data components
  6. 6. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark What is Spark After Dark? Fun, Spark-based dating reference application *Not a movie recommendation engine!! Generate recommendations based on user similarity Demonstrate Apache Spark & related big data projects 6
  7. 7. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Tools of this Talk (github.com/fluxcapacitor) 7   Redis   Docker   Ganglia   Streaming, Kafka   Cassandra, NoSQL   Parquet, JSON, ORC, Avro   Apache Zeppelin Notebooks   Spark SQL, DataFrames, Hive   ElasticSearch, Logstash, Kibana   Spark ML, GraphX, Stanford CoreNLP and…
  8. 8. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Overall Themes of this Talk   Filter Early, Filter Deep   Approximations are OK   Minimize Random Seeks   Maximize Sequential Scans   Go Off-Heap when Possible   Parallelism is Required at Scale   Must Reduce Dimensions at Scale   Seek Performance Gains at all Layers   Customize Data Structs for your Workload 8  Be Nice and Collaborate!
  9. 9. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Outline Spark Core: Mechanical Sympathy & Tuning Spark SQL: Catalyst & DataSources API Spark Streaming: Scaling & Approximating Spark ML: Featurizing & Recommending 9
  10. 10. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Spark Core: Mechanical Sympathy & Tuning Understanding & Acknowledging Mechanical Sympathy 100TB GraySort Challenge, Project Tungsten Shuffle Service and Dynamic Allocation 10
  11. 11. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark and Mechanical Sympathy http://mechanical-sympathy.blogspot.com “Hardware and software working together in harmony” -Martin Thompson Saturate Network I/O Saturate Disk I/O Minimize Memory and GC Maximize CPU Cache Locality 11 Project 
 Tungsten (Spark 1.4-1.6) Daytona GraySort (Spark 1.1-1.2)
  12. 12. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark AlphaSort Trick for Sorting AlphaSort paper, 1995 Chris Nyberg and Jim Gray Naïve List (Pointer-to-Record) Requires Key to be dereferenced for comparison AlphaSort List (Key, Pointer) Key is directly available for comparison 12 Ptr! Ptr!Key! Not cache-line
 Friendly! Requires dereference for key comparison
  13. 13. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Line and Memory Sympathy Key(10 bytes) + Pointer(4 bytes*) 
 = 14 bytes Key(10 bytes) + Pad(2 bytes) + Pointer(4 bytes)
 = 16 bytes Key-Prefix(4 bytes) + Pointer(4 bytes) = 8 bytes 13 Key! Ptr! Pad! /Pad Cache-line
 Friendly! Ptr! Key-Prefix 2x Cache-line
 Friendly! Key! Ptr! Not cache-line
 Friendly!
  14. 14. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Performance Comparison 14
  15. 15. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Similar Technique: Direct Cache Access Packet header placed into CPU cache 15
  16. 16. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Lines: Sequential vs Random 16
  17. 17. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Naïve Matrix Multiplication // Dot product of each row & column vector for (i <- 0 until numRowA) for (j <- 0 until numColsB) for (k <- 0 until numColsA) res[ i ][ j ] += matA[ i ][ k ] * matB[ k ][ j ]; 17 Bad: Row-wise traversal, not using CPU cache line,
 ineffective pre-fetching
  18. 18. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Friendly Matrix Multiplication // Transpose B for (i <- 0 until numRowsB) for (j <- 0 until numColsB) matBT[ i ][ j ] = matB[ j ][ i ]; 
 // Modify dot product calculation for B Transpose for (i <- 0 until numRowsA) for (j <- 0 until numColsB) for (k <- 0 until numColsA) res[ i ][ j ] += matA[ i ][ k ] * matBT[ j ][ k ]; 18 Good: Full CPU cache line,
 effective prefetching OLD: res[ i ][ j ] += matA[ i ][ k ] * matB[ k ] [ j ]; Reference j
 before k
  19. 19. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Comparing CPU Naïve & Cache-Friendly Matrix Multiplication 19
  20. 20. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Instrumenting and Monitoring CPU Linux perf command! 20
  21. 21. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Results of Cache-Friendly vs. Naïve Naïve Matrix Multiply Cache-Friendly Matrix Multiply ~72x ~8x ~3x ~3x ~2x ~7x ~10x perf stat --repeat 5 --scale --event L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses,LLC-prefetch-misses,cache-misses,stalled-cycles-frontend java -Xmx13G -XX:-Inline -jar ~/sbt/bin/sbt-launch.jar "tungsten/run-main com.advancedspark.tungsten.matrix.Cache[Friendly|Naïve]MatrixMultiply 256 1"
  22. 22. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Profile Visualizations Flame Graphs with Java Stack Traces 22 Images courtesy of http://techblog.netflix.com/2015/07/java-in-flames.html! Java Stack 
 Traces!! Plateaus
 are Bad!
  23. 23. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles 100TB Daytona GraySort Challenge Focus on Network and Disk I/O Optimizations Improve Data Structs/Algos for Sort & Shuffle Saturate Network and Disk Controllers 23
  24. 24. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Winning Results 24 Spark Goals   Saturate Network I/O   Saturate Disk I/O (2013) (2014)
  25. 25. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Winning Hardware Configuration Compute 206 EC2 Worker nodes, 1 Master node AWS i2.8xlarge 32 Intel Xeon CPU E5-2670 @ 2.5 Ghz 244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4 NOOP I/O scheduler: FIFO, request merging, no reordering 3 GBps mixed read/write disk I/O per node Network Deployed within Placement Group/VPC Using AWS Enhanced Networking Single Root I/O Virtualization (SR-IOV): extension of PCIe 10 Gbps, low latency, low jitter (iperf showed ~9.5 Gbps) 25
  26. 26. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Winning Software Configuration Spark 1.2, OpenJDK 1.7_<amazon-something>_u65-b17 Disable caching, compression, spec execution, shuffle spill Force NODE_LOCAL task scheduling for optimal data locality HDFS 2.4.1 short-circuit for local reads, 2x replication 4-6 tasks allocated / partition is Spark recommendation 206 nodes * 32 cores = 6592 cores 6592 cores * 4 = 26,368 partitions 6592 cores * 6 = 39,552 partitions 6592 cores * 4.25 = 28,000 partitions was empirically best Range partitioning takes advantage of sequential keyspace 26
  27. 27. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark New Shuffle Manager New Replace hash-based shuffle manager with “sort-based” Use less OS Resources Pre-sort keys in-memory on Mapper Merge-sort keys into single Master file on-disk Mapper serves partition with single seek, sequential scan 27
  28. 28. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark New Network Module Replaces old java.nio, low-level, socket-based code Zero-copy epoll: Stay in kernel-space between disk & ne twork Custom memory management spark.shuffle.blockTransferService=netty Spark-Netty Performance Tuning spark.shuffle.io.numConnectionsPerPeer Increase to saturate hosts with multiple disks spark.shuffle.io.preferDirectBuffers On or Off-heap (Off-heap is default) 28
  29. 29. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark New Algorithms and Data Structures Optimized for sort and shuffle o.a.s.util.collection.TimSort[K,V] Based on JDK 1.7 TimSort Performs best on partially-sorted datasets Optimized for elements of (K,V) pairs Sorts impl of SortDataFormat (ie. KVArraySortDataFormat) o.a.s.util.collection.AppendOnlyMap Open addressing hash, quadratic probing Array of [(K, V), (K, V)] Good memory locality Keys never removed, values only append 29
  30. 30. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles IBM | spark.tc Met Performance Goals! Reducers: 1.1 Gbps/node network I/O (theoretical max = 1.25 Gbps for 10 GB ethernet) Mappers: 3 GBps/node disk I/O (8x800 SSD) 206 nodes * 1.1 Gbps/node ~= 220 Gbps
  31. 31. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Shuffle Performance Tuning Tips Hash Shuffle Manager (no longer default) spark.shuffle.consolidateFiles: mapper output files o.a.s.shuffle.FileShuffleBlockResolver Intermediate Files Increase spark.shuffle.file.buffer: reduce seeks & sys calls Increase spark.reducer.maxSizeInFlight if memory allows Use smaller number of larger workers to reduce total files SQL: BroadcastHashJoin vs. ShuffledHashJoin spark.sql.autoBroadcastJoinThreshold Use DataFrame.explain(true) or EXPLAIN to verify 31
  32. 32. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Project Tungsten Focus on CPU Cache and Memory Optimizations Further Improve Data Structures and Algorithms Operate on Serialized/Compressed Data Provide Path to Off Heap 32
  33. 33. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Why is CPU the Bottleneck? Network and Disk I/O bandwidth are relatively high GraySort optimizations improved network & shuffle More partitioning, pruning, and predicate pushdowns Popularity of columnar file formats like Parquet/ORC CPU is used for serialization, hashing, compression! 33
  34. 34. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark Shuffle Managers spark.shuffle.manager = hash < 10,000 Reducers Output partition file hashes the key of (K,V) pair Mapper creates an output file per partition Leads to M*P output files for all partitions sort >= 10,000 Reducers Default since Spark 1.2 Mapper creates single output file for all partitions Minimizes OS resources Netty and epoll optimize network I/O and memory usage Uses custom data structures and algorithms for sort-shuffle workload Wins Daytona GraySort Challenge unsafe -> Tungsten, Default in Spark 1.5 Uses com.misc.Unsafe to enable self-managed, byte buffers Custom serialization format Operates on both serialized and compressed byte buffers 34
  35. 35. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark New Data Structures “My array will beat your data structure!” New Data Structures for Sort/Shuffle Workload UnsafeRow BytesToBytesMap 35
  36. 36. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark sun.misc.Unsafe 36 Info addressSize() pageSize() Objects allocateInstance() objectFieldOffset() Classes staticFieldOffset() defineClass() defineAnonymousClass() ensureClassInitialized() Synchronization monitorEnter() tryMonitorEnter() monitorExit() compareAndSwapInt() putOrderedInt() Arrays arrayBaseOffset() arrayIndexScale() Memory allocateMemory() copyMemory() freeMemory() getAddress() – not guaranteed after GC getInt()/putInt() getBoolean()/putBoolean() getByte()/putByte() getShort()/putShort() getLong()/putLong() getFloat()/putFloat() getDouble()/putDouble() getObjectVolatile()/putObjectVolatile() Used by 
 Tungsten
  37. 37. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark + com.misc.Unsafe 37 org.apache.spark.sql.execution. aggregate.SortBasedAggregate aggregate.TungstenAggregate aggregate.AggregationIterator aggregate.udaf aggregate.utils SparkPlanner rowFormatConverters UnsafeFixedWidthAggregationMap UnsafeExternalSorter UnsafeExternalRowSorter UnsafeKeyValueSorter UnsafeKVExternalSorter local.ConvertToUnsafeNode local.ConvertToSafeNode local.HashJoinNode local.ProjectNode local.LocalNode local.BinaryHashJoinNode local.NestedLoopJoinNode joins.HashJoin joins.HashSemiJoin joins.HashedRelation joins.BroadcastHashJoin joins.ShuffledHashOuterJoin (not yet converted) joins.BroadcastHashOuterJoin joins.BroadcastLeftSemiJoinHash joins.BroadcastNestedLoopJoin joins.SortMergeJoin joins.LeftSemiJoinBNL joins.SortMergerOuterJoin Exchange SparkPlan UnsafeRowSerializer SortPrefixUtils sort basicOperators aggregate.SortBasedAggregationIterator aggregate.TungstenAggregationIterator datasources.WriterContainer datasources.json.JacksonParser datasources.jdbc.JDBCRDD org.apache.spark. unsafe.Platform unsafe.KVIterator unsafe.array.LongArray unsafe.array.ByteArrayMethods unsafe.array.BitSet unsafe.bitset.BitSetMethods unsafe.hash.Murmur3_x86_32 unsafe.map.BytesToBytesMap unsafe.map.HashMapGrowthStrategy unsafe.memory.TaskMemoryManager unsafe.memory.ExecutorMemoryManager unsafe.memory.MemoryLocation unsafe.memory.UnsafeMemoryAllocator unsafe.memory.MemoryAllocator (trait/interface) unsafe.memory.MemoryBlock unsafe.memory.HeapMemoryAllocator unsafe.memory.ExecutorMemoryManager unsafe.sort.RecordComparator unsafe.sort.PrefixComparator unsafe.sort.PrefixComparators unsafe.sort.UnsafeSorterSpillWriter serializer.DummySerializationInstance shuffle.unsafe.UnsafeShuffleManager shuffle.unsafe.UnsafeShuffleSortDataFormat shuffle.unsafe.SpillInfo shuffle.unsafe.UnsafeShuffleWriter shuffle.unsafe.UnsafeShuffleExternalSorter shuffle.unsafe.PackedRecordPointer shuffle.ShuffleMemoryManager util.collection.unsafe.sort.UnsafeSorterSpillMerger util.collection.unsafe.sort.UnsafeSorterSpillReader util.collection.unsafe.sort.UnsafeSorterSpillWriter util.collection.unsafe.sort.UnsafeShuffleInMemorySorter util.collection.unsafe.sort.UnsafeInMemorySorter util.collection.unsafe.sort.RecordPointerAndKeyPrefix util.collection.unsafe.sort.UnsafeSorterIterator network.shuffle.ExternalShuffleBlockResolver scheduler.Task rdd.SqlNewHadoopRDD executor.Executor org.apache.spark.sql.catalyst.expressions. regexpExpressions BoundAttribute SortOrder SpecializedGetters ExpressionEvalHelper UnsafeArrayData UnsafeReaders UnsafeMapData Projection LiteralGeneartor UnsafeRow JoinedRow SpecializedGetters InputFileName SpecificMutableRow codegen.CodeGenerator codegen.GenerateProjection codegen.GenerateUnsafeRowJoiner codegen.GenerateSafeProjection codegen.GenerateUnsafeProjection codegen.BufferHolder codegen.UnsafeRowWriter codegen.UnsafeArrayWriter complexTypeCreator rows literals misc stringExpressions Over 200 source files affected!!
  38. 38. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU & Memory Optimizations Custom Managed Memory Reduces GC overhead Both on and off heap Exact size calculations Direct Binary Processing Operate on serialized/compressed arrays Kryo can reorder serialized records LZF can reorder compressed records More CPU Cache-aware Data Structs & Algorithms o.a.s.unsafe.map.BytesToBytesMap vs. j.u.HashMap Code Generation (default in 1.5) Generate source code from overall query plan Janino generates bytecode from source code 100+ UDFs converted to use code generation 38 UnsafeFixedWithAggregationMap,& TungstenAggregationIterator CodeGenerator & GeneratorUnsafeRowJoiner UnsafeSortDataFormat & UnsafeShuffleSortDataFormat & PackedRecordPointer & UnsafeRow UnsafeInMemorySorter & UnsafeExternalSorter & UnsafeShuffleWriter Mostly Same Join Code, added if (isUnsafeMode) UnsafeShuffleManager & UnsafeShuffleInMemorySorter & UnsafeShuffleExternalSorterDetails inSPARK-7075
  39. 39. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles IBM | spark.tc Code Generation (Default in 1.5) Problem Generic expression evaluation Expensive on JVM JVM can’t inline polymorphic impls Code generation by-passes poly Virtual function calls Branches based on expression type Boxing causes excessive object creation Implementation Defer source code generation to each operator, type, etc Scala quasiquotes provide AST manipulation & rewriting Generates source code, compiled to bytecode w/ Janino 100+ UDFs now using code gen
  40. 40. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles IBM | spark.tc Code Generation: Spark SQL UDFs 100+ UDFs now using code gen – More to come in Spark 1.6! Details in SPARK-8159
  41. 41. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Outline Spark Core: Mechanical Sympathy & Tuning Spark SQL: Catalyst & DataSources API Spark Streaming: Scaling & Approximating Spark ML: Featurizing & Recommending 41
  42. 42. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Spark SQL: Catalyst and DataSources API Explore DataFrames, Datasets, DataSources, Catalyst Creating a Custom DataSource API Implementation Review Partitions, Pruning, Pushdowns, Formats 42
  43. 43. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark DataFrames API Inspired by R and Pandas DataFrames Schema-aware Cross language support SQL, Python, Scala, Java, R Levels performance of Python, Scala, Java, and R Generates JVM bytecode vs serializing to Python DataFrame is container for logical plan Lazy transformations represented as tree Catalyst optimizer creates physical plan Moves expressions up/down tree UDF and UDAF Support Custom UDF using registerFunction() New, experimental UDAF support Supports existing Hive metastore if available Small, file-based Hive metastore created if not available *DataFrame.rdd returns underlying RDD if needed 43 Use DataFrames instead of RDDs!!
  44. 44. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark DataSources API Relations (o.a.s.sql.sources.interfaces.scala) BaseRelation (abstract class): Provides schema of data TableScan (impl): Read all data from source PrunedFilteredScan (impl): Column pruning & predicate pushdowns InsertableRelation (impl): Insert/overwrite data based on SaveMode RelationProvider (trait/interface): Handle options, BaseRelation factory Execution (o.a.s.sql.execution.commands.scala) RunnableCommand (trait/interface): Common commands like EXPLAIN ExplainCommand(impl: case class) CacheTableCommand(impl: case class) Filters (o.a.s.sql.sources.filters.scala) Filter (abstract class): Handles all predicates/filters supported by this source EqualTo (impl) GreaterThan (impl) StringStartsWith (impl) 44
  45. 45. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Native Spark SQL DataSources 45
  46. 46. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark JSON Data Source DataFrame val ratingsDF = sqlContext.read.format("json") .load("file:/root/pipeline/datasets/dating/ratings.json.bz2") -- or – val ratingsDF = sqlContext.read.json
 ("file:/root/pipeline/datasets/dating/ratings.json.bz2") SQL Code CREATE TABLE genders USING json OPTIONS (path "file:/root/pipeline/datasets/dating/genders.json.bz2") 46 json() convenience method
  47. 47. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark JDBC Data Source Add Driver to Spark JVM System Classpath $ export SPARK_CLASSPATH=<jdbc-driver.jar> DataFrame val jdbcConfig = Map("driver" -> "org.postgresql.Driver", "url" -> "jdbc:postgresql:hostname:port/database", "dbtable" -> ”schema.tablename") df.read.format("jdbc").options(jdbcConfig).load() SQL CREATE TABLE genders USING jdbc 
 OPTIONS (url, dbtable, driver, …) 47
  48. 48. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Parquet Data Source Configuration spark.sql.parquet.filterPushdown=true spark.sql.parquet.mergeSchema=true spark.sql.parquet.cacheMetadata=true spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo] DataFrames val gendersDF = sqlContext.read.format("parquet") .load("file:/root/pipeline/datasets/dating/genders.parquet") gendersDF.write.format("parquet").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders.parquet") SQL CREATE TABLE genders USING parquet OPTIONS (path "file:/root/pipeline/datasets/dating/genders.parquet") 48
  49. 49. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark ORC Data Source Configuration spark.sql.orc.filterPushdown=true DataFrames val gendersDF = sqlContext.read.format("orc") .load("file:/root/pipeline/datasets/dating/genders") gendersDF.write.format("orc").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders") SQL CREATE TABLE genders USING orc OPTIONS (path "file:/root/pipeline/datasets/dating/genders") 49
  50. 50. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Third-Party Spark SQL DataSources 50
  51. 51. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CSV DataSource (Databricks) Github https://github.com/databricks/spark-csv Maven com.databricks:spark-csv_2.10:1.2.0 Code val gendersCsvDF = sqlContext.read .format("com.databricks.spark.csv") .load("file:/root/pipeline/datasets/dating/gender.csv.bz2") .toDF("id", "gender") 51 toDF() is required if CSV does not contain header
  52. 52. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Avro DataSource (Databricks) Github https://github.com/databricks/spark-avro Maven com.databricks:spark-avro_2.10:2.0.1 Code val df = sqlContext.read .format("com.databricks.spark.avro") .load("file:/root/pipeline/datasets/dating/gender.avro”) 52
  53. 53. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark ElasticSearch DataSource (Elastic.co) Github https://github.com/elastic/elasticsearch-hadoop Maven org.elasticsearch:elasticsearch-spark_2.10:2.1.0 Code val esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>", 
 "es.port" -> "<port>") df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite) .options(esConfig).save("<index>/<document-type>") 53
  54. 54. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark AWS Redshift Data Source (Databricks) Github https://github.com/databricks/spark-redshift Maven com.databricks:spark-redshift:0.5.0 Code val df: DataFrame = sqlContext.read .format("com.databricks.spark.redshift") .option("url", "jdbc:redshift://<hostname>:<port>/<database>…") .option("query", "select x, count(*) my_table group by x") .option("tempdir", "s3n://tmpdir") .load(...) 54 UNLOAD and copy to tmp bucket in S3 enables parallel reads
  55. 55. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Cassandra DataSource (DataStax) Github https://github.com/datastax/spark-cassandra-connector Maven com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1 Code ratingsDF.write .format("org.apache.spark.sql.cassandra") .mode(SaveMode.Append) .options(Map("keyspace"->"<keyspace>", "table"->"<table>")).save(…) 55
  56. 56. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Cassandra Pushdown Support spark-cassandra-connector/…/o.a.s.sql.cassandra.PredicatePushDown.scala Pushdown Predicate Rules 1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate 2. Only push down primary key column predicates with = or IN predicate. 3. If there are regular columns in the pushdown predicates, they should have at least one EQ expression on an indexed column and no IN predicates. 4. All partition column predicates must be included in the predicates to be pushed down, only the last part of the partition key can be an IN predicate. For each partition column, only one predicate is allowed. 5. For cluster column predicates, only last predicate can be non-EQ predicate including IN predicate, and preceding column predicates must be EQ predicates. If there is only one cluster column predicate, the predicates could be any non-IN predicate. 6. There is no pushdown predicates if there is any OR condition or NOT IN condition. 7. We're not allowed to push down multiple predicates for the same column if any of them is equality or IN predicate. 56
  57. 57. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Rumor of New Cassandra DataSource By-pass CQL front door used for transactional data Bulk read/write directly from/to SSTables Similar to existing Netflix Open Source project https://github.com/Netflix/aegisthus Promotes Cassandra to first-class Analytics Option Potentially only part of DataStax Enterprise?! Please mail a nasty letter to your local DataStax office 57
  58. 58. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Cloudant DataSource (IBM) Github http://spark-packages.org/package/cloudant/spark-cloudant Maven com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1 Code ratingsDF.write.format("com.cloudant.spark") .mode(SaveMode.Append) .options(Map("cloudant.host"->"<account>.cloudant.com", "cloudant.username"->"<username>", "cloudant.password"->"<password>")) .save("<filename>") 58
  59. 59. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark DB2 and BigSQL DataSources (IBM) Coming Soon! 59
  60. 60. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Rumor of REST DataSource (Databricks) Coming Soon? Ask Michael Armbrust Spark SQL Lead @ Databricks 60
  61. 61. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Custom DataSource (Me and You All!) Coming Right Now! 61 DEMO ALERT!!
  62. 62. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Create a Custom DataSource 62
  63. 63. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Creating a New DataSource Study Existing Native and Third-Party Data Source Impls Native: JDBC (o.a.s.sql.execution.datasources.jdbc) class JDBCRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation Third-Party: Cassandra (o.a.s.sql.cassandra) class CassandraSourceRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation <Insert Your Custom Data Source Here!> 63
  64. 64. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Contributing a Custom Data Source spark-packages.org Managed by Contains links to external github projects Ratings and comments Declare Spark version support for each package Examples https://github.com/databricks/spark-csv https://github.com/databricks/spark-avro https://github.com/databricks/spark-redshift 64
  65. 65. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Catalyst Optimizer Optimize DataFrame Transformation Tree Subquery elimination: use aliases to collapse subqueries Constant folding: replace expression with constant Simplify filters: remove unnecessary filters Predicate/filter pushdowns: avoid unnecessary data load Projection collapsing: avoid unnecessary projections Create Custom Rules Rules are Scala Case Classes val newPlan = MyFilterRule(analyzedPlan) 65 Implements! oas.sql.catalyst.rules.Ruleå! Apply to any stage! JVM code generation
  66. 66. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Query Plan Debugging 66 gendersCsvDF.select($"id", $"gender").filter("gender != 'F'").filter("gender != 'M'").explain(true) DataFrame.queryExecution.logical DataFrame.queryExecution.analyzed DataFrame.queryExecution.optimizedPlan DataFrame.queryExecution.executedPlan
  67. 67. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Query Plan Visualization & Query Metrics 67 Effectiveness of Filter CPU Cache 
 Friendly Binary Format Cost-based Join Optimization Similar to MapReduce Map-side Join Peak Memory for Joins and Aggs
  68. 68. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Parquet Columnar File Format Based on Google Dremel Collab with Twitter and Cloudera Columnar storage format Fast columnar aggregations Tight compression Supports pushdowns Nested, self-describing, evolving schema 68
  69. 69. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Types of Compression Run Length Encoding: Repeated data Dictionary Encoding: Fixed set of values Delta, Prefix Encoding: Sorted data 69
  70. 70. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Demonstrate File Formats, Partition Schemes, and Query Plans 70
  71. 71. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Sample Dataset 71 RATINGS ======== UserID,ProfileID,Rating (1-10) GENDERS ======== UserID,Gender (M,F,U)
  72. 72. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Hive JDBC ODBC ThriftServer Allows BI Tools to connect to Spark DataSources Must register data in Hive Metastore Configuration spark.sql.thriftServer.incrementalCollect=true spark.driver.maxResultSize > 10gb (default) 72
  73. 73. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Accessing Cassandra Data through Beeline and Tableau 73
  74. 74. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Outline Spark Core: Mechanical Sympathy & Tuning Spark SQL: Catalyst & DataSources API Spark Streaming: Scaling & Approximating Spark ML: Featurizing & Recommending 74
  75. 75. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Spark Streaming: Scaling & Approximating Understand Parallelism, Recovery, and Back Pressure Compare Receiver and Receiver-less Implementations Describe Common Streaming Count Approximations 75
  76. 76. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Receiver Impl: Kinesis   KinesisRDD partitions store relevant offsets   Single receiver required to see all data/offsets   Kinesis offsets not deterministic like Kafka   Partitions rebuild from Kinesis using offsets   No Write Ahead Log (WAL) needed   Optimizes happy path by avoiding the WAL   At least once delivery guarantee 76
  77. 77. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Non-Parallelism of Receiver Implementation 77
  78. 78. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Receiver-less, “Direct” Impl: Kafka   KafkaRDD partitions store relevant offsets   Each partition acts as a Receiver   Tasks/workers pull from Kafka in parallel   Partitions rebuild from Kafka using offsets   No Write Ahead Log (WAL) needed   Optimizes happy path by avoiding the WAL   At least once delivery guarantee 78
  79. 79. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Parallelism of Direct Kafka Streaming 79
  80. 80. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Streaming Back Pressure More than just throttling Push back on the source Requires buffered source (Kafka, Kinesis) Based on fundamentals of Control Theory Contributed by TypeSafe 80
  81. 81. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Approximations: HyperLogLog   Approximate cardinality
 (approx count distinct)   Fixed, low memory   Tunable error percentage   Only 1.5KB @ 2% error,10^9 elements   Twitter’s Algebird   Streaming example in Spark codebase   Spark’s countApproxDistinctByKey() 81 http://research.neustar.biz/
  82. 82. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Approximations: Count Min Sketch   Approximate counters   Better than HashMap   Low, fixed memory   Known error bounds   Large num of counters   From Twitter Algebird   Streaming example in Spark codebase 82
  83. 83. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Approximations: Monte Carlo Sims From Manhattan Project (Atomic bomb) Simulate movement of neutrons Law of Large Numbers (LLN) Average of results of many trials
 Converge on expected value SparkPi example in Spark codebase
 Pi ~ 4 * # red dots
 / # total dots 83
  84. 84. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Outline Spark Core: Mechanical Sympathy & Tuning Spark SQL: Catalyst & DataSources API Spark Streaming: Scaling & Approximating Spark ML: Featurizing & Recommending 84
  85. 85. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Spark ML: Featurizing & Recommending Understand Similarity and Dimension Reduction Approximate with Sampling and Bucketing Generate 10 Recommendations 85
  86. 86. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Live, Interactive Demo! sparkafterdark.com 86
  87. 87. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Audience Participation Needed!! 87 -> You are
 here -> Audience Instructions   Navigate to sparkafterdark.com   Click 3 actresses and 3 actors   Wait for us to analyze together! Note: This is totally anonymous!! Project Links   https://github.com/fluxcapacitor/pipeline   https://hub.docker.com/r/fluxcapacitor
  88. 88. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Similarity 88
  89. 89. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Types of Similarity Euclidean Linear-based measure Suffers from Magnitude bias Cosine Angle-based measure Adjusts for magnitude bias Jaccard Set intersection / union Suffers Popularity bias Log Likelihood Netflix “Shawshank” Problem Adjusts for popularity bias 89 Ali Matei Reynold Patrick Andy Kimberly 1 1 1 1 Leslie 1 1! Meredith 1 1 1 Lisa 1 1 1 Holden 1 1 1 1 1 z!
  90. 90. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark All-Pairs Similarity Comparison Compare everything to everything aka. “pair-wise similarity” or “similarity join” Naïve shuffle: O(m*n^2); m=rows, n=cols Minimize shuffle through approximations! Reduce m (rows) Sampling and bucketing Reduce n (cols) Remove most frequent value (ie.0) Principle Component Analysis 90 Dimension reduction!!
  91. 91. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Dimension Reduction Sampling and Bucketing 91
  92. 92. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Reduce m: DIMSUM Sampling “Dimension Independent Matrix Square Using MR” Remove rows with low similarity probability MLlib: RowMatrix.columnSimilarities(…) Twitter: 40% efficiency gain vs. Cosine Similarity 92
  93. 93. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Reduce m: LSH Bucketing “Locality Sensitive Hashing” Split m into b buckets Use similarity hash algorithm Requires pre-processing of data Parallel compare bucket contents O(m*n^2) -> O(m*n/b*b^2); m=rows, n=cols, b=buckets ie. 500k x 500k matrix O(1.25e17) -> O(1.25e13); b=50 93 github.com/mrsqueeze/spark-hash
  94. 94. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Reduce n: Remove Most Frequent Value Eliminate most-frequent value Represent other values with (index,value) pairs Converts O(m*n^2) -> O(m*nnz^2); 
 nnz=num nonzeros, nnz << n Note: Choose most frequent value (may not be 0) 94 (index,value) (index,value)
  95. 95. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Recommendations Summary Statistics and Top-K Historical Analysis Collaborative Filtering and Clustering Text Featurization and NLP 95
  96. 96. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Types of Recommendations Non-personalized
 No preference or behavior data for user, yet aka “Cold Start Problem” Personalized
 User-Item Similarity
 Items that others with similar prefs have liked Item-Item Similarity
 Items similar to your previously-liked items 96
  97. 97. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Recommendation Terminology Feedback Explicit: like, rating Implicit: search, click, hover, view, scroll Feature Engineering Dimension reduction, polynomial expansion Hyper-parameter Tuning K-Folds Cross Validation, Grid Search Pipelines/Workflows Chaining together Transformers and Evaluators 97
  98. 98. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Single Machine ML Algorithms Stay Local, Distribute As Needed Helps migration of existing single-node algos to Spark Convert between Spark and Pandas DataFrames New “pdspark” package: integration w/ scikitlearn, R 98
  99. 99. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Non-Personalized Recommendations Use Aggregate Data to Generate Recommendations 99
  100. 100. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Top Users by Like Count “I might like users who have the most-likes overall based on historical data.” SparkSQL, DataFrames: Summary Stat, Aggs 100
  101. 101. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Top Influencers by Like Graph
 “I might like the most-influential users in overall like graph.” GraphX: PageRank 101
  102. 102. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Generate Non-Personalized Recommendations 102
  103. 103. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Personalized Recommendations Understand Similarity and Personalized Recommendations 103
  104. 104. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Like Behavior of Similar Users “I like the same people that you like. 
 What other people did you like that I haven’t seen?” MLlib: Matrix Factorization, User-Item Similarity 104
  105. 105. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Generate Personalized Recommendations using 
 Collaborative Filtering & Matrix Factorization 105
  106. 106. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Similar Text-based Profiles as Me
 “Our profiles have similar keywords and named entities. 
 We might like each other!” MLlib: Word2Vec, TF/IDF, k-skip n-grams 106
  107. 107. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Similar Profiles to Previous Likes
 107 “Your profile text has similar keywords and named entities to other profiles of people I like. I might like you, too!” MLlib: Word2Vec, TF/IDF, Doc Similarity
  108. 108. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Relevant, High-Value Emails “Your initial email references a lot of things in my profile.
 I might like you for making the effort!” MLlib: Word2Vec, TF/IDF, Entity Recognition 108 ^ Her Email< My Profile
  109. 109. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Feature Engineering for Text/NLP Use Cases 109
  110. 110. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles The Future of Recommendations 110
  111. 111. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Eigenfaces: Facial Recognition “Your face looks similar to others that I’ve liked.
 I might like you.” MLlib: RowMatrix, PCA, Item-Item Similarity 111 Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
  112. 112. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   NLP Conversation Starter Bot! “If your responses to my generic opening lines are positive, I may read your profile.” 
 MLlib: TF/IDF, DecisionTrees, Sentiment Analysis 112 Positive Negative
  113. 113. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles 113 Maintaining the Spark
  114. 114. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark ⑨  Recommendations for Couples “I want Mad Max. You want Message In a Bottle. 
 Let’s find something in between to watch tonight.” MLlib: RowMatrix, Item-Item Similarity
 GraphX: Nearest Neighbors, Shortest Path similar similar •  plots -> <- actors 114
  115. 115. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Final Recommendation! 115
  116. 116. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Get Off the Computer & Meet People! Thank you, Zurich!! Chris Fregly @cfregly IBM Spark Technology Center San Francisco, CA, USA Relevant Links advancedspark.com Signup for the book & global meetup! github.com/fluxcapacitor/pipeline Clone, contribute, and commit code! hub.docker.com/r/fluxcapacitor/pipeline/wiki Run all demos in your own environment with Docker! 116
  117. 117. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark More Relevant Links http://meetup.com/Advanced-Apache-Spark-Meetup http://advancedspark.com http://github.com/fluxcapacitor/pipeline http://hub.docker.com/r/fluxcapacitor/pipeline http://sortbenchmark.org/ApacheSpark2014.pd https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html http://0x0fff.com/spark-architecture-shuffle/ http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/ http://docs.scala-lang.org/overviews/quasiquotes/intro.html http://lwn.net/Articles/252125/ (Memory Part 2: CPU Caches) http://lwn.net/Articles/255364/ (Memory Part 5: What Programmers Can Do) https://www.safaribooksonline.com/library/view/java-performance-the/9781449363512/ch04.html http://web.eece.maine.edu/~vweaver/projects/perf_events/perf_event_open.html http://www.brendangregg.com/perf.html https://perf.wiki.kernel.org/index.php/Tutorial http://techblog.netflix.com/2015/07/java-in-flames.html http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#Java http://sortbenchmark.org/ApacheSpark2014.pdf https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html http://0x0fff.com/spark-architecture-shuffle/ http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/ http://docs.scala-lang.org/overviews/quasiquotes/intro.html http://lwn.net/Articles/252125/ <-- Memory Part 2: CPU Caches http://lwn.net/Articles/255364/ <-- Memory Part 5: What Programmers Can Do 117
  118. 118. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles What’s Next? 118
  119. 119. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark What’s Next? Autoscaling Spark Workers Completely Docker-based Docker Compose and Docker Machine Lots of Demos and Examples! Zeppelin & IPython/Jupyter notebooks Advanced streaming use cases Advanced ML, Graph, and NLP use cases Performance Tuning and Profiling Work closely with Brendan Gregg & Netflix Surface & share more low-level details of Spark internals 119
  120. 120. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Upcoming Meetups and Conferences London Spark Meetup (Oct 12th) Scotland Data Science Meetup (Oct 13th) Dublin Spark Meetup (Oct 15th) Barcelona Spark Meetup (Oct 20th) Madrid Spark/Big Data Meetup (Oct 22nd) Paris Spark Meetup (Oct 26th) Amsterdam Spark Summit & Meetup (Oct 27th) Delft Dutch Data Science Meetup (Oct 29th) Brussels Spark Meetup (Oct 30th) Zurich Big Data Developers Meetup (Nov 2nd) Geneva Spark Meetup (Nov 5th) 120 San Francisco Datapalooza (Nov 10th) San Francisco Advanced Apache Spark (Nov 12th) Oslo Big Data Hadoop Meetup (Nov 18th) Helsinki Spark Meetup (Nov 20th) Stockholm Spark Meetup (Nov 23rd) Copenhagen Spark Meetup (Nov 25th) Budapest Spark Meetup (Nov 27th) Singapore Strata Conference (Dec 1st) San Francisco Advanced Apache Spark (Dec 8th) Mountain View Advanced Apache Spark (Dec 10th) Washington DC DC Spark Meetup (Dec 17th)
  121. 121. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Power of data. Simplicity of design. Speed of innovation. IBM Spark

×