Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Apache Spark Meetup to Date

2,362 views

Published on

* Title *

Spark After Dark 1.5: Deep Dive Into Latest Perf and Scale Improvements in Spark Ecosystem

* Abstract *

Combining the most popular and technically-deep material from his wildly popular Advanced Apache Spark Meetup, Chris Fregly will provide code-level deep dives into the latest performance and scalability advancements within the Apache Spark Ecosystem by exploring the following:

1) Building a Scalable and Performant Spark SQL/DataFrames Data Source Connector such as Spark-CSV, Spark-Cassandra, Spark-ElasticSearch, and Spark-Redshift

2) Speeding Up Spark SQL Queries using Partition Pruning and Predicate Pushdowns with CSV, JSON, Parquet, Avro, and ORC

3) Tuning Spark Streaming Performance and Fault Tolerance with KafkaRDD and KinesisRDD

4) Maintaining Stability during High Scale Streaming Ingestion using Approximations and Probabilistic Data Structures from Spark, Redis, and Twitter's Algebird

5) Building Effective Machine Learning Models using Feature Engineering, Dimension Reduction, and Natural Language Processing with MLlib/GraphX, ML Pipelines, DIMSUM, Locality Sensitive Hashing, and Stanford's CoreNLP

6) Tuning Core Spark Performance by Acknowledging Mechanical Sympathy for the Physical Limitations of OS and Hardware Resources such as CPU, Memory, Network, and Disk with Project Tungsten, Asynchronous Netty, and Linux epoll


* Demos *

This talk features many interesting and audience-interactive demos - as well as code-level deep dives into many of the projects listed above.

All demo code is available on Github at the following link: https://github.com/fluxcapacitor/pipeline/wiki

In addition, the entire demo environment has been Dockerized and made available for download on Docker Hub at the following link: https://hub.docker.com/r/fluxcapacitor/pipeline/


* Speaker Bio *

Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, a Netflix Open Source Committer, as well as the Organizer of the global Advanced Apache Spark Meetup and Author of the Upcoming Book, Advanced Spark.

Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.

When Chris isn’t contributing to Spark and other open source projects, he’s creating book chapters, slides, and demos to share knowledge with his peers at meetups and conferences throughout the world.

Published in: Software

Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Apache Spark Meetup to Date

  1. 1. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles After Dark 1.5 High Performance, Real-time, Streaming, Machine Learning, Natural Language Processing, Text Analytics, and Recommendations Chris Fregly Principal Data Solutions Engineer IBM Spark Technology Center ** We’re Hiring -- Only Nice People, Please!! ** Paris Spark Meetup October 26, 2015
  2. 2. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Who Am I? 2 Streaming Data Engineer Netflix Open Source Committer Data Solutions Engineer
 Apache Contributor Principal Data Solutions Engineer IBM Technology Center Meetup Organizer Advanced Apache Meetup Book Author Advanced (2016)
  3. 3. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Advanced Apache Spark Meetup Meetup Metrics 1400+ members in just 3 mos! 4th most active Spark Meetup!! meetup.com/Advanced-Apache-Spark-Meetup Meetup Goals   Dig deep into Spark & extended-Spark codebase   Study integrations incl Cassandra, ElasticSearch,
 Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R   Surface & share patterns & idioms of these 
 well-designed, distributed, big data components
  4. 4. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Upcoming Meetups and Conferences London Spark Meetup (Oct 12th) Scotland Data Science Meetup (Oct 13th) Dublin Spark Meetup (Oct 15th) Barcelona Spark Meetup (Oct 20th) Madrid Spark/Big Data Meetup (Oct 22nd) Paris Spark Meetup (Oct 26th) Amsterdam Spark Summit & Meetup (Oct 27th) Delft Dutch Data Science Meetup (Oct 29th) Brussels Spark Meetup (Oct 30th) Zurich Big Data Developers Meetup (Nov 2nd) Geneva Spark Meetup (Nov 5th) 4 San Francisco Datapalooza (Nov 10th) San Francisco Advanced Apache Spark Meetup (Nov 12th) Oslo Big Data Hadoop Meetup (Nov 18th) Helsinki Spark Meetup (Nov 20th) Stockholm Spark Meetup (Nov 23rd) Copenhagen Spark Meetup (Nov 25th) Budapest Spark Meetup (Nov 27th) Singapore Strata Conference (Dec 1st) San Francisco Advanced Apache Spark Meetup (Dec 8th) Mountain View Advanced Apache Spark Meetup (Dec 10th) Washington DC Advanced Apache Spark Meetup (Dec 17th) Freg-a-palooza!
  5. 5. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark What is Spark After Dark? Fun, Spark-based dating reference application *Not a movie recommendation engine!! Generate recommendations based on user similarity Demonstrate Apache Spark and related big data projects 5
  6. 6. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Tools of this Talk 6   Redis   Docker   Ganglia   Streaming, Kafka   Cassandra, NoSQL   Parquet, JSON, ORC, Avro   Apache Zeppelin Notebooks   Spark SQL, DataFrames, Hive   ElasticSearch, Logstash, Kibana   Spark ML, GraphX, Stanford CoreNLP and…
  7. 7. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Overall Themes of this Talk   Filter Early, Filter Deep   Approximations are OK   Minimize Random Seeks   Maximize Sequential Scans   Go Off-Heap when Possible   Parallelism is Required at Scale   Must Reduce Dimensions at Scale   Seek Performance Gains at all Layers   Customize Data Structs for your Workload 7   Be Nice and Collaborate with your Peers!
  8. 8. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark High-Level Sections Spark Core: Performance Tuning Spark SQL: DataSources and Tuning Spark Streaming: Scale, Tuning, Approx Spark ML: Scale, Dim Reduce, NLP 8
  9. 9. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Spark Core: Performance Tuning Acknowledging Mechanical Sympathy 100TB Daytona GraySort Challenge Project Tungsten 9
  10. 10. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Acknowledging Mechanical Sympathy “Hardware and software working together in harmony” -Martin Thompson http://mechanical-sympathy.blogspot.com Spark Mechanical Sympathy Concerns Saturate Network I/O Saturate Disk I/O Minimize Memory Footprint and GC Maximize CPU Cache Locality 10
  11. 11. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark and Mechanical Sympathy Saturate Network I/O Saturate Disk I/O Minimize Memory and GC Maximize CPU Cache Locality 11 Project 
 Tungsten Spark 1.4-1.6 Daytona GraySort Spark 1.1-1.2
  12. 12. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark AlphaSort Trick for Sorting AlphaSort paper, 1995 Chris Nyberg and Jim Gray Naïve List (Pointer-to-Record) Requires Key to be dereferenced for comparison AlphaSort List (Key, Pointer) Key is directly available for comparison 12 Ptr! Ptr!Key!
  13. 13. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Key! Ptr! Pad! /Pad CPU Cache Line and Memory Sympathy Key(10 bytes) + Pointer(4 bytes*) = 14 bytes *4 bytes when using compressed OOPS (<32 GB heap) Not binary in size
 Not CPU-cache friendly Add Padding (2 bytes) Key(10 bytes) + Pad(2 bytes) 
 + Pointer(4 bytes)=16 bytes Key-Prefix, Pointer Key distribution affects perf Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes 13 Ptr! Key-Prefix Key! Ptr! Cache-line
 Friendly! 2x Cache-line
 Friendly! Not cache-line
 Friendly!
  14. 14. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Performance Comparison 14
  15. 15. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Similar Technique: Direct Cache Access Packet header placed into CPU cache 15
  16. 16. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Lines 16
  17. 17. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Instrumenting and Monitoring CPU Linux perf command! 17
  18. 18. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Naïve Matrix Multiplication // Find dot product of each row and column vector for (i = 0; i < N; ++i) for (j = 0; j < N; ++j) for (k = 0; k < N; ++k) res[i][j] += matA[i][k] * matB[k][j]; 18 Skipping row-wise, not using full CPU cache line,
 ineffective pre-fetching
  19. 19. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Friendly Matrix Multiplication // Transpose B for (i = 0; i < N; ++i) for (j = 0; j < N; ++j) matBtran [i][j] = matB[j][i];
 // Modify dot product calculation for B transpose for (i = 0; i < N; ++i) for (j = 0; j < N; ++j) for (k = 0; k < N; ++k) res[i][j] += matA[i][k] * matBtran[j][k]; 19 Good use of CPU cache line, effective prefetching
  20. 20. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Comparing CPU Naïve & Cache-Friendly Matrix Multiplication 20
  21. 21. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Results of Naïve vs. Cache Friendly Naïve Matrix Multiply 21 Cache Friendly Matrix Multiply ~72x ~8x ~3x ~3x ~2x ~7x ~10x perf stat --repeat 5 --scale --event L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses,LLC-prefetch-misses,cache-misses,stalled-cycles-frontend java -Xmx13G -XX:-Inline -jar ~/sbt/bin/sbt-launch.jar "tungsten/run-main com.advancedspark.tungsten.matrix.Cache[Friendly|Naïve]MatrixMultiply 256 1"
  22. 22. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Visualizing and Finding Hotspots Flame Graphs with Java Stack Traces 22 Images courtesy of http://techblog.netflix.com/2015/07/java-in-flames.html! Java Stack Traces!!
  23. 23. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles 100TB Daytona GraySort Challenge Focus on Network and Disk I/O Optimizations Improve Data Structs/Algos for Sort & Shuffle Saturate Network and Disk Controllers 23
  24. 24. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Winning Results 24 Spark Goals:   Saturate Network I/O   Saturate Disk I/O (2013) (2014)
  25. 25. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Winning Hardware Configuration Compute 206 EC2 Worker nodes, 1 Master node AWS i2.8xlarge 32 Intel Xeon CPU E5-2670 @ 2.5 Ghz 244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4 NOOP I/O scheduler: FIFO, request merging, no reordering 3 GBps mixed read/write disk I/O per node Network Deployed within Placement Group/VPC Using AWS Enhanced Networking Single Root I/O Virtualization (SR-IOV): extension of PCIe 10 Gbps, low latency, low jitter (iperf showed ~9.5 Gbps) 25
  26. 26. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Winning Software Configuration Spark 1.2, OpenJDK 1.7_<amazon-something>_u65-b17 Disable caching, compression, spec execution, shuffle spill Force NODE_LOCAL task scheduling for optimal data locality HDFS 2.4.1 short-circuit for local reads, 2x replication 4-6 tasks allocated / partition is Spark recommendation 206 nodes * 32 cores = 6592 cores 6592 cores * 4 = 26,368 partitions 6592 cores * 6 = 39,552 partitions 6592 cores * 4.25 = 28,000 partitions was empirically best Range partitioning takes advantage of sequential keyspace 26
  27. 27. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark New Shuffle Manager New “Sort-based” shuffle manager replaces Hash-based New Data Structures and Algos for Shuffle Sort ie. New TimSort for Arrays of (K,V) Pairs 27
  28. 28. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark New Network Module Replaces old java.nio, low-level, socket-based code Zero-copy epoll: kernel-space between disk & network Custom memory management spark.shuffle.blockTransferService=netty Spark-Netty Performance Tuning spark.shuffle.io.numConnectionsPerPeer Increase to saturate hosts with multiple disks spark.shuffle.io.preferDirectBuffers On or Off-heap (Off-heap is default) 28
  29. 29. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark New Algorithms and Data Structures Optimized for sort and shuffle o.a.s.util.collection.TimSort Based on JDK 1.7 TimSort Performs best on partially-sorted datasets Optimized for elements of (K,V) pairs Sorts impl of SortDataFormat (ie. KVArraySortDataFormat) o.a.s.util.collection.AppendOnlyMap Open addressing hash, quadratic probing Array of [(K, V), (K, V)] Good memory locality Keys never removed, values only append 29
  30. 30. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles IBM | spark.tc Met Performance Goals! Reducers: 1.1 Gbps/node network I/O (theoretical max = 1.25 Gbps for 10 GB ethernet) Mappers: 3 GBps/node disk I/O (8x800 SSD) 206 nodes * 1.1 Gbps/node ~= 220 Gbps
  31. 31. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Shuffle Performance Tuning Tips Hash Shuffle Manager (no longer default) spark.shuffle.consolidateFiles: mapper output files o.a.s.shuffle.FileShuffleBlockResolver Intermediate Files Increase spark.shuffle.file.buffer: reduce seeks & sys calls Increase spark.reducer.maxSizeInFlight if memory allows Use smaller number of larger workers to reduce total files SQL: BroadcastHashJoin vs. ShuffledHashJoin spark.sql.autoBroadcastJoinThreshold Use DataFrame.explain(true) or EXPLAIN to verify 31
  32. 32. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Project Tungsten Focus on CPU Cache and Memory Optimizations Further Improve Data Structures and Algorithms Operate on Serialized/Compressed Data Provide Path to Off Heap 32
  33. 33. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Why is CPU the Bottleneck? Network and Disk I/O bandwidth are relatively high GraySort optimizations improved network & shuffle More partitioning, pruning, and predicate pushdowns Poprularity of columnar file formats like Parquet/ORC CPU is used for serialization, hashing, compression! 33
  34. 34. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark Shuffle Managers spark.shuffle.manager = hash < 10,000 Reducers Output file determined by hashing the key of (K,V) pair Each mapper creates an output buffer/file per reducer Leads to M*R number of output buffers/files per shuffle sort >= 10,000 Reducers Default since Spark 1.2 Minimizes OS resources Uses Netty to optimize Network I/O Created custom Data Struts/Algos Wins Daytona GraySort Challenge unsafe -> Tungsten, Default in Spark 1.5 Uses com.misc.Unsafe to sellf-manage binary array buffers Uses custom serialization format Can operate on compressed and serialized buffers 34
  35. 35. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark New Data Structures “I don’t know your data structure, but my array[] will beat it!” Custom Data Structures for Sort/Shuffle Workload UnsafeRow: BytesToBytesMap:: 35
  36. 36. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark sun.misc.Unsafe 36 Info addressSize() pageSize() Objects allocateInstance() objectFieldOffset() Classes staticFieldOffset() defineClass() defineAnonymousClass() ensureClassInitialized() Synchronization monitorEnter() tryMonitorEnter() monitorExit() compareAndSwapInt() putOrderedInt() Arrays arrayBaseOffset() arrayIndexScale() Memory allocateMemory() copyMemory() freeMemory() getAddress() – not guaranteed after GC getInt()/putInt() getBoolean()/putBoolean() getByte()/putByte() getShort()/putShort() getLong()/putLong() getFloat()/putFloat() getDouble()/putDouble() getObjectVolatile()/putObjectVolatile()
  37. 37. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark + com.misc.Unsafe 37 org.apache.spark.sql.execution. aggregate.SortBasedAggregate aggregate.TungstenAggregate aggregate.AggregationIterator aggregate.udaf aggregate.utils SparkPlanner rowFormatConverters UnsafeFixedWidthAggregationMap UnsafeExternalSorter UnsafeExternalRowSorter UnsafeKeyValueSorter UnsafeKVExternalSorter local.ConvertToUnsafeNode local.ConvertToSafeNode local.HashJoinNode local.ProjectNode local.LocalNode local.BinaryHashJoinNode local.NestedLoopJoinNode joins.HashJoin joins.HashSemiJoin joins.HashedRelation joins.BroadcastHashJoin joins.ShuffledHashOuterJoin (not yet converted) joins.BroadcastHashOuterJoin joins.BroadcastLeftSemiJoinHash joins.BroadcastNestedLoopJoin joins.SortMergeJoin joins.LeftSemiJoinBNL joins.SortMergerOuterJoin Exchange SparkPlan UnsafeRowSerializer SortPrefixUtils sort basicOperators aggregate.SortBasedAggregationIterator aggregate.TungstenAggregationIterator datasources.WriterContainer datasources.json.JacksonParser datasources.jdbc.JDBCRDD org.apache.spark. unsafe.Platform unsafe.KVIterator unsafe.array.LongArray unsafe.array.ByteArrayMethods unsafe.array.BitSet unsafe.bitset.BitSetMethods unsafe.hash.Murmur3_x86_32 unsafe.map.BytesToBytesMap unsafe.map.HashMapGrowthStrategy unsafe.memory.TaskMemoryManager unsafe.memory.ExecutorMemoryManager unsafe.memory.MemoryLocation unsafe.memory.UnsafeMemoryAllocator unsafe.memory.MemoryAllocator (trait/interface) unsafe.memory.MemoryBlock unsafe.memory.HeapMemoryAllocator unsafe.memory.ExecutorMemoryManager unsafe.sort.RecordComparator unsafe.sort.PrefixComparator unsafe.sort.PrefixComparators unsafe.sort.UnsafeSorterSpillWriter serializer.DummySerializationInstance shuffle.unsafe.UnsafeShuffleManager shuffle.unsafe.UnsafeShuffleSortDataFormat shuffle.unsafe.SpillInfo shuffle.unsafe.UnsafeShuffleWriter shuffle.unsafe.UnsafeShuffleExternalSorter shuffle.unsafe.PackedRecordPointer shuffle.ShuffleMemoryManager util.collection.unsafe.sort.UnsafeSorterSpillMerger util.collection.unsafe.sort.UnsafeSorterSpillReader util.collection.unsafe.sort.UnsafeSorterSpillWriter util.collection.unsafe.sort.UnsafeShuffleInMemorySorter util.collection.unsafe.sort.UnsafeInMemorySorter util.collection.unsafe.sort.RecordPointerAndKeyPrefix util.collection.unsafe.sort.UnsafeSorterIterator network.shuffle.ExternalShuffleBlockResolver scheduler.Task rdd.SqlNewHadoopRDD executor.Executor org.apache.spark.sql.catalyst.expressions. regexpExpressions BoundAttribute SortOrder SpecializedGetters ExpressionEvalHelper UnsafeArrayData UnsafeReaders UnsafeMapData Projection LiteralGeneartor UnsafeRow JoinedRow SpecializedGetters InputFileName SpecificMutableRow codegen.CodeGenerator codegen.GenerateProjection codegen.GenerateUnsafeRowJoiner codegen.GenerateSafeProjection codegen.GenerateUnsafeProjection codegen.BufferHolder codegen.UnsafeRowWriter codegen.UnsafeArrayWriter complexTypeCreator rows literals misc stringExpressions Over 200 source files affected!!
  38. 38. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU & Memory Optimizations Custom Managed Memory Reduces GC overhead Both on and off heap Exact size calculations Direct Binary Processing Operate on serialized/compressed arrays Kryo can reorder serialized records LZF can reorder compressed records More CPU Cache-aware Data Structs & Algorithms o.a.s.unsafe.map.BytesToBytesMap vs. j.u.HashMap Code Generation (default in 1.5) Generate source code from overall query plan Janino generates bytecode from source code 100+ UDFs converted to use code generation 38 UnsafeFixedWithAggregationMap,& TungstenAggregationIterator CodeGenerator & GeneratorUnsafeRowJoiner UnsafeSortDataFormat & UnsafeShuffleSortDataFormat & PackedRecordPointer & UnsafeRow UnsafeInMemorySorter & UnsafeExternalSorter & UnsafeShuffleWriter Mostly Same Join Code, added if (isUnsafeMode) UnsafeShuffleManager & UnsafeShuffleInMemorySorter & UnsafeShuffleExternalSorterDetails inSPARK-7075
  39. 39. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles IBM | spark.tc Code Generation (Default in 1.5) Problem Generic expression evaluation Expensive on JVM Virtual func calls Branches based on expression type Boxing causes excessive object creation Implementation Defer source code generation to each operator, type, etc Scala quasiquotes provide AST manipulation & rewriting Generates source code, compiled to bytecode w/ Janino 100+ UDFs now using code gen
  40. 40. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles IBM | spark.tc Code Generation: Spark SQL UDFs 100+ UDFs now using code gen – More to come in Spark 1.6! Details in SPARK-8159
  41. 41. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles IBM | spark.tc Project Tungsten in Other Spark Libraries SortDataFormat<K, Buffer>: Base trait UncompressedInBlockSort: MLlib.ALS EdgeArraySortDataFormat: GraphX.Edge
  42. 42. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Spark SQL: DataSources and Tuning Understand Partitions, Pruning, Predicate Pushdowns Understand DataFrames, Catalyst, DataSources Create a DataSource Implementation 42
  43. 43. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Partitions Partition based on data usage patterns /genders.parquet/gender=M/… /gender=F/… <-- Use case: access users by gender /gender=U/… Partition Discovery (Read Path) Infer partitions from organization of data (ie. gender=F) Dynamic Partitions (Write Path) Dynamically create partitions based on given column(s) SQL: INSERT TABLE genders PARTITION (gender) SELECT … DF: gendersDF.write.format("parquet").partitionBy("gender").save(…) 43
  44. 44. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Pruning Partition Pruning Filter out entire rows that have been pre-partitioned SELECT id, gender FROM genders where gender = ‘U’ Column Pruning Filter out entire columns for all rows if not required Optimized for columnar storage formats (Parquet) Minimize data shuffle during joins 44 gender = partition key
  45. 45. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Predicate Pushdowns “Predicate” == “Filter” Filters rows as deep into the data source as possible Predicate returns [true|false] for given func/condition 45
  46. 46. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Putting It All Together Reduce Columns: Column Pruning Reduce Rows: Partitioning, Predicate Pushdown SELECT b FROM table WHERE a in [a2,a3] 46
  47. 47. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark DataFrames Overview Inspired by R and Pandas DataFrames Cross language support SQL, Python, Scala, Java, R Levels performance of Python, Scala, Java, and R Generates JVM bytecode vs serialize/pickle to Python DataFrame is Container for Logical Plan Lazy transformations represented as tree Catalyst Optimizer creates physical plan DataFrame.rdd returns the underlying RDD if needed Custom UDF using registerFunction() New, experimental UDAF support 47 Use DataFrames instead of RDDs!!
  48. 48. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Catalyst Optimizer Optimize DataFrame Transformation Tree Subquery elimination: use aliases to collapse subqueries Constant folding: replace expression with constant Simplify filters: remove unnecessary filters Predicate/filter pushdowns: avoid unnecessary data load Projection collapsing: avoid unnecessary projections Create Custom Rules Rules are Scala Case Classes val newPlan = MyFilterRule(analyzedPlan) 48 Implements! oas.sql.catalyst.rules.Ruleå! Apply to any stage! JVM code generation
  49. 49. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Columnar Storage Format 49 Skip whole chunks with min-max heuristics
 stored in each chunk (sorted data only)
  50. 50. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Parquet File Format   Based on Google Dremel   Implemented by Twitter and Cloudera   Columnar storage format   Optimized for fast columnar aggregations   Tight compression   Supports pushdowns   Nested, self-describing, evolving schema 50
  51. 51. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Types of Compression   Run Length Encoding: Repeated data   Dictionary Encoding: Fixed set of values   Delta, Prefix Encoding: Sorted data 51
  52. 52. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Query Plan Debugging 52 gendersCsvDF.select($"id", $"gender").filter("gender != 'F'").filter("gender != 'M'").explain(true) DataFrame.queryExecution.logical DataFrame.queryExecution.analyzed DataFrame.queryExecution.optimizedPlan DataFrame.queryExecution.optimizedPlan
  53. 53. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Query Plan Visualization & Query Metrics 53 Effectiveness of Filter CPU Cache 
 Friendly Binary Format Cost-based Join Optimization Similar to MapReduce Map-side Join Peak Memory for Joins and Aggs
  54. 54. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Show Various File Formats, Partitioning Schemes, 
 DataSource Implementations, and Query Plans 54 RATINGS ======== UserID,ProfileID,Rating (1-10) GENDERS ======== UserID,Gender (M,F,U) Anonymous, Public Dating Dataset
  55. 55. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark DataSources API Relations (o.a.s.sql.sources.interfaces.scala) BaseRelation (abstract class): Provides schema of data TableScan (impl): Read all data from source PrunedFilteredScan (impl): Column pruning & predicate pushdowns InsertableRelation (impl): Insert/overwrite data based on SaveMode RelationProvider (trait/interface): Handle options, BaseRelation factory Execution (o.a.s.sql.execution.commands.scala) RunnableCommand (trait/interface): Common commands like EXPLAIN ExplainCommand(impl: case class) CacheTableCommand(impl: case class) Filters (o.a.s.sql.sources.filters.scala) Filter (abstract class): Handles all predicates/filters supported by this source EqualTo (impl) GreaterThan (impl) StringStartsWith (impl) 55
  56. 56. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Native Spark SQL DataSources 56
  57. 57. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark JSON Data Source DataFrame val ratingsDF = sqlContext.read.format("json") .load("file:/root/pipeline/datasets/dating/ratings.json.bz2") -- or – val ratingsDF = sqlContext.read.json
 ("file:/root/pipeline/datasets/dating/ratings.json.bz2") SQL Code CREATE TABLE genders USING json OPTIONS (path "file:/root/pipeline/datasets/dating/genders.json.bz2") 57 json() convenience method
  58. 58. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark JDBC Data Source Add Driver to Spark JVM System Classpath $ export SPARK_CLASSPATH=<jdbc-driver.jar> DataFrame val jdbcConfig = Map("driver" -> "org.postgresql.Driver", "url" -> "jdbc:postgresql:hostname:port/database", "dbtable" -> ”schema.tablename") df.read.format("jdbc").options(jdbcConfig).load() SQL CREATE TABLE genders USING jdbc 
 OPTIONS (url, dbtable, driver, …) 58
  59. 59. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Parquet Data Source Configuration spark.sql.parquet.filterPushdown=true spark.sql.parquet.mergeSchema=true spark.sql.parquet.cacheMetadata=true spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo] DataFrames val gendersDF = sqlContext.read.format("parquet") .load("file:/root/pipeline/datasets/dating/genders.parquet") gendersDF.write.format("parquet").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders.parquet") SQL CREATE TABLE genders USING parquet OPTIONS (path "file:/root/pipeline/datasets/dating/genders.parquet") 59
  60. 60. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark ORC Data Source Configuration spark.sql.orc.filterPushdown=true DataFrames val gendersDF = sqlContext.read.format("orc") .load("file:/root/pipeline/datasets/dating/genders") gendersDF.write.format("orc").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders") SQL CREATE TABLE genders USING orc OPTIONS (path "file:/root/pipeline/datasets/dating/genders") 60
  61. 61. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Third-Party Spark SQL DataSources 61
  62. 62. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CSV DataSource (Databricks) Github https://github.com/databricks/spark-csv Maven com.databricks:spark-csv_2.10:1.2.0 Code val gendersCsvDF = sqlContext.read .format("com.databricks.spark.csv”) .load("file:/root/pipeline/datasets/dating/gender.csv.bz2") .toDF("id", "gender") 62 toDF() is required if CSV does not contain header
  63. 63. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Avro DataSource (Databricks) Github https://github.com/databricks/spark-avro Maven com.databricks:spark-avro_2.10:2.0.1 Code val df = sqlContext.read .format("com.databricks.spark.avro") .load("file:/root/pipeline/datasets/dating/gender.avro”) 63
  64. 64. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark ElasticSearch DataSource (Elastic.co) Github https://github.com/elastic/elasticsearch-hadoop Maven org.elasticsearch:elasticsearch-spark_2.10:2.1.0 Code val esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>", 
 "es.port" -> "<port>") df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite) .options(esConfig).save("<index>/<document-type>") 64
  65. 65. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark AWS Redshift Data Source (Databricks) Github https://github.com/databricks/spark-redshift Maven com.databricks:spark-redshift:0.5.0 Code val df: DataFrame = sqlContext.read .format("com.databricks.spark.redshift") .option("url", "jdbc:redshift://<hostname>:<port>/<database>…") .option("query", "select x, count(*) my_table group by x") .option("tempdir", "s3n://tmpdir") .load(...) 65 UNLOAD and copy to tmp bucket in S3 enables parallel reads
  66. 66. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Cassandra DataSource (DataStax) Github https://github.com/datastax/spark-cassandra-connector Maven com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1 Code ratingsDF.write .format("org.apache.spark.sql.cassandra") .mode(SaveMode.Append) .options(Map("keyspace"->"<keyspace>", "table"->"<table>")).save(…) 66
  67. 67. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Cassandra Pushdown Support spark-cassandra-connector/…/o.a.s.sql.cassandra.PredicatePushDown.scala Pushdown Predicate Rules 1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate 2. Only push down primary key column predicates with = or IN predicate. 3. If there are regular columns in the pushdown predicates, they should have at least one EQ expression on an indexed column and no IN predicates. 4. All partition column predicates must be included in the predicates to be pushed down, only the last part of the partition key can be an IN predicate. For each partition column, only one predicate is allowed. 5. For cluster column predicates, only last predicate can be non-EQ predicate including IN predicate, and preceding column predicates must be EQ predicates. If there is only one cluster column predicate, the predicates could be any non-IN predicate. 6. There is no pushdown predicates if there is any OR condition or NOT IN condition. 7. We're not allowed to push down multiple predicates for the same column if any of them is equality or IN predicate. 67
  68. 68. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Rumor of New Cassandra DataSource By-pass CQL front door used for transactional data Bulk read/write directly from/to SSTables Similar to existing Netflix Open Source project https://github.com/Netflix/aegisthus Promotes Cassandra to first-class Analytics Option Potentially only part of DataStax Enterprise?! Please mail a nasty letter to your local DataStax office 68
  69. 69. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Creating a Custom Data Source Study Existing Native and Third-Party Data Source Impls Native: JDBC (o.a.s.sql.execution.datasources.jdbc) class JDBCRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation Third-Party: Cassandra (o.a.s.sql.cassandra) class CassandraSourceRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation <Insert Your Custom Data Source Here!> 69
  70. 70. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Cloudant DataSource (IBM) Github http://spark-packages.org/package/cloudant/spark-cloudant Maven com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1 Code ratingsDF.write.format("com.cloudant.spark") .mode(SaveMode.Append) .options(Map("cloudant.host"->"<account>.cloudant.com", "cloudant.username"->"<username>", "cloudant.password"->"<password>")) .save("<filename>") 70
  71. 71. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark DB2 and BigSQL DataSources (IBM) Coming Soon! 71
  72. 72. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Rumor of REST DataSource (Databricks) Coming Soon? Ask Michael Armbrust Spark SQL Lead @ Databricks 72
  73. 73. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Custom DataSource (Me and You All!) Coming Right Now! 73 DEMO ALERT!!
  74. 74. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Create a Custom DataSource 74
  75. 75. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Contributing a Custom Data Source spark-packages.org Managed by Contains links to externally-managed github projects Ratings and comments Requires supporte Spark version for each package Examples https://github.com/databricks/spark-csv https://github.com/databricks/spark-avro https://github.com/databricks/spark-redshift 75
  76. 76. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Spark Streaming: Scaling & Approximations Understand Parallelism, Recovery, and Back Pressure Describe Common Streaming Count Approximations 76
  77. 77. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Direct Kafka Streaming   KafkaRDD partitions store relevant offsets   Each partition acts as a Receiver   Tasks/workers pull from Kafka in parallel   Partitions rebuild from Kafka using offsets   No Write Ahead Log (WAL) needed   Optimizes happy path by avoiding the WAL   At least once delivery guarantee 77
  78. 78. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Parallelism of Direct Kafka Streaming 78
  79. 79. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Not-so-direct Kinesis Streaming   KinesisRDD partitions store relevant offsets   Single receiver required to see all data/offsets   Kinesis offsets not deterministic like Kafka   Partitions rebuild from Kinesis using offsets   No Write Ahead Log (WAL) needed   Optimizes happy path by avoiding the WAL   At least once delivery guarantee 79
  80. 80. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Streaming Back Pressure More than Throttling Push back on the source Requires buffered source (Kafka, Kinesis) Based on fundamentals of Control Theory Contributed by TypeSafe 80
  81. 81. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark HyperLogLog   Approximate cardinality
 (approx count distinct)   Fixed, low memory   Tunable error percentage   Only 1.5KB @ 2% error,10^9 elements   Twitter’s Algebird   Streaming example in Spark codebase   Spark’s countApproxDistinctByKey() 81 http://research.neustar.biz/
  82. 82. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Count Min Sketch   Approximate counters   Better than HashMap   Low, fixed memory   Known error bounds   Large num of counters   From Twitter Algebird   Streaming example in Spark codebase 82
  83. 83. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Monte Carlo Simulations From Manhattan Project (A-bomb) Simulate movement of neutrons Law of Large Numbers (LLN) Average of results of many trials
 Converge on expected value SparkPi example in Spark codebase
 Pi ~ (# red dots /
 # total dots * 4) 83
  84. 84. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Spark ML: High Scale Machine Learning Define Similarity and Dimension Reduction Describe Sampling and Bucketing Generate 10 Recommendations 84
  85. 85. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Live, Interactive Demo! sparkafterdark.com 85
  86. 86. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Audience Participation Needed!! 86 -> You are
 here -> Audience Instructions   Navigate to sparkafterdark.com   Click 3 actresses and 3 actors   Wait for us to analyze together! Note: This is totally anonymous!! Project Links   https://github.com/fluxcapacitor/pipeline   https://hub.docker.com/r/fluxcapacitor
  87. 87. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Similarity 87
  88. 88. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Types of Similarity Euclidean: linear measure Magnitude bias Cosine: angle measure Adjust for magnitude bias Jaccard: (intersection / union) Popularity bias Log Likelihood Adjust for popularity bias 88 Ali Matei Reynold Patrick Andy Kimberly 1 1 1 1 Leslie 1 1! Meredith 1 1 1 Lisa 1 1 1 Holden 1 1 1 1 1 z!
  89. 89. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark All-Pairs Similarity Comparison Compare everything to everything aka. “pair-wise similarity” or “similarity join” Naïve shuffle: O(m*n^2); m=rows, n=cols Minimize shuffle through approximations! Reduce m (rows) Sampling and bucketing Reduce n (cols) Remove most frequent value (ie.0) Principle Component Analysis 89 Dimension reduction!!
  90. 90. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Dimension Reduction Sampling and Bucketing 90
  91. 91. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Reduce m: DIMSUM Sampling “Dimension Independent Matrix Square Using MR” Remove rows with low similarity probability MLlib: RowMatrix.columnSimilarities(…) Twitter: 40% efficiency gain over Cosine Similarity 91
  92. 92. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Reduce m: LSH Bucketing “Locality Sensitive Hashing” Split m into b buckets Use similarity hash algorithm Requires pre-processing of data Compare bucket contents in parallel Converts O(m*n^2) -> O(m*n/b*b^2); m=rows, n=cols, b=buckets ie. 500k x 500k matrix O(1.25e17) -> O(1.25e13); b=50 github.com/mrsqueeze/spark-hash 92
  93. 93. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Reduce n: Remove Most Frequent Value Eliminate most-frequent value Represent other values with (index,value) pairs Converts O(m*n^2) -> O(m*nnz^2); 
 nnz=num nonzeros, nnz << n Note: Choose most frequent value (may not be 0) 93 (index,value) (index,value)
  94. 94. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Recommendations Summary Statistics and Historical Analysis Collaborative Filtering and Clustering Text Featurization and NLP 94
  95. 95. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Types of Recommendations Non-personalized
 No preference or behavior data for user, yet aka “Cold Start Problem” Personalized
 User-Item Similarity
 Items that others with similar prefs have liked Item-Item Similarity
 Items similar to your previously-liked items 95
  96. 96. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Recommendation Terminology User User seeking recommendations Item Item that has been liked or rated Feedback Explicit: like, rating Implicit: search, click, hover, view, scroll Feature Engineering Dimension reduction 96
  97. 97. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Non-Personalized Recommendations Use Aggregate Data to Generate Recommendations 97
  98. 98. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Top Users by Like Count “I might like users who have the most-likes overall based on historical data.” SparkSQL, DataFrames: Summary Stat, Aggs 98
  99. 99. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Top Influencers by Like Graph
 “I might like the most-influential users in overall like graph.” GraphX: PageRank 99
  100. 100. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Generate Recommnedations using Summary Stats & PageRank 100
  101. 101. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Personalized Recommendations Use Similarity to Generate Personalized Recommendations 101
  102. 102. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Like Behavior of Similar Users “I like the same people that you like. 
 What other people did you like that I haven’t seen?” MLlib: Matrix Factorization, User-Item Similarity 102
  103. 103. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Generate Recommendations using 
 Collaborative Filtering and Matrix Factorization 103
  104. 104. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Similar Text-based Profiles as Me
 “Our profiles have similar keywords and named entities. 
 We might like each other!” MLlib: Word2Vec, TF/IDF, k-skip n-grams 104
  105. 105. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Similar Profiles to Previous Likes
 105 “Your profile text has similar keywords and named entities to other profiles of people I like. I might like you, too!” MLlib: Word2Vec, TF/IDF, Doc Similarity
  106. 106. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Relevant, High-Value Emails “Your initial email references a lot of things in my profile.
 I might like you for making the effort!” MLlib: Word2Vec, TF/IDF, Entity Recognition 106 ^ Her Email< My Profile
  107. 107. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles The Future of Recommendations 107
  108. 108. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Eigenfaces: Facial Recognition “Your face looks similar to others that I’ve liked.
 I might like you.” MLlib: RowMatrix, PCA, Item-Item Similarity 108 Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
  109. 109. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   NLP Conversation Starter Bot! “If your responses to my generic opening lines are positive, I may read your profile.” 
 MLlib: TF/IDF, DecisionTrees, Sentiment Analysis 109 Positive Negative
  110. 110. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles 110 Maintaining the Spark
  111. 111. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark ⑨  Recommendations for Couples “I want Mad Max. You want Message In a Bottle. 
 Let’s find something in between to watch tonight.” MLlib: RowMatrix, Item-Item Similarity
 GraphX: Nearest Neighbors, Shortest Path similar similar •  plots -> <- actors 111
  112. 112. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Final Recommendation! 112
  113. 113. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Get Off the Computer & Meet People! Thank you, Paris!! Chris Fregly @cfregly IBM Spark Technology Center San Francisco, CA, USA Relevant Links advancedspark.com Signup for the book & global meetup! github.com/fluxcapacitor/pipeline Clone, contribute, and commit code! hub.docker.com/r/fluxcapacitor/pipeline/wiki Run all demos in your own environment with Docker! 113
  114. 114. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark More Relevant Links http://meetup.com/Advanced-Apache-Spark-Meetup http://advancedspark.com http://github.com/fluxcapacitor/pipeline http://hub.docker.com/r/fluxcapacitor/pipeline http://sortbenchmark.org/ApacheSpark2014.pd https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html http://0x0fff.com/spark-architecture-shuffle/ http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/ http://docs.scala-lang.org/overviews/quasiquotes/intro.html http://lwn.net/Articles/252125/ (Memory Part 2: CPU Caches) http://lwn.net/Articles/255364/ (Memory Part 5: What Programmers Can Do) https://www.safaribooksonline.com/library/view/java-performance-the/9781449363512/ch04.html http://web.eece.maine.edu/~vweaver/projects/perf_events/perf_event_open.html http://www.brendangregg.com/perf.html https://perf.wiki.kernel.org/index.php/Tutorial http://techblog.netflix.com/2015/07/java-in-flames.html http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#Java http://sortbenchmark.org/ApacheSpark2014.pdf https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html http://0x0fff.com/spark-architecture-shuffle/ http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/ http://docs.scala-lang.org/overviews/quasiquotes/intro.html http://lwn.net/Articles/252125/ <-- Memory Part 2: CPU Caches http://lwn.net/Articles/255364/ <-- Memory Part 5: What Programmers Can Do 114
  115. 115. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles What’s Next? 115
  116. 116. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark What’s Next? Autoscaling Spark Workers Completely Docker-based Docker Compose and Docker Machine Lots of Demos and Examples! Zeppelin & IPython/Jupyter notebooks Advanced streaming use cases Advanced ML, Graph, and NLP use cases Performance Tuning and Profiling Work closely with Brendan Gregg & Netflix Surface & share more low-level details of Spark internals 116
  117. 117. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Upcoming Meetups and Conferences London Spark Meetup (Oct 12th) Scotland Data Science Meetup (Oct 13th) Dublin Spark Meetup (Oct 15th) Barcelona Spark Meetup (Oct 20th) Madrid Spark/Big Data Meetup (Oct 22nd) Paris Spark Meetup (Oct 26th) Amsterdam Spark Summit & Meetup (Oct 27th) Delft Dutch Data Science Meetup (Oct 29th) Brussels Spark Meetup (Oct 30th) Zurich Big Data Developers Meetup (Nov 2nd) Geneva Spark Meetup (Nov 5th) 117 San Francisco Datapalooza (Nov 10th) San Francisco Advanced Apache Spark Meetup (Nov 12th) Oslo Big Data Hadoop Meetup (Nov 18th) Helsinki Spark Meetup (Nov 20th) Stockholm Spark Meetup (Nov 23rd) Copenhagen Spark Meetup (Nov 25th) Budapest Spark Meetup (Nov 27th) Singapore Strata Conference (Dec 1st) San Francisco Advanced Apache Spark Meetup (Dec 8th) Mountain View Advanced Apache Spark Meetup (Dec 10th) Washington DC Advanced Apache Spark Meetup (Dec 17th) Freg-a-palooza!
  118. 118. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Power of data. Simplicity of design. Speed of innovation. IBM Spark

×