Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
After Dark 1.5
Stockholm Spark Meetup
Chris ...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed o...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed o...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed o...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed o...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed o...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed o...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed o...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed o...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
IBM | spark.tc
Spark SQL UDF Code Generation...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed o...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed o...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed o...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed o...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed o...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed o...
Upcoming SlideShare
Loading in …5
×

Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5

533 views

Published on

High Performance, Reali-time, Stream Processing, Machine Learning, Natural Language Processing, Text Analytics, and Recommendations

Published in: Software
  • Be the first to comment

  • Be the first to like this

Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5

  1. 1. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc After Dark 1.5 Stockholm Spark Meetup Chris Fregly Principal Data Solutions Engineer We’re Hiring - Only Nice People! Nov 23rd, 2015
  2. 2. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Who Am I? 2 Streaming Data Engineer Open Source Committer
 Data Solutions Engineer
 Apache Contributor Principal Data Solutions Engineer IBM Technology Center Founder Advanced Apache Meetup Author Advanced . Due 2016 My Ma’s First Time in California
  3. 3. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Random Slide: More Ma “First Time” Pics 3 In California Using Chopsticks Using “New” iPhone
  4. 4. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Upcoming Meetups and Conferences London Spark Meetup (Oct 12th) Scotland Data Science Meetup (Oct 13th) Dublin Spark Meetup (Oct 15th) Barcelona Spark Meetup (Oct 20th) Madrid Big Data Meetup (Oct 22nd) Paris Spark Meetup (Oct 26th) Amsterdam Spark Summit (Oct 27th) Brussels Spark Meetup (Oct 30th) Zurich Big Data Meetup (Nov 2nd) Geneva Spark Meetup (Nov 5th) San Francisco Datapalooza (Nov 10th) San Francisco Advanced Spark (Nov 12th) 4 Oslo Big Data Hadoop Meetup (Nov 19th) Helsinki Spark Meetup (Nov 20th) Stockholm Spark Meetup (Nov 23rd) Copenhagen Spark Meetup (Nov 25th) Budapest Spark Meetup (Nov 26th) Istanbul Spark Meetup (Nov 28th) Singapore Strata Conference (Dec 1st) Sydney Spark Meetup (Dec 7th) Melbourne Spark Meetup (Dec 9th) San Francisco Advanced Spark (Dec 10th) Toronto Spark Meetup (Dec 14th) Austin Data Days Conference (Jan 16th)
  5. 5. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Advanced Apache Spark Meetup Meetup Metrics 1600+ Members in just 4 mos! Top 5 Most Active Spark Meetup!! Meetup Goals   Dig deep into codebase of Spark and related projects   Study integrations of Cassandra, ElasticSearch, Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R   Surface and share patterns and idioms of these well-designed, distributed, big data components
  6. 6. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark All Slides and Code Are Available! advancedspark.com slideshare.net/cfregly github.com/fluxcapacitor hub.docker.com/r/fluxcapacitor 6
  7. 7. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark What is “ After Dark”? Spark-based, Advanced Analytics Reference App End-to-End, Scalable, Real-time Big Data Pipeline Demo Spark and Related Open Source Projects 7 github.com/fluxcapacitor
  8. 8. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Live, Interactive Demo! sparkafterdark.com 8
  9. 9. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Audience Participation Needed!! 9 You -> Audience Instructions   Go to sparkafterdark.com   Click 3 actresses and 3 actors   Wait for us to analyze together! Links To Do This Yourself!   github.com/fluxcapacitor   hub.docker.com/r/fluxcapacitor Data -> Scientist EU Safe Harbor Disclaimer
 This is Totally Anonymous!
  10. 10. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Tools of This Talk 10   Kafka   Redis   Docker   Ganglia   Cassandra   Parquet, JSON, ORC, Avro   Apache Zeppelin Notebooks   Spark SQL, DataFrames, Hive   ElasticSearch, Logstash, Kibana   Spark ML, GraphX, Stanford CoreNLP … github.com/fluxcapacitor hub.docker.com/r/fluxcapacitor
  11. 11. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Themes of this Talk  Filter  Off-Heap  Parallelize  Approximate  Find Similarity  Minimize Seeks  Maximize Scans  Customize for Workload  Tune Performance At Every Layer 11   Be Nice, Collaborate! Like my Ma!!
  12. 12. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Presentation Outline  Spark Core: Tuning & Mechanical Sympathy  Spark SQL: Query Optimizing & Catalyst 12
  13. 13. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Spark Core: Tuning & Mechanical Sympathy Understand and Acknowledge Mechanical Sympathy Study AlphaSort and 100TB GraySort Challenge Dive Deep into Project Tungsten 13
  14. 14. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Mechanical Sympathy Hardware and software working together in harmony. - Martin Thompson http://mechanical-sympathy.blogspot.com Whatever your data structure, my array will beat it. - Scott Meyers Every C++ Book, basically 14 Hair Sympathy - Bruce Jenner
  15. 15. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark and Mechanical Sympathy 15 Project 
 Tungsten (Spark 1.4-1.6+) GraySort Challenge (Spark 1.1-1.2) Minimize Memory and GC Maximize CPU Cache Locality Saturate Network I/O Saturate Disk I/O
  16. 16. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark AlphaSort Technique: Sort 100 Bytes Recs 16 Value Ptr Key Dereference Not Required! AlphaSort List [(Key, Pointer)] Key is directly available for comparison Naïve List [Pointer] Must dereference key for comparison Ptr Dereference for Key Comparison Key
  17. 17. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Line and Memory Sympathy Key (10 bytes)+Pointer (*4 bytes)*Compressed OOPs = 14 bytes 17 Key Ptr Not CPU Cache-line Friendly! Ptr Key-Prefix 2x CPU Cache-line Friendly! Key-Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes Key (10 bytes)+Pad (2 bytes)+Pointer (4 bytes)
 = 16 bytes Key Ptr Pad /Pad CPU Cache-line Friendly!
  18. 18. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Performance Comparison 18
  19. 19. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Similar Trick: Direct Cache Access (DCA) Pull out packet header along side pointer to payload 19
  20. 20. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Line Sizes 20 My
 Laptop My
 SoftLayer
 BareMetal
  21. 21. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Cache Hits: Sequential v Random Access 21
  22. 22. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Mechanical Sympathy CPU Cache Lines and Matrix Multiplication 22
  23. 23. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Naïve Matrix Multiplication // Dot product of each row & column vector for (i <- 0 until numRowA) for (j <- 0 until numColsB) for (k <- 0 until numColsA) res[ i ][ j ] += matA[ i ][ k ] * matB[ k ][ j ]; 23 Bad: Row-wise traversal, not using CPU cache line,
 ineffective pre-fetching
  24. 24. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Friendly Matrix Multiplication // Transpose B for (i <- 0 until numRowsB) for (j <- 0 until numColsB) matBT[ i ][ j ] = matB[ j ][ i ]; 
 // Modify dot product calculation for B Transpose for (i <- 0 until numRowsA) for (j <- 0 until numColsB) for (k <- 0 until numColsA) res[ i ][ j ] += matA[ i ][ k ] * matBT[ j ][ k ]; 24 Good: Full CPU cache line,
 effective prefetching OLD: res[ i ][ j ] += matA[ i ][ k ] * matB [ k ] [ j ]; Reference j
 before k
  25. 25. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Instrumenting and Monitoring CPU Use Linux perf command! 25 http://www.brendangregg.com/blog/2015-11-06/java-mixed-mode-flame-graphs.html
  26. 26. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Demo! Compare CPU Naïve & Cache-Friendly Matrix Multiplication 26
  27. 27. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Results of Matrix Multiply Comparison Cache-Friendly 
 Matrix Multiply 27 Naive 
 Matrix Multiply perf stat –event L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses, LLC-prefetch-misses,cache-misses,stalled-cycles-frontend ~27x ~13x ~13x ~2x ~10x 550 hp 55 hp
  28. 28. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Mechanical Sympathy Updating a Tuple of Counters in Thread-safe Manner 28
  29. 29. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Naïve Case Class Counters case class MyTuple(left: Int, right: Int) object CacheNaiveCaseClassCounters { var tuple = new MyTuple(0,0) … def increment(leftIncrement: Int, rightIncrement: Int) : MyTuple = { tuple.synchronized { tuple = new MyTuple(tuple.left + leftIncrement, tuple.right + rightIncrement) tuple } } } 29
  30. 30. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Naïve Tuple Counters object CacheNaiveTupleIncrement { var tuple = (0,0) … def increment(leftIncrement: Int, rightIncrement: Int) : (Int, Int) = { tuple.synchronized { tuple = (tuple._1 + leftIncrement, tuple._2 + rightIncrement) tuple } } } 30
  31. 31. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Friendly Lock-Free Counters object CacheFriendlyLockFreeCounters { // a single Long (8-bytes) will maintain 2 separate Ints (4-bytes each) val tuple = new AtomicLong() … def increment(leftIncrement: Int, rightIncrement: Int) : Long = { var originalLong = 0L var updatedLong = 0L do { originalLong = tuple.get() val originalRightInt = originalLong.toInt // cast originalLong to Int to get right counter val originalLeftInt = (originalLong >>> 32).toInt // shift right to get left counter val updatedRightInt = originalRightInt + rightIncrement // increment right counter val updatedLeftInt = originalLeftInt + leftIncrement // increment left counter updatedLong = updatedLeftInt // update the new long with the left counter updatedLong = updatedLong << 32 // shift the new long left updatedLong += updatedRightInt // update the new long with the right counter } while (tuple.compareAndSet(originalLong, updatedLong) == false) updatedLong } 31 Q: Why not @volatile long? A: JVM (Java Memory Model)
 does not guarantee atomic
 updates of 64-bit longs, doubles Lock-free, compare and set
  32. 32. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Demo! Compare Naïve & Cache-Friendly Tuple Counter Synchronization 32
  33. 33. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Results of Counters Comparison Naïve Case Class Counters Naïve Tuple Counters Cache Friendly,
 Lock-Free Counters ~2x ~1.5x ~3.5x ~2x ~2x ~1.5x ~1.5x ~1.5x
  34. 34. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Profiling Visualizations: Flame Graphs 34 Example: Spark Word Count Java Stack Traces are Good! (-XX:-Inline -XX:+PreserveFramePointer) Plateaus
 are Bad!!
  35. 35. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc 100TB GraySort Challenge Sort 100TB of 100-Byte Records with 10-byte Keys Custom Data Structs & Algos for Sort & Shuffle Saturate Network and Disk I/O Controllers 35
  36. 36. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark 100TB GraySort Challenge Results 36 Performance Goals  Saturate Network I/O  Saturate Disk I/O  Maxminize Throughput (2013) (2014) EC2 (i2.8xlarge) (2014)
  37. 37. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Winning Hardware Configuration Compute 206 Workers, 1 Master (AWS EC2 i2.8xlarge) 32 Intel Xeon CPU E5-2670 @ 2.5 Ghz 244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4 3 GBps mixed read/write disk I/O per node Network AWS Placement Groups, VPC, Enhanced Networking Single Root I/O Virtualization (SR-IOV) 10 Gbps, low latency, low jitter (iperf: ~9.5 Gbps) 37 Q: Why only 206? A: Network is saturated @ 206
  38. 38. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Winning Software Configuration Spark 1.2, OpenJDK 1.7 Disable caching, compression, spec execution, shuffle spill Force NODE_LOCAL task scheduling for optimal data locality HDFS 2.4.1 short-circuit local reads, 2x replication Empirically chose between 4-6 partitions per cpu 206 nodes * 32 cores = 6592 cores 6592 cores * 4 = 26,368 partitions 6592 cores * 6 = 39,552 partitions 6592 cores * 4.25 = 28,000 partitions (empirical best) Range partitioning takes advantage of sequential keyspace Required ~10s of sampling 79 keys from in each partition 38
  39. 39. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark New Sort Shuffle Manager for Spark 1.2 Original “hash-based” New “sort-based” ①  Use less OS resources (socket buffers, file descriptors) ②  TimSort partitions in-memory ③  MergeSort partitions on-disk into a single master file ④  Serve partitions from master file: seek once, sequential scan 39
  40. 40. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Asynchronous Network Module Switch to asyncronous Netty vs. synchronous java.nio Switch to zero-copy epoll Use only kernel-space between disk and network controllers Custom memory management spark.shuffle.blockTransferService=netty Spark-Netty Performance Tuning spark.shuffle.io.preferDirectBuffers=true Reuse off-heap buffers spark.shuffle.io.numConnectionsPerPeer=8 (for example) Increase to saturate hosts with multiple disks (8x800 SSD) 40 Details in SPARK-2468
  41. 41. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Custom Algorithms and Data Structures Optimized for sort & shuffle workloads o.a.s.util.collection.TimSort[K,V] Based on JDK 1.7 TimSort Performs best with partially-sorted runs Optimized for elements of (K,V) pairs Sorts impl of SortDataFormat (ie. KVArraySortDataFormat) o.a.s.util.collection.AppendOnlyMap Open addressing hash, quadratic probing Array of [(K, V), (K, V)] Good memory locality Keys never removed, values only append 41
  42. 42. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Daytona GraySort Challenge Goal Success 1.1 Gbps/node network I/O (Reducers)
 Theoretical max = 1.25 Gbps for 10 GB ethernet 3 GBps/node disk I/O (Mappers) 42 Aggregate 
 Cluster Network I/O! 220 Gbps / 206 nodes ~= 1.1 Gbps per node
  43. 43. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Shuffle Performance Tuning Tips Hash Shuffle Manager (Deprecated) spark.shuffle.consolidateFiles (Mapper) o.a.s.shuffle.FileShuffleBlockResolver Intermediate Files Increase spark.shuffle.file.buffer (Reducer) Increase spark.reducer.maxSizeInFlight if memory allows Use Smaller Number of Larger Executors Minimizes intermediate files and overall shuffle More opportunity for PROCESS_LOCAL SQL: BroadcastHashJoin vs. ShuffledHashJoin spark.sql.autoBroadcastJoinThreshold Use DataFrame.explain(true) or EXPLAIN to verify 43 Many Threads (1 per CPU)
  44. 44. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Project Tungsten Data Struts & Algos Operate Directly on Byte Arrays Maximize CPU Cache Locality, Minimize GC Utilize Dynamic Code Generation 44 SPARK-7076 (Spark 1.4)
  45. 45. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Why is CPU the Bottleneck? CPU is used for serialization, hashing, compression GraySort optimizations improved network & shuffle Network and Disk I/O bandwidth are relatively high More partitioning, pruning, predicate pushdowns Better columnar formats reduce disk I/O bottleneck 45
  46. 46. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Yet Another Spark Shuffle Manager! spark.shuffle.manager = hash (Deprecated) < 10,000 reducers Output partition file hashes the key of (K,V) pair Mapper creates an output file per partition Leads to M*P output files for all partitions sort (GraySort Challenge) > 10,000 reducers Default from Spark 1.2-1.5 Mapper creates single output file for all partitions Minimizes OS resources, netty + epoll optimizes network I/O, disk I/O, and memory Uses custom data structures and algorithms for sort-shuffle workload Wins Daytona GraySort Challenge tungsten-sort (Project Tungsten) Default since 1.5 Modification of existing sort-based shuffle Uses com.misc.Unsafe for self-managed memory and garbage collection Maximize CPU utilization and cache locality with AlphaSort-inspired binary data structures/algorithms Perform joins, sorts, and other operators on both serialized and compressed byte buffers 46
  47. 47. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU & Memory Optimizations Custom Managed Memory Reduces GC overhead Both on and off heap Exact size calculations Direct Binary Processing Operate on serialized/compressed arrays Kryo can reorder/sort serialized records LZF can reorder/sort compressed records More CPU Cache-aware Data Structs & Algorithms o.a.s.sql.catalyst.expression.UnsafeRow o.a.s.unsafe.map.BytesToBytesMap Code Generation (default in 1.5) Generate source code from overall query plan 100+ UDFs converted to use code generation 47 UnsafeFixedWithAggregationMap TungstenAggregationIterator CodeGenerator GeneratorUnsafeRowJoiner UnsafeSortDataFormat UnsafeShuffleSortDataFormat PackedRecordPointer UnsafeRow UnsafeInMemorySorter UnsafeExternalSorter UnsafeShuffleWriter Mostly Same Join Code, UnsafeProjection UnsafeShuffleManager UnsafeShuffleInMemorySorter UnsafeShuffleExternalSorter Details in SPARK-7075
  48. 48. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark sun.misc.Unsafe 48 Info addressSize() pageSize() Objects allocateInstance() objectFieldOffset() Classes staticFieldOffset() defineClass() defineAnonymousClass() ensureClassInitialized() Synchronization monitorEnter() tryMonitorEnter() monitorExit() compareAndSwapInt() putOrderedInt() Arrays arrayBaseOffset() arrayIndexScale() Memory allocateMemory() copyMemory() freeMemory() getAddress() – not guaranteed after GC getInt()/putInt() getBoolean()/putBoolean() getByte()/putByte() getShort()/putShort() getLong()/putLong() getFloat()/putFloat() getDouble()/putDouble() getObjectVolatile()/putObjectVolatile() Used by 
 Tungsten
  49. 49. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark + com.misc.Unsafe 49 org.apache.spark.sql.execution. aggregate.SortBasedAggregate aggregate.TungstenAggregate aggregate.AggregationIterator aggregate.udaf aggregate.utils SparkPlanner rowFormatConverters UnsafeFixedWidthAggregationMap UnsafeExternalSorter UnsafeExternalRowSorter UnsafeKeyValueSorter UnsafeKVExternalSorter local.ConvertToUnsafeNode local.ConvertToSafeNode local.HashJoinNode local.ProjectNode local.LocalNode local.BinaryHashJoinNode local.NestedLoopJoinNode joins.HashJoin joins.HashSemiJoin joins.HashedRelation joins.BroadcastHashJoin joins.ShuffledHashOuterJoin (not yet converted) joins.BroadcastHashOuterJoin joins.BroadcastLeftSemiJoinHash joins.BroadcastNestedLoopJoin joins.SortMergeJoin joins.LeftSemiJoinBNL joins.SortMergerOuterJoin Exchange SparkPlan UnsafeRowSerializer SortPrefixUtils sort basicOperators aggregate.SortBasedAggregationIterator aggregate.TungstenAggregationIterator datasources.WriterContainer datasources.json.JacksonParser datasources.jdbc.JDBCRDD org.apache.spark. unsafe.Platform unsafe.KVIterator unsafe.array.LongArray unsafe.array.ByteArrayMethods unsafe.array.BitSet unsafe.bitset.BitSetMethods unsafe.hash.Murmur3_x86_32 unsafe.map.BytesToBytesMap unsafe.map.HashMapGrowthStrategy unsafe.memory.TaskMemoryManager unsafe.memory.ExecutorMemoryManager unsafe.memory.MemoryLocation unsafe.memory.UnsafeMemoryAllocator unsafe.memory.MemoryAllocator (trait/interface) unsafe.memory.MemoryBlock unsafe.memory.HeapMemoryAllocator unsafe.memory.ExecutorMemoryManager unsafe.sort.RecordComparator unsafe.sort.PrefixComparator unsafe.sort.PrefixComparators unsafe.sort.UnsafeSorterSpillWriter serializer.DummySerializationInstance shuffle.unsafe.UnsafeShuffleManager shuffle.unsafe.UnsafeShuffleSortDataFormat shuffle.unsafe.SpillInfo shuffle.unsafe.UnsafeShuffleWriter shuffle.unsafe.UnsafeShuffleExternalSorter shuffle.unsafe.PackedRecordPointer shuffle.ShuffleMemoryManager util.collection.unsafe.sort.UnsafeSorterSpillMerger util.collection.unsafe.sort.UnsafeSorterSpillReader util.collection.unsafe.sort.UnsafeSorterSpillWriter util.collection.unsafe.sort.UnsafeShuffleInMemorySorter util.collection.unsafe.sort.UnsafeInMemorySorter util.collection.unsafe.sort.RecordPointerAndKeyPrefix util.collection.unsafe.sort.UnsafeSorterIterator network.shuffle.ExternalShuffleBlockResolver scheduler.Task rdd.SqlNewHadoopRDD executor.Executor org.apache.spark.sql.catalyst.expressions. regexpExpressions BoundAttribute SortOrder SpecializedGetters ExpressionEvalHelper UnsafeArrayData UnsafeReaders UnsafeMapData Projection LiteralGeneartor UnsafeRow JoinedRow SpecializedGetters InputFileName SpecificMutableRow codegen.CodeGenerator codegen.GenerateProjection codegen.GenerateUnsafeRowJoiner codegen.GenerateSafeProjection codegen.GenerateUnsafeProjection codegen.BufferHolder codegen.UnsafeRowWriter codegen.UnsafeArrayWriter complexTypeCreator rows literals misc stringExpressions Over 200 source files affected!!
  50. 50. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Traditional Java Object Row Layout 4-byte String Multi-field Object 50
  51. 51. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Custom Data Structures for Workload UnsafeRow (Dense Binary Row) TaskMemoryManager (Virtual Memory Address) BytesToBytesMap (Dense Binary HashMap) 51 Dense, 8-bytes per field (word-aligned) Key Ptr AlphaSort-Style (Key + Pointer) OS-Style Memory Paging
  52. 52. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark UnsafeRow Layout Example 52 Pre-Tungsten Tungsten
  53. 53. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Custom Memory Management o.a.s.memory.
 TaskMemoryManager & MemoryConsumer Memory management: virtual memory allocation, pageing Off-heap: direct 64-bit address On-heap: 13-bit page num + 27-bit page offset o.a.s.shuffle.sort. PackedRecordPointer 64-bit word (24-bit partition key, (13-bit page num, 27-bit page offset)) o.a.s.unsafe.types. UTF8String Primitive Array[Byte] 53 2^13 pages * 2^27 page size = 1 TB RAM per Task
  54. 54. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark UnsafeFixedWidthAggregationMap Aggregations o.a.s.sql.execution.
 UnsafeFixedWidthAggregationMap Uses BytesToBytesMap In-place updates of serialized data No object creation on hot-path Improved external agg support No OOM’s for large, single key aggs o.a.s.sql.catalyst.expression.codegen. GenerateUnsafeRowJoiner Combine 2 UnsafeRows into 1 o.a.s.sql.execution.aggregate. TungstenAggregate & TungstenAggregationIterator Operates directly on serialized, binary UnsafeRow 2 Steps: hash-based agg (grouping), then sort-based agg Supports spilling and external merge sorting 54
  55. 55. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Equality Bitwise comparison on UnsafeRow No need to calculate equals(), hashCode() Row 1 Equals! Row 2 55
  56. 56. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Joins Surprisingly, not many code changes o.a.s.sql.catalyst.expressions. UnsafeProjection Converts InternalRow to UnsafeRow 56
  57. 57. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Sorting o.a.s.util.collection.unsafe.sort. UnsafeSortDataFormat UnsafeInMemorySorter UnsafeExternalSorter RecordPointerAndKeyPrefix
 UnsafeShuffleWriter AlphaSort-Style Cache Friendly 57 Ptr Key-Prefix 2x CPU Cache-line Friendly! Using multiple subclasses of SortDataFormat simultaneously will prevent JIT inlining. This affects sort & shuffle performance. Supports merging compressed records if compression CODEC supports it (LZF)
  58. 58. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spilling More Efficient Spilling Exact data size is known No need to maintain heuristics & approximations Controls amount of spilling Spill Merge on Compressed Records!! (If compression CODEC supports it - ie. LZF) 58 UnsafeFixedWidthAggregationMap.getPeakMemoryUsedBytes() Exact Memory ByteArray Count
  59. 59. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Code Generation Problem Boxing creates excessive objects Expression tree evaluations are costly JVM can’t inline polymorphic impls Solution Codegen by-passes virtual functions Defer source code generation to each operator, UDF, UDAF Rewrite and optimize code for overall plan, 8-byte align, etc Use Scala quasiquote macros for Scala AST source code gen Use Janino to compile generated source code into bytecode 59
  60. 60. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc IBM | spark.tc Spark SQL UDF Code Generation 100+ UDFs now generating code More to come in Spark 1.6+ Details in SPARK-8159, SPARK-9571 Each Implements Expression.genCode()!
  61. 61. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Creating a Custom UDF with Codegen Study existing implementations https://github.com/apache/spark/pull/7214/files Extend base trait o.a.s.sql.catalyst.expressions.Expression.genCode() Register the function o.a.s.sql.catalyst.analysis.FunctionRegistry.registerFunction() Augment DataFrame with new UDF (Scala implicits) o.a.s.sql.functions.scala Don’t forget about Python! python.pyspark.sql.functions.py 61
  62. 62. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Who Benefits from Project Tungsten? Users of DataFrames All Spark SQL Queries Catalyst All RDDs Serialization, Compression, and Aggregations 62
  63. 63. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Project Tungsten Performance Results Query Time Garbage Collection 63 OOM’d on Large Dataset!
  64. 64. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Autoscaling Spark Workers Spark 1.5+ spark-submit --max-executors API SparkContext.addExecutors() SparkContext.removeExecutors() Scaling up is easy J Scaling down is tricky L RDD cache within Executor JVM is lost RDD partitions need rebuilding in another Executor JVM Made possible by External Shuffle Service (Spark 1.2) 64
  65. 65. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Presentation Outline  Spark Core: Tuning & Mechanical Sympathy  Spark SQL: Query Optimizing & Catalyst 65
  66. 66. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Spark SQL: Query Optimizing & Catalyst Explore DataFrames/Datasets/DataSources, Catalyst Review Partitions, Pruning, Pushdowns, File Formats Create a Custom DataSource API Implementation 66
  67. 67. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark DataFrames Inspired by R and Pandas DataFrames Schema-aware Cross language support SQL, Python, Scala, Java, R Equal performance between all languages DataFrame is container for logical plan Lazy transformations represented as tree Only logical plan is sent from Python -> JVM Only results returned from JVM -> Python Supports existing Hive metastore Small, file-based Hive metastore created by default DataFrame.rdd returns underlying RDD if needed 67 Use DataFrames instead of RDDs!!
  68. 68. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Custom UDF and UDAF Support Study existing implementations https://github.com/apache/spark/pull/7214/files Extend base trait
 o.a.s.sql.catalyst.expressions.Expression.genCode() Register the function o.a.s.sql.catalyst.analysis.FunctionRegistry.registerFunc() Augment DataFrame with new UDF (Scala implicits) o.a.s.sql.functions.scala Don’t forget about Python! python.pyspark.sql.functions.py 68
  69. 69. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark and Hive Sparks’ Early Days Shark: “Hive on Spark” Spork: “Pig on Spark” Catalyst Optimizer replaces Hive Optimizer Always use HiveContext – even if not using Hive! If no Hive, a small Hive metastore file is created Spark 1.5+ supports all Hive versions 0.12+ Separate classloaders for isolation Separate Spark-internal Hive vs User-application Hive 69
  70. 70. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Catalyst Optimizer DataFrame Abstract Syntax Tree Transformation Subquery elimination: use aliases to collapse subqueries Constant folding: replace expression with constant Simplify filters: remove unnecessary filters Predicate/filter pushdowns: avoid unnecessary data load Projection collapsing: avoid unnecessary projections Create Custom Rules Rules are Scala Case Classes val newPlan = MyFilterRule(analyzedPlan) 70 Implements oas.sql.catalyst.rules.Rule Apply Rule at any stage
  71. 71. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark DataSources API Relations (o.a.s.sql.sources.interfaces.scala) BaseRelation (abstract class): Provides schema of data TableScan (impl): Read all data from source PrunedFilteredScan (impl): Column pruning & predicate pushdowns InsertableRelation (impl): Insert/overwrite data based on SaveMode RelationProvider (trait/interface): Handle options, BaseRelation factory Execution (o.a.s.sql.execution.commands.scala) RunnableCommand (trait/interface): Common commands like EXPLAIN ExplainCommand(impl: case class) CacheTableCommand(impl: case class) Filters (o.a.s.sql.sources.filters.scala) Filter (abstract class): Handles all predicates/filters supported by this source EqualTo (impl) GreaterThan (impl) StringStartsWith (impl) 71
  72. 72. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Native Spark SQL DataSources 72
  73. 73. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Debugging the Query Plan 73 gendersCsvDF.select($"id", $"gender").filter("gender != 'F'").filter("gender != 'M'").explain(true) DataFrame.queryExecution.logical DataFrame.queryExecution.analyzed DataFrame.queryExecution.optimizedPlan DataFrame.queryExecution.executedPlan
  74. 74. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Visualizing the Query Plan 74 Effectiveness of Filter CPU Cache 
 Friendly Binary Format Cost-based Join Optimization Similar to MapReduce Map-side Join Peak Memory for Joins and Aggs
  75. 75. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark JSON Data Source DataFrame val ratingsDF = sqlContext.read.format("json") .load("file:/root/pipeline/datasets/dating/ratings.json.bz2") -- or – val ratingsDF = sqlContext.read.json
 ("file:/root/pipeline/datasets/dating/ratings.json.bz2") SQL Code CREATE TABLE genders USING json OPTIONS (path "file:/root/pipeline/datasets/dating/genders.json.bz2") 75 json() convenience method
  76. 76. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark JDBC Data Source Add Driver to Spark JVM System Classpath $ export SPARK_CLASSPATH=<jdbc-driver.jar> DataFrame val jdbcConfig = Map("driver" -> "org.postgresql.Driver", "url" -> "jdbc:postgresql:hostname:port/database", "dbtable" -> ”schema.tablename") df.read.format("jdbc").options(jdbcConfig).load() SQL CREATE TABLE genders USING jdbc 
 OPTIONS (url, dbtable, driver, …) 76
  77. 77. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Parquet Data Source Configuration spark.sql.parquet.filterPushdown=true spark.sql.parquet.mergeSchema=true spark.sql.parquet.cacheMetadata=true spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo] DataFrames val gendersDF = sqlContext.read.format("parquet") .load("file:/root/pipeline/datasets/dating/genders.parquet") gendersDF.write.format("parquet").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders.parquet") SQL CREATE TABLE genders USING parquet OPTIONS (path "file:/root/pipeline/datasets/dating/genders.parquet") 77
  78. 78. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark ORC Data Source Configuration spark.sql.orc.filterPushdown=true DataFrames val gendersDF = sqlContext.read.format("orc") .load("file:/root/pipeline/datasets/dating/genders") gendersDF.write.format("orc").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders") SQL CREATE TABLE genders USING orc OPTIONS (path "file:/root/pipeline/datasets/dating/genders") 78
  79. 79. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Third-Party Spark SQL DataSources 79 spark-packages.org
  80. 80. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CSV DataSource (Databricks) Github https://github.com/databricks/spark-csv Maven com.databricks:spark-csv_2.10:1.2.0 Code val gendersCsvDF = sqlContext.read .format("com.databricks.spark.csv") .load("file:/root/pipeline/datasets/dating/gender.csv.bz2") .toDF("id", "gender") 80 toDF() is required if CSV does not contain header
  81. 81. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark ElasticSearch DataSource (Elastic.co) Github https://github.com/elastic/elasticsearch-hadoop Maven org.elasticsearch:elasticsearch-spark_2.10:2.1.0 Code val esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>", 
 "es.port" -> "<port>") df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite) .options(esConfig).save("<index>/<document-type>") 81
  82. 82. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Elasticsearch Tips Change id field to not_analyzed to avoid indexing Use term filter to build and cache the query Perform multiple aggregations in a single request Adapt scoring function to current trends at query time 82
  83. 83. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark AWS Redshift Data Source (Databricks) Github https://github.com/databricks/spark-redshift Maven com.databricks:spark-redshift:0.5.0 Code val df: DataFrame = sqlContext.read .format("com.databricks.spark.redshift") .option("url", "jdbc:redshift://<hostname>:<port>/<database>…") .option("query", "select x, count(*) my_table group by x") .option("tempdir", "s3n://tmpdir") .load(...) 83 UNLOAD and copy to tmp bucket in S3 enables parallel reads
  84. 84. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark DB2 and BigSQL DataSources (IBM) Coming Soon! 84
  85. 85. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Cassandra DataSource (DataStax) Github https://github.com/datastax/spark-cassandra-connector Maven com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1 Code ratingsDF.write .format("org.apache.spark.sql.cassandra") .mode(SaveMode.Append) .options(Map("keyspace"->"<keyspace>", "table"->"<table>")).save(…) 85
  86. 86. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Cassandra Pushdown Support spark-cassandra-connector/…/o.a.s.sql.cassandra.PredicatePushDown.scala Pushdown Predicate Rules 1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate 2. Only push down primary key column predicates with = or IN predicate. 3. If there are regular columns in the pushdown predicates, they should have at least one EQ expression on an indexed column and no IN predicates. 4. All partition column predicates must be included in the predicates to be pushed down, only the last part of the partition key can be an IN predicate. For each partition column, only one predicate is allowed. 5. For cluster column predicates, only last predicate can be non-EQ predicate including IN predicate, and preceding column predicates must be EQ predicates. If there is only one cluster column predicate, the predicates could be any non-IN predicate. 6. There is no pushdown predicates if there is any OR condition or NOT IN condition. 7. We're not allowed to push down multiple predicates for the same column if any of them is equality or IN predicate. 86
  87. 87. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark New Cassandra DataSource (?) By-pass CQL optimized for transactional data Instead, do bulk reads/writes directly on SSTables Similar to 5 year old Netflix Open Source project Aegisthus Promotes Cassandra to first-class Analytics Option Potentially only part of DataStax Enterprise?! Please mail a nasty letter to your local DataStax office 87
  88. 88. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Rumor of REST DataSource (Databricks) Coming Soon? Ask Michael Armbrust Spark SQL Lead @ Databricks 88
  89. 89. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Custom DataSource (Me and You!) Coming Right Now! 89 DEMO ALERT!!
  90. 90. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Create a Custom DataSource Study Existing Native & Third-Party Data Sources Native Spark JDBC (o.a.s.sql.execution.datasources.jdbc) class JDBCRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation Third-Party DataStax Cassandra (o.a.s.sql.cassandra) class CassandraSourceRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation! 90
  91. 91. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Demo! Create a Custom DataSource 91
  92. 92. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Contribute a Custom Data Source spark-packages.org Managed by Contains links to external github projects Ratings and comments Declare supported Spark version per package Kind of like a package manager Custom Maven Repo ----> Examples https://github.com/databricks/spark-csv https://github.com/datastax/spark-cassandra-connector 92
  93. 93. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Parquet Columnar File Format Based on Google Dremel Collaboration with Twitter and Cloudera Self-describing, evolving schema Fast columnar aggregation Supports filter pushdowns Columnar storage format Excellent compression 93
  94. 94. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Types of Compression Run Length Encoding: Repeated data Dictionary Encoding: Fixed set of values Delta, Prefix Encoding: Sorted data 94
  95. 95. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Demo! Demonstrate File Formats, Partition Schemes, and Query Plans 95
  96. 96. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Hive JDBC ODBC ThriftServer Allow BI Tools to Query and Process Spark Data Register Permanent Table CREATE TABLE ratings(fromuserid INT, touserid INT, rating INT) USING org.apache.spark.sql.json OPTIONS (path "datasets/dating/ratings.json.bz2") Register Temp Table ratingsDF.registerTempTable("ratings_temp") Configuration spark.sql.thriftServer.incrementalCollect=true spark.driver.maxResultSize > 10gb (default) 96
  97. 97. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Demo! Query and Process Spark Data from Beeline and/or Tableau 97
  98. 98. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark More Relevant Links http://meetup.com/Advanced-Apache-Spark-Meetup http://advancedspark.com http://github.com/fluxcapacitor/pipeline http://hub.docker.com/r/fluxcapacitor/pipeline http://sortbenchmark.org/ApacheSpark2014.pd https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-sca le-sorting.html http://0x0fff.com/spark-architecture-shuffle/ http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_ report.pdf http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilize s-the-cpu-cache-to-improve-performance http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/ http://docs.scala-lang.org/overviews/quasiquotes/intro.html http://lwn.net/Articles/252125/ (Memory Part 2: CPU Caches) http://lwn.net/Articles/255364/ (Memory Part 5: What Programmers Can Do) https://www.safaribooksonline.com/library/view/java-performance-the/9781449363512/ ch04.html http://web.eece.maine.edu/~vweaver/projects/perf_events/perf_event_open.html http://www.brendangregg.com/perf.html https://perf.wiki.kernel.org/index.php/Tutorial http://techblog.netflix.com/2015/07/java-in-flames.html http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#Java http://sortbenchmark.org/ApacheSpark2014.pdf https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-sca le-sorting.html http://0x0fff.com/spark-architecture-shuffle/ http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_ report.pdf http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilize s-the-cpu-cache-to-improve-performance http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/ http://docs.scala-lang.org/overviews/quasiquotes/intro.html 98 http://lwn.net/Articles/252125/ <-- Memory Part 2: CPU Caches http://lwn.net/Articles/255364/ <-- Memory Part 5: What Programmers Can Do http://antirez.com/news/75 http://esumitra.github.io/algebird-boston-spark/#/ https://github.com/fluxcapacitor/pipeline http://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/ http://spark.apache.org/docs/latest/ml-guide.html http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html (part 1) http://techblog.netflix.com/2012/06/netflix-recommendations-beyond-5-stars.html (part 2) http://www.brendangregg.com/blog/2015-11-06/java-mixed-mode-flame-graphs.html
  99. 99. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc What’s Next? After Dark 1.6 99
  100. 100. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Incorporate New Features of Spark 1.6 https://docs.cloud.databricks.com/docs/spark/1.6/index.html 100
  101. 101. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark What’s Next? Autoscaling Docker/Spark Workers Completely Docker-based Docker Compose, Google Kubernetes Lots of Demos and Examples More Zeppelin & IPython/Jupyter notebooks More advanced analytics use cases Performance Tuning and Profiling Work closely with Netflix & Databricks Identify & fix Spark performance bottlenecks 101
  102. 102. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Upcoming Meetups and Conferences London Spark Meetup (Oct 12th) Scotland Data Science Meetup (Oct 13th) Dublin Spark Meetup (Oct 15th) Barcelona Spark Meetup (Oct 20th) Madrid Big Data Meetup (Oct 22nd) Paris Spark Meetup (Oct 26th) Amsterdam Spark Summit (Oct 27th) Brussels Spark Meetup (Oct 30th) Zurich Big Data Meetup (Nov 2nd) Geneva Spark Meetup (Nov 5th) San Francisco Datapalooza (Nov 10th) San Francisco Advanced Spark (Nov 12th) 102 Oslo Big Data Hadoop Meetup (Nov 19th) Helsinki Spark Meetup (Nov 20th) Stockholm Spark Meetup (Nov 23rd) Copenhagen Spark Meetup (Nov 25th) Budapest Spark Meetup (Nov 26th) Istanbul Spark Meetup (Nov 28th) Singapore Strata Conference (Dec 1st) Sydney Spark Meetup (Dec 7th) Melbourne Spark Meetup (Dec 9th) San Francisco Advanced Spark (Dec 10th) Toronto Spark Meetup (Dec 14th) Austin Data Days Conference (Jan 16th)
  103. 103. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Thank You!!! Chris Fregly IBM Spark Technology Center San Francisco, California (Find me on LinkedIn, Twitter, Github) Relevant Links advancedspark.com Signup for the book & global meetup! github.com/fluxcapacitor/pipeline Clone, contribute, and commit code! hub.docker.com/r/fluxcapacitor/pipeline/wiki Run all demos in your own environment with Docker! 103
  104. 104. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

×