Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)

2,301 views

Published on

Presentation at Spark Summit 2015

Published in: Data & Analytics

Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)

  1. 1. (or: How I learned to stop worrying and love the shuffle) By: Ilya Ganelin Lessons from Spark
  2. 2. •  Recommendations (ML Lib) –  Collaborative Filtering (CF) –  65 million X 10 million –  2.5 TB, ~12 PFlops –  All recs (long tail) •  You’re mad! –  Don’t usually generate all possible recs –  6 data nodes (~40 GB RAM) –  Shared cluster –  Spark stability What:
  3. 3. •  Goal: –  Understand how Spark internals drive design and configuration •  Contents: –  Background •  Partitions •  Caching •  Serialization •  Shuffle –  Lessons 1-4 Overview
  4. 4. Background
  5. 5. •  Partitions –  How data is split on disk –  Affects memory usage, shuffle size –  Count ~ speed, Count ~ 1/memory •  Caching –  Persist RDDs in distributed memory –  Major speedup for repeated operations •  Serialization –  Efficient movement of data –  Java vs. Kryo Partitions, Caching, and Serialization
  6. 6. Shuffle?
  7. 7. Shuffle! •  All-all operations –  reduceByKey, groupByKey •  Data movement –  Serialization –  Akka •  Memory overhead –  Dumps to disk when OOM –  Garbage collection •  EXPENSIVE! Map Reduce
  8. 8. Lessons
  9. 9. •  Memory –  You’re using more than you think •  JVM overhead •  Spark metadata (shuffles, long-running jobs) •  Scala vs. Java –  Shuffle & heap •  Debugging is hard –  Distributed logs –  Hundreds of tasks Lesson 1: Spark is a problem child!
  10. 10. •  Tame the beast (memory) –  Partition wisely –  Know your data! •  Size, Types, Distribution –  Kryo Serialization •  Cleanup –  Long-term jobs consume memory indefinitely –  Spark context cleanup fails in production environment –  Solution: YARN! •  Separate spark-submits per batch •  Stable Spark-based job that runs for weeks Lesson 1: Discipline
  11. 11. •  Why? –  Speed up execution –  Increase stability –  ???? –  Profit! •  How? –  Use the driver! •  Collect •  Broadcast •  Accumulators Lesson 2: Avoid shuffles!
  12. 12. •  Limited memory –  Collected RDDs –  Metadata –  Results (Accumulators) •  Akka messaging –  10e6 x (120 bytes) ~ 1.2GB; 20 partitions –  Read ~60 MB per partition – (Default is 10MB) –  Solution: Partition & set akka.frameSize - know your data! •  Big data –  Solution: Batch process •  Problem: Cleanup and long-term stability Lesson 3: Using the driver is hard!
  13. 13. •  Cache, but cache wisely –  If you use it twice, cache it •  Broadcast variables –  Visible to all executors –  Only serialized once –  Blazing-fast lookups! •  Threading –  Thread pool on driver –  Fast operations, many tasks •  75x speedup over ML Lib ALS predict() –  Start: 1 rec / 1.5 seconds –  End: 50 recs / second Lesson 4: Speed!
  14. 14. Questions?

×