Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sorry - How Bieber broke Google Cloud at Spotify

1,882 views

Published on

Talk at Scala Up North Jul 21 2017
We will talk about Spotify's story with Scala big data and our journey to migrate our entire data infrastructure to Google Cloud and how Justin Bieber contributed to breaking it. We'll talk about Scio, a Scala API for Apache Beam and Google Cloud Dataflow, and the technology behind it, including macros, algebird, chill and shapeless. There'll also be a live coding demo.

Published in: Software
  • Be the first to comment

Sorry - How Bieber broke Google Cloud at Spotify

  1. 1. Sorry How Bieber broke Google Cloud at Spotify Neville Li @sinisa_lyh
  2. 2. Who am I? ● Spotify NYC since 2011 ● Formerly Yahoo! Search ● Music recommendations ● Data & ML infrastructure ● Scala since 2013
  3. 3. About Us ● 100M+ active users ● 40M+ subscribers ● 30M+ songs, 20K new per day ● 2B+ playlists ● 1B+ plays per day
  4. 4. And We Have Data
  5. 5. Hadoop at Spotify ● On-premise → Amazon EMR → On-premise ● ~2,500 nodes, largest in EU ● 100PB+ Disk, 100TB+ RAM ● 60TB+ per day log ingestion ● 20K+ jobs per day
  6. 6. Data Processing ● Luigi, Python M/R, circa 2011 ● Scalding, Spark, circa 2013 ● 200+ Scala users, 1K+ unique jobs ● Storm for real-time ● Hive for ad-hoc analysis
  7. 7. Moving to Google CloudEarly 2015
  8. 8. Hive → BigQuery ● Full Avro scans → Columnar storage ● M/R jobs → Optimized execution engine ● Batch → interactive query ● $$$ → Pay for bytes processed ● Beam/Dataflow integration Dremel Paper, 2010
  9. 9. Top Tracks in Vancouver Jun 2017 ● 30 date partitioned tables, 60TB ● 1 metadata table, 418GB ● 94.2s, 4.82TB processed Track Artist Count Despacito - Remix Luis Fonsi 349188 I'm the One DJ Khaled 214863 HUMBLE. Kendrick Lamar 200690 Unforgettable French Montana 192141 XO TOUR Llif3 Lil Uzi Vert 148546 Slide Calvin Harris 134236 Attention Charlie Puth 131514 Strip That Down Liam Payne 125191 2U David Guetta 124998
  10. 10. Top Bieber Tracks Jun 2017 ● 30 date partitioned tables, 60TB ● 1 metadata table, 418GB ● 81.8s, 2.92TB processed Track Count Love Yourself 28400153 Sorry 27269157 What Do You Mean? 20047452 Company 11794409 Purpose 8314381 I'll Show You 6627706 Baby 6161099 Boyfriend 5854955 The Feeling 5415162 As Long As You Love Me 4738529
  11. 11. Adoption ● ~500 unique users ● ~640K queries per month ● ~500PB queried per month
  12. 12. ● Cluster management, multi-tenancy & resource utilization ● Tuning and scaling ● 3 sets of separate APIs ● Not cloud native & missing Google Cloud connectors
  13. 13. What is Beam and Cloud Dataflow?
  14. 14. The Evolution of Apache Beam MapReduce BigTable DremelColossus FlumeMegastoreSpanner PubSub Millwheel Apache Beam Google Cloud Dataflow
  15. 15. What is Apache Beam? 1. The Beam Programming Model 2. SDKs for writing Beam pipelines -- Java & Python 3. Runners for existing distributed processing backends ○ Apache Flink (thanks to data Artisans) ○ Apache Spark (thanks to Cloudera and PayPal) ○ Google Cloud Dataflow (fully managed service) ○ Local runner for testing
  16. 16. 17 The Beam Model: Asking the Right Questions What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?
  17. 17. 18 Customizing What Where When How 3 Streaming 4 Streaming + Accumulation 1 Classic Batch 2 Windowed Batch
  18. 18. Why Beam and Cloud Dataflow?
  19. 19. ● Unified batch and streaming model ● Hosted, fully managed, no ops ● Auto-scaling, dynamic work re-balance ● GCP ecosystem - BigQuery, Bigtable, Datastore, Pubsub Beam + Cloud Dataflow
  20. 20. No More PagerDuty!
  21. 21. Why Beam and Scala?
  22. 22. ● High level DSL ● Familiarity with Scalding, Spark and Flink ● Functional programming natural fit for data ● Numerical libraries - Breeze, Algebird ● Macros & shapeless for code generation Beam and Scala
  23. 23. Scio A Scala API for Apache Beam and Google Cloud Dataflow
  24. 24. Scio Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i̯o] Verb: I can, know, understand, have knowledge.
  25. 25. github.com/spotify/scio Apache Licence 2.0
  26. 26. Cloud Storage Pub/Sub Datastore BigtableBigQuery Batch Streaming Interactive REPL Scio Scala API Dataflow Java SDK Scala Libraries Extra features
  27. 27. WordCount val sc = ScioContext() sc.textFile("shakespeare.txt") .flatMap { _ .split("[^a-zA-Z']+") .filter(_.nonEmpty) } .countByValue .saveAsTextFile("wordcount.txt") sc.close()
  28. 28. PageRank def pageRank(in: SCollection[(String, String)]) = { val links = in.groupByKey() var ranks = links.mapValues(_ => 1.0) for (i <- 1 to 10) { val contribs = links.join(ranks).values .flatMap { case (urls, rank) => val size = urls.size urls.map((_, rank / size)) } ranks = contribs.sumByKey.mapValues((1 - 0.85) + 0.85 * _) } ranks }
  29. 29. Why Scio?
  30. 30. Type safe BigQuery Macro generated case classes, schemas and converters @BigQuery.fromQuery("SELECT id, name FROM [users] WHERE ...") class User // look mom no code! sc.typedBigQuery[User]().map(u => (u.id, u.name)) @BigQuery.toTable case class Score(id: String, score: Double) data.map(kv => Score(kv._1, kv._2)).saveAsTypedBigQuery("table")
  31. 31. ● Best of both worlds ● BigQuery for slicing and dicing huge datasets ● Scala for custom logic ● Seamless integration BigQuery + Scio
  32. 32. REPL $ scio-repl Welcome to _____ ________________(_)_____ __ ___/ ___/_ /_ __ _(__ )/ /__ _ / / /_/ / /____/ ___/ /_/ ____/ version 0.3.4 Using Scala version 2.11.11 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121) Type in expressions to have them evaluated. Type :help for more information. Using 'scio-test' as your BigQuery project. BigQuery client available as 'bq' Scio context available as 'sc' scio> _ Available in github.com/spotify/homebrew-public
  33. 33. ● DAG and source code visualization ● BigQuery caching, legacy & SQL 2011 support ● HDFS, Protobuf, TensorFlow, Bigtable, Elasticsearch & Cassandra I/O ● Join optimizations - hash, skewed, sparse ● Job metrics ● DistCache, job orchestration Other goodies
  34. 34. Demo Time!
  35. 35. Adoption ● At Spotify ○ 200+ users, 700+ unique production pipelines (from ~70 10 months ago) ○ Most new to Scala and Scio ○ Both batch and streaming jobs ● Externally ○ ~10 companies, several fairly large ones
  36. 36. Use Cases
  37. 37. Fan Insights ● Listener stats [artist|track] × [context|geography|demography] × [day|week|month] ● BigQuery, GCS, Datastore ● TBs daily ● 150+ Java jobs → ~10 Scio jobs
  38. 38. Master Metadata ● n1-standard-1 workers ● 1 core 3.75GB RAM ● Autoscaling 2-35 workers ● 26 Avro sources - artist, album, track, disc, cover art, ... ● 120GB out, 70M records ● 200 LOC vs original Java 600 LOC
  39. 39. And we broke Google
  40. 40. Listening history ● user x track x [day/week/month/all time] ● 300B elements ● 800 n1-highmem-32 workers ● 32 core 208GB RAM ● 240TB in - Bigtable ● 90TB out - GCS
  41. 41. BigDiffy ● Pairwise field-level statistical diff ● Diff 2 SCollection[T] given keyFn: T => String ● T: Avro, BigQuery JSON, Protobuf ● Leaf field Δ - numeric, string (Levenshtein), vector (Cosine) ● Δ statistics - min, max, μ, σ, etc. ● Non-deterministic fields ○ ignore field ○ treat "repeated" field (List) as unordered list (Set) Part of github.com/spotify/ratatool
  42. 42. Dataset Diff ● Diff stats ○ Global: # of SAME, DIFF, MISSING LHS/RHS ○ Key: key → SAME, DIFF, MISSING LHS/RHS ○ Field: field → min, max, μ, σ, etc. ● Use cases ○ Validating pipeline migration ○ Sanity checking ML models
  43. 43. Pairwise field-level deltas val lKeyed = lhs.keyBy(keyFn) val rKeyed = rhs.keyBy(keyFn) val deltas = (lKeyed outerJoin rKeyed).map { case (k, (lOpt, rOpt)) => (lOpt, rOpt) match { case (Some(l), Some(r)) => val ds = diffy(l, r) // Seq[Delta] val dt = if (ds.isEmpty) SAME else DIFFERENT (k, (ds, dt)) case (_, _) => val dt = if (lOpt.isDefined) MISSING_RHS else MISSING_LHS (k, (Nil, dt)) } }
  44. 44. Summing deltas import com.twitter.algebird._ // convert deltas to map of (field → summable stats) def deltasToMap(ds: Seq[Delta], dt: DeltaType) : Map[String, (Long, Option[(DeltaType, Min[Double], Max[Double], Moments)])] = { // ... } deltas .map { case (_, (ds, dt)) => deltasToMap(ds, dt) } .sum // Semigroup!
  45. 45. 1. Copy & paste from legacy codebase to Scio 2. Verify with BigDiffy 3. Profit!
  46. 46. Featran ● a.k.a. Featran77 or F77 ● Type safe feature transformer for machine learning ● Column-wise aggregation and transformation ● 1 shuffle stage (Algebird aggregation) ● In memory, Flink, Scalding, Scio & Spark runner ● Scala collection, array, Breeze, TensorFlow & NumPy output formats ● Property-based testing, 100% coverage https://github.com/spotify/featran
  47. 47. Transforming features case class Record(d1: Double, d2: Option[Double], d3: Double, s: Seq[String]) val spec = FeatureSpec.of[Record] .required(_.d1)(Binarizer("bin", threshold = 0.5)) .optional(_.d2)(Bucketizer("bucket", Array(0.0, 10.0, 20.0))) .required(_.d3)(StandardScaling("std")) .required(_.d3)(QuantileDiscretizer("q", 4)) .required(_.s)(NHotEncoder("nhot")) val f = spec.extract(records) f.featureNames // human readable names f.featureValues[DenseVector[Double]] // output format f.featureSettings // settings can be reloaded
  48. 48. shapeless-datatype ● Case class ↔ Avro, BigQuery, Datastore & TensorFlow types ● Compile time and extendable type mapping ● Generic mapper between case class types ● Type and lens based record matcher https://github.com/nevillelyh/shapeless-datatype
  49. 49. protobuf-generic ● Generic Protobuf manipulation without compiled classes ● Similar to Avro GenericRecord ● Protobuf type T → JSON schema ● Bytes ↔ JSON with JSON schema ● Command line tool to inspect binary files https://github.com/nevillelyh/protobuf-generic
  50. 50. Other uses ● AB testing ○ Statistical analysis with bootstrap and DimSum ● Monetization ○ Ads targeting ○ User conversion analysis ● User understanding ○ Diversity ○ Session analysis ○ Behavior analysis ● Home page ranking ● Audio fingerprint analysis
  51. 51. We're developing so fast that Google's having a hard time keep up capacity
  52. 52. Google Cloud Capacity ● Almost too easy to write big complex jobs leveraging BigQuery & Dataflow ● BigQuery export limit 10TB/day/project ● Compute engine datacenter capacity ● Shuffle disk size & throughput ● Bigtable throttling
  53. 53. And We Have Data
  54. 54. Scala Challenges
  55. 55. Serialization ● Data ser/de ○ Scalding, Spark and Storm uses Kryo and Twitter Chill ○ Dataflow/Beam requires explicit Coder[T] Sometimes inferable via Guava TypeToken ○ ClassTag to the rescue, fallback to Kryo/Chill ● Lambda ser/de ○ Custom ClosureCleaner, Serializable and @transient lazy val ○ Scala 2.12, Java 8 lambda issues #10232 #10233
  56. 56. REPL ● Spark REPL transports lambda bytecode via HTTP ● Dataflow requires job jar for execution (masterless) ● Custom class loader and ILoop ● Interpreted classes → job jar → job submission ● SCollection[T]#closeAndCollect(): Iterator[T] to mimic Spark actions
  57. 57. Macros and IntelliJ IDEA ● IntelliJ IDEA does not see macro expanded classes https://youtrack.jetbrains.com/issue/SCL-8834 ● @BigQueryType.{fromTable, fromQuery} class MyRecord ● Scio IDEA plugin https://github.com/spotify/scio-idea-plugin
  58. 58. Make IntelliJ Intelligent Again!
  59. 59. Scio in Apache Zeppelin Local Zeppelin server, remote managed Dataflow cluster, NO OPS
  60. 60. The End Thank You Neville Li @sinisa_lyh

×