Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scio - Moving to Google Cloud, A Spotify Story

1,391 views

Published on

Talk at Philly ETE Apr 28 2017
We will talk about Spotify’s story of migrating our big data infrastructure to Google Cloud. Over the past year or so we moved away from maintaining our own 2500+ node Hadoop cluster to managed services in the cloud. We replaced two key components in our data processing stack, Hive and Scalding, with BigQuery and Scio and are able to iterate at a much faster speed. We will focus the technical aspect of Scio, a Scala API for Apache Beam and Google Cloud Dataflow and how it changed the way we process data.

Published in: Software
  • Be the first to comment

Scio - Moving to Google Cloud, A Spotify Story

  1. 1. Scio Moving to Google Cloud A Spotify Story Neville Li @sinisa_lyh
  2. 2. Who am I? ● Spotify NYC since 2011 ● Formerly Yahoo! Search ● Music recommendations ● Data & ML infrastructure ● Scala since 2013
  3. 3. About Us ● 100M+ active users ● 40M+ subscribers ● 30M+ songs, 20K new per day ● 2B+ playlists ● 1B+ plays per day
  4. 4. And We Have Data
  5. 5. Hadoop at Spotify ● On-premise → Amazon EMR → On-premise ● ~2,500 nodes, largest in EU ● 100PB+ Disk, 100TB+ RAM ● 60TB+ per day log ingestion ● 20K+ jobs per day
  6. 6. Data Processing ● Luigi, Python M/R, circa 2011 ● Scalding, Spark, circa 2013 ● 200+ Scala users, 1K+ unique jobs ● Storm for real-time ● Hive for ad-hoc analysis
  7. 7. Moving to Google CloudEarly 2015
  8. 8. Hive ● Row-oriented Avro/TSV → full files scan ● Translates to M/R → disk IO between steps ● Slow and resource hog ● Immature Parquet support ● Weak UDF and programmatic API
  9. 9. Hive → BigQuery ● Columnar storage ● Optimized execution engine ● Interactive query ● JavaScript UDF ● Pay for bytes processed ● Beam/Dataflow integration Dremel Paper, 2010
  10. 10. Typical Workloads Query Type Hive/Hadoop BigQuery KPI by specific ad hoc parameter ~1,200s ~10-20s FB audience list for social targeting ~4,000s ~15-30s Top tracks by age, gender & market ~18,000s ~500s
  11. 11. Adoption ● ~500 unique users ● ~640K queries per month ● ~500PB queried per month
  12. 12. ● Scalding for most pipelines ○ Key metrics, discover weekly, AB testing analysis, ... ○ Stable and proven at Twitter, Stripe, Etsy, eBay, ... ○ M/R disk IO overhead, Tez not ready ○ No streaming (Summingbird?)
  13. 13. ● Spark ○ Interactive ML, user behavior modelling & prediction... ○ Batch, streaming, notebook, SQL and ML ○ Separate APIs for batch and streaming ○ Many operation modes, hard to tune and scale
  14. 14. ● Storm ○ Ads targeting, new user recommendations ○ Separate cluster and ops ○ Low level API, missing window, join primitives
  15. 15. ● Cluster management ● Multi-tenancy, resource utilization and contention ● 3 sets of separate APIs ● Missing Google Cloud connectors
  16. 16. What is Beam and Cloud Dataflow?
  17. 17. The Evolution of Apache Beam MapReduce BigTable DremelColossus FlumeMegastoreSpanner PubSub Millwheel Apache Beam Google Cloud Dataflow
  18. 18. What is Apache Beam? 1. The Beam Programming Model 2. SDKs for writing Beam pipelines -- starting with Java 3. Runners for existing distributed processing backends ○ Apache Flink (thanks to data Artisans) ○ Apache Spark (thanks to Cloudera and PayPal) ○ Google Cloud Dataflow (fully managed service) ○ Local runner for testing
  19. 19. 19 The Beam Model: Asking the Right Questions What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?
  20. 20. 20 Customizing What Where When How 3 Streaming 4 Streaming + Accumulation 1 Classic Batch 2 Windowed Batch
  21. 21. 21 The Apache Beam Vision 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam concepts available in new languages. 3. Runner writers: who have a distributed processing environment and want to support Beam pipelines Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other LanguagesBeam Java Beam Python Execution Execution Cloud Dataflow Execution
  22. 22. Data model Spark ● RDD for batch, DStream for streaming ● Two sets of APIs ● Explicit caching semantics Beam / Cloud Dataflow ● PCollection for batch and streaming ● One unified API ● Windowed and timestamped values
  23. 23. Execution Spark ● One driver, n executors ● Dynamic execution from driver ● Transforms and actions Beam / Cloud Dataflow ● No master ● Static execution planning ● Transforms only, no actions
  24. 24. Why Beam and Cloud Dataflow?
  25. 25. ● Unified batch and streaming model ● Hosted, fully managed, no ops ● Auto-scaling, dynamic work re-balance ● GCP ecosystem - BigQuery, Bigtable, Datastore, Pubsub Beam and Cloud Dataflow
  26. 26. No More PagerDuty!
  27. 27. Why Beam and Scala?
  28. 28. ● High level DSL ● Familiarity with Scalding, Spark and Flink ● Functional programming natural fit for data ● Numerical libraries - Breeze, Algebird ● Macros for code generation Beam and Scala
  29. 29. Scio A Scala API for Apache Beam and Google Cloud Dataflow
  30. 30. Scio Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i̯o] Verb: I can, know, understand, have knowledge.
  31. 31. github.com/spotify/scio Apache Licence 2.0
  32. 32. Cloud Storage Pub/Sub Datastore BigtableBigQuery Batch Streaming Interactive REPL Scio Scala API Dataflow Java SDK Scala Libraries Extra features
  33. 33. WordCount val sc = ScioContext() sc.textFile("shakespeare.txt") .flatMap { _ .split("[^a-zA-Z']+") .filter(_.nonEmpty) } .countByValue .saveAsTextFile("wordcount.txt") sc.close()
  34. 34. PageRank def pageRank(in: SCollection[(String, String)]) = { val links = in.groupByKey() var ranks = links.mapValues(_ => 1.0) for (i <- 1 to 10) { val contribs = links.join(ranks).values .flatMap { case (urls, rank) => val size = urls.size urls.map((_, rank / size)) } ranks = contribs.sumByKey.mapValues((1 - 0.85) + 0.85 * _) } ranks }
  35. 35. Why Scio?
  36. 36. Type safe BigQuery Macro generated case classes, schemas and converters @BigQuery.fromQuery("SELECT id, name FROM [users] WHERE ...") class User // look mom no code! sc.typedBigQuery[User]().map(u => (u.id, u.name)) @BigQuery.toTable case class Score(id: String, score: Double) data.map(kv => Score(kv._1, kv._2)).saveAsTypedBigQuery("table")
  37. 37. ● Best of both worlds ● BigQuery for slicing and dicing huge datasets ● Scala for custom logic ● Seamless integration BigQuery + Scio
  38. 38. REPL $ scio-repl Welcome to _____ ________________(_)_____ __ ___/ ___/_ /_ __ _(__ )/ /__ _ / / /_/ / /____/ ___/ /_/ ____/ version 0.3.0-beta3 Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121) Type in expressions to have them evaluated. Type :help for more information. Using 'scio-test' as your BigQuery project. BigQuery client available as 'bq' Scio context available as 'sc' scio> _ Available in github.com/spotify/homebrew-public
  39. 39. Future based orchestration // Job 1 val f: Future[Tap[String]] = data1.saveAsTextFile("output") sc1.close() // submit job val t: Tap[String] = Await.result(f) t.value.foreach(println) // Iterator[String] // Job 2 val sc2 = ScioContext(options) val data2: SCollection[String] = t.open(sc2)
  40. 40. DistCache val sw = sc.distCache("gs://bucket/stopwords.txt") { f => Source.fromFile(f).getLines().toSet } sc.textFile("gs://bucket/shakespeare.txt") .flatMap { _ .split("[^a-zA-Z']+") .filter(w => w.nonEmpty && !sw().contains(w)) } .countByValue .saveAsTextFile("wordcount.txt")
  41. 41. ● DAG and source code visualization ● BigQuery caching, legacy & SQL 2011 support ● HDFS, Protobuf, TensorFlow and Bigtable I/O ● Join optimizations - hash, skewed, sparse ● Job metrics and accumulators Other goodies
  42. 42. Scio is the swiss army knife of data processing on Google Cloud
  43. 43. Demo Time!
  44. 44. Adoption ● At Spotify ○ 200+ users, 400+ production pipelines (from ~70 6 months ago) ○ Most of them new to Scala and Scio ○ Both batch and streaming jobs ● Externally ○ ~10 companies, several fairly large ones
  45. 45. Collaboration ● Open source model ○ Discussion on Slack & Gitter ○ Mailing lists ○ Issue tracking on public Github ○ Community driven feature development Type safe BigQuery, Bigtable, Datastore, Protobuf
  46. 46. 3: # of developers behind Scio
  47. 47. Use Cases
  48. 48. Release Radar ● 50 n1-standard-1 workers ● 1 core 3.75GB RAM ● 130GB in - Avro & Bigtable ● 130GB out x 2 - Bigtable in US+EU ● 110M Bigtable mutations ● 120 LOC
  49. 49. Fan Insights ● Listener stats [artist|track] × [context|geography|demography] × [day|week|month] ● BigQuery, GCS, Datastore ● TBs daily ● 150+ Java jobs → ~10 Scio jobs
  50. 50. Master Metadata ● n1-standard-1 workers ● 1 core 3.75GB RAM ● Autoscaling 2-35 workers ● 26 Avro sources - artist, album, track, disc, cover art, ... ● 120GB out, 70M records ● 200 LOC vs original Java 600 LOC
  51. 51. BigDiffy ● Pairwise field-level statistical diff ● Diff 2 SCollection[T] given keyFn: T => String ● T: Avro, BigQuery, Protobuf ● Leaf field Δ - numeric, string, vector ● Δ statistics - min, max, μ, σ, etc. ● Non-deterministic fields ○ ignore field ○ treat "repeated" field as unordered list Part of github.com/spotify/ratatool
  52. 52. Dataset Diff ● Diff stats ○ Global: # of SAME, DIFF, MISSING LHS/RHS ○ Key: key → SAME, DIFF, MISSING LHS/RHS ○ Field: field → min, max, μ, σ, etc. ● Use cases ○ Validating pipeline migration ○ Sanity checking ML models
  53. 53. Pairwise field-level deltas val lKeyed = lhs.keyBy(keyFn) val rKeyed = rhs.keyBy(keyFn) val deltas = (lKeyed outerJoin rKeyed).map { case (k, (lOpt, rOpt)) => (lOpt, rOpt) match { case (Some(l), Some(r)) => val ds = diffy(l, r) // Seq[Delta] val dt = if (ds.isEmpty) SAME else DIFFERENT (k, (ds, dt)) case (_, _) => val dt = if (lOpt.isDefined) MISSING_RHS else MISSING_LHS (k, (Nil, dt)) } }
  54. 54. Summing deltas import com.twitter.algebird._ // convert deltas to map of (field → summable stats) def deltasToMap(ds: Seq[Delta], dt: DeltaType) : Map[String, (Long, Option[(DeltaType, Min[Double], Max[Double], Moments)])] = { // ... } deltas .map { case (_, (ds, dt)) => deltasToMap(ds, dt) } .sum // Semigroup!
  55. 55. 1. Copy & paste from legacy codebase to Scio 2. Verify with BigDiffy 3. Profit!
  56. 56. Other uses ● AB testing ○ Statistical analysis with bootstrap and DimSum ● Monetization ○ Ads targeting ○ User conversion analysis ● User understanding ○ Diversity ○ Session analysis ○ Behavior analysis ● Home page ranking ● Audio fingerprint analysis
  57. 57. The End Thank You Neville Li @sinisa_lyh

×