Successfully reported this slideshow.
Your SlideShare is downloading. ×

Data processing at spotify using scio

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 43 Ad

Data processing at spotify using scio

Download to read offline

Two years ago, Spotify introduced Scio, an open-source Scala framework to develop data pipelines and deploy them on Google Dataflow. In this talk, we will discuss the evolution of Scio, and share the highlights of running Scio in production for two years. We will showcase several interesting data processing workflows ran at Spotify, what we learned from running them in production, and how we leveraged that knowledge to make Scio faster, and safer and easier to use.

Two years ago, Spotify introduced Scio, an open-source Scala framework to develop data pipelines and deploy them on Google Dataflow. In this talk, we will discuss the evolution of Scio, and share the highlights of running Scio in production for two years. We will showcase several interesting data processing workflows ran at Spotify, what we learned from running them in production, and how we leveraged that knowledge to make Scio faster, and safer and easier to use.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Data processing at spotify using scio (20)

Advertisement

Recently uploaded (20)

Data processing at spotify using scio

  1. 1. Data processing @Spotify using Scio Julien Tournay | Scala Matsuri 2019 󾓥
  2. 2. about:jto − 1,5 years Spotifier (🇸🇪 Stockholm ) − Data engineer at #flatmap (Data Infrastructure) Spotify 勤務は 1.5年
  3. 3. about:spotify − Audio streaming subscription service − 100M+ Subscribers − 217M+ Monthly Active Users − 50M+ Songs − 3B+ Playlists − 79 Markets https://newsroom.spotify.com/company-info/ Spotify は 1億人のユーザを持つオーディオ系ストリーミン グサービス
  4. 4. Data-processing
  5. 5. ‒ Discover Weekly ‒ Release Radar ‒ Daily Mixes ‒ Your 2018 Wrapped ‒ Fan insights ‒ ... scio: use cases scioのユースケース。パーソナライズされた、毎週更新の プレイリスト、最新曲、おすすめ曲の紹介など。
  6. 6. − Volume of data − Number of datasets − Number of data-engineers / data-scientists scalability データ量、データセット数、データエンジニア人数など多次 元でのスケーラビリティ
  7. 7. ‒ Scheduling ‒ Orchestration ‒ Processing (Batch & streaming) challenges ‒ Testing ‒ Storage ‒ 💰 (runtime & storage) 目下の課題はスケジューリング、テスト、ストレージ、予算 など
  8. 8. ‒ Scheduling ‒ Orchestration ‒ Lineage ‒ Permissions ‒ Processing (Batch & streaming) ‒ Encryption (GDPR) ‒ Monitoring ‒ Data quality challenges ‒ Testing ‒ Latency ‒ Incident handling ‒ Storage ‒ Discovery ‒ Lifecycle ‒ Atomicity ‒ 💰💰💰 (runtime, storage, incidents, lateness, skewness, productivity, ...) 暗号化 (GDPR 対応)、モニタリングなど課題とともに膨らむ 経費
  9. 9. DATA PROCESSING is HARD & EXPENSIVE データプロセッシングは難しい。そして高い。
  10. 10. ☑ It works ☑ It works reliably ☐ It works reliably and efficiently ☐ It works reliably, efficiently and easily data-processing @spotify 今は「確実に動く」段階、目指すのは「効率」と「簡単」
  11. 11. Scio
  12. 12. − A Scala API for data processing − Based on Apache Beam − Unified batch and streaming − Open source (Apache v2.0) − Runs on Dataflow, Spark, Flink, ... val sc = ScioContext() sc.textFile(sourcePath) .flatMap { _ .split("s+") .filter(_.nonEmpty) } .countByValue .saveAsTextFile(target) sc.close() about:scio https://github.com/spotify/scio Beam に基づいたデータプロセッシング Scala API。バッチ 処理とストリーム処理の統合。
  13. 13. Scio − Good documentation − Predictability − Performant − Productivity − Type safety ドキュメントが豊富。予測性、実行性能、生産性が高い。型 安全。
  14. 14. − Scala for Data engineering − Scala Center advisory board member − Open-source :) https://github.com/spotify/ Scala @spotify 弊社で Scala の取り組み。データエンジニアリング、Scala Center、OSS など。
  15. 15. Scio @Spotify − 350+ users (Data engineers, Data Scientists, Backend engineers) − 1000s unique production jobs − Batch and streaming 350人以上の社内 Scio ユーザ、プロダクション環境に何 千ものjobがある。バッチもストリーミングも。
  16. 16. Efficiency: serialization 効率性: シリアライズ編
  17. 17. Scio 0.7.0 − Released Jan, 18 − New static Coders − Redesigned BQ client − Refactored IO − Better modularity − bugfixes − etc. Scio 0.7.0では、静的なCoderが追加され、BigQueryのク ライアントを再設計、そしてIOをリファクタリングした。
  18. 18. Coders in 0.7 − Kryo to Typesafe coders − Predictability − Testability − Performance − Granularity − Automatic derivation at compile time − Fallback to Kryo 型安全なコーダーへと移行することで予測性を高めた。コ ンパイル時に自動導出。
  19. 19. 0.6 def map[U: Coder] (f: T => U): SCollection[U] Safer, simpler, compile time black magic 0.6 vs 0.7 def map[U: ClassTag] (f: T => U): SCollection[U] Unsafe, Kryo based, runtime black magic 0.7 (mostly) automated migration using scalafix 0.6 は Kryo が実行時黒魔術。0.7 はより安全なコンパイル 時黒魔術。
  20. 20. Results 結果
  21. 21. Anonymisation job ‒ Encrypt all personal data ‒ Each user has unique keys ‒ Runs hourly https://labs.spotify.com/2018/09/18/scalable-user-privacy/ 匿名化のジョブ。個人情報を暗号化する。
  22. 22. Anonymisation optimisation ‒ Replace Kryo by custom coders for Avro’s GenericRecord ‒ Kryo was really inefficient ‒ Only possible in Scio > 0.7 ‒ Scio now has a compile time warning for GenericRecord 匿名化の最適化。非効率的だったKryoを、カスタムコー ダーへ置換した。コンパイル時の警告も出る。これらも0.7 の成果。
  23. 23. Anonymisation job cost (▼ 60% largest event) 匿名化のジョブのコスト
  24. 24. Anonymisation job runtime (in minutes - largest event) 匿名化のジョブの実行時間
  25. 25. (YMMV but) DO NOT OVERLOOK SERIALIZATION シリアライズを見落とさないように
  26. 26. Efficiency: joins 効率性: join 編
  27. 27. joins − Really common use case − 😎 Large x Small − 😅 Large x Medium − 😱 Large x Large(-ish) → 💰💰💰💰💰💰 (Shuffle) よくある話: 巨大なもの同士をjoinするとコストがかかる。
  28. 28. SMB join − Sort Merge Bucket join − Store data bucketed (sharding by key) − Sort the content of each bucket by key ソフトマージバケット join
  29. 29. Bucketing 29 id B (number of buckets) = 3 8 7 6 0 3 1 7 2 6 1 4 3 3 4 5 id 0 3 3 3 6 6 id 1 1 4 4 7 7 id 2 5 8
  30. 30. Joining 0,6,6 3,6 4,7 1 2 2,5,8 id L 6 4 0 2 7 6 L Merge join R id 8 3 1 6 2 5 R
  31. 31. SMB join − Shuffle once, join everywhere → Amortized cost − PR in Apache Beam (https://github.com/apache/beam/pull/8486) Goal: handle gotchas automatically: • Store and check bucketing metadata • handle skewness • support joining datasets with a different number of buckets − Bonus: Storage is more efficient (better compression) 一度のシャッフルで何度も join できる。Beam に PR 中。
  32. 32. Ease of use: BeamSQL 簡易性: BeamSQL編
  33. 33. Scio 0.8 − SchemaCoder (structure aware coders) − BeamSQL − Automatic type conversion − Better coder support for java classes − Simpler job completion API (remove futures / ExecutionContext)... − Bugfixes − etc. Scio 0.8 では SchemaCoder、BeamSQL などを追加
  34. 34. BeamSQL val coll: SCollection[User] = ??? val r: SCollection[(String, List[String])] = sql""" SELECT username, emails FROM ${coll} """.as[(String, List[String])] ‒ Is the query valid SQL ? ‒ Are `username` and `emails` valid fields in `User` ? ‒ What’s the type of `username` ? ‒ What’s the type of `emails` ? ‒ Can the result be converted to the expected type ? クエリは妥当な SQL だろうか? String は何を指すのか?
  35. 35. BeamSQL val r = - sql""" + tsql""" SELECT username, emails FROM ${coll} """.as[(String, List[String])]
  36. 36. BeamSQL tsql""" SELECT username, emails FOM ${coll} """.as[(String, List[String])]
  37. 37. BeamSQL tsql""" SELECT username, emails FOM ${coll} """.as[(String, List[String])] ParseException: Encountered "PCOLLECTION" at line 4, column 8. Was expecting one of: <EOF> "ORDER" … "OFFSET" … "FETCH" … "FROM" … "," … Query: Select username, emails fom PCOLLECTION タイポはコンパイル時に検知
  38. 38. BeamSQL val coll: SCollection[User] = ??? tsql""" SELECT name, emails FROM ${coll} """.as[(String, List[String])] SqlValidatorException: Column 'name' not found in any table PCOLLECTION schema: ┌─────────────────────────────┬──────────┬──────────┐ │ NAME │ TYPE │ NULLABLE │ ├─────────────────────────────┼──────────┼──────────┤ │ username │ STRING │ NO │ │ emails │ STRING[] │ NO │ └─────────────────────────────┴──────────┴──────────┘ Expected schema: ┌─────────────────────────────┬──────────┬──────────┐ │ NAME │ TYPE │ NULLABLE │ ├─────────────────────────────┼──────────┼──────────┤ │ _1 │ STRING │ NO │ │ _2 │ STRING[] │ NO │ └─────────────────────────────┴──────────┴──────────┘ 「name カラムはテーブルにありません」
  39. 39. BeamSQL tsql""" SELECT username, emails FROM ${coll} """.as[(Int, List[String])]
  40. 40. BeamSQL tsql""" SELECT username, emails FROM ${coll} """.as[(Int, List[String])] Inferred schema for query is not compatible with the expected schema. Query result schema (inferred): ┌─────────────────────────────┬──────────┬──────────┐ │ NAME │ TYPE │ NULLABLE │ ├─────────────────────────────┼──────────┼──────────┤ │ username │ STRING │ NO │ │ emails │ STRING[] │ NO │ └─────────────────────────────┴──────────┴──────────┘ Expected schema: ┌─────────────────────────────┬──────────┬──────────┐ │ NAME │ TYPE │ NULLABLE │ ├─────────────────────────────┼──────────┼──────────┤ │ _1 │ INT32 │ NO │ │ _2 │ STRING[] │ NO │ └─────────────────────────────┴──────────┴──────────┘ 「推論されたスキーマは期待されるスキーマとの互換性が ありません」
  41. 41. Automatic type conversion val in: SCollection[A] = ??? val r: SCollection[B] = in.to[B](To.safe) − Convert between classes without boilerplate − Support Java beans − Support Scala case classes − Support Avro SpecificRecord 自動の型変換。ボイラープレートなしで、Java beansや、 Scalaのcase class、AvroのSpecificRecordについて対 応。
  42. 42. Type conversion val in: SCollection[A] = ??? val r: SCollection[B] = in.to[B](To.safe) Schemas are not compatible: A schema: ┌─────────────────────────────┬──────────┬──────────┐ │ NAME │ TYPE │ NULLABLE │ ├─────────────────────────────┼──────────┼──────────┤ │ i │ INT32 │ NO │ │ s │ STRING │ NO │ │ e │ ROW │ NO │ │ e.xs │ INT64[] │ NO │ │ e.q │ STRING │ NO │ └─────────────────────────────┴──────────┴──────────┘ B schema: ┌─────────────────────────────┬──────────┬──────────┐ │ NAME │ TYPE │ NULLABLE │ ├─────────────────────────────┼──────────┼──────────┤ │ q │ STRING │ NO │ │ xs │ INT64[] │ NO │ └─────────────────────────────┴──────────┴──────────┘ 型変換
  43. 43. github.com/spotify/scio spotify.github.io/scio @skaalf gitter.im/spotify/scio spotifyjobs.com

×