Sorry - How Bieber broke Google Cloud at Spotify

Sorry
How Bieber broke
Google Cloud at Spotify
Neville Li
@sinisa_lyh

Who am I?
● Spotify NYC since 2011
● Formerly Yahoo! Search
● Music recommendations
● Data & ML infrastructure
● Scala since 2013

About Us
● 100M+ active users
● 40M+ subscribers
● 30M+ songs, 20K new per day
● 2B+ playlists
● 1B+ plays per day

Hadoop at Spotify
● On-premise → Amazon EMR → On-premise
● ~2,500 nodes, largest in EU
● 100PB+ Disk, 100TB+ RAM
● 60TB+ per day log ingestion
● 20K+ jobs per day

Data Processing
● Luigi, Python M/R, circa 2011
● Scalding, Spark, circa 2013
● 200+ Scala users, 1K+ unique jobs
● Storm for real-time
● Hive for ad-hoc analysis

Moving to
Google
CloudEarly 2015

Hive → BigQuery
● Full Avro scans → Columnar storage
● M/R jobs → Optimized execution engine
● Batch → interactive query
● $$$ → Pay for bytes processed
● Beam/Dataflow integration
Dremel Paper, 2010

Top Tracks in Vancouver Jun 2017
● 30 date partitioned tables, 60TB
● 1 metadata table, 418GB
● 94.2s, 4.82TB processed
Track Artist Count
Despacito - Remix Luis Fonsi 349188
I'm the One DJ Khaled 214863
HUMBLE. Kendrick Lamar 200690
Unforgettable French Montana 192141
XO TOUR Llif3 Lil Uzi Vert 148546
Slide Calvin Harris 134236
Attention Charlie Puth 131514
Strip That Down Liam Payne 125191
2U David Guetta 124998

Top Bieber Tracks Jun 2017
● 30 date partitioned tables, 60TB
● 1 metadata table, 418GB
● 81.8s, 2.92TB processed
Track Count
Love Yourself 28400153
Sorry 27269157
What Do You Mean? 20047452
Company 11794409
Purpose 8314381
I'll Show You 6627706
Baby 6161099
Boyfriend 5854955
The Feeling 5415162
As Long As You Love Me 4738529

Adoption
● ~500 unique users
● ~640K queries per month
● ~500PB queried per month

● Cluster management, multi-tenancy & resource utilization
● Tuning and scaling
● 3 sets of separate APIs
● Not cloud native & missing Google Cloud connectors

What is Beam and
Cloud Dataflow?

The Evolution of Apache Beam
MapReduce
BigTable DremelColossus
FlumeMegastoreSpanner
PubSub
Millwheel
Apache
Beam
Google Cloud
Dataflow

What is Apache Beam?
1. The Beam Programming Model
2. SDKs for writing Beam pipelines -- Java & Python
3. Runners for existing distributed processing backends
○ Apache Flink (thanks to data Artisans)
○ Apache Spark (thanks to Cloudera and PayPal)
○ Google Cloud Dataflow (fully managed service)
○ Local runner for testing

17
The Beam Model: Asking the Right Questions
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?

18
Customizing What Where When How
3
Streaming
4
Streaming
+ Accumulation
1
Classic
Batch
2
Windowed
Batch

● Unified batch and streaming model
● Hosted, fully managed, no ops
● Auto-scaling, dynamic work re-balance
● GCP ecosystem - BigQuery, Bigtable, Datastore, Pubsub
Beam + Cloud Dataflow

● High level DSL
● Familiarity with Scalding, Spark and Flink
● Functional programming natural fit for data
● Numerical libraries - Breeze, Algebird
● Macros & shapeless for code generation
Beam and Scala

Scio
A Scala API for Apache Beam and
Google Cloud Dataflow

Scio
Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i̯o]
Verb: I can, know, understand, have knowledge.

github.com/spotify/scio
Apache Licence 2.0

Cloud
Storage
Pub/Sub Datastore BigtableBigQuery
Batch Streaming Interactive REPL
Scio Scala API
Dataflow Java SDK Scala Libraries
Extra features

WordCount
val sc = ScioContext()
sc.textFile("shakespeare.txt")
.flatMap { _
.split("[^a-zA-Z']+")
.filter(_.nonEmpty)
}
.countByValue
.saveAsTextFile("wordcount.txt")
sc.close()

PageRank
def pageRank(in: SCollection[(String, String)]) = {
val links = in.groupByKey()
var ranks = links.mapValues(_ => 1.0)
for (i <- 1 to 10) {
val contribs = links.join(ranks).values
.flatMap { case (urls, rank) =>
val size = urls.size
urls.map((_, rank / size))
}
ranks = contribs.sumByKey.mapValues((1 - 0.85) + 0.85 * _)
}
ranks
}

Type safe BigQuery
Macro generated case classes, schemas and converters
@BigQuery.fromQuery("SELECT id, name FROM [users] WHERE ...")
class User // look mom no code!
sc.typedBigQuery[User]().map(u => (u.id, u.name))
@BigQuery.toTable
case class Score(id: String, score: Double)
data.map(kv => Score(kv._1, kv._2)).saveAsTypedBigQuery("table")

● Best of both worlds
● BigQuery for slicing and
dicing huge datasets
● Scala for custom logic
● Seamless integration
BigQuery + Scio

REPL
$ scio-repl
Welcome to
_____
________________(_)_____
__ ___/ ___/_ /_ __
_(__ )/ /__ _ / / /_/ /
/____/ ___/ /_/ ____/ version 0.3.4
Using Scala version 2.11.11 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121)
Type in expressions to have them evaluated.
Type :help for more information.
Using 'scio-test' as your BigQuery project.
BigQuery client available as 'bq'
Scio context available as 'sc'
scio> _
Available in github.com/spotify/homebrew-public

● DAG and source code visualization
● BigQuery caching, legacy & SQL 2011 support
● HDFS, Protobuf, TensorFlow, Bigtable,
Elasticsearch & Cassandra I/O
● Join optimizations - hash, skewed, sparse
● Job metrics
● DistCache, job orchestration
Other goodies

Adoption
● At Spotify
○ 200+ users, 700+ unique production pipelines (from ~70 10 months ago)
○ Most new to Scala and Scio
○ Both batch and streaming jobs
● Externally
○ ~10 companies, several fairly large ones

Master Metadata
● n1-standard-1 workers
● 1 core 3.75GB RAM
● Autoscaling 2-35 workers
● 26 Avro sources - artist, album, track, disc, cover art, ...
● 120GB out, 70M records
● 200 LOC vs original Java 600 LOC

Listening history
● user x track x
[day/week/month/all time]
● 300B elements
● 800 n1-highmem-32 workers
● 32 core 208GB RAM
● 240TB in - Bigtable
● 90TB out - GCS

BigDiffy
● Pairwise field-level statistical diff
● Diff 2 SCollection[T] given keyFn: T => String
● T: Avro, BigQuery JSON, Protobuf
● Leaf field Δ - numeric, string (Levenshtein), vector (Cosine)
● Δ statistics - min, max, μ, σ, etc.
● Non-deterministic fields
○ ignore field
○ treat "repeated" field (List) as unordered list (Set)
Part of github.com/spotify/ratatool

Dataset Diff
● Diff stats
○ Global: # of SAME, DIFF, MISSING LHS/RHS
○ Key: key → SAME, DIFF, MISSING LHS/RHS
○ Field: field → min, max, μ, σ, etc.
● Use cases
○ Validating pipeline migration
○ Sanity checking ML models

Pairwise field-level deltas
val lKeyed = lhs.keyBy(keyFn)
val rKeyed = rhs.keyBy(keyFn)
val deltas = (lKeyed outerJoin rKeyed).map { case (k, (lOpt, rOpt)) =>
(lOpt, rOpt) match {
case (Some(l), Some(r)) =>
val ds = diffy(l, r) // Seq[Delta]
val dt = if (ds.isEmpty) SAME else DIFFERENT
(k, (ds, dt))
case (_, _) =>
val dt = if (lOpt.isDefined) MISSING_RHS else MISSING_LHS
(k, (Nil, dt))
}
}

Summing deltas
import com.twitter.algebird._
// convert deltas to map of (field → summable stats)
def deltasToMap(ds: Seq[Delta], dt: DeltaType)
: Map[String,
(Long, Option[(DeltaType, Min[Double], Max[Double], Moments)])] = {
// ...
}
deltas
.map { case (_, (ds, dt)) => deltasToMap(ds, dt) }
.sum // Semigroup!

1. Copy & paste from legacy
codebase to Scio
2. Verify with BigDiffy
3. Profit!

Featran
● a.k.a. Featran77 or F77
● Type safe feature transformer for machine learning
● Column-wise aggregation and transformation
● 1 shuffle stage (Algebird aggregation)
● In memory, Flink, Scalding, Scio & Spark runner
● Scala collection, array, Breeze, TensorFlow & NumPy output formats
● Property-based testing, 100% coverage
https://github.com/spotify/featran

Transforming features
case class Record(d1: Double, d2: Option[Double], d3: Double,
s: Seq[String])
val spec = FeatureSpec.of[Record]
.required(_.d1)(Binarizer("bin", threshold = 0.5))
.optional(_.d2)(Bucketizer("bucket", Array(0.0, 10.0, 20.0)))
.required(_.d3)(StandardScaling("std"))
.required(_.d3)(QuantileDiscretizer("q", 4))
.required(_.s)(NHotEncoder("nhot"))
val f = spec.extract(records)
f.featureNames // human readable names
f.featureValues[DenseVector[Double]] // output format
f.featureSettings // settings can be reloaded

shapeless-datatype
● Case class ↔ Avro, BigQuery, Datastore & TensorFlow types
● Compile time and extendable type mapping
● Generic mapper between case class types
● Type and lens based record matcher
https://github.com/nevillelyh/shapeless-datatype

protobuf-generic
● Generic Protobuf manipulation without compiled classes
● Similar to Avro GenericRecord
● Protobuf type T → JSON schema
● Bytes ↔ JSON with JSON schema
● Command line tool to inspect binary files
https://github.com/nevillelyh/protobuf-generic

Other uses
● AB testing
○ Statistical analysis with bootstrap
and DimSum
● Monetization
○ Ads targeting
○ User conversion analysis
● User understanding
○ Diversity
○ Session analysis
○ Behavior analysis
● Home page ranking
● Audio fingerprint analysis

We're developing so
fast that Google's
having a hard time
keep up capacity

Google Cloud Capacity
● Almost too easy to write big complex jobs leveraging
BigQuery & Dataflow
● BigQuery export limit 10TB/day/project
● Compute engine datacenter capacity
● Shuffle disk size & throughput
● Bigtable throttling

Serialization
● Data ser/de
○ Scalding, Spark and Storm uses Kryo and Twitter Chill
○ Dataflow/Beam requires explicit Coder[T]
Sometimes inferable via Guava TypeToken
○ ClassTag to the rescue, fallback to Kryo/Chill
● Lambda ser/de
○ Custom ClosureCleaner, Serializable and @transient lazy val
○ Scala 2.12, Java 8 lambda issues #10232 #10233

REPL
● Spark REPL transports lambda bytecode via HTTP
● Dataflow requires job jar for execution (masterless)
● Custom class loader and ILoop
● Interpreted classes → job jar → job submission
● SCollection[T]#closeAndCollect(): Iterator[T]
to mimic Spark actions

Macros and IntelliJ IDEA
● IntelliJ IDEA does not see macro expanded classes
https://youtrack.jetbrains.com/issue/SCL-8834
● @BigQueryType.{fromTable, fromQuery}
class MyRecord
● Scio IDEA plugin
https://github.com/spotify/scio-idea-plugin

Make IntelliJ
Intelligent Again!

Scio in Apache Zeppelin
Local Zeppelin server, remote managed Dataflow cluster, NO OPS

The End
Thank You
Neville Li
@sinisa_lyh

Sorry - How Bieber broke Google Cloud at Spotify

More Related Content

What's hot

Similar to Sorry - How Bieber broke Google Cloud at Spotify

Recently uploaded

Sorry - How Bieber broke Google Cloud at Spotify