Speakers: Igor Maravić & Neville Li, Spotify
From stream to recommendation with
Cloud Pub/Sub and Cloud Dataflow
DATA & ANALYTICS
22
Current Event Delivery System
3
Client
Client
Client
Client
Current event delivery system
Gateway
Syslog
Syslog
Producer
Any Data Centre
Groupers Realtime
Brokers
ETL job
Checkpoint
Monitor
Hadoop
Hadoop Data Center
Service
Discovery
ACK
Brokers
Syslog
Consumer
Liveness
Monitor
Brokers
4
Client
Client
Client
Client
Complex
Gateway
Syslog
Syslog
Producer
Any Data Centre
Groupers Realtime
Brokers
ETL job
Checkpoint
Monitor
Hadoop
Hadoop Data Center
Service
Discovery
ACK
Brokers
Syslog
Consumer
Liveness
Monitor
Brokers
5
Client
Client
Client
Client
Stateless
Gateway
Syslog
Syslog
Producer
Any Data Centre
Groupers Realtime
Brokers
ETL job
Checkpoint
Monitor
Hadoop
Hadoop Data Center
Service
Discovery
ACK
Brokers
Syslog
Consumer
Liveness
Monitor
Brokers
6
Delivered data growth
2007 2008 2009 2010 2011 2012 2013 2014 2015
77
Redesigning Event Delivery
8
Redesigning event delivery
Gateway
Syslog
File Tailer
Any data centre
Client
Hadoop
Client
Client
Client Event
Delivery
Service
Reliable
Persistent Queue
ETL
9
Same API
Gateway
Syslog
File Tailer
Any data centre
Hadoop
Event
Delivery
Service
Reliable
Persistent Queue
ETL
Client
Client
Client
Client
10
Persistence
Gateway
Syslog
File Tailer
Any data centre
Hadoop
Event
Delivery
Service
Reliable
Persistent Queue
ETL
Client
Client
Client
Client
11
Keep it simple
Gateway
Syslog
File Tailer
Any data centre
Hadoop
Event
Delivery
Service
Reliable
Persistent Queue
ETL
Client
Client
Client
Client
Build it!
1313
Choosing reliable persistent queue
Kafka 0.8
14
Proven technology
15
16
Strong community
1717
Reliable persistent queue
18
Event delivery with Kafka 0.8
Gateway
Syslog
File Tailer
Any data centre
Client
Hadoop
Client
Client
Client
Event
Delivery
Service
Hadoop data centre
Camus
(ETL)
Brokers
Mirror
Makers
Brokers
19
Gateway
Syslog
File Tailer
Any data centre
Client
Hadoop
Client
Client
Client
Event
Delivery
Service
Hadoop data centre
Camus
(ETL)
Brokers
Mirror
Makers
Brokers
Event delivery with Kafka 0.8
Cloud Pub/Sub
20
Retains undelivered data
22
At least once delivery
2323
Globally available
24
Simple REST API
2525
No operational responsibility*
2626
SHUT UP
AND
TAKE MY MONEY!
2727
Caution advised!
Building up trust in Cloud Pub/Sub
28
29
Delivered data growth
2007 2008 2009 2010 2011 2012 2013 2014 2015
Demo time!
30
31
2M events
per second.
Cloud Pub/Sub,
Spotify chooses You!
32
33
Event delivery with Cloud Pub/Sub
Gateway
Any data centre
Client
Hadoop
Client
Client
Client
Cloud Pub/Sub
Event
Delivery
Service
File Tailer
Syslog
Cloud Storage
Dataflow
ETL using
Cloud Dataflow
3434
Streaming ETL job with
Cloud Dataflow
35
Dataflow SDK is a framework
36
Cloud Dataflow is a managed service
37
ETL job
38
Single Cloud Pub/Sub subscription
39
GCS and HDFS in parallel.
40
2016-03-22
03H
2016-03-22
04H
Event time based hourly buckets
2016-03-21
23H
2016-03-22
00H
2016-03-22
01H
2016-03-22
02H
41
Incremental bucket fill
2016-03-21
23H
2016-03-22
00H
2016-03-22
01H
2016-03-22
02H
2016-03-22
04H
2016-03-22
03H
42
2016-03-22
00H
2016-03-22
01H
2016-03-21
23H
2016-03-22
03H
Bucket completeness
2016-03-22
02H
2016-03-22
04H
43
2016-03-22
04H
Late data handling
2016-03-22
03H
2016-03-22
00H
2016-03-22
01H
2016-03-21
23H
2016-03-22
02H
44
Event time based hourly buckets
Incremental bucket fill
Bucket completeness
Late data handling
45
Windowing
46
Windowing
@Override
public PCollection<KV<String, Iterable<EventMessage>>> apply(
final PCollection<KV<String, EventMessage>> shardedEvents) {
return shardedEvents
.apply("Assign Hourly Windows",
Window.<~>into(
FixedWindows.of(ONE_HOUR))
.withAllowedLateness(ONE_DAY)
.triggering(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterPane.elementCountAtLeast(maxEventsInFile))
.withLateFirings(AfterFirst.of(
AfterPane.elementCountAtLeast(maxEventsInFile),
AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(TEN_SECONDS))))
.discardingFiredPanes())
.apply("Aggregate Events", GroupByKey.create());
}
4747
Streaming
Where are we right now?
49
Preliminary results
Watermark Lag
Minutes
5050
Scio
Scala API for Google Cloud Dataflow
51
Origin story
Scalding and Spark popular for ML, recommendations, analytics @ Spotify
50+ users, 400+ unique jobs
Early 2015 - Dataflow Scala hack project
52
Why not Scalding on GCE
Pros
● Big community - Twitter, eBay, Etsy, Stripe, LinkedIn, SoundCloud
● Stable and proven
Cons
● Hadoop cluster operations
● Multi-tenancy, resource contention and utilization
● No streaming mode
53
Why not Spark on GCE
Pros
● Batch, streaming, interactive and SQL
● MLlib, GraphX
● Scala, Python, and R support
Cons
● Hard to tune and scale
● Cluster lifecycle management
54
Why Dataflow with Scala
Dataflow
● Hosted solution, no operations
● Ecosystem: GCS, Bigquery, Pubsub, Datastore, Bigtable
● Simple unified model for batch and streaming
Scala
● High level DSL, easy transition for developers
● Reusable and composable code via functional programming
● Numerical libraries: Breeze, Algebird
55
Cloud
Storage
Pub/Sub Datastore BigtableBigQuery
Batch Streaming Interactive REPL
Scio Scala API
Dataflow Java SDK Scala Libraries
Extra features
56
Scio
Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i̯o]
Verb: I can, know, understand, have knowledge.
Core API similar to spark-core, some ideas from scalding
github.com/spotify/scio
57
WordCount
Almost identical to Spark version
val sc = ScioContext()
sc.textFile("shakespeare.txt")
.flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty))
.countByValue()
.saveAsTextFile("wordcount.txt")
58
PageRank in 13 lines
def pageRank(in: SCollection[(String, String)]) = {
val links = in.groupByKey()
var ranks = links.mapValues(_ => 1.0)
for (i <- 1 to 10) {
val contribs = links.join(ranks).values
.flatMap { case (urls, rank) =>
val size = urls.size
urls.map((_, rank / size))
}
ranks = contribs.sumByKey.mapValues((1 - 0.85) + 0.85 * _)
}
ranks
}
59
SQL and Big Data Pipelines
SQL is easier to write than data pipelines, but
Hive with TSV or Avro
● Row based storage, inefficient full scan
● No integration with other frameworks
Parquet
● Inspired by Google Dremel which powers BigQuery
● Immature Hive integration, hard to scale with Spark SQL
● Poor impedance matching with Scalding, Avro, etc.
60
BigQuery and Scio
BigQuery
● Slicing and dicing, aggregation, etc.
● Scaling independently
● Web UI, Tableau, QlikView etc.
Scio
● Custom logic hard to express in SQL
● Seamless integration with BigQuery IO
● Scala macros for type safety
61
JSON vs Type Safe BigQuery
JSON approach, a.k.a. everything is Object
sc.bigQuerySelect("...").map { r =>
(r.get("track").asInstanceOf[TableRow]
.get("name").asInstanceOf[String],
r.get("audio").asInstanceOf[TableRow]
.get("tempo").toString.toInt
)
}
Compile
Run job
Wait
NullPointerException or ClassCastException
Repeat
Type safe approach
@BigQueryType.fromQuery("...")
class TrackTempo
sc.typedBigQuery[TrackTempo]().map { t =>
(t.track.name, t.audio.tempo.getOrElse(-1))
}
Compile
Run
Profit
62
Spotify Running
60 million tracks
30 million users * 10 tempo buckets * 25 personalized tracks
Audio: tempo, energy, time signature ...
Metadata: genres, categories
Latent vectors from collaborative filtering
63
Rapid prototyping with Bigquery
64
Spotify Running
SELECT user_id, vector
FROM UserEntity
WHERE ...
SELECT
track_id, audio.tempo ...
FROM TrackEntity
WHERE ...
most popular
per recording
top N tracks
per artist
bucket by
tempo
vector LSH
per bucket
GBK GBK GBK
RBK
top tracks per
user + bucket
side input
Cloud
Datastore
65
66
67
What’s the catch?
Early stage, some rough edges
No interactive mode → Scio REPL (WIP), BigQuery + Datalab
No machine learning → TensorFlow
Licensed under Apache 2, contribution welcome!
Learnings?
69
Blog posts @ labs.spotify.com
Spotify’s Event Delivery - The Road To The Cloud
Part I, Part II, Part III
7070
Thank You
Igor Maravić <igor@spotify.com>
Neville Li <neville@spotify.com>

From stream to recommendation using apache beam with cloud pubsub and cloud dataflow