From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

Speakers: Igor Maravić & Neville Li, Spotify
From stream to recommendation with
Cloud Pub/Sub and Cloud Dataflow
DATA & ANALYTICS

22
Current Event Delivery System

3
Client
Client
Client
Client
Current event delivery system
Gateway
Syslog
Syslog
Producer
Any Data Centre
Groupers Realtime
Brokers
ETL job
Checkpoint
Monitor
Hadoop
Hadoop Data Center
Service
Discovery
ACK
Brokers
Syslog
Consumer
Liveness
Monitor
Brokers

4
Client
Client
Client
Client
Complex
Gateway
Syslog
Syslog
Producer
Any Data Centre
Groupers Realtime
Brokers
ETL job
Checkpoint
Monitor
Hadoop
Hadoop Data Center
Service
Discovery
ACK
Brokers
Syslog
Consumer
Liveness
Monitor
Brokers

5
Client
Client
Client
Client
Stateless
Gateway
Syslog
Syslog
Producer
Any Data Centre
Groupers Realtime
Brokers
ETL job
Checkpoint
Monitor
Hadoop
Hadoop Data Center
Service
Discovery
ACK
Brokers
Syslog
Consumer
Liveness
Monitor
Brokers

6
Delivered data growth
2007 2008 2009 2010 2011 2012 2013 2014 2015

8
Redesigning event delivery
Gateway
Syslog
File Tailer
Any data centre
Client
Hadoop
Client
Client
Client Event
Delivery
Service
Reliable
Persistent Queue
ETL

9
Same API
Gateway
Syslog
File Tailer
Any data centre
Hadoop
Event
Delivery
Service
Reliable
Persistent Queue
ETL
Client
Client
Client
Client

10
Persistence
Gateway
Syslog
File Tailer
Any data centre
Hadoop
Event
Delivery
Service
Reliable
Persistent Queue
ETL
Client
Client
Client
Client

11
Keep it simple
Gateway
Syslog
File Tailer
Any data centre
Hadoop
Event
Delivery
Service
Reliable
Persistent Queue
ETL
Client
Client
Client
Client

1313
Choosing reliable persistent queue

1717
Reliable persistent queue

18
Event delivery with Kafka 0.8
Gateway
Syslog
File Tailer
Any data centre
Client
Hadoop
Client
Client
Client
Event
Delivery
Service
Hadoop data centre
Camus
(ETL)
Brokers
Mirror
Makers
Brokers

19
Gateway
Syslog
File Tailer
Any data centre
Client
Hadoop
Client
Client
Client
Event
Delivery
Service
Hadoop data centre
Camus
(ETL)
Brokers
Mirror
Makers
Brokers
Event delivery with Kafka 0.8

2525
No operational responsibility*

2626
SHUT UP
AND
TAKE MY MONEY!

Building up trust in Cloud Pub/Sub
28

29
Delivered data growth
2007 2008 2009 2010 2011 2012 2013 2014 2015

Cloud Pub/Sub,
Spotify chooses You!
32

33
Event delivery with Cloud Pub/Sub
Gateway
Any data centre
Client
Hadoop
Client
Client
Client
Cloud Pub/Sub
Event
Delivery
Service
File Tailer
Syslog
Cloud Storage
Dataflow
ETL using
Cloud Dataflow

3434
Streaming ETL job with
Cloud Dataflow

35
Dataflow SDK is a framework

36
Cloud Dataflow is a managed service

38
Single Cloud Pub/Sub subscription

40
2016-03-22
03H
2016-03-22
04H
Event time based hourly buckets
2016-03-21
23H
2016-03-22
00H
2016-03-22
01H
2016-03-22
02H

41
Incremental bucket fill
2016-03-21
23H
2016-03-22
00H
2016-03-22
01H
2016-03-22
02H
2016-03-22
04H
2016-03-22
03H

42
2016-03-22
00H
2016-03-22
01H
2016-03-21
23H
2016-03-22
03H
Bucket completeness
2016-03-22
02H
2016-03-22
04H

43
2016-03-22
04H
Late data handling
2016-03-22
03H
2016-03-22
00H
2016-03-22
01H
2016-03-21
23H
2016-03-22
02H

44
Event time based hourly buckets
Incremental bucket fill
Bucket completeness
Late data handling

46
Windowing
@Override
public PCollection<KV<String, Iterable<EventMessage>>> apply(
final PCollection<KV<String, EventMessage>> shardedEvents) {
return shardedEvents
.apply("Assign Hourly Windows",
Window.<~>into(
FixedWindows.of(ONE_HOUR))
.withAllowedLateness(ONE_DAY)
.triggering(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterPane.elementCountAtLeast(maxEventsInFile))
.withLateFirings(AfterFirst.of(
AfterPane.elementCountAtLeast(maxEventsInFile),
AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(TEN_SECONDS))))
.discardingFiredPanes())
.apply("Aggregate Events", GroupByKey.create());
}

49
Preliminary results
Watermark Lag
Minutes

5050
Scio
Scala API for Google Cloud Dataflow

51
Origin story
Scalding and Spark popular for ML, recommendations, analytics @ Spotify
50+ users, 400+ unique jobs
Early 2015 - Dataflow Scala hack project

52
Why not Scalding on GCE
Pros
● Big community - Twitter, eBay, Etsy, Stripe, LinkedIn, SoundCloud
● Stable and proven
Cons
● Hadoop cluster operations
● Multi-tenancy, resource contention and utilization
● No streaming mode

53
Why not Spark on GCE
Pros
● Batch, streaming, interactive and SQL
● MLlib, GraphX
● Scala, Python, and R support
Cons
● Hard to tune and scale
● Cluster lifecycle management

54
Why Dataflow with Scala
Dataflow
● Hosted solution, no operations
● Ecosystem: GCS, Bigquery, Pubsub, Datastore, Bigtable
● Simple unified model for batch and streaming
Scala
● High level DSL, easy transition for developers
● Reusable and composable code via functional programming
● Numerical libraries: Breeze, Algebird

55
Cloud
Storage
Pub/Sub Datastore BigtableBigQuery
Batch Streaming Interactive REPL
Scio Scala API
Dataflow Java SDK Scala Libraries
Extra features

56
Scio
Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i̯o]
Verb: I can, know, understand, have knowledge.
Core API similar to spark-core, some ideas from scalding
github.com/spotify/scio

57
WordCount
Almost identical to Spark version
val sc = ScioContext()
sc.textFile("shakespeare.txt")
.flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty))
.countByValue()
.saveAsTextFile("wordcount.txt")

58
PageRank in 13 lines
def pageRank(in: SCollection[(String, String)]) = {
val links = in.groupByKey()
var ranks = links.mapValues(_ => 1.0)
for (i <- 1 to 10) {
val contribs = links.join(ranks).values
.flatMap { case (urls, rank) =>
val size = urls.size
urls.map((_, rank / size))
}
ranks = contribs.sumByKey.mapValues((1 - 0.85) + 0.85 * _)
}
ranks
}

59
SQL and Big Data Pipelines
SQL is easier to write than data pipelines, but
Hive with TSV or Avro
● Row based storage, inefficient full scan
● No integration with other frameworks
Parquet
● Inspired by Google Dremel which powers BigQuery
● Immature Hive integration, hard to scale with Spark SQL
● Poor impedance matching with Scalding, Avro, etc.

60
BigQuery and Scio
BigQuery
● Slicing and dicing, aggregation, etc.
● Scaling independently
● Web UI, Tableau, QlikView etc.
Scio
● Custom logic hard to express in SQL
● Seamless integration with BigQuery IO
● Scala macros for type safety

61
JSON vs Type Safe BigQuery
JSON approach, a.k.a. everything is Object
sc.bigQuerySelect("...").map { r =>
(r.get("track").asInstanceOf[TableRow]
.get("name").asInstanceOf[String],
r.get("audio").asInstanceOf[TableRow]
.get("tempo").toString.toInt
)
}
Compile
Run job
Wait
NullPointerException or ClassCastException
Repeat
Type safe approach
@BigQueryType.fromQuery("...")
class TrackTempo
sc.typedBigQuery[TrackTempo]().map { t =>
(t.track.name, t.audio.tempo.getOrElse(-1))
}
Compile
Run
Profit

62
Spotify Running
60 million tracks
30 million users * 10 tempo buckets * 25 personalized tracks
Audio: tempo, energy, time signature ...
Metadata: genres, categories
Latent vectors from collaborative filtering

63
Rapid prototyping with Bigquery

64
Spotify Running
SELECT user_id, vector
FROM UserEntity
WHERE ...
SELECT
track_id, audio.tempo ...
FROM TrackEntity
WHERE ...
most popular
per recording
top N tracks
per artist
bucket by
tempo
vector LSH
per bucket
GBK GBK GBK
RBK
top tracks per
user + bucket
side input
Cloud
Datastore

67
What’s the catch?
Early stage, some rough edges
No interactive mode → Scio REPL (WIP), BigQuery + Datalab
No machine learning → TensorFlow
Licensed under Apache 2, contribution welcome!

69
Blog posts @ labs.spotify.com
Spotify’s Event Delivery - The Road To The Cloud
Part I, Part II, Part III

7070
Thank You
Igor Maravić <igor@spotify.com>
Neville Li <neville@spotify.com>

From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

More Related Content

What's hot

Similar to From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

Recently uploaded

From stream to recommendation using apache beam with cloud pubsub and cloud dataflow