Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Shortening the
Feedback Loop
HowSpotify’sBigDataEcosystemHas
EvolvedtoProduceReal-timeInsights
Josh Baer (jbx@spotify.com)
Note: opinions expressed in these slides are the authors and not necessarilythose of Spotify

Who am I?
• Technical Product Owner at Spotify
• Working with fast processing infrastructure
• Previously, building out Spotify’s 2500 node
Hadoop cluster
@l_phant

• Spotify Launches
• Instant Access to a gigantic
catalog of music
• Click to play instantaneous!
In 2008

Behind the Scenes:
Days to Insights

Behind the Scenes
Minutes to
transfer
Hours to Clean
and Bucket
Hours to Run
Jobs or Ad Hoc
Queries

“Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009

Real-time
Processing
Batch Processing
(Hadoop, Hive, BigQuery)
“Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009
Operational
Monitoring

To leverage actionable
insights, we need a
faster feedback loop!

• Music Streaming Service
• Launched in 2008
• Premium and FreeTiers
• Available in 59 Countries
What is Spotify?

Hadoop at Spotify
• 2,500 Nodes
• 130 PB Capacity
• 120TB Memory accessible by jobs
• 20KJobs/Day

Apache Kafka at Spotify
• 500 Kafka-related machines
• 40TB/day from logs

“Real-Time” at Spotify
• Storm Topologies fed via Kafka
• Powering
✦ Ad Targeting
✦ Real-time recommendations
✦ Real-time stream counts

In the Beginning…
• Spotifywas almost completely on-premise/bare
metal
• 2500 node Hadoop cluster, over 10K machines in
production at four globally distributed data centers
• Grew with users: from 1M in 2009, over 100M in 2016

Why Move to the Cloud?
• Cloud Providers have matured, decreasing in costs
and increasing in reliability and variety of service
offered
• Owning and operating physical machines is not a
competitive advantage for Spotify

Why Google’s Cloud?
• We believe Google’s industry leading background
in Big Data technologies will give us a data
processing advantage

BigQuery
• Ad-hoc and interactive querying service for massive datasets
• Like Hive, but without needing to manage Hadoop and servers
• Leverages Google’s internal tech
• Dremel (query execution engine)
• Colossus (distributed storage)
• Borg (distributed compute)
• Juniper (network)
Source: https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood

BigQuery vs. Hive
• Example Query: Find the top 10 songs by
popularity in Spain during October
• BigQuery (1.50 TB processed): 108s
• Hive(15.5TB processed): 2647s
Note: Hive performance unoptimized.Version used (0.14), input format (Avro), run on
a ~2500 nodeYarn cluster.This is not considered to be a thorough benchmark

BigQuery vs. Hive (example #2)
• Example Query: Find the total hours of music
listening in Spain during October
• BigQuery (780 GB processed): 33s
• Hive(15.5TB processed): 969s
Note: Hive performance unoptimized.Version used (0.14), input format (Avro), run on
a ~2500 nodeYarn cluster.This is not considered to be a thorough benchmark

•
Top 10 Tracks in Spain during October 2016
Rank Artist(s) Track Name
1 J Balvin Safari
2 DJ Snake Let Me Love You
3 Ricky Mar8n Vente Pa' Ca
4 Sebas8an Yatra Traicionera
5 Zion & Lennox (feat. J Balvin) Otra Vez
6 Carlos Vives, Shakira La Bicicleta
7 The Chainsmokers Closer
8 Major Lazer (feat. Jus8n Bieber & MØ) Cold Water
9 Sia The Greatest
10 IAmChino (feat. Pitbull, Yandel & Chacal) Ay MI Dios

Time Spent Listening to
Spotify by users in Spain
during October
Nearly 10,000 Years!

BigQuery at Spotify
• Interactive and ad-hoc querying immediately
started to transferto BQ once the data was
available on the cloud
• Pace of learning increases as friction to question
decreases

Cloud Pub/Sub
• At least once globally distributed message queue
• For high volume, low topic (<10,000) publish
subscribe behavior
• Like Kafka, but without needing to operate servers
and supporting services (zookeeper)

Cloud Pub/Sub at Spotify
• 800K events/second? No problem
• P99 Latency of ingestions into ES: 500ms
• Ingestion from globally distributed non-GCP
datacenters is painless

• Managed Service for running batch and streaming jobs
• UnifiedAPI for batch and streaming mode
• Inspired by internal Google tools like FlumeJava and
Millwheel
• Programming model open-sourced asApache Beam
(currently incubating)
Cloud Dataflow

• Usually run via Scio: https://github.com/spotify/scio
• Scio provides a scalaAPI for running Dataflow jobs
and provides easy integrations with BigQuery
• New batch processing jobs @Spotify are being
written in Scio/Dataflow
Cloud Dataflow (Batch) at Spotify

• Exactly-once stream processing framework
• Areplacement for Spark/Flink streaming and
Storm workloads at Spotify
• Optimizes for consistencywhich can complicate
real-time workloads
Cloud Dataflow (Streaming) at Spotify

Spotify + Google Cloud Timeline
2015 2016
Beginning of Google
Cloud evaluation
BigQuery begins
to replace Hive
Cloud Pub/Sub begins
to replace Kafka
Dataflow (streaming)
begins to replace StormSpotify + Google
Cloud Announcement
Dataflow (batch)
replacing Map/Reduce
Note: Dates are approximations

The Problem
• We want to detect within minutes ifwe’ve
introduced a bug in a client release that affects
critical event logging behavior

Before…
Minutes to
transfer
Hours to Clean
and Bucket
Hours to Run
Jobs or Ad Hoc
Queries
HOURS TO
INSIGHTS

Introducing “Pulsar”
• An internal name forthe system aggregating data
fromAccess Points and feeding it into Cloud Pub/
Sub
• Replaces the Kafka real-time event feed

Pub/Sub
• Aggregates global event feed from Pulsar
• Makes data available to multiple zones in
milliseconds

Dataflow
• Subscribes to critical event Pub/Sub topics
• Aggregate events into minute windows
• Always running, no need to schedule orwait for
results

BigQuery
• Receives aggregates from Dataflow
• Allows for ad-hoc inspection or slicing on different
dimensions

Tableau
• DataVisualizationTool that integrates nicelywith
BigQuery
• Pulls data from BigQuery periodically and caches for
quick inspection

Putting it all together
Milliseconds
to transfer
Milliseconds
to process
Seconds to
Query
SECONDS TO
INSIGHTS

FasterInsights
toClient
Behavior

Problem
As a developer, I want to be able to instantly explore
data being logged bythe clients.

Solution
• Produce a topic for all employee client events
• Store in Elasticsearch
• Visualize in Kibana

Benefits
• Able to understand what’s being sent bythe client
as it happens
• Exploring events, visualizing distribution (i.e. does
this field actually get populated)
• Prototyping analysis based on a sample
• Dashboards for Employee Releases

The previous dashboard is great for prototyping, but
what ifyou want all the data?
Problem

Solution
Allow developers to funnel feature-specific data to
their own elastic search cluster

Dataflow to the Rescue!
• We created a librarythat allows teams to build
maps/filters with simple java code
• Code gets translated into a Dataflow job

No Ops!
• For our users:
• Event-feed managed through Cloud Pub/Sub
• Dataflow managed by Google
• Shared Elasticsearch cluster (managed by an
infra team)

Low Ops :/
• Dataflow is improving, but it’s had some stability
issues with streaming jobs
• Teams may need to set-up their own Elasticsearch
cluster ifthey require a higher SLAthan default

Ad Targeting
• Real-time genre targeting
• Session insights — explicit filter

Live Results for X-Factor
• X-Factor: television music
competition
• Contest songs get loaded onto
Spotify immediately after show
airs
• Listener behavior determines the
order of contestants on the playlist

Cloud to the Rescue!
• Spotify has leveled up our abilityto gain actionable
insights by leveraging Google Cloud tools, such as
Pub/Sub, Dataflow and BigQuery

TheValue of a Fast Feedback Loop
• Detecting problems early in data avoids long backfills or
long term data loss
• Instant insights on newly developed features allows
teams to iterate quicker and take risks
• Providing a quicker ad-hoc querying engine allows teams
to ask more questions and learn faster

UseAnything and Everything
• Opensource and other cloud providers offer many
alternatives to the stack we’ve used
• Opensource tools, like Elasticsearch/Kibana, and
proprietary solutions, like Tableau, have also been
useful additions

WhereAre We Going?
• The real-time mission is in the early stages at
Spotify

Stream Processing First
• The sun never sets on Spotify, why impose
boundaries on our datasets?
• What’s the shortest distance between two lines?
Zero!
• Can we reduce the feedback cycle to zero?

We’reHiring!
Engineers, Managers, Product Owners
needed in NYC and Stockholm
https://www.spotify.com/jobs

Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Similar to Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer (20)

More from Big Data Spain

More from Big Data Spain (20)

Recently uploaded

Recently uploaded (20)

Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer