Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Shortening the
Feedback Loop
HowSpotify’sBigDataEcosystemHas
EvolvedtoLeverageActionableInsights
Josh Baer (jbx@spotify.co...
Who am I?
• Technical Product Owner at Spotify
• Working with fast processing infrastructure
• Previously, building out Sp...
• Spotify Launches
• Access to a gigantic catalog
of music
• Click to play instantaneous!
In 2008
Behind the Scenes:
Days to Insights
Behind the Scenes
Behind the Scenes
Minutes to
transfer
Hours to Clean
and Bucket
Hours to Run
Jobs or Ad Hoc
Queries
DAYS TO
INSIGHTS
“Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009
Real-time
Processing
Batch Processing
(Hadoop, Hive, BigQuery)
“Continuous Analytics: Stream Query Processing in Practice”...
To leverage actionable
insights, we need a
faster feedback loop!
• Music Streaming Service
• Launched in 2008
• Premium and FreeTiers
• Available in 60 Countries
What is Spotify?
Over 100 Million Active Users
Over 30 Million Songs
Over 1 Billion Plays Per Day
And we have Data
Hadoop at Spotify
• ~2,500 Nodes
• >100 PB Capacity
• >100 TB Memory accessible by jobs
• 20KJobs/Day
Apache Kafka at Spotify
• 500 Kafka-related machines
• 40TB/day from logs
Real-Time at Spotify
• StormTopologies fed via Kafka
• Mostly used for hack ideas or proof of concepts
Migratingto
theCloud
In the Beginning…
• Spotifywas almost completely on-premise/bare metal
• Grew to 2,500 node Hadoop cluster and over 10K
to...
In 2014
• Maybe we should trythis cloud thing for real
Why Move to the Cloud?
• Cloud Providers have matured, decreasing in costs
and increasing in reliability and variety of se...
Why Google’s Cloud?
• We believe Google’s industry leading background
in Big Data technologies will give us a data
process...
Google
Cloud Data
Building Blocks
BigQuery
• Ad-hoc and interactive querying service for massive datasets
• Like Hive, but without needing to manage Hadoop ...
BigQuery vs. Hive
• Example Queries:
• What are the top 10 songs by popularity in Spain
during October 2016?
• How many ho...
BigQuery vs. Hive
• What are the top 10 songs by popularity in Spain during October 2016?
• Hive
• 2647s (44min, 7sec)
• 1...
Top 10 Tracks in Spain during October 2016
Rank Artist(s) Track Name
1 J	Balvin Safari
2 DJ	Snake Let	Me	Love	You
3 Ricky	...
BigQuery vs. Hive
• How much time did users in Spain spend listening to Spotify during October?
• Hive
• 969s (16min, 9 se...
Nearly 10,000 Years!
BigQuery at Spotify
• Interactive and ad-hoc querying immediately
started to transferto BQ once the data was
available on ...
Cloud Pub/Sub
• At least once globally distributed message queue
• For high volume, low topic (<10,000) publish
subscribe ...
Cloud Pub/Sub at Spotify
• 800K events/second? No problem
• P99 Latency of ingestions into ES: 500ms
• Ingestion from glob...
• Managed Service for running batch and streaming jobs
• UnifiedAPI for batch and streaming mode
• Inspired by internal Go...
• Usually run via Scio: https://github.com/spotify/scio
• Scio provides a scalaAPI for running Dataflow jobs
and provides ...
• Exactly-once stream processing framework
• Areplacement for Spark/Flink streaming and
Storm workloads at Spotify
• Optim...
Spotify + Google Cloud Timeline
2015 2016
Beginning of Google
Cloud evaluation
BigQuery begins
to replace Hive
Cloud Pub/S...
Putting ItAll
Together
The Problem
• We want to detect within minutes ifwe’ve
introduced a bug in a client release that affects
important event l...
Before…
Minutes to
transfer
Hours to Clean
and Bucket
Hours to Run
Jobs or Ad Hoc
Queries
DAYS TO
INSIGHTS
Getting Data from Clients to Pub/Sub
• Built Pulsar, a simple service aggregating data from
Access Points and feeding it i...
Pulsar
Dataflow
• Subscribes to important event Pub/Sub topics
• Aggregate events into minute windows
• Always running, no need t...
BigQuery
• Receives aggregates from Dataflow
• Allows for ad-hoc inspection or slicing on different
dimensions
Tableau
• DataVisualizationTool that integrates nicelywith
BigQuery
• Pulls data from BigQuery periodically and caches for...
Milliseconds
to transfer
Milliseconds
to process
Seconds to
Query
SECONDS TO
INSIGHTS
FasterInsights
toClient
Behavior
Problem
As a developer, I want to be able to instantly explore
data being logged bythe clients.
Solution
• Produce a topic for all employee client events
• Store in Elasticsearch
• Visualize in Kibana
Benefits
• Able to understand what’s being sent bythe client
as it happens
• Exploring events, visualizing distribution (i...
OtherUses
Ad Targeting
• Real-time genre targeting
• Session insights — explicit filter
Real-time Recommendations
Live Results for X-Factor
• X-Factor: music competition
• Songs available on Spotify
immediately after show airs
• Listene...
Review
Real-time
Processing
Batch Processing
(Hadoop, Hive, BigQuery)
“Continuous Analytics: Stream Query Processing in Practice”...
Behind the Scenes
Minutes to
transfer
Hours to Clean
and Bucket
Hours to Run
Jobs or Ad Hoc
Queries
DAYS TO
INSIGHTS
To leverage actionable
insights, we need a
faster feedback loop!
Putting it all together
Milliseconds
to transfer
Milliseconds
to process
Seconds to
Query
SECONDS TO
INSIGHTS
TheValue of a Fast Feedback Loop
• Detecting problems early in data avoids long backfills or
long term data loss
• Instant...
UseAnything and Everything
• Spotify has leveraged Google Cloud tools, such as Pub/
Sub, Dataflow and BigQuery
• Opensourc...
WhereAre We Going?
• The real-time mission is in the early stages at
Spotify
Stream Processing First
• The sun never sets on Spotify, why impose
boundaries on our datasets?
• What’s the shortest dist...
We’reHiring!
Engineers, Managers, Product Owners
needed in NYC and Stockholm
https://www.spotify.com/jobs
Thanks! BigDataSpain!
Shortening the feedback loop
Shortening the feedback loop
Shortening the feedback loop
Shortening the feedback loop
Shortening the feedback loop
Shortening the feedback loop
Upcoming SlideShare
Loading in …5
×

Shortening the feedback loop

8,257 views

Published on

How Spotify's big data ecosystem has evolved through Google Cloud Platform building blocks to leverage actionable insights

Published in: Technology

Shortening the feedback loop

  1. 1. Shortening the Feedback Loop HowSpotify’sBigDataEcosystemHas EvolvedtoLeverageActionableInsights Josh Baer (jbx@spotify.com) Note: opinions expressed in these slides are the authors and not necessarilythose of Spotify
  2. 2. Who am I? • Technical Product Owner at Spotify • Working with fast processing infrastructure • Previously, building out Spotify’s 2500 node Hadoop cluster @l_phant
  3. 3. • Spotify Launches • Access to a gigantic catalog of music • Click to play instantaneous! In 2008
  4. 4. Behind the Scenes: Days to Insights
  5. 5. Behind the Scenes
  6. 6. Behind the Scenes Minutes to transfer Hours to Clean and Bucket Hours to Run Jobs or Ad Hoc Queries DAYS TO INSIGHTS
  7. 7. “Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009
  8. 8. Real-time Processing Batch Processing (Hadoop, Hive, BigQuery) “Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009
  9. 9. To leverage actionable insights, we need a faster feedback loop!
  10. 10. • Music Streaming Service • Launched in 2008 • Premium and FreeTiers • Available in 60 Countries What is Spotify?
  11. 11. Over 100 Million Active Users
  12. 12. Over 30 Million Songs
  13. 13. Over 1 Billion Plays Per Day
  14. 14. And we have Data
  15. 15. Hadoop at Spotify • ~2,500 Nodes • >100 PB Capacity • >100 TB Memory accessible by jobs • 20KJobs/Day
  16. 16. Apache Kafka at Spotify • 500 Kafka-related machines • 40TB/day from logs
  17. 17. Real-Time at Spotify • StormTopologies fed via Kafka • Mostly used for hack ideas or proof of concepts
  18. 18. Migratingto theCloud
  19. 19. In the Beginning… • Spotifywas almost completely on-premise/bare metal • Grew to 2,500 node Hadoop cluster and over 10K total machines in production at four globally distributed data centers • “Flirted” with cloud providers at various times
  20. 20. In 2014 • Maybe we should trythis cloud thing for real
  21. 21. Why Move to the Cloud? • Cloud Providers have matured, decreasing in costs and increasing in reliability and variety of service offered • Owning and operating physical machines is not a competitive advantage for Spotify
  22. 22. Why Google’s Cloud? • We believe Google’s industry leading background in Big Data technologies will give us a data processing advantage
  23. 23. Google Cloud Data Building Blocks
  24. 24. BigQuery • Ad-hoc and interactive querying service for massive datasets • Like Hive, but without needing to manage Hadoop and servers • Leverages Google’s internal tech • Dremel (query execution engine) • Colossus (distributed storage) • Borg (distributed compute) • Jupiter (network) Source: https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood
  25. 25. BigQuery vs. Hive • Example Queries: • What are the top 10 songs by popularity in Spain during October 2016? • How many hours did users in Spain spend listening to Spotify during October?
  26. 26. BigQuery vs. Hive • What are the top 10 songs by popularity in Spain during October 2016? • Hive • 2647s (44min, 7sec) • 15.5TB processed • BigQuery • 108s (1min, 48sec) • 1.50TB processed Note: Hive performance unoptimized.Version used (0.14), input format (Avro), run on a ~2500 nodeYarn cluster.This is not considered to be a thorough benchmark
  27. 27. Top 10 Tracks in Spain during October 2016 Rank Artist(s) Track Name 1 J Balvin Safari 2 DJ Snake Let Me Love You 3 Ricky Mar8n Vente Pa' Ca 4 Sebas8an Yatra Traicionera 5 Zion & Lennox (feat. J Balvin) Otra Vez 6 Carlos Vives, Shakira La Bicicleta 7 The Chainsmokers Closer 8 Major Lazer (feat. Jus8n Bieber & MØ) Cold Water 9 Sia The Greatest 10 IAmChino (feat. Pitbull, Yandel & Chacal) Ay MI Dios
  28. 28. BigQuery vs. Hive • How much time did users in Spain spend listening to Spotify during October? • Hive • 969s (16min, 9 sec) • 15.5TB processed • BigQuery • 33s • 780 GB processed Note: Hive performance unoptimized.Version used (0.14), input format (Avro), run on a ~2500 nodeYarn cluster.This is not considered to be a thorough benchmark
  29. 29. Nearly 10,000 Years!
  30. 30. BigQuery at Spotify • Interactive and ad-hoc querying immediately started to transferto BQ once the data was available on the cloud • Pace of learning increases as friction to question decreases
  31. 31. Cloud Pub/Sub • At least once globally distributed message queue • For high volume, low topic (<10,000) publish subscribe behavior • Like Kafka, but without needing to operate servers and supporting services (zookeeper)
  32. 32. Cloud Pub/Sub at Spotify • 800K events/second? No problem • P99 Latency of ingestions into ES: 500ms • Ingestion from globally distributed non-GCP datacenters is painless
  33. 33. • Managed Service for running batch and streaming jobs • UnifiedAPI for batch and streaming mode • Inspired by internal Google tools like FlumeJava and Millwheel • Programming model open-sourced asApache Beam (currently incubating) Cloud Dataflow
  34. 34. • Usually run via Scio: https://github.com/spotify/scio • Scio provides a scalaAPI for running Dataflow jobs and provides easy integrations with BigQuery • New batch processing jobs at Spotify are being written in Scio/Dataflow Cloud Dataflow (Batch) at Spotify
  35. 35. • Exactly-once stream processing framework • Areplacement for Spark/Flink streaming and Storm workloads at Spotify • Optimizes for consistencywhich can complicate real-time workloads Cloud Dataflow (Streaming) at Spotify
  36. 36. Spotify + Google Cloud Timeline 2015 2016 Beginning of Google Cloud evaluation BigQuery begins to replace Hive Cloud Pub/Sub begins to replace Kafka Dataflow (streaming) begins to replace Storm Dataflow (batch) replacing Map/Reduce Note: Dates are approximations
  37. 37. Putting ItAll Together
  38. 38. The Problem • We want to detect within minutes ifwe’ve introduced a bug in a client release that affects important event logging behavior
  39. 39. Before… Minutes to transfer Hours to Clean and Bucket Hours to Run Jobs or Ad Hoc Queries DAYS TO INSIGHTS
  40. 40. Getting Data from Clients to Pub/Sub • Built Pulsar, a simple service aggregating data from Access Points and feeding it into Cloud Pub/Sub • Replaces the Kafka real-time event feed
  41. 41. Pulsar
  42. 42. Dataflow • Subscribes to important event Pub/Sub topics • Aggregate events into minute windows • Always running, no need to schedule orwait for results
  43. 43. BigQuery • Receives aggregates from Dataflow • Allows for ad-hoc inspection or slicing on different dimensions
  44. 44. Tableau • DataVisualizationTool that integrates nicelywith BigQuery • Pulls data from BigQuery periodically and caches for quick inspection
  45. 45. Milliseconds to transfer Milliseconds to process Seconds to Query SECONDS TO INSIGHTS
  46. 46. FasterInsights toClient Behavior
  47. 47. Problem As a developer, I want to be able to instantly explore data being logged bythe clients.
  48. 48. Solution • Produce a topic for all employee client events • Store in Elasticsearch • Visualize in Kibana
  49. 49. Benefits • Able to understand what’s being sent bythe client as it happens • Exploring events, visualizing distribution (i.e. does this field actually get populated) • Prototyping analysis based on a sample • Dashboards for Employee Releases
  50. 50. OtherUses
  51. 51. Ad Targeting • Real-time genre targeting • Session insights — explicit filter
  52. 52. Real-time Recommendations
  53. 53. Live Results for X-Factor • X-Factor: music competition • Songs available on Spotify immediately after show airs • Listener behavior determines the order of contestants on the playlist
  54. 54. Review
  55. 55. Real-time Processing Batch Processing (Hadoop, Hive, BigQuery) “Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009
  56. 56. Behind the Scenes Minutes to transfer Hours to Clean and Bucket Hours to Run Jobs or Ad Hoc Queries DAYS TO INSIGHTS
  57. 57. To leverage actionable insights, we need a faster feedback loop!
  58. 58. Putting it all together Milliseconds to transfer Milliseconds to process Seconds to Query SECONDS TO INSIGHTS
  59. 59. TheValue of a Fast Feedback Loop • Detecting problems early in data avoids long backfills or long term data loss • Instant insights on newly developed features allows teams to iterate quicker and take risks • Providing a quicker ad-hoc querying engine allows teams to ask more questions and learn faster
  60. 60. UseAnything and Everything • Spotify has leveraged Google Cloud tools, such as Pub/ Sub, Dataflow and BigQuery • Opensource and other cloud providers offer many alternatives to this stack • Opensource tools (Elasticsearch/Kibana) and proprietary solutions (Tableau) have also been useful additions
  61. 61. WhereAre We Going? • The real-time mission is in the early stages at Spotify
  62. 62. Stream Processing First • The sun never sets on Spotify, why impose boundaries on our datasets? • What’s the shortest distance between two points? Zero! • Can we reduce the feedback cycle to zero?
  63. 63. We’reHiring! Engineers, Managers, Product Owners needed in NYC and Stockholm https://www.spotify.com/jobs
  64. 64. Thanks! BigDataSpain!

×