Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
An evolution of data infrastructure
Spotify in the Cloud
Alison Gilles
Director of Engineering,
Data Infrastructure
Josh Baer
Technical Product Owner,
Data Infrastructure
Moving to the Cloud
Music Streaming Service
Launched in 2008
Premium and Free Tiers
Available in 60 Countries
Over 140M Active Users
More than 30M Songs
Over 1 billion plays per day
We have lots of data
• Spotify tribe (department) responsible for data
processing platform
• Organized into 10 squads (engineering teams),
spli...
Data Processing Backend Services
The Decision
• Spotify was completely on-premise/bare metal
• Previously had experimented with cloud, but
nothing stuck
• By 2014 we ha...
Owning and operating physical machines is not a
competitive advantage for Spotify.
Why move to the Cloud?
• Google’s Big Data toolset is best in class
• Collaborating with Google engineers was a great
fit for us
Why Google?
[Imagine a gif of Captain Picard
saying “Make it so!” in this blank
space]
The Challenge
• ~ 2,500 Nodes (50K CPU Cores)
• > 100 PB Capacity
• > 100 TB Memory accessible by jobs
• 20K Jobs/Day from 2K unique wor...
• We can’t move it ourselves
• People hate to be blocked
• Everyone had quite a bit of other work to do
But wait, there’s ...
• Everything’s going to change
• All new technology and we’re the experts
• PLEASE no one quit!
And…
The Strategy
Everybody stop everything
Everybody stop everything
Keep it simple for Spotify
engineers through tooling
• Unblock teams by ensuring they have the data
they need where they expect it
• Last month:
• 80,000 jobs: on-premise -> c...
Everybody stop everything new
Two paths to choose between
And a strategy of not blocking teams
The Forklifting Path
• We want to treat our machines more like cattle,
less like pets
• Our previous data set-up relied on a very custom
“pet-l...
• A batch job scheduler for Kubernetes
• Containerize your workflows, then define a schedule
• https://github.com/spotify/...
• “Hadoop Cluster as a Service” utilizing Google
Cloud Dataproc and GCS
• Spin up clusters to run workflows or point
workf...
• Forklifting still requires time from teams
• Forklifting means not cleaning up tech debt
Forklifting Path Learnings
The Rewrite Path
• Update your workflow using
BigQuery or Scio (Dataflow)
• Uses the latest and greatest
• Fully managed
The Rewrite Path
• Scala API around Apache Beam
• Currently used by six other
companies
• https://github.com/spotify/scio
Scio
The Rewrite ...
It always takes longer than you expect,
even when you take into account
Hofstadter's Law.
Hofstadter's Law
The Task Force
• Some teams started rewriting immediately
• Some teams forklift themselves
• Others needed a “push”
We need a push
The Ta...
• A couple dedicated infrastructure engineers
• A focused 1-2 weeks sprint
• Move ALL THE THINGS
What were they?
The Task ...
Teams with a
mission…
What about people stuff?
Life gets even better
The Stack
BigQuery
• MAU: 25% of total Spotify
employees
• Over 3mm queries +
scheduled jobs per month
Ad-Hoc / Interactive Analysis
PubSub
• 1 trillion requests per day
• P99 latencies less than 400 ms
Event Delivery
Cloud Dataflow and Dataproc
• 5000 Dataflow jobs run per day
• Most via Scio
• Dataproc spins up Hadoop
clusters per workf...
• An area of development
• Currently users of Spark,
Tensorflow, Keras, SciKit Learn
• Evaluating ML Engine
Machine Learni...
• Data quality tooling
• Data management
• User privacy infrastructure
Other future endeavors…
Key Takeaways
Lesson #1: Know Thy Org
Understand how your organization operates and
tailor your migration path towards that
Lesson #2: Embrace Change
Be honest with teams, challenge people
Lesson #3: Open Source
Blend of Google primitives and Spotify specific
things. Open source when possible!
spotify.github.io
Engineers, Managers, Product
Owners needed in NYC and
Stockholm
https://www.spotifyjobs.com/
We’re Hiring!
[email redacted]
@l_phant
Q&A
Alison Gilles
Director of Engineering,
Data Infrastructure
Josh Baer
Technical Product Owner...
Spotify in the Cloud - An evolution of data infrastructure - Strata NYC
Spotify in the Cloud - An evolution of data infrastructure - Strata NYC
Upcoming SlideShare
Loading in …5
×

Spotify in the Cloud - An evolution of data infrastructure - Strata NYC

3,156 views

Published on

Slides from a presentation given by Alison Gilles and Josh Baer during StrataNYC 2017.

Covers the decision, challenge and strategy (technical, organizational, people) for converting Spotify's 2500 node Hadoop cluster's worth of data and processing to Google Cloud.

Finally, touches on Spotify's resulting infrastructure on GCP.

Published in: Technology

Spotify in the Cloud - An evolution of data infrastructure - Strata NYC

  1. 1. An evolution of data infrastructure Spotify in the Cloud
  2. 2. Alison Gilles Director of Engineering, Data Infrastructure Josh Baer Technical Product Owner, Data Infrastructure
  3. 3. Moving to the Cloud
  4. 4. Music Streaming Service Launched in 2008 Premium and Free Tiers Available in 60 Countries
  5. 5. Over 140M Active Users
  6. 6. More than 30M Songs
  7. 7. Over 1 billion plays per day
  8. 8. We have lots of data
  9. 9. • Spotify tribe (department) responsible for data processing platform • Organized into 10 squads (engineering teams), split between Stockholm and NYC • Squads each own a problem space — e.g. event delivery, real time processing Data Infrastructure
  10. 10. Data Processing Backend Services
  11. 11. The Decision
  12. 12. • Spotify was completely on-premise/bare metal • Previously had experimented with cloud, but nothing stuck • By 2014 we had launched in the US, growth was rapid and we had trouble keeping up Some history…
  13. 13. Owning and operating physical machines is not a competitive advantage for Spotify. Why move to the Cloud?
  14. 14. • Google’s Big Data toolset is best in class • Collaborating with Google engineers was a great fit for us Why Google?
  15. 15. [Imagine a gif of Captain Picard saying “Make it so!” in this blank space]
  16. 16. The Challenge
  17. 17. • ~ 2,500 Nodes (50K CPU Cores) • > 100 PB Capacity • > 100 TB Memory accessible by jobs • 20K Jobs/Day from 2K unique workflows from 100 different teams Hadoop at Spotify
  18. 18. • We can’t move it ourselves • People hate to be blocked • Everyone had quite a bit of other work to do But wait, there’s more!
  19. 19. • Everything’s going to change • All new technology and we’re the experts • PLEASE no one quit! And…
  20. 20. The Strategy
  21. 21. Everybody stop everything
  22. 22. Everybody stop everything
  23. 23. Keep it simple for Spotify engineers through tooling
  24. 24. • Unblock teams by ensuring they have the data they need where they expect it • Last month: • 80,000 jobs: on-premise -> cloud • 30,000 jobs: cloud -> on-premise Copy all the things!
  25. 25. Everybody stop everything new
  26. 26. Two paths to choose between And a strategy of not blocking teams
  27. 27. The Forklifting Path
  28. 28. • We want to treat our machines more like cattle, less like pets • Our previous data set-up relied on a very custom “pet-like” approach • How could we replicate the function of our previous setup and adopt our new approach? Forklifting Path Challenges
  29. 29. • A batch job scheduler for Kubernetes • Containerize your workflows, then define a schedule • https://github.com/spotify/styx Styx Forklifting Path Technology
  30. 30. • “Hadoop Cluster as a Service” utilizing Google Cloud Dataproc and GCS • Spin up clusters to run workflows or point workflows at pre-existing clusters • https://github.com/spotify/spydra Spydra Forklifting Path Technology
  31. 31. • Forklifting still requires time from teams • Forklifting means not cleaning up tech debt Forklifting Path Learnings
  32. 32. The Rewrite Path
  33. 33. • Update your workflow using BigQuery or Scio (Dataflow) • Uses the latest and greatest • Fully managed The Rewrite Path
  34. 34. • Scala API around Apache Beam • Currently used by six other companies • https://github.com/spotify/scio Scio The Rewrite Path
  35. 35. It always takes longer than you expect, even when you take into account Hofstadter's Law. Hofstadter's Law
  36. 36. The Task Force
  37. 37. • Some teams started rewriting immediately • Some teams forklift themselves • Others needed a “push” We need a push The Task Force
  38. 38. • A couple dedicated infrastructure engineers • A focused 1-2 weeks sprint • Move ALL THE THINGS What were they? The Task Force
  39. 39. Teams with a mission… What about people stuff? Life gets even better
  40. 40. The Stack
  41. 41. BigQuery • MAU: 25% of total Spotify employees • Over 3mm queries + scheduled jobs per month Ad-Hoc / Interactive Analysis
  42. 42. PubSub • 1 trillion requests per day • P99 latencies less than 400 ms Event Delivery
  43. 43. Cloud Dataflow and Dataproc • 5000 Dataflow jobs run per day • Most via Scio • Dataproc spins up Hadoop clusters per workflows Big Data Processing
  44. 44. • An area of development • Currently users of Spark, Tensorflow, Keras, SciKit Learn • Evaluating ML Engine Machine Learning
  45. 45. • Data quality tooling • Data management • User privacy infrastructure Other future endeavors…
  46. 46. Key Takeaways
  47. 47. Lesson #1: Know Thy Org Understand how your organization operates and tailor your migration path towards that
  48. 48. Lesson #2: Embrace Change Be honest with teams, challenge people
  49. 49. Lesson #3: Open Source Blend of Google primitives and Spotify specific things. Open source when possible! spotify.github.io
  50. 50. Engineers, Managers, Product Owners needed in NYC and Stockholm https://www.spotifyjobs.com/ We’re Hiring!
  51. 51. [email redacted] @l_phant Q&A Alison Gilles Director of Engineering, Data Infrastructure Josh Baer Technical Product Owner, Data Infrastructure [email redacted] @agilles

×