Slides from a presentation given by Alison Gilles and Josh Baer during StrataNYC 2017.
Covers the decision, challenge and strategy (technical, organizational, people) for converting Spotify's 2500 node Hadoop cluster's worth of data and processing to Google Cloud.
Finally, touches on Spotify's resulting infrastructure on GCP.
9. • Spotify tribe (department) responsible for data
processing platform
• Organized into 10 squads (engineering teams),
split between Stockholm and NYC
• Squads each own a problem space —
e.g. event delivery, real time processing
Data Infrastructure
12. • Spotify was completely on-premise/bare metal
• Previously had experimented with cloud, but
nothing stuck
• By 2014 we had launched in the US, growth was
rapid and we had trouble keeping up
Some history…
13. Owning and operating physical machines is not a
competitive advantage for Spotify.
Why move to the Cloud?
14. • Google’s Big Data toolset is best in class
• Collaborating with Google engineers was a great
fit for us
Why Google?
15. [Imagine a gif of Captain Picard
saying “Make it so!” in this blank
space]
17. • ~ 2,500 Nodes (50K CPU Cores)
• > 100 PB Capacity
• > 100 TB Memory accessible by jobs
• 20K Jobs/Day from 2K unique workflows from 100
different teams
Hadoop at Spotify
18.
19. • We can’t move it ourselves
• People hate to be blocked
• Everyone had quite a bit of other work to do
But wait, there’s more!
20. • Everything’s going to change
• All new technology and we’re the experts
• PLEASE no one quit!
And…
25. • Unblock teams by ensuring they have the data
they need where they expect it
• Last month:
• 80,000 jobs: on-premise -> cloud
• 30,000 jobs: cloud -> on-premise
Copy all the things!
29. • We want to treat our machines more like cattle,
less like pets
• Our previous data set-up relied on a very custom
“pet-like” approach
• How could we replicate the function of our
previous setup and adopt our new approach?
Forklifting Path Challenges
30. • A batch job scheduler for Kubernetes
• Containerize your workflows, then define a schedule
• https://github.com/spotify/styx
Styx
Forklifting Path Technology
31. • “Hadoop Cluster as a Service” utilizing Google
Cloud Dataproc and GCS
• Spin up clusters to run workflows or point
workflows at pre-existing clusters
• https://github.com/spotify/spydra
Spydra
Forklifting Path Technology
32. • Forklifting still requires time from teams
• Forklifting means not cleaning up tech debt
Forklifting Path Learnings