An evolution of data infrastructure
Spotify in the Cloud
Alison Gilles
Director of Engineering,
Data Infrastructure
Josh Baer
Technical Product Owner,
Data Infrastructure
Moving to the Cloud
Music Streaming Service
Launched in 2008
Premium and Free Tiers
Available in 60 Countries
Over 140M Active Users
More than 30M Songs
Over 1 billion plays per day
We have lots of data
• Spotify tribe (department) responsible for data
processing platform
• Organized into 10 squads (engineering teams),
split between Stockholm and NYC
• Squads each own a problem space —
e.g. event delivery, real time processing
Data Infrastructure
Data Processing Backend Services
The Decision
• Spotify was completely on-premise/bare metal
• Previously had experimented with cloud, but
nothing stuck
• By 2014 we had launched in the US, growth was
rapid and we had trouble keeping up
Some history…
Owning and operating physical machines is not a
competitive advantage for Spotify.
Why move to the Cloud?
• Google’s Big Data toolset is best in class
• Collaborating with Google engineers was a great
fit for us
Why Google?
[Imagine a gif of Captain Picard
saying “Make it so!” in this blank
space]
The Challenge
• ~ 2,500 Nodes (50K CPU Cores)
• > 100 PB Capacity
• > 100 TB Memory accessible by jobs
• 20K Jobs/Day from 2K unique workflows from 100
different teams
Hadoop at Spotify
• We can’t move it ourselves
• People hate to be blocked
• Everyone had quite a bit of other work to do
But wait, there’s more!
• Everything’s going to change
• All new technology and we’re the experts
• PLEASE no one quit!
And…
The Strategy
Everybody stop everything
Everybody stop everything
Keep it simple for Spotify
engineers through tooling
• Unblock teams by ensuring they have the data
they need where they expect it
• Last month:
• 80,000 jobs: on-premise -> cloud
• 30,000 jobs: cloud -> on-premise
Copy all the things!
Everybody stop everything new
Two paths to choose between
And a strategy of not blocking teams
The Forklifting Path
• We want to treat our machines more like cattle,
less like pets
• Our previous data set-up relied on a very custom
“pet-like” approach
• How could we replicate the function of our
previous setup and adopt our new approach?
Forklifting Path Challenges
• A batch job scheduler for Kubernetes
• Containerize your workflows, then define a schedule
• https://github.com/spotify/styx
Styx
Forklifting Path Technology
• “Hadoop Cluster as a Service” utilizing Google
Cloud Dataproc and GCS
• Spin up clusters to run workflows or point
workflows at pre-existing clusters
• https://github.com/spotify/spydra
Spydra
Forklifting Path Technology
• Forklifting still requires time from teams
• Forklifting means not cleaning up tech debt
Forklifting Path Learnings
The Rewrite Path
• Update your workflow using
BigQuery or Scio (Dataflow)
• Uses the latest and greatest
• Fully managed
The Rewrite Path
• Scala API around Apache Beam
• Currently used by six other
companies
• https://github.com/spotify/scio
Scio
The Rewrite Path
It always takes longer than you expect,
even when you take into account
Hofstadter's Law.
Hofstadter's Law
The Task Force
• Some teams started rewriting immediately
• Some teams forklift themselves
• Others needed a “push”
We need a push
The Task Force
• A couple dedicated infrastructure engineers
• A focused 1-2 weeks sprint
• Move ALL THE THINGS
What were they?
The Task Force
Teams with a
mission…
What about people stuff?
Life gets even better
The Stack
BigQuery
• MAU: 25% of total Spotify
employees
• Over 3mm queries +
scheduled jobs per month
Ad-Hoc / Interactive Analysis
PubSub
• 1 trillion requests per day
• P99 latencies less than 400 ms
Event Delivery
Cloud Dataflow and Dataproc
• 5000 Dataflow jobs run per day
• Most via Scio
• Dataproc spins up Hadoop
clusters per workflows
Big Data Processing
• An area of development
• Currently users of Spark,
Tensorflow, Keras, SciKit Learn
• Evaluating ML Engine
Machine Learning
• Data quality tooling
• Data management
• User privacy infrastructure
Other future endeavors…
Key Takeaways
Lesson #1: Know Thy Org
Understand how your organization operates and
tailor your migration path towards that
Lesson #2: Embrace Change
Be honest with teams, challenge people
Lesson #3: Open Source
Blend of Google primitives and Spotify specific
things. Open source when possible!
spotify.github.io
Engineers, Managers, Product
Owners needed in NYC and
Stockholm
https://www.spotifyjobs.com/
We’re Hiring!
[email redacted]
@l_phant
Q&A
Alison Gilles
Director of Engineering,
Data Infrastructure
Josh Baer
Technical Product Owner,
Data Infrastructure
[email redacted]
@agilles

Spotify in the Cloud - An evolution of data infrastructure - Strata NYC

  • 1.
    An evolution ofdata infrastructure Spotify in the Cloud
  • 2.
    Alison Gilles Director ofEngineering, Data Infrastructure Josh Baer Technical Product Owner, Data Infrastructure
  • 3.
  • 4.
    Music Streaming Service Launchedin 2008 Premium and Free Tiers Available in 60 Countries
  • 5.
  • 6.
  • 7.
    Over 1 billionplays per day
  • 8.
    We have lotsof data
  • 9.
    • Spotify tribe(department) responsible for data processing platform • Organized into 10 squads (engineering teams), split between Stockholm and NYC • Squads each own a problem space — e.g. event delivery, real time processing Data Infrastructure
  • 10.
  • 11.
  • 12.
    • Spotify wascompletely on-premise/bare metal • Previously had experimented with cloud, but nothing stuck • By 2014 we had launched in the US, growth was rapid and we had trouble keeping up Some history…
  • 13.
    Owning and operatingphysical machines is not a competitive advantage for Spotify. Why move to the Cloud?
  • 14.
    • Google’s BigData toolset is best in class • Collaborating with Google engineers was a great fit for us Why Google?
  • 15.
    [Imagine a gifof Captain Picard saying “Make it so!” in this blank space]
  • 16.
  • 17.
    • ~ 2,500Nodes (50K CPU Cores) • > 100 PB Capacity • > 100 TB Memory accessible by jobs • 20K Jobs/Day from 2K unique workflows from 100 different teams Hadoop at Spotify
  • 19.
    • We can’tmove it ourselves • People hate to be blocked • Everyone had quite a bit of other work to do But wait, there’s more!
  • 20.
    • Everything’s goingto change • All new technology and we’re the experts • PLEASE no one quit! And…
  • 21.
  • 22.
  • 23.
  • 24.
    Keep it simplefor Spotify engineers through tooling
  • 25.
    • Unblock teamsby ensuring they have the data they need where they expect it • Last month: • 80,000 jobs: on-premise -> cloud • 30,000 jobs: cloud -> on-premise Copy all the things!
  • 26.
  • 27.
    Two paths tochoose between And a strategy of not blocking teams
  • 28.
  • 29.
    • We wantto treat our machines more like cattle, less like pets • Our previous data set-up relied on a very custom “pet-like” approach • How could we replicate the function of our previous setup and adopt our new approach? Forklifting Path Challenges
  • 30.
    • A batchjob scheduler for Kubernetes • Containerize your workflows, then define a schedule • https://github.com/spotify/styx Styx Forklifting Path Technology
  • 31.
    • “Hadoop Clusteras a Service” utilizing Google Cloud Dataproc and GCS • Spin up clusters to run workflows or point workflows at pre-existing clusters • https://github.com/spotify/spydra Spydra Forklifting Path Technology
  • 32.
    • Forklifting stillrequires time from teams • Forklifting means not cleaning up tech debt Forklifting Path Learnings
  • 33.
  • 34.
    • Update yourworkflow using BigQuery or Scio (Dataflow) • Uses the latest and greatest • Fully managed The Rewrite Path
  • 35.
    • Scala APIaround Apache Beam • Currently used by six other companies • https://github.com/spotify/scio Scio The Rewrite Path
  • 36.
    It always takeslonger than you expect, even when you take into account Hofstadter's Law. Hofstadter's Law
  • 37.
  • 38.
    • Some teamsstarted rewriting immediately • Some teams forklift themselves • Others needed a “push” We need a push The Task Force
  • 39.
    • A couplededicated infrastructure engineers • A focused 1-2 weeks sprint • Move ALL THE THINGS What were they? The Task Force
  • 40.
    Teams with a mission… Whatabout people stuff? Life gets even better
  • 41.
  • 43.
    BigQuery • MAU: 25%of total Spotify employees • Over 3mm queries + scheduled jobs per month Ad-Hoc / Interactive Analysis
  • 44.
    PubSub • 1 trillionrequests per day • P99 latencies less than 400 ms Event Delivery
  • 45.
    Cloud Dataflow andDataproc • 5000 Dataflow jobs run per day • Most via Scio • Dataproc spins up Hadoop clusters per workflows Big Data Processing
  • 46.
    • An areaof development • Currently users of Spark, Tensorflow, Keras, SciKit Learn • Evaluating ML Engine Machine Learning
  • 47.
    • Data qualitytooling • Data management • User privacy infrastructure Other future endeavors…
  • 48.
  • 49.
    Lesson #1: KnowThy Org Understand how your organization operates and tailor your migration path towards that
  • 50.
    Lesson #2: EmbraceChange Be honest with teams, challenge people
  • 51.
    Lesson #3: OpenSource Blend of Google primitives and Spotify specific things. Open source when possible! spotify.github.io
  • 52.
    Engineers, Managers, Product Ownersneeded in NYC and Stockholm https://www.spotifyjobs.com/ We’re Hiring!
  • 53.
    [email redacted] @l_phant Q&A Alison Gilles Directorof Engineering, Data Infrastructure Josh Baer Technical Product Owner, Data Infrastructure [email redacted] @agilles