Migrating pipelines
into Docker
Noa Resare, Spotify
@blippie
Welcome!
‣(let’s define pipeline)
‣Background
‣Docker improving engineering
experience
‣Docker piece of puzzle to handle
growth
‣Practical advice
Spotify & me
‣Spotify
Streaming music
Celebrates 10 years this summer
30m subscribers, most users on free tier
millions of concurrent users
‣..me
at Spotify for 6 years.
less than 50 engineers, now more than 1000
operations engineering
backend development
Free Software
Data Infrastructure
Big Data at
Spotify
Humble beginnings
‣Counting stream playbacks
‣Stack of servers in the Fußball-room
‣Streaming Hadoop, python
‣Quick excursion to Amazon 2012
The new cluster
‣One large cluster, early 2012
‣60 nodes!
‣Luigi development starts
More technologies
‣python code
‣pure java map/reduce
‣apache scrub
‣scala, scalding
Spotify engineering org
‣A lot of autonomy
‣Big data touches a many different
teams
Finance
Analytics
Feature development (A/B testing)
Recommendations
Payments and fraud
Shared resources, packaging
‣Started out with some shared edge
nodes
chaos ensued
‣More edge nodes!
more chaos? more chaos!
‣Shared execution environment
from .deb to .jar
still a lot of one off edge nodes
Docker for
pipelines
Brief introduction to docker
‣Containers seem like virtual machines
‣docker run -it <image_name>
‣Filesystem reset between invocations
‣Typically built using a docker file
‣Image inheritance
Docker at Spotify
‣Big bet on docker for services: helios
‣Lots of useful infrastructure
‣Solves some immediate packaging
problems
What does Docker provide?
‣Useful abstraction to reason about
‣an incremental way out of dependency
hell
‣Artefact distribution, caching
‣Image inheritance mechanism for sharing
infrastructure
Switching to docker in practice
‣Previously
maven project with java, python, cron file
build step to upload resulting jar to artifactory
build step to copy cron file to execution cluster
‣Now
add Dockerfile, data infrastructure base image
build step to build and upload image
Problems with cron cluster execution
‣Implicit deployment via CI/CD
declaration
‣Status reported via output materialising
‣Who / what triggered job X?
‣Where does it run?
‣Debugging is a pain
Our solution: execution as a service
‣Restful API for pipeline execution
‣List your job invocations
‣Explicitly schedule execution on node
‣Don’t rerun successful execution
‣Interface: docker image
Data growth,
or cluster
day of doom
Scaling is hard
‣2000 nodes
‣100 PB storage
‣800 000 000 files in HDFS
‣180GB heap, 10G young generation
‣Adding 100TB data per day
Docker as vehicle for migration
‣Our path forward: Google Cloud
‣Decouple storage from compute
‣Transparent switch from on premise
Hadoop to DataProc and Cloud Storage
‣Entry point executable in base image
‣Auth, config, dynamic cluster allocation
Where are we now?
‣Two squads are using dockerized
pipelines in production
‣Still using luigi, pull based
dependencies
‣Styx, execution as service soon in prod
‣Google cloud migration as we speak
‣Docker drives transparent migration
Some practical docker advice
‣Reproducible normalised builds
‣Explicit versioning
‣Split code, configuration, secrets
‣github.com/spotify/dockerfile-maven
Thank you!
Don’t be a stranger
noa@spotify.com
@blippie

Migrating pipelines into Docker

  • 1.
    Migrating pipelines into Docker NoaResare, Spotify @blippie
  • 2.
    Welcome! ‣(let’s define pipeline) ‣Background ‣Dockerimproving engineering experience ‣Docker piece of puzzle to handle growth ‣Practical advice
  • 3.
    Spotify & me ‣Spotify Streamingmusic Celebrates 10 years this summer 30m subscribers, most users on free tier millions of concurrent users ‣..me at Spotify for 6 years. less than 50 engineers, now more than 1000 operations engineering backend development Free Software Data Infrastructure
  • 4.
  • 5.
    Humble beginnings ‣Counting streamplaybacks ‣Stack of servers in the Fußball-room ‣Streaming Hadoop, python ‣Quick excursion to Amazon 2012
  • 6.
    The new cluster ‣Onelarge cluster, early 2012 ‣60 nodes! ‣Luigi development starts
  • 7.
    More technologies ‣python code ‣purejava map/reduce ‣apache scrub ‣scala, scalding
  • 8.
    Spotify engineering org ‣Alot of autonomy ‣Big data touches a many different teams Finance Analytics Feature development (A/B testing) Recommendations Payments and fraud
  • 9.
    Shared resources, packaging ‣Startedout with some shared edge nodes chaos ensued ‣More edge nodes! more chaos? more chaos! ‣Shared execution environment from .deb to .jar still a lot of one off edge nodes
  • 10.
  • 11.
    Brief introduction todocker ‣Containers seem like virtual machines ‣docker run -it <image_name> ‣Filesystem reset between invocations ‣Typically built using a docker file ‣Image inheritance
  • 12.
    Docker at Spotify ‣Bigbet on docker for services: helios ‣Lots of useful infrastructure ‣Solves some immediate packaging problems
  • 13.
    What does Dockerprovide? ‣Useful abstraction to reason about ‣an incremental way out of dependency hell ‣Artefact distribution, caching ‣Image inheritance mechanism for sharing infrastructure
  • 14.
    Switching to dockerin practice ‣Previously maven project with java, python, cron file build step to upload resulting jar to artifactory build step to copy cron file to execution cluster ‣Now add Dockerfile, data infrastructure base image build step to build and upload image
  • 15.
    Problems with croncluster execution ‣Implicit deployment via CI/CD declaration ‣Status reported via output materialising ‣Who / what triggered job X? ‣Where does it run? ‣Debugging is a pain
  • 16.
    Our solution: executionas a service ‣Restful API for pipeline execution ‣List your job invocations ‣Explicitly schedule execution on node ‣Don’t rerun successful execution ‣Interface: docker image
  • 17.
  • 18.
    Scaling is hard ‣2000nodes ‣100 PB storage ‣800 000 000 files in HDFS ‣180GB heap, 10G young generation ‣Adding 100TB data per day
  • 19.
    Docker as vehiclefor migration ‣Our path forward: Google Cloud ‣Decouple storage from compute ‣Transparent switch from on premise Hadoop to DataProc and Cloud Storage ‣Entry point executable in base image ‣Auth, config, dynamic cluster allocation
  • 20.
    Where are wenow? ‣Two squads are using dockerized pipelines in production ‣Still using luigi, pull based dependencies ‣Styx, execution as service soon in prod ‣Google cloud migration as we speak ‣Docker drives transparent migration
  • 21.
    Some practical dockeradvice ‣Reproducible normalised builds ‣Explicit versioning ‣Split code, configuration, secrets ‣github.com/spotify/dockerfile-maven
  • 22.
    Thank you! Don’t bea stranger noa@spotify.com @blippie