June 12, 2014
Danielle Jabin
Data Engineer, A/B Testing
Data at Spotify
I’m Danielle Jabin
•  Data Engineer in the Stockholm office
•  A/B testing infrastructure
•  California born & raised
•  If I can survive a Swedish winter, so can you!
•  Studied Computer Science, Statistics, and Real Estate
through the M&T program at the University of
Pennsylvania
3
Over 40 million active
users
As of June 9, 2014	
  
4
Access to more than 20
million songs
As of June 9, 2014	
  
Big Data
•  40 million Monthly Active Users
•  20+ million tracks
•  1.5 TB of compressed data from users per day
•  64 TB of data generated in Hadoop each day (including
replication factor of 3)
As of June 9, 2014	
  
6
So how much data is that?
Let’s compare: 64 TB
•  293, 203, 072 books (200 pages or 240,000
characters)
•  16,777,216 MP3 files (with 4MB average file size)
•  22,369,600 images (with 3MB average file size)
8
That’s a lot of selfies
9
How do we use this data?
Use Cases
•  Reporting
•  Business Analytics
•  Operational Analytics
•  Product Features
Reporting
•  Reporting to labels, licensors, partners, and advertisers
•  We support our partners
Business Analytics
•  Analyzing growth, user behavior, sign-up funnels, etc
•  Company KPIs
•  NPS analysis
Operational Metrics
•  Root cause analysis
•  Latency analysis
•  Better capacity planning (servers, people, bandwidth)
Product Features
•  Discover and Radio
•  Top lists
•  Personalized recommendations
•  A/B Testing
15
How do we collect this
data?
The three pillars of our Data Infrastructure:
Kafka
Collection
Hadoop
Processing
Databases
Analytics/Visualization
This is Dave. Data Engineer at
Spotify by day…
…chiptune DJ Demoscene Time
Machine by night.
Let’s listen to Dave’s song
Kafka
•  High volume pub-sub
system
•  “Producers publish messages to
Kafka topics, and consumers
subscribe to these topics and
consume the messages.”
Kafka
•  Robust and scalable solution for collection of logs
•  Fast data transfer
•  Low CPU overhead
•  Built-in partitioning, replication, and fault-tolerance
•  Consumers can pull data at different rates
•  Able to handle extremely high volumes
Other people listened too!
Hadoop
•  Process and store massive amounts of unstructured data
across a distributed cluster
•  One cluster with 37 nodes to 690 nodes today
•  28 PB of storage
•  The largest Hadoop cluster in Europe
Hadoop
•  Entering the land of optimizations
•  Data retention policy
•  Move to JVM-based languages
•  MapReduce languages
•  Moving to Crunch, JVM-based, for speed and scalability
•  Python with Hadoop Streaming, Java, Hive, PIG, Scala
•  Sprunch: Crunch wrapper for Scala, open sourced by Spotify
•  Spotify open-sourced scheduler, Luigi, written in Python
•  Simple and easy way to chain jobs
What if we want to know more?
vs
Databases
•  Aggregates from Hadoop put into PostgreSQL or
Cassandra
•  Sqoop
•  Core data can be used and manipulated for various needs
•  Ad hoc queries
•  Dashboards
Databases
•  Aggregates from Hadoop put into PostgreSQL or
Cassandra
•  Sqoop
•  Ad hoc queries
•  Dashboards
Databases
•  Aggregates from Hadoop put into PostgreSQL or
Cassandra
•  Sqoop
•  Ad hoc queries
•  Dashboards
Questions?
A/B testing questions? Find me!
Contr
ol
vs
Thank you!

Data at Spotify

  • 1.
    June 12, 2014 DanielleJabin Data Engineer, A/B Testing Data at Spotify
  • 2.
    I’m Danielle Jabin • Data Engineer in the Stockholm office •  A/B testing infrastructure •  California born & raised •  If I can survive a Swedish winter, so can you! •  Studied Computer Science, Statistics, and Real Estate through the M&T program at the University of Pennsylvania
  • 3.
    3 Over 40 millionactive users As of June 9, 2014  
  • 4.
    4 Access to morethan 20 million songs As of June 9, 2014  
  • 5.
    Big Data •  40million Monthly Active Users •  20+ million tracks •  1.5 TB of compressed data from users per day •  64 TB of data generated in Hadoop each day (including replication factor of 3) As of June 9, 2014  
  • 6.
    6 So how muchdata is that?
  • 7.
    Let’s compare: 64TB •  293, 203, 072 books (200 pages or 240,000 characters) •  16,777,216 MP3 files (with 4MB average file size) •  22,369,600 images (with 3MB average file size)
  • 8.
    8 That’s a lotof selfies
  • 9.
    9 How do weuse this data?
  • 10.
    Use Cases •  Reporting • Business Analytics •  Operational Analytics •  Product Features
  • 11.
    Reporting •  Reporting tolabels, licensors, partners, and advertisers •  We support our partners
  • 12.
    Business Analytics •  Analyzinggrowth, user behavior, sign-up funnels, etc •  Company KPIs •  NPS analysis
  • 13.
    Operational Metrics •  Rootcause analysis •  Latency analysis •  Better capacity planning (servers, people, bandwidth)
  • 14.
    Product Features •  Discoverand Radio •  Top lists •  Personalized recommendations •  A/B Testing
  • 15.
    15 How do wecollect this data?
  • 16.
    The three pillarsof our Data Infrastructure: Kafka Collection Hadoop Processing Databases Analytics/Visualization
  • 17.
    This is Dave.Data Engineer at Spotify by day…
  • 18.
    …chiptune DJ DemosceneTime Machine by night.
  • 19.
    Let’s listen toDave’s song
  • 20.
    Kafka •  High volumepub-sub system •  “Producers publish messages to Kafka topics, and consumers subscribe to these topics and consume the messages.”
  • 21.
    Kafka •  Robust andscalable solution for collection of logs •  Fast data transfer •  Low CPU overhead •  Built-in partitioning, replication, and fault-tolerance •  Consumers can pull data at different rates •  Able to handle extremely high volumes
  • 22.
  • 23.
    Hadoop •  Process andstore massive amounts of unstructured data across a distributed cluster •  One cluster with 37 nodes to 690 nodes today •  28 PB of storage •  The largest Hadoop cluster in Europe
  • 24.
    Hadoop •  Entering theland of optimizations •  Data retention policy •  Move to JVM-based languages •  MapReduce languages •  Moving to Crunch, JVM-based, for speed and scalability •  Python with Hadoop Streaming, Java, Hive, PIG, Scala •  Sprunch: Crunch wrapper for Scala, open sourced by Spotify •  Spotify open-sourced scheduler, Luigi, written in Python •  Simple and easy way to chain jobs
  • 25.
    What if wewant to know more? vs
  • 26.
    Databases •  Aggregates fromHadoop put into PostgreSQL or Cassandra •  Sqoop •  Core data can be used and manipulated for various needs •  Ad hoc queries •  Dashboards
  • 27.
    Databases •  Aggregates fromHadoop put into PostgreSQL or Cassandra •  Sqoop •  Ad hoc queries •  Dashboards
  • 28.
    Databases •  Aggregates fromHadoop put into PostgreSQL or Cassandra •  Sqoop •  Ad hoc queries •  Dashboards
  • 29.
  • 30.
    A/B testing questions?Find me! Contr ol vs
  • 31.