Data at Spotify

June 12, 2014
Danielle Jabin
Data Engineer, A/B Testing
Data at Spotify

I’m Danielle Jabin
•  Data Engineer in the Stockholm oﬃce
•  A/B testing infrastructure
•  California born & raised
•  If I can survive a Swedish winter, so can you!
•  Studied Computer Science, Statistics, and Real Estate
through the M&T program at the University of
Pennsylvania

3
Over 40 million active
users
As of June 9, 2014

4
Access to more than 20
million songs
As of June 9, 2014

Big Data
•  40 million Monthly Active Users
•  20+ million tracks
•  1.5 TB of compressed data from users per day
•  64 TB of data generated in Hadoop each day (including
replication factor of 3)
As of June 9, 2014

Let’s compare: 64 TB
•  293, 203, 072 books (200 pages or 240,000
characters)
•  16,777,216 MP3 files (with 4MB average file size)
•  22,369,600 images (with 3MB average file size)

Use Cases
•  Reporting
•  Business Analytics
•  Operational Analytics
•  Product Features

Reporting
•  Reporting to labels, licensors, partners, and advertisers
•  We support our partners

Business Analytics
•  Analyzing growth, user behavior, sign-up funnels, etc
•  Company KPIs
•  NPS analysis

Operational Metrics
•  Root cause analysis
•  Latency analysis
•  Better capacity planning (servers, people, bandwidth)

Product Features
•  Discover and Radio
•  Top lists
•  Personalized recommendations
•  A/B Testing

15
How do we collect this
data?

The three pillars of our Data Infrastructure:
Kafka
Collection
Hadoop
Processing
Databases
Analytics/Visualization

This is Dave. Data Engineer at
Spotify by day…

…chiptune DJ Demoscene Time
Machine by night.

Let’s listen to Dave’s song

Kafka
•  High volume pub-sub
system
•  “Producers publish messages to
Kafka topics, and consumers
subscribe to these topics and
consume the messages.”

Kafka
•  Robust and scalable solution for collection of logs
•  Fast data transfer
•  Low CPU overhead
•  Built-in partitioning, replication, and fault-tolerance
•  Consumers can pull data at diﬀerent rates
•  Able to handle extremely high volumes

Hadoop
•  Process and store massive amounts of unstructured data
across a distributed cluster
•  One cluster with 37 nodes to 690 nodes today
•  28 PB of storage
•  The largest Hadoop cluster in Europe

Hadoop
•  Entering the land of optimizations
•  Data retention policy
•  Move to JVM-based languages
•  MapReduce languages
•  Moving to Crunch, JVM-based, for speed and scalability
•  Python with Hadoop Streaming, Java, Hive, PIG, Scala
•  Sprunch: Crunch wrapper for Scala, open sourced by Spotify
•  Spotify open-sourced scheduler, Luigi, written in Python
•  Simple and easy way to chain jobs

What if we want to know more?
vs

Databases
•  Aggregates from Hadoop put into PostgreSQL or
Cassandra
•  Sqoop
•  Core data can be used and manipulated for various needs
•  Ad hoc queries
•  Dashboards

Databases
•  Aggregates from Hadoop put into PostgreSQL or
Cassandra
•  Sqoop
•  Ad hoc queries
•  Dashboards

A/B testing questions? Find me!
Contr
ol
vs

Data at Spotify

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Data at Spotify

Similar to Data at Spotify (20)

Recently uploaded

Recently uploaded (20)

Data at Spotify