The Evolution of Big Data at Spotify

TheEvolution
ofBigDataat
Spotify
Josh Baer (jbx@spotify.com)

WhoAm I?
‣ Technical Product Owner at
Spotify, responsible for
Hadoop
@l_phant

Overview
• Creating Music Charts in three parts:
• Playing Music
• Collecting Data
• Processing Data
• The Future

Building Music Charts
Play Music Collect Data Process

What is Spotify?
• Music Streaming Service
• Browse and Discover Millions of
Songs,Artists andAlbums
• Bythe end of 2014
• 60 Million Monthly Users
• 15 Million Paid Subscribers

What is Spotify?
• Data Infrastructure
• 1300 Hadoop Nodes
• 42 PB Storage
• 30TB data ingested via Kafka/day
• 400TB generated by Hadoop/day

Powered by Data
• RunningApp
• Matches music to running tempo
• Personalized running playlists in
multiple tempos for millions of
active users
http://www.theverge.com/2015/6/1/8696659/spotify-running-is-great-for-discovery

Powered by Data
• Now Page
• Shows, podcasts and playlists
based on day-parts
• Personalized layout so you always
have the right music forthe right
moment

Building Music Charts
10.123.133.333 - - [Mon, 3 June 2015 11:31:33 GMT] "GET /api/admin/job/
aggregator/status HTTP/1.1" 200 1847 "https://my.analytics.app/admin"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36"
10.123.133.222 - - [Mon, 3 June 2015 11:31:43 GMT] "GET /api/admin/job/
aggregator/status HTTP/1.1" 200 1984 "https://my.analytics.app/admin"
(KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36”
10.123.133.222 - - [Mon, 3 June 2015 11:33:02 GMT] "GET /dashboard/
courses/1291726 HTTP/1.1" 304 - "https://my.analytics.app/admin"
10.321.145.111 - - [Mon, 3 June 2015 11:33:03 GMT] "GET /api/loggedInUser
HTTP/1.1" 304 - "https://my.analytics.app/dashboard/courses/1291726"
10.112.322.111 - - [Mon, 3 June 2015 11:33:03 GMT] "POST /api/
instrumentation/events/new HTTP/1.1" 200 2 "https://my.analytics.app/
dashboard/courses/1291726" "Mozilla/5.0 (Macintosh; Intel Mac OS X
10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81
Safari/537.36”
10.123.133.222 - - [Mon, 3 June 2015 11:33:02 GMT] "GET /dashboard/
courses/1291726 HTTP/1.1" 304 - "https://my.analytics.app/admin"
• Raw data is complicated
• Often dirty
• Evolving structure
• Duplication all over
• Getting data to a central
processing point is HARD

“It’s simple, we just
throw the data into
Hadoop”
A naive data engineer

LogArchiver
• Original method to transport logs fromAPs to HDFS
• Lasted from 2009 - 2013
• Relies on rsynch/scp to move files around
• Regularly scheduled via cron

LogArchiver Fails
• Worked well with small number ofAPs
• Issues with scale
• Manual Processes of adding new hosts
• Frequent dying of hosts or network issues caused massive congestion
• Manual process of overrides

Apache Kafka to the rescue!
• Apublish-subscribe messaging system open sourced by LinkedIn in 2011
• High level overview:
• Topic: Feeds of messages
• Producer:Amessage publisher
• Consumer :Asubscriber oftopics

Apache Kafka to the rescue!
• Log -> HDFS latency reduced from hours to seconds!
• Benefits:
• Community supported
• Division of responsibilities
• Allowed for enhanced streaming use-cases

Workflow Management Fail!
0
*
*
*
*

spotify-‐core

hadoop
jar
merge_hourly.jar

15
*
*
*
*

spotify-‐core

hadoop
jar
aggregate_song_plays.jar

30
*
*
*
*

spotify-‐analytics
hadoop
jar
merge_artist_song.jar

*
1
*
*
*

spotify-‐core

hadoop
jar
daily_aggregate.jar

*
2
*
*
*

spotify-‐core

hadoop
jar
calculate_royalties.jar

*/2
22
*
*
*
spotify-‐radio

hadoop
jar
generate_radio.jar

Handles the ‘plumbing’ for Hadoop jobs
https://github.com/spotify/luigi
Luigi - Python Workflow Manager
Easy to get started, no xml like Oozie

Hadoop Availability
• In 2013:
• Hadoop expanded to 200 nodes
• It was business critical
• It was not very reliable :-(
• Created a ‘squad’ with two missions:
• Migrate to a new distribution withYarn
• Make Hadoop reliable

How did we do?
HadoopUptime
90%
92%
94%
96%
98%
100%
Q3-2012 Q4-2012 Q1-2013 Q2-2013 Q3-2013 Q4-2013 Q1-2014 Q2-2014 Q3-2014 Q4-2014 Q1-2015 Q2-2015
Hadoop ownerless Dedicated
squad launches
Upgrade
instability
Continually improving

Going from Python to Crunch
• Most of our jobs were Hadoop (python) streaming
• Lots of failures, slow performance
• Had to find a betterway

Moving from Python to Crunch
• Investigated several frameworks*
• Selected Crunch:
• Real types - compile time error detection, bettertestability
• Higher levelAPI - let the framework optimize foryou
• Better performance #JVM_FTW
*thewit.ch/scalding_crunchy_pig

Play Music Collect Data Process
Data driven features that
allows for new ways to
play and discover music
Lower latency and
enhanced reliability for
passing data from Access
Points to HDFS via Kafka
Increased Hadoop
reliability, Luigi scheduling
and better performance
with Crunch
Improving Charts!

Growth%
0
500
1000
1500
2000
2500
3000
3500
2012 2013 2014 2015
Hadoop Usage Spotify Users
Growth of Hadoop vs. Spotify Users

Explosive Growth
• Increased Spotify Users
• More users listening to more music -> more data -> longer running jobs
• Increased Use Cases
• Beyond simple analytics into Machine Learning, advanced processing
• Increased Engineers
• In 2014, growth of data and machine learning engineers grew rapidly

Scaling Machines: Easy
Scaling People: Hard

Inviso
Developed by Netflix: https://github.com/Netflix/inviso

Hadoop Report Card
• Contains Statistics
• Guidelines and Best
Practices
• Sent Quarterly

RealTime Use Cases
• Expanding our use of Storm for:
• TargetingAds based on genres
• Visualizing Data
• Quicker recommendations
• More information:
• https://labs.spotify.com/2015/01/05/how-spotify-scales-apache-storm/

Two takeaways
• Getting data into Hadoop is halfthe challenge.Think early
and often about scale.
• Increasing infrastructure reliability and performance leads to
expanded use.This adds challenges but it’s a good problem
to have.

Join The Band!
Engineers needed in NYC, Stockholm
http://spotify.com/jobs

The Evolution of Big Data at Spotify

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to The Evolution of Big Data at Spotify

Similar to The Evolution of Big Data at Spotify (20)

Recently uploaded

Recently uploaded (20)

The Evolution of Big Data at Spotify