The Evolution of
Hadoop at Spotify
Through Failures and Pain
Josh Baer (jbx@spotify.com)
Rafal Wojdyla (rav@spotify.com) 1
Note: Our views are our own and don't necessarily represent those of Spotify.
2
• Growing Pains (2009-2012)
• Gaining Focus (2013 - 2014)
• The Future (2015+)
Overview
3
Our First Major
Hadoop Bug
4
Cluster 1.0
What is Spotify?
• Music Streaming Service
• Browse and Discover Millions of
Songs, Artists and Albums
• Launched in October 2008
• December 2014:
• 60 Million Monthly Users
• 15 Million Paid Subscribers
5
What is Spotify?
• Data Infrastructure:
• 1300 Hadoop Nodes
• 42 PB Storage
• 20 TB data ingested via Kafka/day
• 200 TB generated by Hadoop/day
6
7
select artist_id, count(1)
from user_activities
where play_seconds > 30
group by artist_id;
7
select artist_id, count(1)
from user_activities
where play_seconds > 30
group by artist_id;
7
0	
  *	
  *	
  *	
  *	
  	
  	
  	
  spotify-­‐core	
  	
  	
  	
  	
  	
  hadoop	
  jar	
  hourly_import.jar	
  
15	
  *	
  *	
  *	
  *	
  	
  	
  spotify-­‐core	
  	
  	
  	
  	
  	
  hadoop	
  jar	
  hourly_listeners.jar	
  
30	
  *	
  *	
  *	
  *	
  	
  	
  spotify-­‐analytics	
  hadoop	
  jar	
  user_funnel_hourly.jar	
  
*	
  1	
  *	
  *	
  *	
  	
  	
  	
  spotify-­‐core	
  	
  	
  	
  	
  	
  hadoop	
  jar	
  daily_aggregate.jar	
  
*	
  2	
  *	
  *	
  *	
  	
  	
  	
  spotify-­‐core	
  	
  	
  	
  	
  	
  hadoop	
  jar	
  calculate_royalties.jar	
  
*/2	
  22	
  *	
  *	
  *	
  spotify-­‐radio	
  	
  	
  	
  	
  hadoop	
  jar	
  generate_radio.jar	
  
8
0	
  *	
  *	
  *	
  *	
  	
  	
  	
  spotify-­‐core	
  	
  	
  	
  	
  	
  hadoop	
  jar	
  hourly_import.jar	
  
15	
  *	
  *	
  *	
  *	
  	
  	
  spotify-­‐core	
  	
  	
  	
  	
  	
  hadoop	
  jar	
  hourly_listeners.jar	
  
30	
  *	
  *	
  *	
  *	
  	
  	
  spotify-­‐analytics	
  hadoop	
  jar	
  user_funnel_hourly.jar	
  
*	
  1	
  *	
  *	
  *	
  	
  	
  	
  spotify-­‐core	
  	
  	
  	
  	
  	
  hadoop	
  jar	
  daily_aggregate.jar	
  
*	
  2	
  *	
  *	
  *	
  	
  	
  	
  spotify-­‐core	
  	
  	
  	
  	
  	
  hadoop	
  jar	
  calculate_royalties.jar	
  
*/2	
  22	
  *	
  *	
  *	
  spotify-­‐radio	
  	
  	
  	
  	
  hadoop	
  jar	
  generate_radio.jar	
  
8
9
Handles the ‘plumbing’ for Hadoop jobs
https://github.com/spotify/luigi
10
10
1111
To the Cloud!
1111
To the Cloud!
1111
12
#	
  sudo	
  addgroup	
  hadoop	
  
#	
  sudo	
  adduser	
  —ingroup	
  hadoop	
  hdfs	
  
#	
  sudo	
  adduser	
  —ingroup	
  hadoop	
  yarn	
  
#	
  cp	
  /tmp/configs/*.xml	
  /etc/hadoop/conf/	
  
#	
  apt-­‐get	
  update	
  
…	
  
[hdfs@sj-­‐hadoop-­‐b20	
  ~]	
  $	
  apt-­‐get	
  install	
  hadoop-­‐hdfs-­‐datanode	
  
…	
  
[yarn@sj-­‐hadoop-­‐b20	
  ~]	
  $	
  apt-­‐get	
  install	
  hadoop-­‐yarn-­‐nodemanager	
  
12
#	
  sudo	
  addgroup	
  hadoop	
  
#	
  sudo	
  adduser	
  —ingroup	
  hadoop	
  hdfs	
  
#	
  sudo	
  adduser	
  —ingroup	
  hadoop	
  yarn	
  
#	
  cp	
  /tmp/configs/*.xml	
  /etc/hadoop/conf/	
  
#	
  apt-­‐get	
  update	
  
…	
  
[hdfs@sj-­‐hadoop-­‐b20	
  ~]	
  $	
  apt-­‐get	
  install	
  hadoop-­‐hdfs-­‐datanode	
  
…	
  
[yarn@sj-­‐hadoop-­‐b20	
  ~]	
  $	
  apt-­‐get	
  install	
  hadoop-­‐yarn-­‐nodemanager	
  
13
Automated
Config Management
(via Puppet)
14
[data-­‐sci@sj-­‐edge-­‐a1	
  ~]	
  $	
  hdfs	
  dfs	
  -­‐ls	
  /data	
  
Found	
  3	
  items	
  
drwxr-­‐xr-­‐x	
  	
  	
  -­‐	
  hdfs	
  hadoop	
  	
  	
  	
  	
  0	
  2015-­‐01-­‐01	
  12:00	
  lake	
  
drwxr-­‐xr-­‐x	
  	
  	
  -­‐	
  hdfs	
  hadoop	
  	
  	
  	
  	
  0	
  2015-­‐01-­‐01	
  12:00	
  pond	
  
drwxr-­‐xr-­‐x	
  	
  	
  -­‐	
  hdfs	
  hadoop	
  	
  	
  	
  	
  0	
  2015-­‐01-­‐01	
  12:00	
  ocean	
  
[data-­‐sci@sj-­‐edge-­‐a1	
  ~]	
  $	
  hdfs	
  dfs	
  -­‐ls	
  /data/lake	
  
Found	
  1	
  items	
  
drwxr-­‐xr-­‐x	
  	
  	
  -­‐	
  hdfs	
  hadoop	
  	
  	
  	
  	
  1321451	
  2015-­‐01-­‐01	
  12:00	
  boats.txt	
  
[data-­‐sci@sj-­‐edge-­‐a1	
  ~]	
  $	
  hdfs	
  dfs	
  -­‐cat	
  /data/lake/boats.txt	
  
…
14
[data-­‐sci@sj-­‐edge-­‐a1	
  ~]	
  $	
  hdfs	
  dfs	
  -­‐ls	
  /data	
  
Found	
  3	
  items	
  
drwxr-­‐xr-­‐x	
  	
  	
  -­‐	
  hdfs	
  hadoop	
  	
  	
  	
  	
  0	
  2015-­‐01-­‐01	
  12:00	
  lake	
  
drwxr-­‐xr-­‐x	
  	
  	
  -­‐	
  hdfs	
  hadoop	
  	
  	
  	
  	
  0	
  2015-­‐01-­‐01	
  12:00	
  pond	
  
drwxr-­‐xr-­‐x	
  	
  	
  -­‐	
  hdfs	
  hadoop	
  	
  	
  	
  	
  0	
  2015-­‐01-­‐01	
  12:00	
  ocean	
  
[data-­‐sci@sj-­‐edge-­‐a1	
  ~]	
  $	
  hdfs	
  dfs	
  -­‐ls	
  /data/lake	
  
Found	
  1	
  items	
  
drwxr-­‐xr-­‐x	
  	
  	
  -­‐	
  hdfs	
  hadoop	
  	
  	
  	
  	
  1321451	
  2015-­‐01-­‐01	
  12:00	
  boats.txt	
  
[data-­‐sci@sj-­‐edge-­‐a1	
  ~]	
  $	
  hdfs	
  dfs	
  -­‐cat	
  /data/lake/boats.txt	
  
…
15
$	
  time	
  for	
  i	
  in	
  {1..100};	
  do	
  hadoop	
  fs	
  -­‐ls	
  /	
  >	
  /dev/null;	
  done	
  
real	
   3m32.014s	
  
user	
   6m15.891s	
  
sys	
  	
  	
  	
  0m18.821s	
  
$	
  time	
  for	
  i	
  in	
  {1..100};	
  do	
  snakebite	
  ls	
  /	
  >	
  /dev/null;	
  done	
  
real	
   0m34.760s	
  
user	
   0m29.962s	
  
sys	
  	
  	
  	
  0m4.512s	
  
16
17
Gaining Focus
(2013-2014)
18
• In 2013, expanded to 200 nodes
• Hadoop critical
• Needed a team totally focused on it
• Created a ‘squad’ with two missions:
• Migrate to a new distribution with Yarn
• Make Hadoop reliable
Forming a team
19
19
Hadoop ownerless
19
Hadoop ownerless
Squad
19
Hadoop ownerless Upgrades
Squad
19
Hadoop ownerless Upgrades Getting there
Squad
20
• Alert on service level problems (i.e. no jobs running)
• Keep your alarm channel clean. Beware of alert fatigue.
Alerting
21
Uhh ohh…..
I think I made a mistake
[data-­‐sci@sj-­‐edge-­‐a1	
  ~]	
  $	
  snakebite	
  rm	
  -­‐R	
  /team/disco/	
  CF/test-­‐10/	
  
22
Goodbye Data (1PB)
[data-­‐sci@sj-­‐edge-­‐a1	
  ~]	
  $	
  snakebite	
  rm	
  -­‐R	
  /team/disco/	
  CF/test-­‐10/	
  
22
OK:	
  Deleted	
  /team/disco	
  
23
• “Sit on your hands before you type” - Wouter de Bie
• Users will always want to retain data!
• Remove superusers from ‘edgenodes’
• Moving to trash = client-side implementation
Lessons Learned
24
The Wild Wild West
25
• Same hardware profile as production cluster
• Similar configuration
• Staging environment
• Reliable
Pre-Production Cluster
26
Automated Testing
27
28
Moving Data
29
• Features:
• Data discovery
• Lineage
• Lifecycle management
• More
• We use it for data movement
• Uses Oozie behind the scenes
Apache Falcon
• Most of our jobs were Hadoop (python) Streaming
• Lots of failures, slow performance
• Had to find a better way….
30
Improving Performance
31
• Investigated several frameworks
• Selected Crunch:
• Real types!
• Higher level API
• Easier to test
• Better performance #JVM_FTW
*Dave Whiting’s analysis of systems: http://thewit.ch/scalding_crunchy_pig
Improving Performance
32
33
Let’s Review
34
The Future
(2015+)
35
Note: Spotify users is based on publicly released numbers only
36
Explosive Growth
• Increased Spotify Users
• Increased use cases
• Increased Engineers
37
http://everynoise.com/engenremap.html
38
Scaling machines: easy
Scaling people: hard
39
User Feedback:
Automate IT!
40
Data Management
41
• Data-discovery tool
• Luigi Integration
• Find and browse datasets
• View schemas
• Trace lineage
• Open-source plans? :-(
Raynor
42
Two Takeaways
• Automate Everything
• More time to play FIFA build cool tools
• Listen to your users
• Fail fast, don’t be afraid to scrap work
43
Join the Band!
Engineers wanted in NYC & Stockholm
http://spotify.com/jobs

The Evolution of Hadoop at Spotify - Through Failures and Pain

  • 1.
    The Evolution of Hadoopat Spotify Through Failures and Pain Josh Baer (jbx@spotify.com) Rafal Wojdyla (rav@spotify.com) 1 Note: Our views are our own and don't necessarily represent those of Spotify.
  • 2.
    2 • Growing Pains(2009-2012) • Gaining Focus (2013 - 2014) • The Future (2015+) Overview
  • 3.
  • 4.
  • 5.
    What is Spotify? •Music Streaming Service • Browse and Discover Millions of Songs, Artists and Albums • Launched in October 2008 • December 2014: • 60 Million Monthly Users • 15 Million Paid Subscribers 5
  • 6.
    What is Spotify? •Data Infrastructure: • 1300 Hadoop Nodes • 42 PB Storage • 20 TB data ingested via Kafka/day • 200 TB generated by Hadoop/day 6
  • 7.
    7 select artist_id, count(1) fromuser_activities where play_seconds > 30 group by artist_id;
  • 8.
    7 select artist_id, count(1) fromuser_activities where play_seconds > 30 group by artist_id;
  • 9.
  • 10.
    0  *  *  *  *        spotify-­‐core            hadoop  jar  hourly_import.jar   15  *  *  *  *      spotify-­‐core            hadoop  jar  hourly_listeners.jar   30  *  *  *  *      spotify-­‐analytics  hadoop  jar  user_funnel_hourly.jar   *  1  *  *  *        spotify-­‐core            hadoop  jar  daily_aggregate.jar   *  2  *  *  *        spotify-­‐core            hadoop  jar  calculate_royalties.jar   */2  22  *  *  *  spotify-­‐radio          hadoop  jar  generate_radio.jar   8
  • 11.
    0  *  *  *  *        spotify-­‐core            hadoop  jar  hourly_import.jar   15  *  *  *  *      spotify-­‐core            hadoop  jar  hourly_listeners.jar   30  *  *  *  *      spotify-­‐analytics  hadoop  jar  user_funnel_hourly.jar   *  1  *  *  *        spotify-­‐core            hadoop  jar  daily_aggregate.jar   *  2  *  *  *        spotify-­‐core            hadoop  jar  calculate_royalties.jar   */2  22  *  *  *  spotify-­‐radio          hadoop  jar  generate_radio.jar   8
  • 12.
    9 Handles the ‘plumbing’for Hadoop jobs https://github.com/spotify/luigi
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
    12 #  sudo  addgroup  hadoop   #  sudo  adduser  —ingroup  hadoop  hdfs   #  sudo  adduser  —ingroup  hadoop  yarn   #  cp  /tmp/configs/*.xml  /etc/hadoop/conf/   #  apt-­‐get  update   …   [hdfs@sj-­‐hadoop-­‐b20  ~]  $  apt-­‐get  install  hadoop-­‐hdfs-­‐datanode   …   [yarn@sj-­‐hadoop-­‐b20  ~]  $  apt-­‐get  install  hadoop-­‐yarn-­‐nodemanager  
  • 19.
    12 #  sudo  addgroup  hadoop   #  sudo  adduser  —ingroup  hadoop  hdfs   #  sudo  adduser  —ingroup  hadoop  yarn   #  cp  /tmp/configs/*.xml  /etc/hadoop/conf/   #  apt-­‐get  update   …   [hdfs@sj-­‐hadoop-­‐b20  ~]  $  apt-­‐get  install  hadoop-­‐hdfs-­‐datanode   …   [yarn@sj-­‐hadoop-­‐b20  ~]  $  apt-­‐get  install  hadoop-­‐yarn-­‐nodemanager  
  • 20.
  • 21.
    14 [data-­‐sci@sj-­‐edge-­‐a1  ~]  $  hdfs  dfs  -­‐ls  /data   Found  3  items   drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          0  2015-­‐01-­‐01  12:00  lake   drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          0  2015-­‐01-­‐01  12:00  pond   drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          0  2015-­‐01-­‐01  12:00  ocean   [data-­‐sci@sj-­‐edge-­‐a1  ~]  $  hdfs  dfs  -­‐ls  /data/lake   Found  1  items   drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          1321451  2015-­‐01-­‐01  12:00  boats.txt   [data-­‐sci@sj-­‐edge-­‐a1  ~]  $  hdfs  dfs  -­‐cat  /data/lake/boats.txt   …
  • 22.
    14 [data-­‐sci@sj-­‐edge-­‐a1  ~]  $  hdfs  dfs  -­‐ls  /data   Found  3  items   drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          0  2015-­‐01-­‐01  12:00  lake   drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          0  2015-­‐01-­‐01  12:00  pond   drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          0  2015-­‐01-­‐01  12:00  ocean   [data-­‐sci@sj-­‐edge-­‐a1  ~]  $  hdfs  dfs  -­‐ls  /data/lake   Found  1  items   drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          1321451  2015-­‐01-­‐01  12:00  boats.txt   [data-­‐sci@sj-­‐edge-­‐a1  ~]  $  hdfs  dfs  -­‐cat  /data/lake/boats.txt   …
  • 23.
  • 24.
    $  time  for  i  in  {1..100};  do  hadoop  fs  -­‐ls  /  >  /dev/null;  done   real   3m32.014s   user   6m15.891s   sys        0m18.821s   $  time  for  i  in  {1..100};  do  snakebite  ls  /  >  /dev/null;  done   real   0m34.760s   user   0m29.962s   sys        0m4.512s   16
  • 25.
  • 26.
    18 • In 2013,expanded to 200 nodes • Hadoop critical • Needed a team totally focused on it • Created a ‘squad’ with two missions: • Migrate to a new distribution with Yarn • Make Hadoop reliable Forming a team
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
    19 Hadoop ownerless UpgradesGetting there Squad
  • 32.
    20 • Alert onservice level problems (i.e. no jobs running) • Keep your alarm channel clean. Beware of alert fatigue. Alerting
  • 33.
    21 Uhh ohh….. I thinkI made a mistake
  • 34.
    [data-­‐sci@sj-­‐edge-­‐a1  ~]  $  snakebite  rm  -­‐R  /team/disco/  CF/test-­‐10/   22
  • 35.
    Goodbye Data (1PB) [data-­‐sci@sj-­‐edge-­‐a1  ~]  $  snakebite  rm  -­‐R  /team/disco/  CF/test-­‐10/   22 OK:  Deleted  /team/disco  
  • 36.
    23 • “Sit onyour hands before you type” - Wouter de Bie • Users will always want to retain data! • Remove superusers from ‘edgenodes’ • Moving to trash = client-side implementation Lessons Learned
  • 37.
  • 38.
    25 • Same hardwareprofile as production cluster • Similar configuration • Staging environment • Reliable Pre-Production Cluster
  • 39.
  • 40.
  • 41.
  • 42.
    29 • Features: • Datadiscovery • Lineage • Lifecycle management • More • We use it for data movement • Uses Oozie behind the scenes Apache Falcon
  • 43.
    • Most ofour jobs were Hadoop (python) Streaming • Lots of failures, slow performance • Had to find a better way…. 30 Improving Performance
  • 44.
    31 • Investigated severalframeworks • Selected Crunch: • Real types! • Higher level API • Easier to test • Better performance #JVM_FTW *Dave Whiting’s analysis of systems: http://thewit.ch/scalding_crunchy_pig Improving Performance
  • 45.
  • 46.
  • 47.
  • 48.
    35 Note: Spotify usersis based on publicly released numbers only
  • 49.
    36 Explosive Growth • IncreasedSpotify Users • Increased use cases • Increased Engineers
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
    41 • Data-discovery tool •Luigi Integration • Find and browse datasets • View schemas • Trace lineage • Open-source plans? :-( Raynor
  • 55.
    42 Two Takeaways • AutomateEverything • More time to play FIFA build cool tools • Listen to your users • Fail fast, don’t be afraid to scrap work
  • 56.
    43 Join the Band! Engineerswanted in NYC & Stockholm http://spotify.com/jobs