SlideShare a Scribd company logo
1 of 56
Download to read offline
The Evolution of
Hadoop at Spotify
Through Failures and Pain
Josh Baer (jbx@spotify.com)
Rafal Wojdyla (rav@spotify.com) 1
Note: Our views are our own and don't necessarily represent those of Spotify.
2
• Growing Pains (2009-2012)
• Gaining Focus (2013 - 2014)
• The Future (2015+)
Overview
3
Our First Major
Hadoop Bug
4
Cluster 1.0
What is Spotify?
• Music Streaming Service
• Browse and Discover Millions of
Songs, Artists and Albums
• Launched in October 2008
• December 2014:
• 60 Million Monthly Users
• 15 Million Paid Subscribers
5
What is Spotify?
• Data Infrastructure:
• 1300 Hadoop Nodes
• 42 PB Storage
• 20 TB data ingested via Kafka/day
• 200 TB generated by Hadoop/day
6
7
select artist_id, count(1)
from user_activities
where play_seconds > 30
group by artist_id;
7
select artist_id, count(1)
from user_activities
where play_seconds > 30
group by artist_id;
7
0	
  *	
  *	
  *	
  *	
  	
  	
  	
  spotify-­‐core	
  	
  	
  	
  	
  	
  hadoop	
  jar	
  hourly_import.jar	
  
15	
  *	
  *	
  *	
  *	
  	
  	
  spotify-­‐core	
  	
  	
  	
  	
  	
  hadoop	
  jar	
  hourly_listeners.jar	
  
30	
  *	
  *	
  *	
  *	
  	
  	
  spotify-­‐analytics	
  hadoop	
  jar	
  user_funnel_hourly.jar	
  
*	
  1	
  *	
  *	
  *	
  	
  	
  	
  spotify-­‐core	
  	
  	
  	
  	
  	
  hadoop	
  jar	
  daily_aggregate.jar	
  
*	
  2	
  *	
  *	
  *	
  	
  	
  	
  spotify-­‐core	
  	
  	
  	
  	
  	
  hadoop	
  jar	
  calculate_royalties.jar	
  
*/2	
  22	
  *	
  *	
  *	
  spotify-­‐radio	
  	
  	
  	
  	
  hadoop	
  jar	
  generate_radio.jar	
  
8
0	
  *	
  *	
  *	
  *	
  	
  	
  	
  spotify-­‐core	
  	
  	
  	
  	
  	
  hadoop	
  jar	
  hourly_import.jar	
  
15	
  *	
  *	
  *	
  *	
  	
  	
  spotify-­‐core	
  	
  	
  	
  	
  	
  hadoop	
  jar	
  hourly_listeners.jar	
  
30	
  *	
  *	
  *	
  *	
  	
  	
  spotify-­‐analytics	
  hadoop	
  jar	
  user_funnel_hourly.jar	
  
*	
  1	
  *	
  *	
  *	
  	
  	
  	
  spotify-­‐core	
  	
  	
  	
  	
  	
  hadoop	
  jar	
  daily_aggregate.jar	
  
*	
  2	
  *	
  *	
  *	
  	
  	
  	
  spotify-­‐core	
  	
  	
  	
  	
  	
  hadoop	
  jar	
  calculate_royalties.jar	
  
*/2	
  22	
  *	
  *	
  *	
  spotify-­‐radio	
  	
  	
  	
  	
  hadoop	
  jar	
  generate_radio.jar	
  
8
9
Handles the ‘plumbing’ for Hadoop jobs
https://github.com/spotify/luigi
10
10
1111
To the Cloud!
1111
To the Cloud!
1111
12
#	
  sudo	
  addgroup	
  hadoop	
  
#	
  sudo	
  adduser	
  —ingroup	
  hadoop	
  hdfs	
  
#	
  sudo	
  adduser	
  —ingroup	
  hadoop	
  yarn	
  
#	
  cp	
  /tmp/configs/*.xml	
  /etc/hadoop/conf/	
  
#	
  apt-­‐get	
  update	
  
…	
  
[hdfs@sj-­‐hadoop-­‐b20	
  ~]	
  $	
  apt-­‐get	
  install	
  hadoop-­‐hdfs-­‐datanode	
  
…	
  
[yarn@sj-­‐hadoop-­‐b20	
  ~]	
  $	
  apt-­‐get	
  install	
  hadoop-­‐yarn-­‐nodemanager	
  
12
#	
  sudo	
  addgroup	
  hadoop	
  
#	
  sudo	
  adduser	
  —ingroup	
  hadoop	
  hdfs	
  
#	
  sudo	
  adduser	
  —ingroup	
  hadoop	
  yarn	
  
#	
  cp	
  /tmp/configs/*.xml	
  /etc/hadoop/conf/	
  
#	
  apt-­‐get	
  update	
  
…	
  
[hdfs@sj-­‐hadoop-­‐b20	
  ~]	
  $	
  apt-­‐get	
  install	
  hadoop-­‐hdfs-­‐datanode	
  
…	
  
[yarn@sj-­‐hadoop-­‐b20	
  ~]	
  $	
  apt-­‐get	
  install	
  hadoop-­‐yarn-­‐nodemanager	
  
13
Automated
Config Management
(via Puppet)
14
[data-­‐sci@sj-­‐edge-­‐a1	
  ~]	
  $	
  hdfs	
  dfs	
  -­‐ls	
  /data	
  
Found	
  3	
  items	
  
drwxr-­‐xr-­‐x	
  	
  	
  -­‐	
  hdfs	
  hadoop	
  	
  	
  	
  	
  0	
  2015-­‐01-­‐01	
  12:00	
  lake	
  
drwxr-­‐xr-­‐x	
  	
  	
  -­‐	
  hdfs	
  hadoop	
  	
  	
  	
  	
  0	
  2015-­‐01-­‐01	
  12:00	
  pond	
  
drwxr-­‐xr-­‐x	
  	
  	
  -­‐	
  hdfs	
  hadoop	
  	
  	
  	
  	
  0	
  2015-­‐01-­‐01	
  12:00	
  ocean	
  
[data-­‐sci@sj-­‐edge-­‐a1	
  ~]	
  $	
  hdfs	
  dfs	
  -­‐ls	
  /data/lake	
  
Found	
  1	
  items	
  
drwxr-­‐xr-­‐x	
  	
  	
  -­‐	
  hdfs	
  hadoop	
  	
  	
  	
  	
  1321451	
  2015-­‐01-­‐01	
  12:00	
  boats.txt	
  
[data-­‐sci@sj-­‐edge-­‐a1	
  ~]	
  $	
  hdfs	
  dfs	
  -­‐cat	
  /data/lake/boats.txt	
  
…
14
[data-­‐sci@sj-­‐edge-­‐a1	
  ~]	
  $	
  hdfs	
  dfs	
  -­‐ls	
  /data	
  
Found	
  3	
  items	
  
drwxr-­‐xr-­‐x	
  	
  	
  -­‐	
  hdfs	
  hadoop	
  	
  	
  	
  	
  0	
  2015-­‐01-­‐01	
  12:00	
  lake	
  
drwxr-­‐xr-­‐x	
  	
  	
  -­‐	
  hdfs	
  hadoop	
  	
  	
  	
  	
  0	
  2015-­‐01-­‐01	
  12:00	
  pond	
  
drwxr-­‐xr-­‐x	
  	
  	
  -­‐	
  hdfs	
  hadoop	
  	
  	
  	
  	
  0	
  2015-­‐01-­‐01	
  12:00	
  ocean	
  
[data-­‐sci@sj-­‐edge-­‐a1	
  ~]	
  $	
  hdfs	
  dfs	
  -­‐ls	
  /data/lake	
  
Found	
  1	
  items	
  
drwxr-­‐xr-­‐x	
  	
  	
  -­‐	
  hdfs	
  hadoop	
  	
  	
  	
  	
  1321451	
  2015-­‐01-­‐01	
  12:00	
  boats.txt	
  
[data-­‐sci@sj-­‐edge-­‐a1	
  ~]	
  $	
  hdfs	
  dfs	
  -­‐cat	
  /data/lake/boats.txt	
  
…
15
$	
  time	
  for	
  i	
  in	
  {1..100};	
  do	
  hadoop	
  fs	
  -­‐ls	
  /	
  >	
  /dev/null;	
  done	
  
real	
   3m32.014s	
  
user	
   6m15.891s	
  
sys	
  	
  	
  	
  0m18.821s	
  
$	
  time	
  for	
  i	
  in	
  {1..100};	
  do	
  snakebite	
  ls	
  /	
  >	
  /dev/null;	
  done	
  
real	
   0m34.760s	
  
user	
   0m29.962s	
  
sys	
  	
  	
  	
  0m4.512s	
  
16
17
Gaining Focus
(2013-2014)
18
• In 2013, expanded to 200 nodes
• Hadoop critical
• Needed a team totally focused on it
• Created a ‘squad’ with two missions:
• Migrate to a new distribution with Yarn
• Make Hadoop reliable
Forming a team
19
19
Hadoop ownerless
19
Hadoop ownerless
Squad
19
Hadoop ownerless Upgrades
Squad
19
Hadoop ownerless Upgrades Getting there
Squad
20
• Alert on service level problems (i.e. no jobs running)
• Keep your alarm channel clean. Beware of alert fatigue.
Alerting
21
Uhh ohh…..
I think I made a mistake
[data-­‐sci@sj-­‐edge-­‐a1	
  ~]	
  $	
  snakebite	
  rm	
  -­‐R	
  /team/disco/	
  CF/test-­‐10/	
  
22
Goodbye Data (1PB)
[data-­‐sci@sj-­‐edge-­‐a1	
  ~]	
  $	
  snakebite	
  rm	
  -­‐R	
  /team/disco/	
  CF/test-­‐10/	
  
22
OK:	
  Deleted	
  /team/disco	
  
23
• “Sit on your hands before you type” - Wouter de Bie
• Users will always want to retain data!
• Remove superusers from ‘edgenodes’
• Moving to trash = client-side implementation
Lessons Learned
24
The Wild Wild West
25
• Same hardware profile as production cluster
• Similar configuration
• Staging environment
• Reliable
Pre-Production Cluster
26
Automated Testing
27
28
Moving Data
29
• Features:
• Data discovery
• Lineage
• Lifecycle management
• More
• We use it for data movement
• Uses Oozie behind the scenes
Apache Falcon
• Most of our jobs were Hadoop (python) Streaming
• Lots of failures, slow performance
• Had to find a better way….
30
Improving Performance
31
• Investigated several frameworks
• Selected Crunch:
• Real types!
• Higher level API
• Easier to test
• Better performance #JVM_FTW
*Dave Whiting’s analysis of systems: http://thewit.ch/scalding_crunchy_pig
Improving Performance
32
33
Let’s Review
34
The Future
(2015+)
35
Note: Spotify users is based on publicly released numbers only
36
Explosive Growth
• Increased Spotify Users
• Increased use cases
• Increased Engineers
37
http://everynoise.com/engenremap.html
38
Scaling machines: easy
Scaling people: hard
39
User Feedback:
Automate IT!
40
Data Management
41
• Data-discovery tool
• Luigi Integration
• Find and browse datasets
• View schemas
• Trace lineage
• Open-source plans? :-(
Raynor
42
Two Takeaways
• Automate Everything
• More time to play FIFA build cool tools
• Listen to your users
• Fail fast, don’t be afraid to scrap work
43
Join the Band!
Engineers wanted in NYC & Stockholm
http://spotify.com/jobs

More Related Content

What's hot

Scala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music RecommendationsScala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music RecommendationsChris Johnson
 
Music Personalization At Spotify
Music Personalization At SpotifyMusic Personalization At Spotify
Music Personalization At SpotifyVidhya Murali
 
Collaborative Filtering at Spotify
Collaborative Filtering at SpotifyCollaborative Filtering at Spotify
Collaborative Filtering at SpotifyErik Bernhardsson
 
Algorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyAlgorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyChris Johnson
 
Music Recommendations at Scale with Spark
Music Recommendations at Scale with SparkMusic Recommendations at Scale with Spark
Music Recommendations at Scale with SparkChris Johnson
 
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...DataWorks Summit
 
Scala Data Pipelines @ Spotify
Scala Data Pipelines @ SpotifyScala Data Pipelines @ Spotify
Scala Data Pipelines @ SpotifyNeville Li
 
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand StreamingTalk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand StreamingSameera Horawalavithana
 
Spotify: Horizontal Scalability for Great Success
Spotify: Horizontal Scalability for Great SuccessSpotify: Horizontal Scalability for Great Success
Spotify: Horizontal Scalability for Great SuccessNick Barkas
 
Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.Esh Vckay
 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupAndy Sloane
 
From Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyFrom Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyChris Johnson
 
Spotify architecture - Pressing play
Spotify architecture - Pressing playSpotify architecture - Pressing play
Spotify architecture - Pressing playNiklas Gustavsson
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsErik Bernhardsson
 
Music Recommendation 2018
Music Recommendation 2018Music Recommendation 2018
Music Recommendation 2018Fabien Gouyon
 
CF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyCF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyVidhya Murali
 
Spotify Discover Weekly: The machine learning behind your music recommendations
Spotify Discover Weekly: The machine learning behind your music recommendationsSpotify Discover Weekly: The machine learning behind your music recommendations
Spotify Discover Weekly: The machine learning behind your music recommendationsSophia Ciocca
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per SecondAmazon Web Services
 

What's hot (20)

Scala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music RecommendationsScala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music Recommendations
 
Music Personalization At Spotify
Music Personalization At SpotifyMusic Personalization At Spotify
Music Personalization At Spotify
 
Collaborative Filtering at Spotify
Collaborative Filtering at SpotifyCollaborative Filtering at Spotify
Collaborative Filtering at Spotify
 
Algorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyAlgorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at Spotify
 
Music Recommendations at Scale with Spark
Music Recommendations at Scale with SparkMusic Recommendations at Scale with Spark
Music Recommendations at Scale with Spark
 
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
 
Spotify: P2P music streaming
Spotify: P2P music streamingSpotify: P2P music streaming
Spotify: P2P music streaming
 
Search @ Spotify
Search @ Spotify Search @ Spotify
Search @ Spotify
 
Scala Data Pipelines @ Spotify
Scala Data Pipelines @ SpotifyScala Data Pipelines @ Spotify
Scala Data Pipelines @ Spotify
 
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand StreamingTalk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
 
Spotify: Horizontal Scalability for Great Success
Spotify: Horizontal Scalability for Great SuccessSpotify: Horizontal Scalability for Great Success
Spotify: Horizontal Scalability for Great Success
 
Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.
 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data Meetup
 
From Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyFrom Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover Weekly
 
Spotify architecture - Pressing play
Spotify architecture - Pressing playSpotify architecture - Pressing play
Spotify architecture - Pressing play
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive Analytics
 
Music Recommendation 2018
Music Recommendation 2018Music Recommendation 2018
Music Recommendation 2018
 
CF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyCF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At Spotify
 
Spotify Discover Weekly: The machine learning behind your music recommendations
Spotify Discover Weekly: The machine learning behind your music recommendationsSpotify Discover Weekly: The machine learning behind your music recommendations
Spotify Discover Weekly: The machine learning behind your music recommendations
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
 

Viewers also liked

Moving into movies - using video in E-Learning
Moving into movies - using video in E-Learning Moving into movies - using video in E-Learning
Moving into movies - using video in E-Learning Aurion Learning
 
We’ve created a monster! Truth and fiction in SOA
We’ve created a monster! Truth and fiction in SOAWe’ve created a monster! Truth and fiction in SOA
We’ve created a monster! Truth and fiction in SOAJon Collins
 
A thousand fronts: on the architectures I like
A thousand fronts: on the architectures I likeA thousand fronts: on the architectures I like
A thousand fronts: on the architectures I likeDavide Tommaso Ferrando
 
360i Idea Safari: The Hunt of the Mysterious BIG IDEA (Presented at Cannes 2012)
360i Idea Safari: The Hunt of the Mysterious BIG IDEA (Presented at Cannes 2012)360i Idea Safari: The Hunt of the Mysterious BIG IDEA (Presented at Cannes 2012)
360i Idea Safari: The Hunt of the Mysterious BIG IDEA (Presented at Cannes 2012)360i
 
4.4 mb portfolio print 2012-2016
4.4 mb portfolio print 2012-20164.4 mb portfolio print 2012-2016
4.4 mb portfolio print 2012-2016Meghan Garnett
 
Story, Sci-Fi & Transmedia to develop Corporate Technology Strategies.
Story, Sci-Fi & Transmedia to develop Corporate Technology Strategies.Story, Sci-Fi & Transmedia to develop Corporate Technology Strategies.
Story, Sci-Fi & Transmedia to develop Corporate Technology Strategies.Hubbub Media
 
Big Idea: FIction & Non-Fiction
Big Idea: FIction & Non-FictionBig Idea: FIction & Non-Fiction
Big Idea: FIction & Non-FictionAngela Maiers
 
Spotify's Music Recommendations Lambda Architecture
Spotify's Music Recommendations Lambda ArchitectureSpotify's Music Recommendations Lambda Architecture
Spotify's Music Recommendations Lambda ArchitectureEsh Vckay
 
When Brands Attack
When Brands AttackWhen Brands Attack
When Brands AttackDirk Herbert
 
Nature, Architecture And Music (V M )
Nature, Architecture And Music (V M )Nature, Architecture And Music (V M )
Nature, Architecture And Music (V M )Valeriu Margescu
 
exploring architecture and music
exploring architecture and musicexploring architecture and music
exploring architecture and musicmichielmoyaert
 
Structural Project 1 Group 1
Structural Project 1 Group 1Structural Project 1 Group 1
Structural Project 1 Group 1guestb2749c7
 
But Today We Collect Bullshit: Architecture and Storytelling in the Age of So...
But Today We Collect Bullshit: Architecture and Storytelling in the Age of So...But Today We Collect Bullshit: Architecture and Storytelling in the Age of So...
But Today We Collect Bullshit: Architecture and Storytelling in the Age of So...Davide Tommaso Ferrando
 
Architectural structures world Wide
Architectural structures   world WideArchitectural structures   world Wide
Architectural structures world WideSagun Rakibe
 
Notaable Buildings Around The World
Notaable Buildings Around The WorldNotaable Buildings Around The World
Notaable Buildings Around The WorldFakhrul Afifi
 

Viewers also liked (20)

Postmodernism
PostmodernismPostmodernism
Postmodernism
 
Barnes and Noble
Barnes and NobleBarnes and Noble
Barnes and Noble
 
Moving into movies - using video in E-Learning
Moving into movies - using video in E-Learning Moving into movies - using video in E-Learning
Moving into movies - using video in E-Learning
 
We’ve created a monster! Truth and fiction in SOA
We’ve created a monster! Truth and fiction in SOAWe’ve created a monster! Truth and fiction in SOA
We’ve created a monster! Truth and fiction in SOA
 
A thousand fronts: on the architectures I like
A thousand fronts: on the architectures I likeA thousand fronts: on the architectures I like
A thousand fronts: on the architectures I like
 
360i Idea Safari: The Hunt of the Mysterious BIG IDEA (Presented at Cannes 2012)
360i Idea Safari: The Hunt of the Mysterious BIG IDEA (Presented at Cannes 2012)360i Idea Safari: The Hunt of the Mysterious BIG IDEA (Presented at Cannes 2012)
360i Idea Safari: The Hunt of the Mysterious BIG IDEA (Presented at Cannes 2012)
 
Math music and architecture
Math music and architectureMath music and architecture
Math music and architecture
 
4.4 mb portfolio print 2012-2016
4.4 mb portfolio print 2012-20164.4 mb portfolio print 2012-2016
4.4 mb portfolio print 2012-2016
 
Story, Sci-Fi & Transmedia to develop Corporate Technology Strategies.
Story, Sci-Fi & Transmedia to develop Corporate Technology Strategies.Story, Sci-Fi & Transmedia to develop Corporate Technology Strategies.
Story, Sci-Fi & Transmedia to develop Corporate Technology Strategies.
 
Big Idea: FIction & Non-Fiction
Big Idea: FIction & Non-FictionBig Idea: FIction & Non-Fiction
Big Idea: FIction & Non-Fiction
 
Hamburg 2012
Hamburg 2012Hamburg 2012
Hamburg 2012
 
Spotify's Music Recommendations Lambda Architecture
Spotify's Music Recommendations Lambda ArchitectureSpotify's Music Recommendations Lambda Architecture
Spotify's Music Recommendations Lambda Architecture
 
When Brands Attack
When Brands AttackWhen Brands Attack
When Brands Attack
 
Nature, Architecture And Music (V M )
Nature, Architecture And Music (V M )Nature, Architecture And Music (V M )
Nature, Architecture And Music (V M )
 
exploring architecture and music
exploring architecture and musicexploring architecture and music
exploring architecture and music
 
Structural Project 1 Group 1
Structural Project 1 Group 1Structural Project 1 Group 1
Structural Project 1 Group 1
 
Urban Design
Urban DesignUrban Design
Urban Design
 
But Today We Collect Bullshit: Architecture and Storytelling in the Age of So...
But Today We Collect Bullshit: Architecture and Storytelling in the Age of So...But Today We Collect Bullshit: Architecture and Storytelling in the Age of So...
But Today We Collect Bullshit: Architecture and Storytelling in the Age of So...
 
Architectural structures world Wide
Architectural structures   world WideArchitectural structures   world Wide
Architectural structures world Wide
 
Notaable Buildings Around The World
Notaable Buildings Around The WorldNotaable Buildings Around The World
Notaable Buildings Around The World
 

Similar to The Evolution of Hadoop at Spotify - Through Failures and Pain

Unleash your cluster with YARN
Unleash your cluster with YARNUnleash your cluster with YARN
Unleash your cluster with YARNFerran Galí Reniu
 
Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)Ferran Galí Reniu
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...DataStax
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & HadoopJeffrey Breen
 
Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤Toshihiro Suzuki
 
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x HadoopGOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoopfvanvollenhoven
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparktrihug
 
Hadoop installation with an example
Hadoop installation with an exampleHadoop installation with an example
Hadoop installation with an exampleNikita Kesharwani
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Adam Kawa
 
RMLL 2013 : Build Your Personal Search Engine using Crawlzilla
RMLL 2013 : Build Your Personal Search Engine using CrawlzillaRMLL 2013 : Build Your Personal Search Engine using Crawlzilla
RMLL 2013 : Build Your Personal Search Engine using CrawlzillaJazz Yao-Tsung Wang
 
Druid at Strata Conf NY 2016.pdf
Druid at Strata Conf NY 2016.pdfDruid at Strata Conf NY 2016.pdf
Druid at Strata Conf NY 2016.pdfHimanshuGupta936
 
Why Managed Service Providers Should Embrace Container Technology
Why Managed Service Providers Should Embrace Container TechnologyWhy Managed Service Providers Should Embrace Container Technology
Why Managed Service Providers Should Embrace Container TechnologySagi Brody
 

Similar to The Evolution of Hadoop at Spotify - Through Failures and Pain (20)

Unleash your cluster with YARN
Unleash your cluster with YARNUnleash your cluster with YARN
Unleash your cluster with YARN
 
Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & Hadoop
 
Hadoop bootcamp getting started
Hadoop bootcamp getting startedHadoop bootcamp getting started
Hadoop bootcamp getting started
 
Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤
 
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x HadoopGOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoop
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
 
Hadoop installation with an example
Hadoop installation with an exampleHadoop installation with an example
Hadoop installation with an example
 
20080529dublinpt1
20080529dublinpt120080529dublinpt1
20080529dublinpt1
 
Backups
BackupsBackups
Backups
 
Big Data
Big DataBig Data
Big Data
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
RMLL 2013 : Build Your Personal Search Engine using Crawlzilla
RMLL 2013 : Build Your Personal Search Engine using CrawlzillaRMLL 2013 : Build Your Personal Search Engine using Crawlzilla
RMLL 2013 : Build Your Personal Search Engine using Crawlzilla
 
Druid at Strata Conf NY 2016.pdf
Druid at Strata Conf NY 2016.pdfDruid at Strata Conf NY 2016.pdf
Druid at Strata Conf NY 2016.pdf
 
#WeSpeakLinux Session
#WeSpeakLinux Session#WeSpeakLinux Session
#WeSpeakLinux Session
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
 
Why Managed Service Providers Should Embrace Container Technology
Why Managed Service Providers Should Embrace Container TechnologyWhy Managed Service Providers Should Embrace Container Technology
Why Managed Service Providers Should Embrace Container Technology
 

Recently uploaded

Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 

Recently uploaded (20)

Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 

The Evolution of Hadoop at Spotify - Through Failures and Pain

  • 1. The Evolution of Hadoop at Spotify Through Failures and Pain Josh Baer (jbx@spotify.com) Rafal Wojdyla (rav@spotify.com) 1 Note: Our views are our own and don't necessarily represent those of Spotify.
  • 2. 2 • Growing Pains (2009-2012) • Gaining Focus (2013 - 2014) • The Future (2015+) Overview
  • 5. What is Spotify? • Music Streaming Service • Browse and Discover Millions of Songs, Artists and Albums • Launched in October 2008 • December 2014: • 60 Million Monthly Users • 15 Million Paid Subscribers 5
  • 6. What is Spotify? • Data Infrastructure: • 1300 Hadoop Nodes • 42 PB Storage • 20 TB data ingested via Kafka/day • 200 TB generated by Hadoop/day 6
  • 7. 7 select artist_id, count(1) from user_activities where play_seconds > 30 group by artist_id;
  • 8. 7 select artist_id, count(1) from user_activities where play_seconds > 30 group by artist_id;
  • 9. 7
  • 10. 0  *  *  *  *        spotify-­‐core            hadoop  jar  hourly_import.jar   15  *  *  *  *      spotify-­‐core            hadoop  jar  hourly_listeners.jar   30  *  *  *  *      spotify-­‐analytics  hadoop  jar  user_funnel_hourly.jar   *  1  *  *  *        spotify-­‐core            hadoop  jar  daily_aggregate.jar   *  2  *  *  *        spotify-­‐core            hadoop  jar  calculate_royalties.jar   */2  22  *  *  *  spotify-­‐radio          hadoop  jar  generate_radio.jar   8
  • 11. 0  *  *  *  *        spotify-­‐core            hadoop  jar  hourly_import.jar   15  *  *  *  *      spotify-­‐core            hadoop  jar  hourly_listeners.jar   30  *  *  *  *      spotify-­‐analytics  hadoop  jar  user_funnel_hourly.jar   *  1  *  *  *        spotify-­‐core            hadoop  jar  daily_aggregate.jar   *  2  *  *  *        spotify-­‐core            hadoop  jar  calculate_royalties.jar   */2  22  *  *  *  spotify-­‐radio          hadoop  jar  generate_radio.jar   8
  • 12. 9 Handles the ‘plumbing’ for Hadoop jobs https://github.com/spotify/luigi
  • 13. 10
  • 14. 10
  • 15. 1111
  • 18. 12 #  sudo  addgroup  hadoop   #  sudo  adduser  —ingroup  hadoop  hdfs   #  sudo  adduser  —ingroup  hadoop  yarn   #  cp  /tmp/configs/*.xml  /etc/hadoop/conf/   #  apt-­‐get  update   …   [hdfs@sj-­‐hadoop-­‐b20  ~]  $  apt-­‐get  install  hadoop-­‐hdfs-­‐datanode   …   [yarn@sj-­‐hadoop-­‐b20  ~]  $  apt-­‐get  install  hadoop-­‐yarn-­‐nodemanager  
  • 19. 12 #  sudo  addgroup  hadoop   #  sudo  adduser  —ingroup  hadoop  hdfs   #  sudo  adduser  —ingroup  hadoop  yarn   #  cp  /tmp/configs/*.xml  /etc/hadoop/conf/   #  apt-­‐get  update   …   [hdfs@sj-­‐hadoop-­‐b20  ~]  $  apt-­‐get  install  hadoop-­‐hdfs-­‐datanode   …   [yarn@sj-­‐hadoop-­‐b20  ~]  $  apt-­‐get  install  hadoop-­‐yarn-­‐nodemanager  
  • 21. 14 [data-­‐sci@sj-­‐edge-­‐a1  ~]  $  hdfs  dfs  -­‐ls  /data   Found  3  items   drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          0  2015-­‐01-­‐01  12:00  lake   drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          0  2015-­‐01-­‐01  12:00  pond   drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          0  2015-­‐01-­‐01  12:00  ocean   [data-­‐sci@sj-­‐edge-­‐a1  ~]  $  hdfs  dfs  -­‐ls  /data/lake   Found  1  items   drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          1321451  2015-­‐01-­‐01  12:00  boats.txt   [data-­‐sci@sj-­‐edge-­‐a1  ~]  $  hdfs  dfs  -­‐cat  /data/lake/boats.txt   …
  • 22. 14 [data-­‐sci@sj-­‐edge-­‐a1  ~]  $  hdfs  dfs  -­‐ls  /data   Found  3  items   drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          0  2015-­‐01-­‐01  12:00  lake   drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          0  2015-­‐01-­‐01  12:00  pond   drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          0  2015-­‐01-­‐01  12:00  ocean   [data-­‐sci@sj-­‐edge-­‐a1  ~]  $  hdfs  dfs  -­‐ls  /data/lake   Found  1  items   drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          1321451  2015-­‐01-­‐01  12:00  boats.txt   [data-­‐sci@sj-­‐edge-­‐a1  ~]  $  hdfs  dfs  -­‐cat  /data/lake/boats.txt   …
  • 23. 15
  • 24. $  time  for  i  in  {1..100};  do  hadoop  fs  -­‐ls  /  >  /dev/null;  done   real   3m32.014s   user   6m15.891s   sys        0m18.821s   $  time  for  i  in  {1..100};  do  snakebite  ls  /  >  /dev/null;  done   real   0m34.760s   user   0m29.962s   sys        0m4.512s   16
  • 26. 18 • In 2013, expanded to 200 nodes • Hadoop critical • Needed a team totally focused on it • Created a ‘squad’ with two missions: • Migrate to a new distribution with Yarn • Make Hadoop reliable Forming a team
  • 27. 19
  • 31. 19 Hadoop ownerless Upgrades Getting there Squad
  • 32. 20 • Alert on service level problems (i.e. no jobs running) • Keep your alarm channel clean. Beware of alert fatigue. Alerting
  • 33. 21 Uhh ohh….. I think I made a mistake
  • 34. [data-­‐sci@sj-­‐edge-­‐a1  ~]  $  snakebite  rm  -­‐R  /team/disco/  CF/test-­‐10/   22
  • 35. Goodbye Data (1PB) [data-­‐sci@sj-­‐edge-­‐a1  ~]  $  snakebite  rm  -­‐R  /team/disco/  CF/test-­‐10/   22 OK:  Deleted  /team/disco  
  • 36. 23 • “Sit on your hands before you type” - Wouter de Bie • Users will always want to retain data! • Remove superusers from ‘edgenodes’ • Moving to trash = client-side implementation Lessons Learned
  • 38. 25 • Same hardware profile as production cluster • Similar configuration • Staging environment • Reliable Pre-Production Cluster
  • 40. 27
  • 42. 29 • Features: • Data discovery • Lineage • Lifecycle management • More • We use it for data movement • Uses Oozie behind the scenes Apache Falcon
  • 43. • Most of our jobs were Hadoop (python) Streaming • Lots of failures, slow performance • Had to find a better way…. 30 Improving Performance
  • 44. 31 • Investigated several frameworks • Selected Crunch: • Real types! • Higher level API • Easier to test • Better performance #JVM_FTW *Dave Whiting’s analysis of systems: http://thewit.ch/scalding_crunchy_pig Improving Performance
  • 45. 32
  • 48. 35 Note: Spotify users is based on publicly released numbers only
  • 49. 36 Explosive Growth • Increased Spotify Users • Increased use cases • Increased Engineers
  • 54. 41 • Data-discovery tool • Luigi Integration • Find and browse datasets • View schemas • Trace lineage • Open-source plans? :-( Raynor
  • 55. 42 Two Takeaways • Automate Everything • More time to play FIFA build cool tools • Listen to your users • Fail fast, don’t be afraid to scrap work
  • 56. 43 Join the Band! Engineers wanted in NYC & Stockholm http://spotify.com/jobs