Spotify: From 1 to 100 Hadoop developers
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Spotify: From 1 to 100 Hadoop developers

  • 1,073 views
Uploaded on

How Spotify scaled their Hadoop cluster and the people working on it from 1 to over 100 develop, and 1 node to now over 690 nodes pushing them to have the largest Hadoop cluster in Europe.

How Spotify scaled their Hadoop cluster and the people working on it from 1 to over 100 develop, and 1 node to now over 690 nodes pushing them to have the largest Hadoop cluster in Europe.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,073
On Slideshare
1,049
From Embeds
24
Number of Embeds
3

Actions

Shares
Downloads
8
Comments
0
Likes
0

Embeds 24

https://twitter.com 21
http://www.linkedin.com 2
http://eventifier.co 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. From 1 to 100 developers Scaling for developer productivity at Spotify @dawhiting HUG UK @ Strata 11/11/2013
  • 2. 2 How do I scale? How many nodes? How much data? How many records?
  • 3. 3 How do I scale my development? How many developers? How many teams? How many Hadoop jobs? How much code? Data Infrastructure - July 2013
  • 4. 4 A brief history of Hadoop development at Spotify 2008 - Spotify launches in Sweden 2009 - First Hadoop cluster for royalties, 2 developers 2010 - Up to 37 nodes, BI team formed, 3 devs/3 analysts 2011 - to Elastic MapReduce 2012 - Back to own cluster, 60 -> 190 nodes, Infrastructure/Insights/ Tools team split 2013 - 6 teams just for data infrastructure, ~100 developers using Hadoop cluster.
  • 5. 5 Issues What could possibly go wrong? •Contention for resources •Repetition of code, repetition of data •Poor code quality / technical debt •Disorganised HDFS •Data cataloguing
  • 6. 6 Contention for resources Priority and isolation •What is important? Hadoop scheduler •Capacity scheduler •Queue isolation YARN •Resource allocation
  • 7. 7 Don’t Repeat Yourself Refactor data, not just code •Make popular data available pre-joined •Analyse code to find jobs with the same dependencies Work at a higher level •MapReduce out, (S)Crunch in •Allow substitution of operations for cached data
  • 8. 8 Code Quality & Technical Debt Stable platform •Python -> JVM Abolish custom infrastructure •Off-the-shelf is often good enough •Eg. Sqoop, Kafka, ... Testing •Make testing easier than running
  • 9. 9 HDFS Retention policy •Automatic deletion of old intermediate data •Opt-out, not opt-in Establish convention •Can you correctly guess the path to the data you need? Enforce structure •Path literals are a code smell
  • 10. 10 Data Library Core datasets •Identify •Catalogue •Document •Monitor Data library as code library •Easy to use •Synced with release cycles
  • 11. 11 You can have it easier than us Act now •Big Data technical debt is worse than normal technical debt •Rewriting 10 jobs is easier than rewriting 300 Plan to decentralise •At some point it won’t be enough to trust your developers •You won’t be able to review every job forever Make it simpler to do things the right way •Example: build tools
  • 12. Want to join the band? We’re hiring for Stockholm and NYC Check out http://www.spotify.com/ jobs for more information.