• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Spotify: From 1 to 100 Hadoop developers
 

Spotify: From 1 to 100 Hadoop developers

on

  • 810 views

How Spotify scaled their Hadoop cluster and the people working on it from 1 to over 100 develop, and 1 node to now over 690 nodes pushing them to have the largest Hadoop cluster in Europe.

How Spotify scaled their Hadoop cluster and the people working on it from 1 to over 100 develop, and 1 node to now over 690 nodes pushing them to have the largest Hadoop cluster in Europe.

Statistics

Views

Total Views
810
Views on SlideShare
790
Embed Views
20

Actions

Likes
0
Downloads
6
Comments
0

3 Embeds 20

https://twitter.com 17
http://www.linkedin.com 2
http://eventifier.co 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Spotify: From 1 to 100 Hadoop developers Spotify: From 1 to 100 Hadoop developers Presentation Transcript

    • From 1 to 100 developers Scaling for developer productivity at Spotify @dawhiting HUG UK @ Strata 11/11/2013
    • 2 How do I scale? How many nodes? How much data? How many records?
    • 3 How do I scale my development? How many developers? How many teams? How many Hadoop jobs? How much code? Data Infrastructure - July 2013
    • 4 A brief history of Hadoop development at Spotify 2008 - Spotify launches in Sweden 2009 - First Hadoop cluster for royalties, 2 developers 2010 - Up to 37 nodes, BI team formed, 3 devs/3 analysts 2011 - to Elastic MapReduce 2012 - Back to own cluster, 60 -> 190 nodes, Infrastructure/Insights/ Tools team split 2013 - 6 teams just for data infrastructure, ~100 developers using Hadoop cluster.
    • 5 Issues What could possibly go wrong? •Contention for resources •Repetition of code, repetition of data •Poor code quality / technical debt •Disorganised HDFS •Data cataloguing
    • 6 Contention for resources Priority and isolation •What is important? Hadoop scheduler •Capacity scheduler •Queue isolation YARN •Resource allocation
    • 7 Don’t Repeat Yourself Refactor data, not just code •Make popular data available pre-joined •Analyse code to find jobs with the same dependencies Work at a higher level •MapReduce out, (S)Crunch in •Allow substitution of operations for cached data
    • 8 Code Quality & Technical Debt Stable platform •Python -> JVM Abolish custom infrastructure •Off-the-shelf is often good enough •Eg. Sqoop, Kafka, ... Testing •Make testing easier than running
    • 9 HDFS Retention policy •Automatic deletion of old intermediate data •Opt-out, not opt-in Establish convention •Can you correctly guess the path to the data you need? Enforce structure •Path literals are a code smell
    • 10 Data Library Core datasets •Identify •Catalogue •Document •Monitor Data library as code library •Easy to use •Synced with release cycles
    • 11 You can have it easier than us Act now •Big Data technical debt is worse than normal technical debt •Rewriting 10 jobs is easier than rewriting 300 Plan to decentralise •At some point it won’t be enough to trust your developers •You won’t be able to review every job forever Make it simpler to do things the right way •Example: build tools
    • Want to join the band? We’re hiring for Stockholm and NYC Check out http://www.spotify.com/ jobs for more information.