Hadoop at Last.fmJune 2010
About usLast.fm you say?
Last.fm is amusic discovery websitepowered by scrobblingthat provides personalized radio
Music discovery websiteEach month we get:over 40M unique visitorsover 500M pageviewsEach pageview leads to at least one log lineClicks and other interactions sometimes lead to log lines too
Powered by scrobblingscrobble: skrob·bul (ˈskrɒbəll)[verb] To automatically add the tracks you play to your Last.fm profile with a piece of software called a ScrobblerStats:Up to 800 scrobbles per secondMore than 40 million scrobbles per dayOver 40 billion scrobbles so farEach scrobble leads to a log line
Personalized radioVia flash player, Xbox, desktop and mobile appsStats:Over 10 million streaming hours per monthOver 400 thousand unique stations per dayEach stream leads to at least one log line
And it’s not just logs…So we gather a lot of logs, but also:TagsShoutsJournalsWikisFriend connectionsFingerprints…Hadoop is the infrastructure we use for storing and processing our flood of data
OUR SETUPHow many nodes?
Our herd of elephantsCurrent specs of our production cluster:44 nodes8 cores per node16 GB memory per node4 disks of 1 TB spinning at 7200 RPM per nodeUnpatched CDH2 using:Fair scheduler with preemptionSlightly patched hadoop-lzoRecordIO, AvroHive, Dumbo, Pig
We often avoid Java with Dumbodef mapper(key, value):    for word in value.split():        yield word, 1def reducer(key, values):    yield key, sum(values)if __name__ == "__main__":    import dumbodumbo.run(mapper, reducer, combiner=reducer)
Or go even more high-level with Hivehive> CREATE TABLE name_counts (gender STRING, name STRING, occurrences INT);hive> INSERT OVERWRITE TABLE name_countsSELECT lower(sex), lower(split(realname, ‘ ‘)[0]), count(1) FROM meta_user_infoWHERE lower(sex) <> ‘n’ GROUP BY lower(sex), lower(split(realname, ‘ ‘)[0]);hive> CREATE TABLE gender_likelihoods (name STRING, gender STRING, likelihood FLOAT);hive> INSERT OVERWRITE TABLE gender_likelihoodsSELECT b.name, b.gender, b.occurrences / a.occurrences FROM(SELECT name, sum(occurrences) as occurrences FROM name_counts GROUP BY name) a JOIN name_countsb ON (a.name = b.name);hive> SELECT * FROM gender_likelihoods WHERE (name = ‘klaas’) OR (name = ‘sam’);klaasm	 0.99038464  klaasf	 0.009615385samm	 0.7578873samf	 0.24211268
Mixed usage of tools is commondef starter(prog):    month = prog.delopt(“month”)  # is expected to be YYYY/MMhql = “INSERT OVERWRITE DIRECTORY ‘cool/stuff_hive’ ...”.format(month)    if os.system(‘hive –e “{0}”’.format(hql)) != 0:        raise dumbo.Error("hive query failed")prog.addopt(“input”, “cool/stuff_hive”)  # will be text delimited by ‘\x01’prog.addopt(“output”, “cool/stuff/” + month)…if __name__ == “__main__”:dumbo.main(runner, starter)
Running out of DFS space is common tooPossible solutions:Bigger and/or more disksHDFS RAID
Running out of DFS space is common tooData deleted New nodesMore compression and nodesNot finalized yet after upgradePossible solutions:Bigger and/or more disksHDFS RAID
Hitting I/O and CPU limits is less commonOur cluster can be pretty busy at timesBut DFS space is our main worry
Hitting I/O and CPU limits is less commonUpgraded to 0.20Our cluster can be pretty busy at timesBut DFS space is our main worry
USE CASESWhat do you use it for?
Things we do with HadoopSite stats and metricsChartsReportingMetadata correctionsNeighboursRecommendationsIndexing for searchEvaluationsData insights
And also scaring our ops…
Example: Website traffic statsWe compute a lot of site metrics, mostly from apache logs
Example: Website traffic statsGoogle Chrome is gaining groundWe compute a lot of site metrics, mostly from apache logs
Example: Overall chartsCharts for a single user can be shown in real time and are computed on the flyBut computing overall charts is a pretty big job and is done on Hadoop
Example: World chartsThis “world chart” for the Belgian band “Hooverphonic” also required Hadoop because it’s based on data from many different users
Example: Overall wave graphs  Overall visualizations also typically require Hadoop for getting the data they visualizeThe “wave graphs” in our “Best of 2009” newspaper were good examples
Example: Overall wave graphs
Example: Death and scrobbles graphs−scrobbles− listeners
Example: Radio statsGraphs for several metrics that can be broken down by various attributesUsed extensively for A/B testing
Example: Radio statsGraphs for several metrics that can be broken down by various attributesUsed extensively for A/B testingSignificant differences
Example: Radio statsGraphs for several metrics that can be broken down by various attributesUsed extensively for A/B testingOverheated data centreSignificant differencesDB maintenance that went bad
Thanks!klaas@last.fm 		@klbosteemarc@last.fm 		@lanttims@last.fm		@roserpens

Hadoop at Last.fm

  • 1.
  • 2.
  • 3.
    Last.fm is amusicdiscovery websitepowered by scrobblingthat provides personalized radio
  • 4.
    Music discovery websiteEachmonth we get:over 40M unique visitorsover 500M pageviewsEach pageview leads to at least one log lineClicks and other interactions sometimes lead to log lines too
  • 5.
    Powered by scrobblingscrobble:skrob·bul (ˈskrɒbəll)[verb] To automatically add the tracks you play to your Last.fm profile with a piece of software called a ScrobblerStats:Up to 800 scrobbles per secondMore than 40 million scrobbles per dayOver 40 billion scrobbles so farEach scrobble leads to a log line
  • 6.
    Personalized radioVia flashplayer, Xbox, desktop and mobile appsStats:Over 10 million streaming hours per monthOver 400 thousand unique stations per dayEach stream leads to at least one log line
  • 7.
    And it’s notjust logs…So we gather a lot of logs, but also:TagsShoutsJournalsWikisFriend connectionsFingerprints…Hadoop is the infrastructure we use for storing and processing our flood of data
  • 8.
  • 9.
    Our herd ofelephantsCurrent specs of our production cluster:44 nodes8 cores per node16 GB memory per node4 disks of 1 TB spinning at 7200 RPM per nodeUnpatched CDH2 using:Fair scheduler with preemptionSlightly patched hadoop-lzoRecordIO, AvroHive, Dumbo, Pig
  • 10.
    We often avoidJava with Dumbodef mapper(key, value): for word in value.split(): yield word, 1def reducer(key, values): yield key, sum(values)if __name__ == "__main__": import dumbodumbo.run(mapper, reducer, combiner=reducer)
  • 11.
    Or go evenmore high-level with Hivehive> CREATE TABLE name_counts (gender STRING, name STRING, occurrences INT);hive> INSERT OVERWRITE TABLE name_countsSELECT lower(sex), lower(split(realname, ‘ ‘)[0]), count(1) FROM meta_user_infoWHERE lower(sex) <> ‘n’ GROUP BY lower(sex), lower(split(realname, ‘ ‘)[0]);hive> CREATE TABLE gender_likelihoods (name STRING, gender STRING, likelihood FLOAT);hive> INSERT OVERWRITE TABLE gender_likelihoodsSELECT b.name, b.gender, b.occurrences / a.occurrences FROM(SELECT name, sum(occurrences) as occurrences FROM name_counts GROUP BY name) a JOIN name_countsb ON (a.name = b.name);hive> SELECT * FROM gender_likelihoods WHERE (name = ‘klaas’) OR (name = ‘sam’);klaasm 0.99038464 klaasf 0.009615385samm 0.7578873samf 0.24211268
  • 12.
    Mixed usage oftools is commondef starter(prog): month = prog.delopt(“month”) # is expected to be YYYY/MMhql = “INSERT OVERWRITE DIRECTORY ‘cool/stuff_hive’ ...”.format(month) if os.system(‘hive –e “{0}”’.format(hql)) != 0: raise dumbo.Error("hive query failed")prog.addopt(“input”, “cool/stuff_hive”) # will be text delimited by ‘\x01’prog.addopt(“output”, “cool/stuff/” + month)…if __name__ == “__main__”:dumbo.main(runner, starter)
  • 13.
    Running out ofDFS space is common tooPossible solutions:Bigger and/or more disksHDFS RAID
  • 14.
    Running out ofDFS space is common tooData deleted New nodesMore compression and nodesNot finalized yet after upgradePossible solutions:Bigger and/or more disksHDFS RAID
  • 15.
    Hitting I/O andCPU limits is less commonOur cluster can be pretty busy at timesBut DFS space is our main worry
  • 16.
    Hitting I/O andCPU limits is less commonUpgraded to 0.20Our cluster can be pretty busy at timesBut DFS space is our main worry
  • 17.
    USE CASESWhat doyou use it for?
  • 18.
    Things we dowith HadoopSite stats and metricsChartsReportingMetadata correctionsNeighboursRecommendationsIndexing for searchEvaluationsData insights
  • 19.
  • 20.
    Example: Website trafficstatsWe compute a lot of site metrics, mostly from apache logs
  • 21.
    Example: Website trafficstatsGoogle Chrome is gaining groundWe compute a lot of site metrics, mostly from apache logs
  • 22.
    Example: Overall chartsChartsfor a single user can be shown in real time and are computed on the flyBut computing overall charts is a pretty big job and is done on Hadoop
  • 23.
    Example: World chartsThis“world chart” for the Belgian band “Hooverphonic” also required Hadoop because it’s based on data from many different users
  • 24.
    Example: Overall wavegraphs Overall visualizations also typically require Hadoop for getting the data they visualizeThe “wave graphs” in our “Best of 2009” newspaper were good examples
  • 25.
  • 26.
    Example: Death andscrobbles graphs−scrobbles− listeners
  • 27.
    Example: Radio statsGraphsfor several metrics that can be broken down by various attributesUsed extensively for A/B testing
  • 28.
    Example: Radio statsGraphsfor several metrics that can be broken down by various attributesUsed extensively for A/B testingSignificant differences
  • 29.
    Example: Radio statsGraphsfor several metrics that can be broken down by various attributesUsed extensively for A/B testingOverheated data centreSignificant differencesDB maintenance that went bad
  • 30.