Hadoop at Last.fm

Last.fm is amusic discovery websitepowered by scrobblingthat provides personalized radio

Music discovery websiteEach month we get:over 40M unique visitorsover 500M pageviewsEach pageview leads to at least one log lineClicks and other interactions sometimes lead to log lines too

Powered by scrobblingscrobble: skrob·bul (ˈskrɒbəll)[verb] To automatically add the tracks you play to your Last.fm profile with a piece of software called a ScrobblerStats:Up to 800 scrobbles per secondMore than 40 million scrobbles per dayOver 40 billion scrobbles so farEach scrobble leads to a log line

Personalized radioVia flash player, Xbox, desktop and mobile appsStats:Over 10 million streaming hours per monthOver 400 thousand unique stations per dayEach stream leads to at least one log line

And it’s not just logs…So we gather a lot of logs, but also:TagsShoutsJournalsWikisFriend connectionsFingerprints…Hadoop is the infrastructure we use for storing and processing our flood of data

Our herd of elephantsCurrent specs of our production cluster:44 nodes8 cores per node16 GB memory per node4 disks of 1 TB spinning at 7200 RPM per nodeUnpatched CDH2 using:Fair scheduler with preemptionSlightly patched hadoop-lzoRecordIO, AvroHive, Dumbo, Pig

We often avoid Java with Dumbodef mapper(key, value): for word in value.split(): yield word, 1def reducer(key, values): yield key, sum(values)if __name__ == "__main__": import dumbodumbo.run(mapper, reducer, combiner=reducer)

Or go even more high-level with Hivehive> CREATE TABLE name_counts (gender STRING, name STRING, occurrences INT);hive> INSERT OVERWRITE TABLE name_countsSELECT lower(sex), lower(split(realname, ‘ ‘)[0]), count(1) FROM meta_user_infoWHERE lower(sex) <> ‘n’ GROUP BY lower(sex), lower(split(realname, ‘ ‘)[0]);hive> CREATE TABLE gender_likelihoods (name STRING, gender STRING, likelihood FLOAT);hive> INSERT OVERWRITE TABLE gender_likelihoodsSELECT b.name, b.gender, b.occurrences / a.occurrences FROM(SELECT name, sum(occurrences) as occurrences FROM name_counts GROUP BY name) a JOIN name_countsb ON (a.name = b.name);hive> SELECT * FROM gender_likelihoods WHERE (name = ‘klaas’) OR (name = ‘sam’);klaasm 0.99038464 klaasf 0.009615385samm 0.7578873samf 0.24211268

Mixed usage of tools is commondef starter(prog): month = prog.delopt(“month”) # is expected to be YYYY/MMhql = “INSERT OVERWRITE DIRECTORY ‘cool/stuff_hive’ ...”.format(month) if os.system(‘hive –e “{0}”’.format(hql)) != 0: raise dumbo.Error("hive query failed")prog.addopt(“input”, “cool/stuff_hive”) # will be text delimited by ‘\x01’prog.addopt(“output”, “cool/stuff/” + month)…if __name__ == “__main__”:dumbo.main(runner, starter)

Running out of DFS space is common tooPossible solutions:Bigger and/or more disksHDFS RAID

Running out of DFS space is common tooData deleted New nodesMore compression and nodesNot finalized yet after upgradePossible solutions:Bigger and/or more disksHDFS RAID

Hitting I/O and CPU limits is less commonOur cluster can be pretty busy at timesBut DFS space is our main worry

Hitting I/O and CPU limits is less commonUpgraded to 0.20Our cluster can be pretty busy at timesBut DFS space is our main worry

USE CASESWhat do you use it for?

Things we do with HadoopSite stats and metricsChartsReportingMetadata correctionsNeighboursRecommendationsIndexing for searchEvaluationsData insights

Example: Website traffic statsWe compute a lot of site metrics, mostly from apache logs

Example: Website traffic statsGoogle Chrome is gaining groundWe compute a lot of site metrics, mostly from apache logs

Example: Overall chartsCharts for a single user can be shown in real time and are computed on the flyBut computing overall charts is a pretty big job and is done on Hadoop

Example: World chartsThis “world chart” for the Belgian band “Hooverphonic” also required Hadoop because it’s based on data from many different users

Example: Overall wave graphs Overall visualizations also typically require Hadoop for getting the data they visualizeThe “wave graphs” in our “Best of 2009” newspaper were good examples

Example: Death and scrobbles graphs−scrobbles− listeners

Example: Radio statsGraphs for several metrics that can be broken down by various attributesUsed extensively for A/B testing

Example: Radio statsGraphs for several metrics that can be broken down by various attributesUsed extensively for A/B testingSignificant differences

Example: Radio statsGraphs for several metrics that can be broken down by various attributesUsed extensively for A/B testingOverheated data centreSignificant differencesDB maintenance that went bad

Thanks!klaas@last.fm @klbosteemarc@last.fm @lanttims@last.fm @roserpens

Hadoop at Last.fm

More Related Content

What's hot

Viewers also liked

Similar to Hadoop at Last.fm

Recently uploaded

Hadoop at Last.fm