Hadoop at Last.fm

9,522 views
9,280 views

Published on

This talk is about the usage of Hadoop at Last.fm, a community-driven music discovery website. We will go through the main types of data Last.fm stores in Hadoop, explain why we need Hadoop to store and process our data, give examples of what we do with it, and mention some of the additional tools from the Hadoop ecosystem on which we rely for getting these things done.

Published in: Technology
1 Comment
17 Likes
Statistics
Notes
No Downloads
Views
Total views
9,522
On SlideShare
0
From Embeds
0
Number of Embeds
26
Actions
Shares
0
Downloads
0
Comments
1
Likes
17
Embeds 0
No embeds

No notes for slide

Hadoop at Last.fm

  1. 1. Hadoop at Last.fm<br />June 2010<br />
  2. 2. About us<br />Last.fm you say?<br />
  3. 3. Last.fm is a<br />music discovery website<br />powered by scrobbling<br />that provides personalized radio<br />
  4. 4. Music discovery website<br />Each month we get:<br />over 40M unique visitors<br />over 500M pageviews<br />Each pageview leads to at least one log line<br />Clicks and other interactions sometimes lead to log lines too<br />
  5. 5. Powered by scrobbling<br />scrobble: skrob·bul (ˈskrɒbəll)<br />[verb] To automatically add the tracks you play to your Last.fm profile with a piece of software called a Scrobbler<br />Stats:<br />Up to 800 scrobbles per second<br />More than 40 million scrobbles per day<br />Over 40 billion scrobbles so far<br />Each scrobble leads to a log line<br />
  6. 6. Personalized radio<br />Via flash player, Xbox, desktop and mobile apps<br />Stats:<br />Over 10 million streaming hours per month<br />Over 400 thousand unique stations per day<br />Each stream leads to at least one log line<br />
  7. 7. And it’s not just logs…<br />So we gather a lot of logs, but also:<br />Tags<br />Shouts<br />Journals<br />Wikis<br />Friend connections<br />Fingerprints<br />…<br />Hadoop is the infrastructure we use for storing and processing our flood of data<br />
  8. 8. OUR SETUP<br />How many nodes?<br />
  9. 9. Our herd of elephants<br />Current specs of our production cluster:<br />44 nodes<br />8 cores per node<br />16 GB memory per node<br />4 disks of 1 TB spinning at 7200 RPM per node<br />Unpatched CDH2 using:<br />Fair scheduler with preemption<br />Slightly patched hadoop-lzo<br />RecordIO, Avro<br />Hive, Dumbo, Pig<br />
  10. 10. We often avoid Java with Dumbo<br />def mapper(key, value):<br /> for word in value.split():<br /> yield word, 1<br />def reducer(key, values):<br /> yield key, sum(values)<br />if __name__ == "__main__":<br /> import dumbo<br />dumbo.run(mapper, reducer, combiner=reducer)<br />
  11. 11. Or go even more high-level with Hive<br />hive> CREATE TABLE name_counts (gender STRING, name STRING, occurrences INT);<br />hive> INSERT OVERWRITE TABLE name_counts<br />SELECT lower(sex), lower(split(realname, ‘ ‘)[0]), count(1) FROM meta_user_info<br />WHERE lower(sex) <> ‘n’ GROUP BY lower(sex), lower(split(realname, ‘ ‘)[0]);<br />hive> CREATE TABLE gender_likelihoods (name STRING, gender STRING, likelihood FLOAT);<br />hive> INSERT OVERWRITE TABLE gender_likelihoods<br />SELECT b.name, b.gender, b.occurrences / a.occurrences FROM<br />(SELECT name, sum(occurrences) as occurrences FROM name_counts GROUP BY name) a JOIN name_countsb ON (a.name = b.name);<br />hive> SELECT * FROM gender_likelihoods WHERE (name = ‘klaas’) OR (name = ‘sam’);<br />klaasm 0.99038464 <br />klaasf 0.009615385<br />samm 0.7578873<br />samf 0.24211268<br />
  12. 12. Mixed usage of tools is common<br />def starter(prog):<br /> month = prog.delopt(“month”) # is expected to be YYYY/MM<br />hql = “INSERT OVERWRITE DIRECTORY ‘cool/stuff_hive’ ...”.format(month)<br /> if os.system(‘hive –e “{0}”’.format(hql)) != 0:<br /> raise dumbo.Error("hive query failed")<br />prog.addopt(“input”, “cool/stuff_hive”) # will be text delimited by ‘x01’<br />prog.addopt(“output”, “cool/stuff/” + month)<br />…<br />if __name__ == “__main__”:<br />dumbo.main(runner, starter)<br />
  13. 13. Running out of DFS space is common too<br />Possible solutions:<br />Bigger and/or more disks<br />HDFS RAID<br />
  14. 14. Running out of DFS space is common too<br />Data deleted <br />New nodes<br />More compression and nodes<br />Not finalized yet after upgrade<br />Possible solutions:<br />Bigger and/or more disks<br />HDFS RAID<br />
  15. 15. Hitting I/O and CPU limits is less common<br />Our cluster can be pretty busy at times<br />But DFS space is our main worry<br />
  16. 16. Hitting I/O and CPU limits is less common<br />Upgraded to 0.20<br />Our cluster can be pretty busy at times<br />But DFS space is our main worry<br />
  17. 17. USE CASES<br />What do you use it for?<br />
  18. 18. Things we do with Hadoop<br />Site stats and metrics<br />Charts<br />Reporting<br />Metadata corrections<br />Neighbours<br />Recommendations<br />Indexing for search<br />Evaluations<br />Data insights<br />
  19. 19. And also scaring our ops…<br />
  20. 20. Example: Website traffic stats<br />We compute a lot of site metrics, mostly from apache logs<br />
  21. 21. Example: Website traffic stats<br />Google Chrome is gaining ground<br />We compute a lot of site metrics, mostly from apache logs<br />
  22. 22. Example: Overall charts<br />Charts for a single user can be shown in real time and are computed on the fly<br />But computing overall charts is a pretty big job and is done on Hadoop<br />
  23. 23. Example: World charts<br />This “world chart” for the Belgian band “Hooverphonic” also required Hadoop because it’s based on data from many different users<br />
  24. 24. Example: Overall wave graphs <br />Overall visualizations also typically require Hadoop for getting the data they visualize<br />The “wave graphs” in our “Best of 2009” newspaper were good examples<br />
  25. 25. Example: Overall wave graphs <br />
  26. 26. Example: Death and scrobbles graphs<br />−scrobbles<br />− listeners<br />
  27. 27. Example: Radio stats<br />Graphs for several metrics that can be broken down by various attributes<br />Used extensively for A/B testing<br />
  28. 28. Example: Radio stats<br />Graphs for several metrics that can be broken down by various attributes<br />Used extensively for A/B testing<br />Significant differences<br />
  29. 29. Example: Radio stats<br />Graphs for several metrics that can be broken down by various attributes<br />Used extensively for A/B testing<br />Overheated data centre<br />Significant differences<br />DB maintenance that went bad<br />
  30. 30. Thanks!<br />klaas@last.fm @klbosteemarc@last.fm @lanttims@last.fm @roserpens<br />

×