Hive at Last.fm

2,800 views

Published on

This talk is about using Hive in practice. We will go through some of the specific use cases for which Hive is currently being used at Last.fm, highlighting its strengths and weaknesses along the way.

Published in: Technology, Education
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,800
On SlideShare
0
From Embeds
0
Number of Embeds
209
Actions
Shares
0
Downloads
0
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Hive at Last.fm

  1. 1. Hive at Last.fm March 2010
  2. 2. What is Last.fm? A music community website, powered by scrobbling that provides personalised radio. We aggregate scrobbles. A single scrobble is the smallest unit of music attention data. 1 scrobble = (track, artist, timestamp).
  3. 3. In numbers • 40 million users visit the site every month • 39 billion scrobbles (600 per second) • 400k personalised radio stations per day enter hadoop...
  4. 4. Hadoop cluster • 44 nodes • 8 cores per node • 16 gig ram per node • 4x 1TB 7200rpm disks per node
  5. 5. Hadoop what is it good for? • Charts • Reporting • Corrections • Site stats / metrics • Neighbours • Recommendations
  6. 6. But wait, can you tell us about <stuff/>? • How many? • When? • Where? • Who? • Why? Why not?
  7. 7. Ad hoc questions • We get them all the time. • Questions are good things, but answers take up time. • We would typically write programs once, run once. enter Hive...
  8. 8. What is Hive? "Hive is a data warehouse infrastructure built on top of Hadoop" You get an SQL-like language for queries. Start queries from a shell, file, jdbc, thrift.
  9. 9. Hive: SQL warehouse Hadoop
  10. 10. Why we chose Hive? • SQL familiarity suits non data engineers. • It integrates well with existing data sets. • It worked.
  11. 11. Johan set it up..
  12. 12. eg: http://www.flickr.com/photos/lozzd/4203345000/
  13. 13. Example: SELECT artistid, insertdate, count(1) FROM scrobbles WHERE (trackid = 10019 OR trackid = 368575614) AND insertdate >= '2009-12-01' AND insertdate <= '2009-12-31' GROUP BY artistid, insertdate ORDER BY artistid, insertdate;
  14. 14. Example: Users that scrobble ? Users that use the radio
  15. 15. Example: SELECT count(1) FROM scrobbles GROUP BY userid; SELECT count(1) FROM radiologs GROUP BY userid; SELECT count(1) FROM radiologs r JOIN scrobbles s ON r.userid = s.userid GROUP BY r.userid;
  16. 16. Example: Consider a user's scrobbles and radio listens for just one track First scrobble! Scrobbles Radio Time
  17. 17. Example: SELECT r.userid, r.trackid, count(1) FROM ( SELECT userid, trackid, min(unixtime) as unixtime FROM scrobbles GROUP BY userid, trackid ) s JOIN radiologs r ON r.userid = s.userid AND r.trackid = r.trackid WHERE s.unixtime < r.unixtime GROUP BY r.userid, r.trackid
  18. 18. Other nice things about hive • Joins are really really easy (most of the time).
  19. 19. Preparing a search index The crowd Labels cloud scrobbles Charts corrections Catalogue Hive Solr artists albums tracks
  20. 20. Not so great • No recordio. • Really huge joins can cause out of memory exceptions.

×