Your SlideShare is downloading. ×
0
Hive at Last.fm
Hive at Last.fm
Hive at Last.fm
Hive at Last.fm
Hive at Last.fm
Hive at Last.fm
Hive at Last.fm
Hive at Last.fm
Hive at Last.fm
Hive at Last.fm
Hive at Last.fm
Hive at Last.fm
Hive at Last.fm
Hive at Last.fm
Hive at Last.fm
Hive at Last.fm
Hive at Last.fm
Hive at Last.fm
Hive at Last.fm
Hive at Last.fm
Hive at Last.fm
Hive at Last.fm
Hive at Last.fm
Hive at Last.fm
Hive at Last.fm
Hive at Last.fm
Hive at Last.fm
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hive at Last.fm

1,002

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,002
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
21
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Hive at Last.fm!Omar Ali - Data Developer!March 2012!
  • 2. Overview!•  Hadoop at Last.fm"•  Hive"•  Examples"What I want to show you:"•  How it fits with a Hadoop infrastructure"•  Typical workflow with Hive"•  Ease of use for experiments and prototypes!
  • 3. Hadoop!•  Brief overview of our infrastructure"•  How we use it""
  • 4. Hadoop!64 node cluster "!  
  • 5. Charts!                                                     
  • 6. Hive!•  What is Hive?"•  How does it fit in with the rest of our system?"•  Using existing data in Hive"•  Example query"
  • 7. What is Hive?!•  Data Warehouse"•  You see your data in the form of tables"•  Query language very similar to SQL" hive>  show  tables  like   hive>  describe   omar_charts_*;   omar_charts_tagcloud_album;               OK   OK   omar_charts_globaltags_album   albumid  int   omar_charts_globaltags_artist   tagid      int   omar_charts_globaltags_track   weight    double   omar_charts_tagcloud_album   omar_charts_tagcloud_artist   omar_charts_tagcloud_track  
  • 8. What is a table?! Standard ! External! ! "•  Metadata stored by Hive" •  Metadata stored by Hive"•  Table data stored by Hive" •  Table data referenced by Hive"•  Deleting the table deletes the data •  Deleting the table only deletes the and the metadata! metadata!
  • 9. What is a table?! Standard ! External! ! "•  Metadata stored by Hive" •  Metadata stored by Hive"•  Table data stored by Hive" •  Table data referenced by Hive"•  Deleting the table deletes the data •  Deleting the table only deletes the and the metadata! metadata! Database  Tables   Log  Files  
  • 10. Example: scrobbles !Scrobble  Log:  13364451  30886670  217803052  358001787  0  0  0  1  0  0  1319068581  42875138  1717  3776668  4641276  0  0  0  1  0  0  1319068445  43108664  1003811  2237730  1019632  0  0  0  1  0  0  1319068783  36107186  1033304  2393940  13409429  0  0  0  0  0  1  1319068524  23842745  1261965  2349564  14091069  0  0  0  0  0  1  1319068594  Directory  Structure:  /data/submissions/2002/01/01  ...  /data/submissions/2012/03/20  /data/submissions/2012/03/21  
  • 11. A Hive Query!select          track.title,  size(collect_set(s.userid))  as  reach  from          meta_track  track          join  data_submissions  s  on  (s.trackid  =  track.id)  where          s.insertdate  =  "2012-­‐03-­‐01”  and  (s.scrobble  +  s.listen  >  0)          and  s.artistid  =  57976724  -­‐-­‐  Lana  Del  Rey  group  by          track.title  order  by          reach  desc  limit  5;  
  • 12. A Hive Query!select          track.title,  size(collect_set(s.userid))  as  reach  from          meta_track  track          join  data_submissions  s  on  (s.trackid  =  track.id)  where          s.insertdate  =  "2012-­‐03-­‐01”  and  (s.scrobble  +  s.listen  >  0)          and  s.artistid  =  57976724  -­‐-­‐  Lana  Del  Rey  group  by          track.title  order  by          reach  desc  limit  5;  Total  MapReduce  jobs  =  3  Launching  Job  1  out  of  3  Number  of  reduce  tasks  not  specified.  Estimated  from  input  data  size:  52  2012-­‐03-­‐19  23:28:58,613  Stage-­‐1  map  =  0%,    reduce  =  0%  2012-­‐03-­‐19  23:29:08,765  Stage-­‐1  map  =  3%,    reduce  =  0%  2012-­‐03-­‐19  23:29:10,794  Stage-­‐1  map  =  9%,    reduce  =  0%  
  • 13. A Hive Query!select          track.title,  size(collect_set(s.userid))  as  reach  from          meta_track  track          join  data_submissions  s  on  (s.trackid  =  track.id)  where          s.insertdate  =  "2012-­‐03-­‐01”  and  (s.scrobble  +  s.listen  >  0)          and  s.artistid  =  57976724  -­‐-­‐  Lana  Del  Rey  group  by          track.title  order  by          reach  desc  limit  5;  Born  to  Die    10765  Video  Games    9382  Off  to  the  Races  6569  Blue  Jeans    6266  National  Anthem  5795   ~300  seconds  
  • 14. Examples!•  Trends in UK Listening"•  Hadoop User Group Charts"
  • 15. Trends in UK Listening!
  • 16. Trends in UK Listening!
  • 17. Trends in UK Listening!
  • 18. select      artistid,  hourOfDay,      meanPlays,  stdPlays,  meanReach,  stdReach,  hoursInExistence,      meanPlays  /  sqrt(hoursInExistence)  as  stdErrPlays,        meanReach  /  sqrt(hoursInExistence)  as  stdErrReach  from      (select          artistCounts.artistid  as  artistid,  artistCounts.hourOfDay,          avg(artistCounts.plays)  as  meanPlays,  stddev_samp(artistCounts.plays)  as  stdPlays,            avg(artistCounts.reach)  as  meanReach,  stddev_samp(artistCounts.reach)  as  stdReach,          size(collect_set(concat(artistCounts.insertdate,  hourOfDay)))  as  hoursInExistence      from          (select                artistid,  insertdate,  hour(from_unixtime(unixtime))  as  hourOfDay,                count(*)  as  plays,  size(collect_set(s.userid))  as  reach          from              lookups_userid_geo  g              join  data_submissions  s  on  (g.userid  =  s.userid)          where              insertdate  >=  2011-­‐01-­‐01  and  insertdate  <  2012-­‐01-­‐01              and  (listen  +  scrobble)  >  0                and  lower(g.countrycode)  =  gb          group  by              artistid,  insertdate,  hour(from_unixtime(unixtime))          )  artistCounts      group  by          artistCounts.artistid,  artistCounts.hourOfDay      )  artistStats  where      meanReach  >  25;  
  • 19. select      artistid,  hourOfDay,      meanPlays,  stdPlays,  meanReach,  stdReach,  hoursInExistence,      meanPlays  /  sqrt(hoursInExistence)  as  stdErrPlays,        meanReach  /  sqrt(hoursInExistence)  as  stdErrReach  from      (select          artistCounts.artistid  as  artistid,  artistCounts.hourOfDay,          avg(artistCounts.plays)  as  meanPlays,  stddev_samp(artistCounts.plays)  as  stdPlays,            avg(artistCounts.reach)  as  meanReach,  stddev_samp(artistCounts.reach)  as  stdReach,          size(collect_set(concat(artistCounts.insertdate,  hourOfDay)))  as  hoursInExistence      from          (select                artistid,  insertdate,  hour(from_unixtime(unixtime))  as  hourOfDay,                count(*)  as  plays,  size(collect_set(s.userid))  as  reach          from              lookups_userid_geo  g              join  data_submissions  s  on  (g.userid  =  s.userid)          where              insertdate  >=  2011-­‐01-­‐01  and  insertdate  <  2012-­‐01-­‐01              and  (listen  +  scrobble)  >  0                and  lower(g.countrycode)  =  gb          group  by              artistid,  insertdate,  hour(from_unixtime(unixtime))          )  artistCounts      group  by          artistCounts.artistid,  artistCounts.hourOfDay      )  artistStats  where      meanReach  >  25;  
  • 20. select      artistid,  hourOfDay,      meanPlays,  stdPlays,  meanReach,  stdReach,  hoursInExistence,      meanPlays  /  sqrt(hoursInExistence)  as  stdErrPlays,        meanReach  /  sqrt(hoursInExistence)  as  stdErrReach  from      (select          artistCounts.artistid  as  artistid,  artistCounts.hourOfDay,          avg(artistCounts.plays)  as  meanPlays,  stddev_samp(artistCounts.plays)  as  stdPlays,            avg(artistCounts.reach)  as  meanReach,  stddev_samp(artistCounts.reach)  as  stdReach,          size(collect_set(concat(artistCounts.insertdate,  hourOfDay)))  as  hoursInExistence      from          (select                artistid,  insertdate,  hour(from_unixtime(unixtime))  as  hourOfDay,                count(*)  as  plays,  size(collect_set(s.userid))  as  reach          from              lookups_userid_geo  g              join  data_submissions  s  on  (g.userid  =  s.userid)          where              insertdate  >=  2011-­‐01-­‐01  and  insertdate  <  2012-­‐01-­‐01              and  (listen  +  scrobble)  >  0                and  lower(g.countrycode)  =  gb          group  by              artistid,  insertdate,  hour(from_unixtime(unixtime))          )  artistCounts      group  by          artistCounts.artistid,  artistCounts.hourOfDay      )  artistStats  where      meanReach  >  25;  
  • 21. So far !•  Test data: listening statistics for each artist, in each hour of the day"•  Base data: averaged hourly statistics for each artist"•  Next step: compare them"
  • 22. Comparison!select        test.artistid,        test.meanReach,  base.meanReach,      test.stdReach,  base.stdReach,      test.stdErrReach,  base.stdErrReach,      (test.meanReach  -­‐  base.meanReach)  /  (base.stdReach)  as  zScore,      (test.meanReach  -­‐  base.meanReach)  /  (base.stdErrReach  *  test.stdErrReach)  as              deviation  from      omar_uk_artist_base  base      join  omar_uk_artist_hours  test  on  (base.artistid  =  test.artistid)  where      test.hourOfDay  =  15  order  by      deviation  desc  limit  5;  
  • 23. Trends in UK Listening!
  • 24. Summary!•  Hive is easy to use"•  It sits comfortably on top of a Hadoop infrastructure"•  Familiar if you know SQL"•  Can ask big questions"•  Can ask wide ranging questions"•  Allows analyses that would otherwise need a lot of preliminary work ""
  • 25. HUG Charts!
  • 26. Any Questions?!

×