Hive at Last.fm

  • 899 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
899
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
18
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Hive at Last.fm!Omar Ali - Data Developer!March 2012!
  • 2. Overview!•  Hadoop at Last.fm"•  Hive"•  Examples"What I want to show you:"•  How it fits with a Hadoop infrastructure"•  Typical workflow with Hive"•  Ease of use for experiments and prototypes!
  • 3. Hadoop!•  Brief overview of our infrastructure"•  How we use it""
  • 4. Hadoop!64 node cluster "!  
  • 5. Charts!                                                     
  • 6. Hive!•  What is Hive?"•  How does it fit in with the rest of our system?"•  Using existing data in Hive"•  Example query"
  • 7. What is Hive?!•  Data Warehouse"•  You see your data in the form of tables"•  Query language very similar to SQL" hive>  show  tables  like   hive>  describe   omar_charts_*;   omar_charts_tagcloud_album;               OK   OK   omar_charts_globaltags_album   albumid  int   omar_charts_globaltags_artist   tagid      int   omar_charts_globaltags_track   weight    double   omar_charts_tagcloud_album   omar_charts_tagcloud_artist   omar_charts_tagcloud_track  
  • 8. What is a table?! Standard ! External! ! "•  Metadata stored by Hive" •  Metadata stored by Hive"•  Table data stored by Hive" •  Table data referenced by Hive"•  Deleting the table deletes the data •  Deleting the table only deletes the and the metadata! metadata!
  • 9. What is a table?! Standard ! External! ! "•  Metadata stored by Hive" •  Metadata stored by Hive"•  Table data stored by Hive" •  Table data referenced by Hive"•  Deleting the table deletes the data •  Deleting the table only deletes the and the metadata! metadata! Database  Tables   Log  Files  
  • 10. Example: scrobbles !Scrobble  Log:  13364451  30886670  217803052  358001787  0  0  0  1  0  0  1319068581  42875138  1717  3776668  4641276  0  0  0  1  0  0  1319068445  43108664  1003811  2237730  1019632  0  0  0  1  0  0  1319068783  36107186  1033304  2393940  13409429  0  0  0  0  0  1  1319068524  23842745  1261965  2349564  14091069  0  0  0  0  0  1  1319068594  Directory  Structure:  /data/submissions/2002/01/01  ...  /data/submissions/2012/03/20  /data/submissions/2012/03/21  
  • 11. A Hive Query!select          track.title,  size(collect_set(s.userid))  as  reach  from          meta_track  track          join  data_submissions  s  on  (s.trackid  =  track.id)  where          s.insertdate  =  "2012-­‐03-­‐01”  and  (s.scrobble  +  s.listen  >  0)          and  s.artistid  =  57976724  -­‐-­‐  Lana  Del  Rey  group  by          track.title  order  by          reach  desc  limit  5;  
  • 12. A Hive Query!select          track.title,  size(collect_set(s.userid))  as  reach  from          meta_track  track          join  data_submissions  s  on  (s.trackid  =  track.id)  where          s.insertdate  =  "2012-­‐03-­‐01”  and  (s.scrobble  +  s.listen  >  0)          and  s.artistid  =  57976724  -­‐-­‐  Lana  Del  Rey  group  by          track.title  order  by          reach  desc  limit  5;  Total  MapReduce  jobs  =  3  Launching  Job  1  out  of  3  Number  of  reduce  tasks  not  specified.  Estimated  from  input  data  size:  52  2012-­‐03-­‐19  23:28:58,613  Stage-­‐1  map  =  0%,    reduce  =  0%  2012-­‐03-­‐19  23:29:08,765  Stage-­‐1  map  =  3%,    reduce  =  0%  2012-­‐03-­‐19  23:29:10,794  Stage-­‐1  map  =  9%,    reduce  =  0%  
  • 13. A Hive Query!select          track.title,  size(collect_set(s.userid))  as  reach  from          meta_track  track          join  data_submissions  s  on  (s.trackid  =  track.id)  where          s.insertdate  =  "2012-­‐03-­‐01”  and  (s.scrobble  +  s.listen  >  0)          and  s.artistid  =  57976724  -­‐-­‐  Lana  Del  Rey  group  by          track.title  order  by          reach  desc  limit  5;  Born  to  Die    10765  Video  Games    9382  Off  to  the  Races  6569  Blue  Jeans    6266  National  Anthem  5795   ~300  seconds  
  • 14. Examples!•  Trends in UK Listening"•  Hadoop User Group Charts"
  • 15. Trends in UK Listening!
  • 16. Trends in UK Listening!
  • 17. Trends in UK Listening!
  • 18. select      artistid,  hourOfDay,      meanPlays,  stdPlays,  meanReach,  stdReach,  hoursInExistence,      meanPlays  /  sqrt(hoursInExistence)  as  stdErrPlays,        meanReach  /  sqrt(hoursInExistence)  as  stdErrReach  from      (select          artistCounts.artistid  as  artistid,  artistCounts.hourOfDay,          avg(artistCounts.plays)  as  meanPlays,  stddev_samp(artistCounts.plays)  as  stdPlays,            avg(artistCounts.reach)  as  meanReach,  stddev_samp(artistCounts.reach)  as  stdReach,          size(collect_set(concat(artistCounts.insertdate,  hourOfDay)))  as  hoursInExistence      from          (select                artistid,  insertdate,  hour(from_unixtime(unixtime))  as  hourOfDay,                count(*)  as  plays,  size(collect_set(s.userid))  as  reach          from              lookups_userid_geo  g              join  data_submissions  s  on  (g.userid  =  s.userid)          where              insertdate  >=  2011-­‐01-­‐01  and  insertdate  <  2012-­‐01-­‐01              and  (listen  +  scrobble)  >  0                and  lower(g.countrycode)  =  gb          group  by              artistid,  insertdate,  hour(from_unixtime(unixtime))          )  artistCounts      group  by          artistCounts.artistid,  artistCounts.hourOfDay      )  artistStats  where      meanReach  >  25;  
  • 19. select      artistid,  hourOfDay,      meanPlays,  stdPlays,  meanReach,  stdReach,  hoursInExistence,      meanPlays  /  sqrt(hoursInExistence)  as  stdErrPlays,        meanReach  /  sqrt(hoursInExistence)  as  stdErrReach  from      (select          artistCounts.artistid  as  artistid,  artistCounts.hourOfDay,          avg(artistCounts.plays)  as  meanPlays,  stddev_samp(artistCounts.plays)  as  stdPlays,            avg(artistCounts.reach)  as  meanReach,  stddev_samp(artistCounts.reach)  as  stdReach,          size(collect_set(concat(artistCounts.insertdate,  hourOfDay)))  as  hoursInExistence      from          (select                artistid,  insertdate,  hour(from_unixtime(unixtime))  as  hourOfDay,                count(*)  as  plays,  size(collect_set(s.userid))  as  reach          from              lookups_userid_geo  g              join  data_submissions  s  on  (g.userid  =  s.userid)          where              insertdate  >=  2011-­‐01-­‐01  and  insertdate  <  2012-­‐01-­‐01              and  (listen  +  scrobble)  >  0                and  lower(g.countrycode)  =  gb          group  by              artistid,  insertdate,  hour(from_unixtime(unixtime))          )  artistCounts      group  by          artistCounts.artistid,  artistCounts.hourOfDay      )  artistStats  where      meanReach  >  25;  
  • 20. select      artistid,  hourOfDay,      meanPlays,  stdPlays,  meanReach,  stdReach,  hoursInExistence,      meanPlays  /  sqrt(hoursInExistence)  as  stdErrPlays,        meanReach  /  sqrt(hoursInExistence)  as  stdErrReach  from      (select          artistCounts.artistid  as  artistid,  artistCounts.hourOfDay,          avg(artistCounts.plays)  as  meanPlays,  stddev_samp(artistCounts.plays)  as  stdPlays,            avg(artistCounts.reach)  as  meanReach,  stddev_samp(artistCounts.reach)  as  stdReach,          size(collect_set(concat(artistCounts.insertdate,  hourOfDay)))  as  hoursInExistence      from          (select                artistid,  insertdate,  hour(from_unixtime(unixtime))  as  hourOfDay,                count(*)  as  plays,  size(collect_set(s.userid))  as  reach          from              lookups_userid_geo  g              join  data_submissions  s  on  (g.userid  =  s.userid)          where              insertdate  >=  2011-­‐01-­‐01  and  insertdate  <  2012-­‐01-­‐01              and  (listen  +  scrobble)  >  0                and  lower(g.countrycode)  =  gb          group  by              artistid,  insertdate,  hour(from_unixtime(unixtime))          )  artistCounts      group  by          artistCounts.artistid,  artistCounts.hourOfDay      )  artistStats  where      meanReach  >  25;  
  • 21. So far !•  Test data: listening statistics for each artist, in each hour of the day"•  Base data: averaged hourly statistics for each artist"•  Next step: compare them"
  • 22. Comparison!select        test.artistid,        test.meanReach,  base.meanReach,      test.stdReach,  base.stdReach,      test.stdErrReach,  base.stdErrReach,      (test.meanReach  -­‐  base.meanReach)  /  (base.stdReach)  as  zScore,      (test.meanReach  -­‐  base.meanReach)  /  (base.stdErrReach  *  test.stdErrReach)  as              deviation  from      omar_uk_artist_base  base      join  omar_uk_artist_hours  test  on  (base.artistid  =  test.artistid)  where      test.hourOfDay  =  15  order  by      deviation  desc  limit  5;  
  • 23. Trends in UK Listening!
  • 24. Summary!•  Hive is easy to use"•  It sits comfortably on top of a Hadoop infrastructure"•  Familiar if you know SQL"•  Can ask big questions"•  Can ask wide ranging questions"•  Allows analyses that would otherwise need a lot of preliminary work ""
  • 25. HUG Charts!
  • 26. Any Questions?!