• Like
  • Save
Hive at Last.fm
Upcoming SlideShare
Loading in...5
×
 

Hive at Last.fm

on

  • 1,241 views

 

Statistics

Views

Total Views
1,241
Views on SlideShare
1,200
Embed Views
41

Actions

Likes
1
Downloads
14
Comments
0

2 Embeds 41

http://lanyrd.com 40
http://tweetedtimes.com 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hive at Last.fm Hive at Last.fm Presentation Transcript

    • Hive at Last.fm!Omar Ali - Data Developer!March 2012!
    • Overview!•  Hadoop at Last.fm"•  Hive"•  Examples"What I want to show you:"•  How it fits with a Hadoop infrastructure"•  Typical workflow with Hive"•  Ease of use for experiments and prototypes!
    • Hadoop!•  Brief overview of our infrastructure"•  How we use it""
    • Hadoop!64 node cluster "!  
    • Charts!                                                     
    • Hive!•  What is Hive?"•  How does it fit in with the rest of our system?"•  Using existing data in Hive"•  Example query"
    • What is Hive?!•  Data Warehouse"•  You see your data in the form of tables"•  Query language very similar to SQL" hive>  show  tables  like   hive>  describe   omar_charts_*;   omar_charts_tagcloud_album;               OK   OK   omar_charts_globaltags_album   albumid  int   omar_charts_globaltags_artist   tagid      int   omar_charts_globaltags_track   weight    double   omar_charts_tagcloud_album   omar_charts_tagcloud_artist   omar_charts_tagcloud_track  
    • What is a table?! Standard ! External! ! "•  Metadata stored by Hive" •  Metadata stored by Hive"•  Table data stored by Hive" •  Table data referenced by Hive"•  Deleting the table deletes the data •  Deleting the table only deletes the and the metadata! metadata!
    • What is a table?! Standard ! External! ! "•  Metadata stored by Hive" •  Metadata stored by Hive"•  Table data stored by Hive" •  Table data referenced by Hive"•  Deleting the table deletes the data •  Deleting the table only deletes the and the metadata! metadata! Database  Tables   Log  Files  
    • Example: scrobbles !Scrobble  Log:  13364451  30886670  217803052  358001787  0  0  0  1  0  0  1319068581  42875138  1717  3776668  4641276  0  0  0  1  0  0  1319068445  43108664  1003811  2237730  1019632  0  0  0  1  0  0  1319068783  36107186  1033304  2393940  13409429  0  0  0  0  0  1  1319068524  23842745  1261965  2349564  14091069  0  0  0  0  0  1  1319068594  Directory  Structure:  /data/submissions/2002/01/01  ...  /data/submissions/2012/03/20  /data/submissions/2012/03/21  
    • A Hive Query!select          track.title,  size(collect_set(s.userid))  as  reach  from          meta_track  track          join  data_submissions  s  on  (s.trackid  =  track.id)  where          s.insertdate  =  "2012-­‐03-­‐01”  and  (s.scrobble  +  s.listen  >  0)          and  s.artistid  =  57976724  -­‐-­‐  Lana  Del  Rey  group  by          track.title  order  by          reach  desc  limit  5;  
    • A Hive Query!select          track.title,  size(collect_set(s.userid))  as  reach  from          meta_track  track          join  data_submissions  s  on  (s.trackid  =  track.id)  where          s.insertdate  =  "2012-­‐03-­‐01”  and  (s.scrobble  +  s.listen  >  0)          and  s.artistid  =  57976724  -­‐-­‐  Lana  Del  Rey  group  by          track.title  order  by          reach  desc  limit  5;  Total  MapReduce  jobs  =  3  Launching  Job  1  out  of  3  Number  of  reduce  tasks  not  specified.  Estimated  from  input  data  size:  52  2012-­‐03-­‐19  23:28:58,613  Stage-­‐1  map  =  0%,    reduce  =  0%  2012-­‐03-­‐19  23:29:08,765  Stage-­‐1  map  =  3%,    reduce  =  0%  2012-­‐03-­‐19  23:29:10,794  Stage-­‐1  map  =  9%,    reduce  =  0%  
    • A Hive Query!select          track.title,  size(collect_set(s.userid))  as  reach  from          meta_track  track          join  data_submissions  s  on  (s.trackid  =  track.id)  where          s.insertdate  =  "2012-­‐03-­‐01”  and  (s.scrobble  +  s.listen  >  0)          and  s.artistid  =  57976724  -­‐-­‐  Lana  Del  Rey  group  by          track.title  order  by          reach  desc  limit  5;  Born  to  Die    10765  Video  Games    9382  Off  to  the  Races  6569  Blue  Jeans    6266  National  Anthem  5795   ~300  seconds  
    • Examples!•  Trends in UK Listening"•  Hadoop User Group Charts"
    • Trends in UK Listening!
    • Trends in UK Listening!
    • Trends in UK Listening!
    • select      artistid,  hourOfDay,      meanPlays,  stdPlays,  meanReach,  stdReach,  hoursInExistence,      meanPlays  /  sqrt(hoursInExistence)  as  stdErrPlays,        meanReach  /  sqrt(hoursInExistence)  as  stdErrReach  from      (select          artistCounts.artistid  as  artistid,  artistCounts.hourOfDay,          avg(artistCounts.plays)  as  meanPlays,  stddev_samp(artistCounts.plays)  as  stdPlays,            avg(artistCounts.reach)  as  meanReach,  stddev_samp(artistCounts.reach)  as  stdReach,          size(collect_set(concat(artistCounts.insertdate,  hourOfDay)))  as  hoursInExistence      from          (select                artistid,  insertdate,  hour(from_unixtime(unixtime))  as  hourOfDay,                count(*)  as  plays,  size(collect_set(s.userid))  as  reach          from              lookups_userid_geo  g              join  data_submissions  s  on  (g.userid  =  s.userid)          where              insertdate  >=  2011-­‐01-­‐01  and  insertdate  <  2012-­‐01-­‐01              and  (listen  +  scrobble)  >  0                and  lower(g.countrycode)  =  gb          group  by              artistid,  insertdate,  hour(from_unixtime(unixtime))          )  artistCounts      group  by          artistCounts.artistid,  artistCounts.hourOfDay      )  artistStats  where      meanReach  >  25;  
    • select      artistid,  hourOfDay,      meanPlays,  stdPlays,  meanReach,  stdReach,  hoursInExistence,      meanPlays  /  sqrt(hoursInExistence)  as  stdErrPlays,        meanReach  /  sqrt(hoursInExistence)  as  stdErrReach  from      (select          artistCounts.artistid  as  artistid,  artistCounts.hourOfDay,          avg(artistCounts.plays)  as  meanPlays,  stddev_samp(artistCounts.plays)  as  stdPlays,            avg(artistCounts.reach)  as  meanReach,  stddev_samp(artistCounts.reach)  as  stdReach,          size(collect_set(concat(artistCounts.insertdate,  hourOfDay)))  as  hoursInExistence      from          (select                artistid,  insertdate,  hour(from_unixtime(unixtime))  as  hourOfDay,                count(*)  as  plays,  size(collect_set(s.userid))  as  reach          from              lookups_userid_geo  g              join  data_submissions  s  on  (g.userid  =  s.userid)          where              insertdate  >=  2011-­‐01-­‐01  and  insertdate  <  2012-­‐01-­‐01              and  (listen  +  scrobble)  >  0                and  lower(g.countrycode)  =  gb          group  by              artistid,  insertdate,  hour(from_unixtime(unixtime))          )  artistCounts      group  by          artistCounts.artistid,  artistCounts.hourOfDay      )  artistStats  where      meanReach  >  25;  
    • select      artistid,  hourOfDay,      meanPlays,  stdPlays,  meanReach,  stdReach,  hoursInExistence,      meanPlays  /  sqrt(hoursInExistence)  as  stdErrPlays,        meanReach  /  sqrt(hoursInExistence)  as  stdErrReach  from      (select          artistCounts.artistid  as  artistid,  artistCounts.hourOfDay,          avg(artistCounts.plays)  as  meanPlays,  stddev_samp(artistCounts.plays)  as  stdPlays,            avg(artistCounts.reach)  as  meanReach,  stddev_samp(artistCounts.reach)  as  stdReach,          size(collect_set(concat(artistCounts.insertdate,  hourOfDay)))  as  hoursInExistence      from          (select                artistid,  insertdate,  hour(from_unixtime(unixtime))  as  hourOfDay,                count(*)  as  plays,  size(collect_set(s.userid))  as  reach          from              lookups_userid_geo  g              join  data_submissions  s  on  (g.userid  =  s.userid)          where              insertdate  >=  2011-­‐01-­‐01  and  insertdate  <  2012-­‐01-­‐01              and  (listen  +  scrobble)  >  0                and  lower(g.countrycode)  =  gb          group  by              artistid,  insertdate,  hour(from_unixtime(unixtime))          )  artistCounts      group  by          artistCounts.artistid,  artistCounts.hourOfDay      )  artistStats  where      meanReach  >  25;  
    • So far !•  Test data: listening statistics for each artist, in each hour of the day"•  Base data: averaged hourly statistics for each artist"•  Next step: compare them"
    • Comparison!select        test.artistid,        test.meanReach,  base.meanReach,      test.stdReach,  base.stdReach,      test.stdErrReach,  base.stdErrReach,      (test.meanReach  -­‐  base.meanReach)  /  (base.stdReach)  as  zScore,      (test.meanReach  -­‐  base.meanReach)  /  (base.stdErrReach  *  test.stdErrReach)  as              deviation  from      omar_uk_artist_base  base      join  omar_uk_artist_hours  test  on  (base.artistid  =  test.artistid)  where      test.hourOfDay  =  15  order  by      deviation  desc  limit  5;  
    • Trends in UK Listening!
    • Summary!•  Hive is easy to use"•  It sits comfortably on top of a Hadoop infrastructure"•  Familiar if you know SQL"•  Can ask big questions"•  Can ask wide ranging questions"•  Allows analyses that would otherwise need a lot of preliminary work ""
    • HUG Charts!
    • Any Questions?!