Hive at Last.fm

Hive at Last.fm!

Omar Ali - Data Developer!
March 2012!

Overview!

•  Hadoop at Last.fm"
•  Hive"
•  Examples"

What I want to show you:"
•  How it ﬁts with a Hadoop infrastructure"
•  Typical workﬂow with Hive"
•  Ease of use for experiments and prototypes!

Hadoop!
•  Brief overview of our infrastructure"
•  How we use it"
"

Charts!

 



  

    

     

     

     

     

     

     

    





Hive!
•  What is Hive?"
•  How does it ﬁt in with the rest of our system?"
•  Using existing data in Hive"
•  Example query"

What is Hive?!

•  Data Warehouse"
•  You see your data in the form of tables"
•  Query language very similar to SQL"

hive>
show
tables
like
hive>
describe

'omar_charts_*';
omar_charts_tagcloud_album;

OK
OK

omar_charts_globaltags_album
albumid
int

omar_charts_globaltags_artist
tagid

int

omar_charts_globaltags_track
weight

double

omar_charts_tagcloud_album

omar_charts_tagcloud_artist

omar_charts_tagcloud_track

What is a table?!

Standard ! External!
! "
•  Metadata stored by Hive" •  Metadata stored by Hive"

•  Table data stored by Hive" •  Table data referenced by Hive"

•  Deleting the table deletes the data •  Deleting the table only deletes the
and the metadata! metadata!

What is a table?!

Standard ! External!
! "
•  Metadata stored by Hive" •  Metadata stored by Hive"

•  Table data stored by Hive" •  Table data referenced by Hive"

•  Deleting the table deletes the data •  Deleting the table only deletes the
and the metadata! metadata!

Database
Tables
Log
Files

Example: scrobbles
!

Scrobble
Log:

13364451
30886670
217803052
358001787
0
0
0
1
0
0
1319068581

42875138
1717
3776668
4641276
0
0
0
1
0
0
1319068445

43108664
1003811
2237730
1019632
0
0
0
1
0
0
1319068783

36107186
1033304
2393940
13409429
0
0
0
0
0
1
1319068524

23842745
1261965
2349564
14091069
0
0
0
0
0
1
1319068594

Directory
Structure:

/data/submissions/2002/01/01

...



A Hive Query!
select

track.title,
size(collect_set(s.userid))
as
reach

from

meta_track
track

join
data_submissions
s
on
(s.trackid
=
track.id)

where

s.insertdate
=
"2012-‐03-‐01”
and
(s.scrobble
+
s.listen
>
0)

and
s.artistid
=
57976724
-‐-‐
Lana
Del
Rey

group
by

track.title

order
by

reach
desc

limit
5;

A Hive Query!
select

track.title,
as
reach

from

meta_track
track

join
data_submissions
s
on
(s.trackid
=
track.id)

where

s.insertdate
=
"2012-‐03-‐01”
and
(s.scrobble
+
s.listen
>
0)

and
s.artistid
=
57976724
-‐-‐
Lana
Del
Rey

group
by

track.title

order
by

reach
desc

limit
5;

Total
MapReduce
jobs
=
3

Launching
Job
1
out
of
3

Number
of
reduce
tasks
not
specified.
Estimated
from
input
data
size:
52

2012-‐03-‐19
23:28:58,613
Stage-‐1
map
=
0%,

reduce
=
0%

2012-‐03-‐19
23:29:08,765
Stage-‐1
map
=
3%,

reduce
=
0%

2012-‐03-‐19
23:29:10,794
Stage-‐1
map
=
9%,

reduce
=
0%

A Hive Query!
select

track.title,
as
reach

from

meta_track
track

join
data_submissions
s
on
(s.trackid
=
track.id)

where

s.insertdate
=
"2012-‐03-‐01”
and
(s.scrobble
+
s.listen
>
0)

and
s.artistid
=
57976724
-‐-‐
Lana
Del
Rey

group
by

track.title

order
by

reach
desc

limit
5;

Born
to
Die

10765

Video
Games

9382

Off
to
the
Races
6569

Blue
Jeans

6266

National
Anthem
5795
~300
seconds

Examples!
•  Trends in UK Listening"
•  Hadoop User Group Charts"

select

artistid,
hourOfDay,

meanPlays,
stdPlays,
meanReach,
stdReach,
hoursInExistence,

meanPlays
/
sqrt(hoursInExistence)
as
stdErrPlays,

meanReach
/
sqrt(hoursInExistence)
as
stdErrReach

from

(select

artistCounts.artistid
as
artistid,
artistCounts.hourOfDay,

avg(artistCounts.plays)
as
meanPlays,
stddev_samp(artistCounts.plays)
as
stdPlays,

avg(artistCounts.reach)
as
meanReach,
stddev_samp(artistCounts.reach)
as
stdReach,

size(collect_set(concat(artistCounts.insertdate,
hourOfDay)))
as
hoursInExistence

from

(select

artistid,
insertdate,
hour(from_unixtime(unixtime))
as
hourOfDay,

count(*)
as
plays,
as
reach

from

lookups_userid_geo
g

join
data_submissions
s
on
(g.userid
=
s.userid)

where

insertdate
>=
'2011-‐01-‐01'
and
insertdate
<
'2012-‐01-‐01'

and
(listen
+
scrobble)
>
0

and
lower(g.countrycode)
=
'gb'

group
by

artistid,
insertdate,
hour(from_unixtime(unixtime))

)
artistCounts

group
by

artistCounts.artistid,
artistCounts.hourOfDay

)
artistStats

where

meanReach
>
25;

So far
!

•  Test data: listening statistics for each artist, in each hour of the day"
•  Base data: averaged hourly statistics for each artist"

•  Next step: compare them"

Comparison!

select

test.artistid,

test.meanReach,
base.meanReach,

test.stdReach,
base.stdReach,

test.stdErrReach,
base.stdErrReach,

(test.meanReach
-‐
base.meanReach)
/
(base.stdReach)
as
zScore,

(test.meanReach
-‐
base.meanReach)
/
(base.stdErrReach
*
test.stdErrReach)
as

deviation

from

omar_uk_artist_base
base

join
omar_uk_artist_hours
test
on
(base.artistid
=
test.artistid)

where

test.hourOfDay
=
15

order
by

deviation
desc

limit
5;

Summary!

•  Hive is easy to use"
•  It sits comfortably on top of a Hadoop infrastructure"
•  Familiar if you know SQL"
•  Can ask big questions"
•  Can ask wide ranging questions"
•  Allows analyses that would otherwise need a lot of
preliminary work "
"

Hive at Last.fm

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Hive at Last.fm

Similar to Hive at Last.fm (20)

More from huguk

More from huguk (20)

Recently uploaded

Recently uploaded (20)

Hive at Last.fm