This document provides an overview of how Hive is used at Last.fm to query and analyze log data stored in Hadoop. It describes Hive as a data warehouse that allows viewing data in tables and querying them using SQL. Examples show how scrobble log data is stored in Hadoop and queried using Hive to analyze metrics like the reach of an artist's songs on a given date. The workflow allows experiments and analysis of streaming music data at scale using Hive and Hadoop.
2. Overview!
• Hadoop at Last.fm"
• Hive"
• Examples"
What I want to show you:"
• How it fits with a Hadoop infrastructure"
• Typical workflow with Hive"
• Ease of use for experiments and prototypes!
3. Hadoop!
• Brief overview of our infrastructure"
• How we use it"
"
7. Hive!
• What is Hive?"
• How does it fit in with the rest of our system?"
• Using existing data in Hive"
• Example query"
8. What is Hive?!
• Data Warehouse"
• You see your data in the form of tables"
• Query language very similar to SQL"
hive>
show
tables
like
hive>
describe
'omar_charts_*';
omar_charts_tagcloud_album;
OK
OK
omar_charts_globaltags_album
albumid
int
omar_charts_globaltags_artist
tagid
int
omar_charts_globaltags_track
weight
double
omar_charts_tagcloud_album
omar_charts_tagcloud_artist
omar_charts_tagcloud_track
9. What is a table?!
Standard ! External!
! "
• Metadata stored by Hive" • Metadata stored by Hive"
• Table data stored by Hive" • Table data referenced by Hive"
• Deleting the table deletes the data • Deleting the table only deletes the
and the metadata! metadata!
10. What is a table?!
Standard ! External!
! "
• Metadata stored by Hive" • Metadata stored by Hive"
• Table data stored by Hive" • Table data referenced by Hive"
• Deleting the table deletes the data • Deleting the table only deletes the
and the metadata! metadata!
Database
Tables
Log
Files
12. A Hive Query!
select
track.title,
size(collect_set(s.userid))
as
reach
from
meta_track
track
join
data_submissions
s
on
(s.trackid
=
track.id)
where
s.insertdate
=
"2012-‐03-‐01”
and
(s.scrobble
+
s.listen
>
0)
and
s.artistid
=
57976724
-‐-‐
Lana
Del
Rey
group
by
track.title
order
by
reach
desc
limit
5;
13. A Hive Query!
select
track.title,
size(collect_set(s.userid))
as
reach
from
meta_track
track
join
data_submissions
s
on
(s.trackid
=
track.id)
where
s.insertdate
=
"2012-‐03-‐01”
and
(s.scrobble
+
s.listen
>
0)
and
s.artistid
=
57976724
-‐-‐
Lana
Del
Rey
group
by
track.title
order
by
reach
desc
limit
5;
Total
MapReduce
jobs
=
3
Launching
Job
1
out
of
3
Number
of
reduce
tasks
not
specified.
Estimated
from
input
data
size:
52
2012-‐03-‐19
23:28:58,613
Stage-‐1
map
=
0%,
reduce
=
0%
2012-‐03-‐19
23:29:08,765
Stage-‐1
map
=
3%,
reduce
=
0%
2012-‐03-‐19
23:29:10,794
Stage-‐1
map
=
9%,
reduce
=
0%
14. A Hive Query!
select
track.title,
size(collect_set(s.userid))
as
reach
from
meta_track
track
join
data_submissions
s
on
(s.trackid
=
track.id)
where
s.insertdate
=
"2012-‐03-‐01”
and
(s.scrobble
+
s.listen
>
0)
and
s.artistid
=
57976724
-‐-‐
Lana
Del
Rey
group
by
track.title
order
by
reach
desc
limit
5;
Born
to
Die
10765
Video
Games
9382
Off
to
the
Races
6569
Blue
Jeans
6266
National
Anthem
5795
~300
seconds
15. Examples!
• Trends in UK Listening"
• Hadoop User Group Charts"
19. select
artistid,
hourOfDay,
meanPlays,
stdPlays,
meanReach,
stdReach,
hoursInExistence,
meanPlays
/
sqrt(hoursInExistence)
as
stdErrPlays,
meanReach
/
sqrt(hoursInExistence)
as
stdErrReach
from
(select
artistCounts.artistid
as
artistid,
artistCounts.hourOfDay,
avg(artistCounts.plays)
as
meanPlays,
stddev_samp(artistCounts.plays)
as
stdPlays,
avg(artistCounts.reach)
as
meanReach,
stddev_samp(artistCounts.reach)
as
stdReach,
size(collect_set(concat(artistCounts.insertdate,
hourOfDay)))
as
hoursInExistence
from
(select
artistid,
insertdate,
hour(from_unixtime(unixtime))
as
hourOfDay,
count(*)
as
plays,
size(collect_set(s.userid))
as
reach
from
lookups_userid_geo
g
join
data_submissions
s
on
(g.userid
=
s.userid)
where
insertdate
>=
'2011-‐01-‐01'
and
insertdate
<
'2012-‐01-‐01'
and
(listen
+
scrobble)
>
0
and
lower(g.countrycode)
=
'gb'
group
by
artistid,
insertdate,
hour(from_unixtime(unixtime))
)
artistCounts
group
by
artistCounts.artistid,
artistCounts.hourOfDay
)
artistStats
where
meanReach
>
25;
20. select
artistid,
hourOfDay,
meanPlays,
stdPlays,
meanReach,
stdReach,
hoursInExistence,
meanPlays
/
sqrt(hoursInExistence)
as
stdErrPlays,
meanReach
/
sqrt(hoursInExistence)
as
stdErrReach
from
(select
artistCounts.artistid
as
artistid,
artistCounts.hourOfDay,
avg(artistCounts.plays)
as
meanPlays,
stddev_samp(artistCounts.plays)
as
stdPlays,
avg(artistCounts.reach)
as
meanReach,
stddev_samp(artistCounts.reach)
as
stdReach,
size(collect_set(concat(artistCounts.insertdate,
hourOfDay)))
as
hoursInExistence
from
(select
artistid,
insertdate,
hour(from_unixtime(unixtime))
as
hourOfDay,
count(*)
as
plays,
size(collect_set(s.userid))
as
reach
from
lookups_userid_geo
g
join
data_submissions
s
on
(g.userid
=
s.userid)
where
insertdate
>=
'2011-‐01-‐01'
and
insertdate
<
'2012-‐01-‐01'
and
(listen
+
scrobble)
>
0
and
lower(g.countrycode)
=
'gb'
group
by
artistid,
insertdate,
hour(from_unixtime(unixtime))
)
artistCounts
group
by
artistCounts.artistid,
artistCounts.hourOfDay
)
artistStats
where
meanReach
>
25;
21. select
artistid,
hourOfDay,
meanPlays,
stdPlays,
meanReach,
stdReach,
hoursInExistence,
meanPlays
/
sqrt(hoursInExistence)
as
stdErrPlays,
meanReach
/
sqrt(hoursInExistence)
as
stdErrReach
from
(select
artistCounts.artistid
as
artistid,
artistCounts.hourOfDay,
avg(artistCounts.plays)
as
meanPlays,
stddev_samp(artistCounts.plays)
as
stdPlays,
avg(artistCounts.reach)
as
meanReach,
stddev_samp(artistCounts.reach)
as
stdReach,
size(collect_set(concat(artistCounts.insertdate,
hourOfDay)))
as
hoursInExistence
from
(select
artistid,
insertdate,
hour(from_unixtime(unixtime))
as
hourOfDay,
count(*)
as
plays,
size(collect_set(s.userid))
as
reach
from
lookups_userid_geo
g
join
data_submissions
s
on
(g.userid
=
s.userid)
where
insertdate
>=
'2011-‐01-‐01'
and
insertdate
<
'2012-‐01-‐01'
and
(listen
+
scrobble)
>
0
and
lower(g.countrycode)
=
'gb'
group
by
artistid,
insertdate,
hour(from_unixtime(unixtime))
)
artistCounts
group
by
artistCounts.artistid,
artistCounts.hourOfDay
)
artistStats
where
meanReach
>
25;
22. So far
!
• Test data: listening statistics for each artist, in each hour of the day"
• Base data: averaged hourly statistics for each artist"
• Next step: compare them"
23. Comparison!
select
test.artistid,
test.meanReach,
base.meanReach,
test.stdReach,
base.stdReach,
test.stdErrReach,
base.stdErrReach,
(test.meanReach
-‐
base.meanReach)
/
(base.stdReach)
as
zScore,
(test.meanReach
-‐
base.meanReach)
/
(base.stdErrReach
*
test.stdErrReach)
as
deviation
from
omar_uk_artist_base
base
join
omar_uk_artist_hours
test
on
(base.artistid
=
test.artistid)
where
test.hourOfDay
=
15
order
by
deviation
desc
limit
5;
25. Summary!
• Hive is easy to use"
• It sits comfortably on top of a Hadoop infrastructure"
• Familiar if you know SQL"
• Can ask big questions"
• Can ask wide ranging questions"
• Allows analyses that would otherwise need a lot of
preliminary work "
"