2. Overview!
• Hadoop at Last.fm"
• Hive"
• Examples"
What I want to show you:"
• How it fits with a Hadoop infrastructure"
• Typical workflow with Hive"
• Ease of use for experiments and prototypes!
3. Hadoop!
• Brief overview of our infrastructure"
• How we use it"
"
7. Hive!
• What is Hive?"
• How does it fit in with the rest of our system?"
• Using existing data in Hive"
• Example query"
8. What is Hive?!
• Data Warehouse"
• You see your data in the form of tables"
• Query language very similar to SQL"
hive>
show
tables
like
hive>
describe
'omar_charts_*';
omar_charts_tagcloud_album;
OK
OK
omar_charts_globaltags_album
albumid
int
omar_charts_globaltags_artist
tagid
int
omar_charts_globaltags_track
weight
double
omar_charts_tagcloud_album
omar_charts_tagcloud_artist
omar_charts_tagcloud_track
9. What is a table?!
Standard ! External!
! "
• Metadata stored by Hive" • Metadata stored by Hive"
• Table data stored by Hive" • Table data referenced by Hive"
• Deleting the table deletes the data • Deleting the table only deletes the
and the metadata! metadata!
10. What is a table?!
Standard ! External!
! "
• Metadata stored by Hive" • Metadata stored by Hive"
• Table data stored by Hive" • Table data referenced by Hive"
• Deleting the table deletes the data • Deleting the table only deletes the
and the metadata! metadata!
Database
Tables
Log
Files
12. A Hive Query!
select
track.title,
size(collect_set(s.userid))
as
reach
from
meta_track
track
join
data_submissions
s
on
(s.trackid
=
track.id)
where
s.insertdate
=
"2012-‐03-‐01”
and
(s.scrobble
+
s.listen
>
0)
and
s.artistid
=
57976724
-‐-‐
Lana
Del
Rey
group
by
track.title
order
by
reach
desc
limit
5;
13. A Hive Query!
select
track.title,
size(collect_set(s.userid))
as
reach
from
meta_track
track
join
data_submissions
s
on
(s.trackid
=
track.id)
where
s.insertdate
=
"2012-‐03-‐01”
and
(s.scrobble
+
s.listen
>
0)
and
s.artistid
=
57976724
-‐-‐
Lana
Del
Rey
group
by
track.title
order
by
reach
desc
limit
5;
Total
MapReduce
jobs
=
3
Launching
Job
1
out
of
3
Number
of
reduce
tasks
not
specified.
Estimated
from
input
data
size:
52
2012-‐03-‐19
23:28:58,613
Stage-‐1
map
=
0%,
reduce
=
0%
2012-‐03-‐19
23:29:08,765
Stage-‐1
map
=
3%,
reduce
=
0%
2012-‐03-‐19
23:29:10,794
Stage-‐1
map
=
9%,
reduce
=
0%
14. A Hive Query!
select
track.title,
size(collect_set(s.userid))
as
reach
from
meta_track
track
join
data_submissions
s
on
(s.trackid
=
track.id)
where
s.insertdate
=
"2012-‐03-‐01”
and
(s.scrobble
+
s.listen
>
0)
and
s.artistid
=
57976724
-‐-‐
Lana
Del
Rey
group
by
track.title
order
by
reach
desc
limit
5;
Born
to
Die
10765
Video
Games
9382
Off
to
the
Races
6569
Blue
Jeans
6266
National
Anthem
5795
~300
seconds
15. Examples!
• Trends in UK Listening"
• Hadoop User Group Charts"
19. select
artistid,
hourOfDay,
meanPlays,
stdPlays,
meanReach,
stdReach,
hoursInExistence,
meanPlays
/
sqrt(hoursInExistence)
as
stdErrPlays,
meanReach
/
sqrt(hoursInExistence)
as
stdErrReach
from
(select
artistCounts.artistid
as
artistid,
artistCounts.hourOfDay,
avg(artistCounts.plays)
as
meanPlays,
stddev_samp(artistCounts.plays)
as
stdPlays,
avg(artistCounts.reach)
as
meanReach,
stddev_samp(artistCounts.reach)
as
stdReach,
size(collect_set(concat(artistCounts.insertdate,
hourOfDay)))
as
hoursInExistence
from
(select
artistid,
insertdate,
hour(from_unixtime(unixtime))
as
hourOfDay,
count(*)
as
plays,
size(collect_set(s.userid))
as
reach
from
lookups_userid_geo
g
join
data_submissions
s
on
(g.userid
=
s.userid)
where
insertdate
>=
'2011-‐01-‐01'
and
insertdate
<
'2012-‐01-‐01'
and
(listen
+
scrobble)
>
0
and
lower(g.countrycode)
=
'gb'
group
by
artistid,
insertdate,
hour(from_unixtime(unixtime))
)
artistCounts
group
by
artistCounts.artistid,
artistCounts.hourOfDay
)
artistStats
where
meanReach
>
25;
20. select
artistid,
hourOfDay,
meanPlays,
stdPlays,
meanReach,
stdReach,
hoursInExistence,
meanPlays
/
sqrt(hoursInExistence)
as
stdErrPlays,
meanReach
/
sqrt(hoursInExistence)
as
stdErrReach
from
(select
artistCounts.artistid
as
artistid,
artistCounts.hourOfDay,
avg(artistCounts.plays)
as
meanPlays,
stddev_samp(artistCounts.plays)
as
stdPlays,
avg(artistCounts.reach)
as
meanReach,
stddev_samp(artistCounts.reach)
as
stdReach,
size(collect_set(concat(artistCounts.insertdate,
hourOfDay)))
as
hoursInExistence
from
(select
artistid,
insertdate,
hour(from_unixtime(unixtime))
as
hourOfDay,
count(*)
as
plays,
size(collect_set(s.userid))
as
reach
from
lookups_userid_geo
g
join
data_submissions
s
on
(g.userid
=
s.userid)
where
insertdate
>=
'2011-‐01-‐01'
and
insertdate
<
'2012-‐01-‐01'
and
(listen
+
scrobble)
>
0
and
lower(g.countrycode)
=
'gb'
group
by
artistid,
insertdate,
hour(from_unixtime(unixtime))
)
artistCounts
group
by
artistCounts.artistid,
artistCounts.hourOfDay
)
artistStats
where
meanReach
>
25;
21. select
artistid,
hourOfDay,
meanPlays,
stdPlays,
meanReach,
stdReach,
hoursInExistence,
meanPlays
/
sqrt(hoursInExistence)
as
stdErrPlays,
meanReach
/
sqrt(hoursInExistence)
as
stdErrReach
from
(select
artistCounts.artistid
as
artistid,
artistCounts.hourOfDay,
avg(artistCounts.plays)
as
meanPlays,
stddev_samp(artistCounts.plays)
as
stdPlays,
avg(artistCounts.reach)
as
meanReach,
stddev_samp(artistCounts.reach)
as
stdReach,
size(collect_set(concat(artistCounts.insertdate,
hourOfDay)))
as
hoursInExistence
from
(select
artistid,
insertdate,
hour(from_unixtime(unixtime))
as
hourOfDay,
count(*)
as
plays,
size(collect_set(s.userid))
as
reach
from
lookups_userid_geo
g
join
data_submissions
s
on
(g.userid
=
s.userid)
where
insertdate
>=
'2011-‐01-‐01'
and
insertdate
<
'2012-‐01-‐01'
and
(listen
+
scrobble)
>
0
and
lower(g.countrycode)
=
'gb'
group
by
artistid,
insertdate,
hour(from_unixtime(unixtime))
)
artistCounts
group
by
artistCounts.artistid,
artistCounts.hourOfDay
)
artistStats
where
meanReach
>
25;
22. So far
!
• Test data: listening statistics for each artist, in each hour of the day"
• Base data: averaged hourly statistics for each artist"
• Next step: compare them"
23. Comparison!
select
test.artistid,
test.meanReach,
base.meanReach,
test.stdReach,
base.stdReach,
test.stdErrReach,
base.stdErrReach,
(test.meanReach
-‐
base.meanReach)
/
(base.stdReach)
as
zScore,
(test.meanReach
-‐
base.meanReach)
/
(base.stdErrReach
*
test.stdErrReach)
as
deviation
from
omar_uk_artist_base
base
join
omar_uk_artist_hours
test
on
(base.artistid
=
test.artistid)
where
test.hourOfDay
=
15
order
by
deviation
desc
limit
5;
25. Summary!
• Hive is easy to use"
• It sits comfortably on top of a Hadoop infrastructure"
• Familiar if you know SQL"
• Can ask big questions"
• Can ask wide ranging questions"
• Allows analyses that would otherwise need a lot of
preliminary work "
"