At MediaMath, we deal with billions of records every day. One of our biggest challenges is hourly reporting of attribution data - the joining of billions of records to millions of events. How did we solve this hourly attribution reporting issue? We will walk through our evaluation, testing, and fine tuning of a variety of tools including Netezza, Hive, and Pig, and how we ultimately chose Cloudera's Impala.
3. About
MediaMath
Overview
of
Real-‐Time
Bidding
Real-‐1me
Auc1on
<30
ms
Adver1ser
(Client)
User
ad
www.cnn.com
ad
4. About
MediaMath
Overview
or
Real-‐Time
Bidding
User
www.cnn.com
Purchased!
ad
www.shoes.com
$$
Event
Logs
5. • Ad
OpportuniNes:
80-‐100
billion
per
day
" 1.2
million
opportuniNes
per
second
at
peak
• We
bid
on
30-‐40
billion
ads
per
day
• We
serve
1-‐2
billion
ads
per
day
• 15-‐20
million
events
(click,
sale,
online
sign-‐up)
per
hour
• 2
TB
of
data
daily
(compressed)
" Note:
This
only
counts
our
wins.
If
we
count
losses,
we
easily
reach
PBs.
About
MediaMath
10. " Hive
Run
Nme:
Took
5-‐6
hours
to
complete
Stability:
High
" Pig
Run
Nme:
Took
4-‐5
hours
to
complete
Stability:
High
" Impala
Beta
(0.6)
Run
Nme:
Took
2-‐3
hours
to
complete
Stability:
Low
Evaluated
OpNons:
Round
1
11. " Hive:
Post-‐Tuning
(map
joins,
bucke1ng,
split
size,
etc.)
Run
Nme:
Took
2-‐3
hours
to
complete
Stability:
High
" Impala
GPA
(1.0)
(L0
compression,
slicing,
tuning,
hw
upgrade)
Run
Nme:
Took
30
minutes
to
complete
Stability:
High
Evaluated
OpNons:
Round
2
12. Data
Warehouse
Architecture
2011
Bid
Logs
Pixel
Logs
Metadata
Repor1ng
Data
Marts
Repor1ng
Data
Marts
Repor1ng
Data
Marts
Repor1ng
Data
Marts
ELT
A
T
T
R
I
B
U
T
I
O
N
Repor
ts
Aggr
ega1
on
Netezza
2011
13. Data
Warehouse
Architecture
2011
Bid
Logs
Pixel
Logs
Metadata
Repor1ng
Data
Marts
Repor1ng
Data
Marts
Repor1ng
Data
Marts
Repor1ng
Data
Marts
ELT
A
T
T
R
I
B
U
T
I
O
N
Repor
ts
Aggr
ega1
on
Reports
Aggrega1on
Netezza
Hadoop
2013