More Related Content More from DataWorks Summit More from DataWorks Summit (20) Analyzing 1.4 trillion events with Hadoop1. Using Hadoop to Process a
Trillion+ Events
Michael Brown, CTO | March 2012
© comScore, Inc. Proprietary.
2. comScore is a leading internet technology company that
provides Analytics for a Digital World™
NASDAQ SCOR
Clients 2,100+ Worldwide
Employees 1,000+
Headquarters Reston, Virginia, USA
Global Coverage Measurement from 172 Countries; 44 Markets Reported
Local Presence 32 Locations in 23 Countries
Big Data Over 1.5 Trillion Digital Interactions Captured Monthly
© comScore, Inc. Proprietary. V0113 2
3. Some of our Clients
Media Agencies Telecom/Mobile Financial Retail Travel CPG Pharma Technology
© comScore, Inc. Proprietary. V1011
4. The Trusted Source for Digital Intelligence Across Vertical Markets
9 out of the top 10 9 out of the top 10
INVESTMENT BANKS AUTO INSURERS
4 out of the top 4 11 out of the top 12
WIRELESS CARRIERS INTERNET SERVICE
PROVIDERS
47 out of the top 50 14 out of the top 15
ONLINE PROPERTIES PHARMACEUTICAL
COMPANIES
45 out of the top 50 11 out of the top 12
ADVERTISING AGENCIES CONSUMER FINANCE
COMPANIES
9 out of the top 10 8 out of the top 10
MAJOR MEDIA COMPANIES CPG COMPANIES
© comScore, Inc. Proprietary. V1011
5. Vocabulary for Measuring Information
If a Grain of Sand were One Byte of Information . . .
1 Exabyte =
1,000 petabytes
1 Megabyte = the same beach—
1 million bytes from Maine to North Carolina
a tablespoon of sand
1 Gigabyte = 1 Zetabyte =
1 billion bytes 1,000 exabytes
patch of sand— the same beach—
9” square, 1’ deep along the entire US coast
1 Terabyte = 1 Yottabyte =
1,000 zetabytes (24 Zeroes)
1 trillion bytes enough info to bury the entire
a sandbox— US under 296 feet of sand
24’ square, 1’ deep
1 Petabyte =
1,000 terabytes
a mile long beach—
100’ wide , 1’ deep
6. Worldwide Tags per Month
1,600,000,000,000
1,400,000,000,000
1,200,000,000,000
1,000,000,000,000
# of records
800,000,000,000
600,000,000,000
400,000,000,000
200,000,000,000
0
May
May
May
Nov
Nov
Nov
Nov
Feb
Feb
Feb
Jul
Aug
Sep
Jul
Aug
Sep
Jul
Aug
Sep
Jul
Aug
Sep
Oct
Jan
Jun
Oct
Jan
Jun
Oct
Jan
Jun
Oct
Jan
Mar
Apr
Mar
Apr
Mar
Apr
Dec
Dec
Dec
Dec
2009 2010 2011 2012 2013
Panel Records Beacon Records
© comScore, Inc. Proprietary.
8. Our Event Volume in Perspective
Top 65 WW Properties – Cumulative Page Views
1,600,000
1,400,000
1,200,000
1,000,000
800,000
600,000
400,000
200,000
0
Source: comScore MediaMetrix Worldwide December 2012
© comScore, Inc. Proprietary.
9. Daily Records Collection Trend
50,000,000,000 5,000,000,000
R² = 0.940
4,500,000,000
R² = 0.822
40,000,000,000
4,000,000,000
3,500,000,000
30,000,000,000
# of census records
# of panel records
3,000,000,000
20,000,000,000 2,500,000,000
2,000,000,000
10,000,000,000
1,500,000,000
1,000,000,000
0
Jul 2009
Jul 2010
Jul 2011
Jul 2012
Jul 2013
Mar 2010
May 2010
Mar 2011
May 2011
Mar 2012
May 2012
Mar 2013
May 2013
Sep 2009
Nov 2009
Sep 2010
Nov 2010
Sep 2011
Nov 2011
Sep 2012
Nov 2012
Sep 2013
Nov 2013
Jan 2010
Jan 2011
Jan 2012
Jan 2013
Jan 2014
500,000,000
-10,000,000,000 0
Beacon Records Panel Records Linear (Beacon Records) Linear (Panel Records)
© comScore, Inc. Proprietary.
12. The Problem Statement
Calculate the number of events and unique cookies for each reportable
campaign element
Key take away
Data on input will be aggregated daily
Need to process all data for 3 months
Need to calculate values for every day in the 92 day period spanning all
reportable campaign elements
© comScore, Inc. Proprietary.
13. Structure of the Required Output
Client Campaign Population Location Cookie Ct Period
1234 160873284 840 1 863,185 1
1234 160873284 840 1 1,719,738 2
1234 160873284 840 1 2,631,624 3
1234 160873284 840 1 3,572,163 4
1234 160873284 840 1 4,445,508 5
1234 160873284 840 1 5,308,532 6
1234 160873284 840 1 6,032,073 7
1234 160873284 840 1 6,710,645 8
1234 160873284 840 1 7,421,258 9
1234 160873284 840 1 8,154,543 10
© comScore, Inc. Proprietary.
14. Counting Uniques from a Time Ordered Log File
A Major Downsides:
Need to keep all key elements in memory.
D Constrained to one machine for final aggregation.
B
C
B
A
A
© comScore, Inc. Proprietary.
15. First Version
Java Map-Reduce application which processes pre-aggregated data from 92 days
Map reads the data and emits each cookie as the key of the key value pair
All 170B records go though the shuffle
Each Reducer will get all the data for a particular campaign sorted by cookie
Reducer aggregates the data by grouping key ( Client / Campaign / Population ) and calculates
unique cookies for period 1-92
Volume Grew rapidly to the point the daily processing took more than a day
© comScore, Inc. Proprietary.
16. M/R Data Flow
B C A B C A
Mapper Mapper Mapper
Map Map Map
A A B B C C
Reduce Reduce Reduce
A B C
© comScore, Inc. Proprietary.
17. Scaling Issue
As our volume has grown we have the following stats:
Over 500 billion events per month
Daily Aggregate 1.5 billion (and growing)
170 billion aggregate records for 92 days
70K Campaigns
Over 50 countries
We see 15 billion distinct cookies in a month
We only need to output 25 million rows
© comScore, Inc. Proprietary.
18. Basic Approach Retrospective
Processing speed is not scaling to our needs on a sample of the input data
Diagnosis
Most aggregations could not take significant advantage of combiners.
Large shuffles caused poor job performance. In some cases large aggregations ran slower on the
Hadoop cluster due to shuffle and skew in data for keys.
Diagnosis
A new approach is required to reduce the shuffle
© comScore, Inc. Proprietary.
19. Counting Uniques from a Key Ordered Log File
A Major Downsides:
Need to sort data in advance.
A The sort time increases as volume grows.
A
B
B
C
D
© comScore, Inc. Proprietary.
22. Solution to reduce the shuffle
The Problem:
Aggregations can not take advantage of combiners, leading to large shuffles and job performance issues
The Idea:
Partition and sort the data by cookie on a daily basis
Create a custom InputFormat to merge daily partitions for monthly aggregations
© comScore, Inc. Proprietary.
23. Custom Input Format with Map Side Aggregation
B C A B C A
A Mapper
Map B
Mapper
Map C Mapper
Map
Combiner Combiner Combiner
A B C
Reduce Reduce Reduce
A B C
© comScore, Inc. Proprietary.
24. Risks for Partitioning
Data locality
Custom InputFormat requires reading blocks of the partitioned data over the network
This was solved using a feature of the MapR file system. We created volumes and set the chunk size to
zero which guarantees that the data written to a volume will stay on one node
Map failures might result in long run times
Size of the map inputs is no longer set by block size
This was solved by creating a large number (10K) of volumes to limit the size of data processed by each
mapper
© comScore, Inc. Proprietary.
25. Partitioning Summary
Benefits:
A large portion of the aggregation can be completed in the map phase
Applications can now take advantage of combiners
Shuffles sizes are minimal
Results:
Took a job from 35 hours to 3 hours with no hardware changes
© comScore, Inc. Proprietary.
26. Our Cluster
Production Hadoop Cluster
120 nodes: Mix of Dell 720xd, R710 and R510 servers
Each R510 has (12x2TB drives; 64GB RAM; 24 cores)
3000+ total CPUs
6.0TB total memory
2PB total disk space
Our distro is MapR M5 2.1.0
© comScore, Inc. Proprietary.
27. Useful Factoids
Colorful, bite-sized graphical representations of the best discoveries we unearth.
Visit www.comscoredatamine.com or follow @datagems for the latest gems.
© comScore, Inc. Proprietary.
28. Thank You!
Michael Brown
CTO
comScore, Inc.
mbrown@comscore.com
© comScore, Inc. Proprietary.
29. Diagram
© comScore, Inc. Proprietary. 29
Editor's Notes Key MessagecomScore is a global internet technology company providing customers with Analytics for a Digital WorldSupporting Talking PointsFounded in 1999, comScore is best known as the gold standard for measuring digital activity, including website visitation, search, video, social, digital advertisingcomScore’s data and technologies are well-established crucial components in measuring and analyzing the rapidly evolving digital world, and are widely deployed at a broad range of publishers, advertising agencies, advertisers, retailers and telecom operators, both in the US and internationally In 2011, 400exabytes of storage was shipped by drive manufacturers April 2012 Data