30 Billion Events a Day with HadoopMichael Brown, CTO, comScore, Inc.May 10th, 2012
comScore is a Global Leader in Measuring the Digital World                                                  NASDAQ        ...
Some of our Clients Media   Agencies   Telecom/Mobile            Financial   Retail   Travel   CPG   Pharma   Technology  ...
The Trusted Source for Digital Intelligence Across Vertical Markets       9   out of the top   10                         ...
Unified Digital Measurement™ (UDM) Establishes Platform ForPanel + Census Data Integration     Global PERSON              ...
Beacon Heat Map              © comScore, Inc.   Proprietary.   6
Worldwide Tags per Month                                                                        Monthly Records Collection...
Our Event Volume in Perspective                                                   Property            Page Views (MM)     ...
Growth Slides1,600,000,000,000                                                          R² = 0.93351,400,000,000,0001,200,...
The Project:Census Web Agg           © comScore, Inc.   Proprietary.   10
The Problem Statement§  Calculate the number of events and unique cookies for each key§  Key take aways  –  Data on inpu...
Counting Uniques from a Time Ordered Log File         A                                                Major Downsides:   ...
Counting Uniques from a Key Ordered Log File         A                                                   Major Downsides: ...
Scaling Issue§  As our volume has grown we have the following stats:  –  Over 900 billion events per month  –  Over 150 b...
Counting Uniques from a Key Ordered Log File               © comScore, Inc.   Proprietary.   15
Windows v1 (Single Server)§  Time to process data for first few months       Month                                Wall Ti...
Counting Uniques from Sharded Key Ordered Log Files               © comScore, Inc.   Proprietary.   17
Windows v2§  Features of this version  –  Distributed (32 servers)  –  Multithreaded  –  Data Localization  –  Very low n...
Enter the Elephant§  Why Hadoop? –  Scalable –  Low risk to lose data due to replication –  Run on a shared production cl...
Basic Approach§  Leverage Pig for POC  –  Pig Latin is easy for developers and data analysts to learn  –  Rapid applicati...
Performance of Basic Approach on Various Samples                                                  Aggregation Performance ...
M/R Data Flow       B    C                                         A        B       C       A     Mapper       Map        ...
Basic Approach Retrospective§  Processing speed is not scaling to our needs on a sample of the input data§  Diagnosis  –...
Solution to reduce the shuffle§  The Problem:  –  Most aggregations within comScore can not take advantage of combiners, ...
Custom Input Format with Map Side Aggregation       B       C                                       A        B    C    A  ...
Performance of v2 on Various Samples                                                       Aggregation Performance        ...
Partitioning Summary§  Benefits:  –  A large portion of the aggregation can be completed in the map phase  –  Application...
Full Sample Performance§  Full set of data analysis  –  10 TB of input data  –  150 billion session rows§  Total Time  –...
Future Ideas§  HBase  –  Unique cookie calculations are free as data is more organized  –  How will data loading fare?§ ...
Hadoop Cluster§  Production Hadoop Cluster  –  80 nodes: Mix of Dell R710 and R510  –  Each R510 has (12x2TB drives; 64GB...
Useful Factoids  Colorful, bite-sized graphical representations of the best discoveries we unearth.    Visit www.comscored...
Thank You! Michael Brown CTO comScore, Inc. mbrown@comscore.com             © comScore, Inc.   Proprietary.   32
Upcoming SlideShare
Loading in …5
×

30B events a day with hadoop

1,699 views

Published on

0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,699
On SlideShare
0
From Embeds
0
Number of Embeds
121
Actions
Shares
0
Downloads
0
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

30B events a day with hadoop

  1. 1. 30 Billion Events a Day with HadoopMichael Brown, CTO, comScore, Inc.May 10th, 2012
  2. 2. comScore is a Global Leader in Measuring the Digital World NASDAQ SCOR Clients 1860+ worldwide Employees 1000+ Headquarters Reston, VA 170+ countries under measurement; Global Coverage 43 markets reported Local Presence 32 locations in 23 countries © comScore, Inc. Proprietary. 2 V1011
  3. 3. Some of our Clients Media Agencies Telecom/Mobile Financial Retail Travel CPG Pharma Technology © comScore, Inc. Proprietary. 3 V1011
  4. 4. The Trusted Source for Digital Intelligence Across Vertical Markets 9 out of the top 10 9 out of the top 10 INVESTMENT BANKS AUTO INSURERS 4 out of the top 4 11 out of the top 12 WIRELESS CARRIERS INTERNET SERVICE PROVIDERS 47 out of the top 50 14 out of the top 15 ONLINE PROPERTIES PHARMACEUTICAL COMPANIES 45 out of the top 50 11 out of the top 12 ADVERTISING AGENCIES CONSUMER FINANCE COMPANIES 9 out of the top 10 8 out of the top 10 MAJOR MEDIA COMPANIES CPG COMPANIES © comScore, Inc. Proprietary. 4 V1011
  5. 5. Unified Digital Measurement™ (UDM) Establishes Platform ForPanel + Census Data Integration Global PERSON Global DEVICE Measurement Measurement PANEL CENSUS Unified Digital Measurement (UDM) Patent-Pending Methodology Adopted by 90% of Top 100 U.S. Media Properties © comScore, Inc. Proprietary. 5 V0411
  6. 6. Beacon Heat Map © comScore, Inc. Proprietary. 6
  7. 7. Worldwide Tags per Month Monthly Records Collection 1,000,000,000,000 900,000,000,000 800,000,000,000 700,000,000,000 600,000,000,000# of records 500,000,000,000 400,000,000,000 300,000,000,000 200,000,000,000 100,000,000,000 0 Jul Aug Sep Oct Nov Dec Jan Feb Apr Jun Jul Aug Sep Oct Nov Dec Jan Feb Apr Jun Jul Aug Sep Oct Nov Dec Jan Feb Apr Mar Mar Mar May May May 2009 2010 2011 2012 Panel Records Beacon Records © comScore, Inc. Proprietary. 7
  8. 8. Our Event Volume in Perspective Property Page Views (MM) FACEBOOK.COM 472,814 Google Sites 302,802 Yahoo! Sites 90,448 Total 866,064Source: comScore MediaMetrix Worldwide April 2012 © comScore, Inc. Proprietary. 8
  9. 9. Growth Slides1,600,000,000,000 R² = 0.93351,400,000,000,0001,200,000,000,0001,000,000,000,000 800,000,000,000 600,000,000,000 400,000,000,000 200,000,000,000 - © comScore, Inc. Proprietary. 9
  10. 10. The Project:Census Web Agg © comScore, Inc. Proprietary. 10
  11. 11. The Problem Statement§  Calculate the number of events and unique cookies for each key§  Key take aways –  Data on input will be sessionized daily –  Need to process all data for a month –  Need to calculate values for Total Internet and for each site under measurement © comScore, Inc. Proprietary. 11
  12. 12. Counting Uniques from a Time Ordered Log File A Major Downsides: Need to keep all key elements in memory. D Constrained to one machine for final aggregation. B C B A A © comScore, Inc. Proprietary. 12
  13. 13. Counting Uniques from a Key Ordered Log File A Major Downsides: Need to sort data in advance. A The sort time increases as volume grows. A B B C D © comScore, Inc. Proprietary. 13
  14. 14. Scaling Issue§  As our volume has grown we have the following stats: –  Over 900 billion events per month –  Over 150 billion sessions per month –  Over 5,000 reportable sites –  Over 50 countries –  We see 15 billion distinct cookies in a month –  5 sites have over 1 billion cookies in a month –  The sum of all distinct cookies is 377 billion –  We only need to output 15 million rows © comScore, Inc. Proprietary. 14
  15. 15. Counting Uniques from a Key Ordered Log File © comScore, Inc. Proprietary. 15
  16. 16. Windows v1 (Single Server)§  Time to process data for first few months Month Wall Time (hours) Jul 2009 8 Aug 2009 10 Sep 2009 11 Oct 2009 16 Nov 2009 37§  V1 Processed sessions at roughly 250K rows/sec§  Problems with this version: –  Slow –  Not Scalable –  Dedicated Server –  Bottleneck for delivering production © comScore, Inc. Proprietary. 16
  17. 17. Counting Uniques from Sharded Key Ordered Log Files © comScore, Inc. Proprietary. 17
  18. 18. Windows v2§  Features of this version –  Distributed (32 servers) –  Multithreaded –  Data Localization –  Very low network data transfer –  Handling the data growth§  The V2 code processed data over 8 million rows/sec –  1 hour for Dec 2009; 5 hours for April 2012§  Issues –  Data is distributed by ID into 64 parts –  Possibilities for skew in distribution key, that impacts performance and high disk usage on a node –  All data replication is manual, along with recovery –  Results cannot be calculated if any node is down –  Adding new servers or change in parts is a ton of effort –  Overhead to maintain framework to run distributed jobs © comScore, Inc. Proprietary. 18
  19. 19. Enter the Elephant§  Why Hadoop? –  Scalable –  Low risk to lose data due to replication –  Run on a shared production cluster –  No overhead to maintain framework –  Easy job submission and management © comScore, Inc. Proprietary. 19
  20. 20. Basic Approach§  Leverage Pig for POC –  Pig Latin is easy for developers and data analysts to learn –  Rapid application development vs. M/R applications (i.e. 1 line of Pig Latin = 20 lines in Java Map/ Reduce) –  Extendable via UDFs © comScore, Inc. Proprietary. 20
  21. 21. Performance of Basic Approach on Various Samples Aggregation Performance 80.00 70.00 60.00 50.00Time (minutes) 40.00 30.00 20.00 10.00 0.00 372 GB (3%) 744 GB (6%) 1116 GB (9%) Input data size © comScore, Inc. Proprietary. 21 Note: Target data size is over 10 TB
  22. 22. M/R Data Flow B C A B C A Mapper Map Mapper Mapper Map Map A A B B C C Reduce Reduce Reduce A B C © comScore, Inc. Proprietary. 22
  23. 23. Basic Approach Retrospective§  Processing speed is not scaling to our needs on a sample of the input data§  Diagnosis –  Most aggregations could not take significant advantage of combiners. Not a Pig issue. –  Large shuffles caused poor job performance. In some cases large aggregations ran slower on the Hadoop cluster compared to the current architecture§  Diagnosis –  A new approach is required to reduce the shuffle © comScore, Inc. Proprietary. 23
  24. 24. Solution to reduce the shuffle§  The Problem: –  Most aggregations within comScore can not take advantage of combiners, leading to large shuffles and job performance issues§  The Idea: –  Partition and sort data on a daily basis –  Create a custom input format to merge daily partitions for monthly aggregations © comScore, Inc. Proprietary. 24
  25. 25. Custom Input Format with Map Side Aggregation B C A B C A A Mapper Map B Mapper Map C Mapper Map Combiner Combiner Combiner A B C Reduce Reduce Reduce A B C © comScore, Inc. Proprietary. 25
  26. 26. Performance of v2 on Various Samples Aggregation Performance 120.00 100.00 80.00Time (minutes) 60.00 40.00 20.00 0.00 372 GB (3%) 744 GB (6%) 1116 GB (9%) 10304 GB (100%) Input data size Pig Custom Input Format © comScore, Inc. Proprietary. 26
  27. 27. Partitioning Summary§  Benefits: –  A large portion of the aggregation can be completed in the map phase –  Applications can now take advantage of combiners –  Shuffles sizes are minimal§  Risks: –  Data locality loss –  Map failures might result in long run times. This is dependent on the size of the partitions © comScore, Inc. Proprietary. 27
  28. 28. Full Sample Performance§  Full set of data analysis –  10 TB of input data –  150 billion session rows§  Total Time –  1 hour, 45 minutes –  Over 23,000,000 rows/sec © comScore, Inc. Proprietary. 28
  29. 29. Future Ideas§  HBase –  Unique cookie calculations are free as data is more organized –  How will data loading fare?§  Data Locality –  Ideally it would be great to provide additional clues to the storage of the data –  Not sure if it will be included in Hadoop§  Connection to a MPP DB –  We also leverage Greenplum DB, we could connect to each sharded instance © comScore, Inc. Proprietary. 29
  30. 30. Hadoop Cluster§  Production Hadoop Cluster –  80 nodes: Mix of Dell R710 and R510 –  Each R510 has (12x2TB drives; 64GB RAM; 24 cores) –  1768 total CPUs –  4.7TB total memory –  1200TB total disk space –  Our distro is MapR M5 1.2.7 © comScore, Inc. Proprietary. 30
  31. 31. Useful Factoids Colorful, bite-sized graphical representations of the best discoveries we unearth. Visit www.comscoredatamine.com or follow @datagems for the latest gems. © comScore, Inc. Proprietary. 31
  32. 32. Thank You! Michael Brown CTO comScore, Inc. mbrown@comscore.com © comScore, Inc. Proprietary. 32

×