Analyzing 1.4 trillion events with Hadoop

1,276 views

Published on

Published in: Technology, Business
1 Comment
6 Likes
Statistics
Notes
No Downloads
Views
Total views
1,276
On SlideShare
0
From Embeds
0
Number of Embeds
34
Actions
Shares
0
Downloads
0
Comments
1
Likes
6
Embeds 0
No embeds

No notes for slide
  • Key MessagecomScore is a global internet technology company providing customers with Analytics for a Digital WorldSupporting Talking PointsFounded in 1999, comScore is best known as the gold standard for measuring digital activity, including website visitation, search, video, social, digital advertisingcomScore’s data and technologies are well-established crucial components in measuring and analyzing the rapidly evolving digital world, and are widely deployed at a broad range of publishers, advertising agencies, advertisers, retailers and telecom operators, both in the US and internationally
  • In 2011, 400exabytes of storage was shipped by drive manufacturers
  • April 2012 Data
  • Analyzing 1.4 trillion events with Hadoop

    1. 1. Using Hadoop to Process aTrillion+ EventsMichael Brown, CTO | March 2012 © comScore, Inc. Proprietary.
    2. 2. comScore is a leading internet technology company thatprovides Analytics for a Digital World™ NASDAQ SCOR Clients 2,100+ Worldwide Employees 1,000+ Headquarters Reston, Virginia, USA Global Coverage Measurement from 172 Countries; 44 Markets Reported Local Presence 32 Locations in 23 Countries Big Data Over 1.5 Trillion Digital Interactions Captured Monthly © comScore, Inc. Proprietary. V0113 2
    3. 3. Some of our Clients Media Agencies Telecom/Mobile Financial Retail Travel CPG Pharma Technology © comScore, Inc. Proprietary. V1011
    4. 4. The Trusted Source for Digital Intelligence Across Vertical Markets 9 out of the top 10 9 out of the top 10 INVESTMENT BANKS AUTO INSURERS 4 out of the top 4 11 out of the top 12 WIRELESS CARRIERS INTERNET SERVICE PROVIDERS 47 out of the top 50 14 out of the top 15 ONLINE PROPERTIES PHARMACEUTICAL COMPANIES 45 out of the top 50 11 out of the top 12 ADVERTISING AGENCIES CONSUMER FINANCE COMPANIES 9 out of the top 10 8 out of the top 10 MAJOR MEDIA COMPANIES CPG COMPANIES © comScore, Inc. Proprietary. V1011
    5. 5. Vocabulary for Measuring InformationIf a Grain of Sand were One Byte of Information . . . 1 Exabyte = 1,000 petabytes 1 Megabyte = the same beach— 1 million bytes from Maine to North Carolina a tablespoon of sand 1 Gigabyte = 1 Zetabyte = 1 billion bytes 1,000 exabytes patch of sand— the same beach— 9” square, 1’ deep along the entire US coast 1 Terabyte = 1 Yottabyte = 1,000 zetabytes (24 Zeroes) 1 trillion bytes enough info to bury the entire a sandbox— US under 296 feet of sand 24’ square, 1’ deep 1 Petabyte = 1,000 terabytes a mile long beach— 100’ wide , 1’ deep
    6. 6. Worldwide Tags per Month 1,600,000,000,000 1,400,000,000,000 1,200,000,000,000 1,000,000,000,000# of records 800,000,000,000 600,000,000,000 400,000,000,000 200,000,000,000 0 May May May Nov Nov Nov Nov Feb Feb Feb Jul Aug Sep Jul Aug Sep Jul Aug Sep Jul Aug Sep Oct Jan Jun Oct Jan Jun Oct Jan Jun Oct Jan Mar Apr Mar Apr Mar Apr Dec Dec Dec Dec 2009 2010 2011 2012 2013 Panel Records Beacon Records © comScore, Inc. Proprietary.
    7. 7. Beacon Heat Map © comScore, Inc. Proprietary.
    8. 8. Our Event Volume in Perspective Top 65 WW Properties – Cumulative Page Views 1,600,000 1,400,000 1,200,000 1,000,000 800,000 600,000 400,000 200,000 0 Source: comScore MediaMetrix Worldwide December 2012 © comScore, Inc. Proprietary.
    9. 9. Daily Records Collection Trend 50,000,000,000 5,000,000,000 R² = 0.940 4,500,000,000 R² = 0.822 40,000,000,000 4,000,000,000 3,500,000,000 30,000,000,000# of census records # of panel records 3,000,000,000 20,000,000,000 2,500,000,000 2,000,000,000 10,000,000,000 1,500,000,000 1,000,000,000 0 Jul 2009 Jul 2010 Jul 2011 Jul 2012 Jul 2013 Mar 2010 May 2010 Mar 2011 May 2011 Mar 2012 May 2012 Mar 2013 May 2013 Sep 2009 Nov 2009 Sep 2010 Nov 2010 Sep 2011 Nov 2011 Sep 2012 Nov 2012 Sep 2013 Nov 2013 Jan 2010 Jan 2011 Jan 2012 Jan 2013 Jan 2014 500,000,000 -10,000,000,000 0 Beacon Records Panel Records Linear (Beacon Records) Linear (Panel Records) © comScore, Inc. Proprietary.
    10. 10. The Project:vCE – Validated Campaign Essentials © comScore, Inc. Proprietary.
    11. 11. comScore - vCE © comScore, Inc. Proprietary.
    12. 12. The Problem StatementCalculate the number of events and unique cookies for each reportablecampaign elementKey take away  Data on input will be aggregated daily  Need to process all data for 3 months  Need to calculate values for every day in the 92 day period spanning all reportable campaign elements © comScore, Inc. Proprietary.
    13. 13. Structure of the Required Output Client Campaign Population Location Cookie Ct Period 1234 160873284 840 1 863,185 1 1234 160873284 840 1 1,719,738 2 1234 160873284 840 1 2,631,624 3 1234 160873284 840 1 3,572,163 4 1234 160873284 840 1 4,445,508 5 1234 160873284 840 1 5,308,532 6 1234 160873284 840 1 6,032,073 7 1234 160873284 840 1 6,710,645 8 1234 160873284 840 1 7,421,258 9 1234 160873284 840 1 8,154,543 10 © comScore, Inc. Proprietary.
    14. 14. Counting Uniques from a Time Ordered Log File A Major Downsides: Need to keep all key elements in memory. D Constrained to one machine for final aggregation. B C B A A © comScore, Inc. Proprietary.
    15. 15. First VersionJava Map-Reduce application which processes pre-aggregated data from 92 daysMap reads the data and emits each cookie as the key of the key value pairAll 170B records go though the shuffleEach Reducer will get all the data for a particular campaign sorted by cookieReducer aggregates the data by grouping key ( Client / Campaign / Population ) and calculatesunique cookies for period 1-92Volume Grew rapidly to the point the daily processing took more than a day © comScore, Inc. Proprietary.
    16. 16. M/R Data Flow B C A B C A Mapper Mapper Mapper Map Map Map A A B B C C Reduce Reduce Reduce A B C © comScore, Inc. Proprietary.
    17. 17. Scaling IssueAs our volume has grown we have the following stats:  Over 500 billion events per month  Daily Aggregate 1.5 billion (and growing)  170 billion aggregate records for 92 days  70K Campaigns  Over 50 countries  We see 15 billion distinct cookies in a month  We only need to output 25 million rows © comScore, Inc. Proprietary.
    18. 18. Basic Approach RetrospectiveProcessing speed is not scaling to our needs on a sample of the input dataDiagnosis  Most aggregations could not take significant advantage of combiners.  Large shuffles caused poor job performance. In some cases large aggregations ran slower on the Hadoop cluster due to shuffle and skew in data for keys.Diagnosis  A new approach is required to reduce the shuffle © comScore, Inc. Proprietary.
    19. 19. Counting Uniques from a Key Ordered Log File A Major Downsides: Need to sort data in advance. A The sort time increases as volume grows. A B B C D © comScore, Inc. Proprietary.
    20. 20. Counting Uniques from a Key Ordered Log File © comScore, Inc. Proprietary.
    21. 21. Counting Uniques from Sharded Key Ordered Log Files © comScore, Inc. Proprietary.
    22. 22. Solution to reduce the shuffleThe Problem:  Aggregations can not take advantage of combiners, leading to large shuffles and job performance issuesThe Idea:  Partition and sort the data by cookie on a daily basis  Create a custom InputFormat to merge daily partitions for monthly aggregations © comScore, Inc. Proprietary.
    23. 23. Custom Input Format with Map Side Aggregation B C A B C A A Mapper Map B Mapper Map C Mapper Map Combiner Combiner Combiner A B C Reduce Reduce Reduce A B C © comScore, Inc. Proprietary.
    24. 24. Risks for PartitioningData locality  Custom InputFormat requires reading blocks of the partitioned data over the network  This was solved using a feature of the MapR file system. We created volumes and set the chunk size to zero which guarantees that the data written to a volume will stay on one nodeMap failures might result in long run times  Size of the map inputs is no longer set by block size  This was solved by creating a large number (10K) of volumes to limit the size of data processed by each mapper © comScore, Inc. Proprietary.
    25. 25. Partitioning SummaryBenefits:  A large portion of the aggregation can be completed in the map phase  Applications can now take advantage of combiners  Shuffles sizes are minimalResults:  Took a job from 35 hours to 3 hours with no hardware changes © comScore, Inc. Proprietary.
    26. 26. Our ClusterProduction Hadoop Cluster  120 nodes: Mix of Dell 720xd, R710 and R510 servers  Each R510 has (12x2TB drives; 64GB RAM; 24 cores)  3000+ total CPUs  6.0TB total memory  2PB total disk space  Our distro is MapR M5 2.1.0 © comScore, Inc. Proprietary.
    27. 27. Useful Factoids Colorful, bite-sized graphical representations of the best discoveries we unearth. Visit www.comscoredatamine.com or follow @datagems for the latest gems. © comScore, Inc. Proprietary.
    28. 28. Thank You! Michael Brown CTO comScore, Inc. mbrown@comscore.com © comScore, Inc. Proprietary.
    29. 29. Diagram © comScore, Inc. Proprietary. 29

    ×