• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
BigData @ comScore

BigData @ comScore






Total Views
Views on SlideShare
Embed Views



2 Embeds 118

http://cto.eaiti.com 77
http://chieftechnologyofficer.org 41



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    BigData @ comScore BigData @ comScore Presentation Transcript

    • BigData @ comScore Michael Brown, CTO, comScore, Inc. March 25th , 2011
    • comScore is a Global Leader in Measuring the Digital World NASDAQ SCOR Clients 1600+ worldwide Employees 1,000+ Headquarters Reston, VA Global Coverage 170+ countries under measurement; 43 markets reported Local Presence 30+ locations in 21 countries 2© comScore, Inc. Proprietary. Local Presence 30+ locations in 21 countries V0910
    • Broad Client Base and Deep Expertise Across Key Industries Media Agencies Telecom/Mobile Financial Retail Travel CPG Pharma Technology 3© comScore, Inc. Proprietary. V0910
    • The Trusted Source for Digital Intelligence Across Vertical Markets 47 out of the top 50 4 out of the top 4 WIRELESS CARRIERS 9 out of the top 10 INVESTMENT BANKS 9 out of the top 10 9 out of the top 10 INTERNET SERVICE PROVIDERS 9 out of the top 10 AUTO INSURERS 4© comScore, Inc. Proprietary. 47 out of the top 50 ONLINE PROPERTIES 45 out of the top 50 ADVERTISING AGENCIES 9 out of the top 10 MAJOR MEDIA COMPANIES 9 out of the top 10 PHARMACEUTICAL COMPANIES 9 out of the top 10 CONSUMER FINANCE COMPANIES 9 out of the top 10 CPG COMPANIES V0910
    • comScore History of Leadership and Innovation To measure the search market To measure video streaming To provide behavioral ad effectiveness To meter mobile user behavior 1st To Unify census + panel measurement 5© comScore, Inc. Proprietary. To build and project from 2 million+ longitudinal panel To monitor and report e-commerce data 1 To deliver a worldwide Internet audience measurement Global Shaper Company 2010 V0910
    • Average Records Captured per Day (2005-2009) 800,000,000 1,000,000,000 1,200,000,000 1,400,000,000 1,600,000,000 1,800,000,000 6© comScore, Inc. Proprietary. - 200,000,000 400,000,000 600,000,000 800,000,000
    • Launching the 3rd Generation In 2009, in the midst of the recession, comScore decided to build and release its 3rd Generation Product – Unified Digital Measurement (UDM or Hybrid) Technology Goals – Ramp up data collection – Deploy new methodologies for data processing and analysis – Be able to scale linearly to the environment to support growth 7© comScore, Inc. Proprietary. – Be able to scale linearly to the environment to support growth – Have yesterdays data available today And one more thing … do it in 4 months or less.
    • Unified Digital Measurement™ (UDM) Establishes Platform For Panel + Census Data Integration Global PERSON Measurement Global MACHINE Measurement 8© comScore, Inc. Proprietary. PAGE TAGSPANEL Unified Digital Measurement (UDM) Patent-Pending Methodology Adopted by 88% of Top U.S. Media Properties V0910
    • How Does the Hybrid Process Work? Collect Traffic from PCs and devices Clean Traffic – remove non- human, bots, apply edit rules 9© comScore, Inc. Proprietary. Apply comScore URL Dictionary Total Traffic Filtered Traffic
    • URL Dictionary (CFD): Advertising Industry “Currency” Intelligent grouping of Properties with 7+ levels of detail – Property (e.g., Yahoo! Properties, Microsoft Sites) – Media Title (e.g., Yahoo!, MSN) 10© comScore, Inc. Proprietary. – Channel (e.g., Yahoo! Search, MSN Homepages) – Subchannel (e.g., Yahoo! Image Search, MSNBC) – Group/Subgroup (e.g., Yahoo! Calendar, Today)
    • URL Dictionary (CFD) Coverage Statistics 11MM Unique Domains Average/Month in 2010 • Over 80% pages viewed from top 131K domains in 2010 vs. 392K in 2009 11© comScore, Inc. Proprietary. • 2,360K patterns in January 2011represents 85% of all pages • 1,254K syndicated entities in January 2010 • 41K patterns added/month in 2010.
    • Worldwide UDM™ Penetration Europe Austria 80% Asia Pacific Australia 91% North America Canada 94% Latin America Argentina 94% Middle East & Africa Israel 93% Percentage of Machines Included in UDM Measurement 12© comScore, Inc. Proprietary. July 2010 Penetration Data Austria 80% Belgium 85% Switzerland 84% Germany 84% Denmark 82% Spain 90% Finland 85% France 91% Ireland 91% Italy 80% Netherlands 88% Norway 84% Portugal 86% Sweden 85% United Kingdom 90% Australia 91% Hong Kong 88% India 84% Japan 73% Malaysia 87% New Zealand 88% Singapore 91% Canada 94% United States 91% Argentina 94% Brazil 92% Chile 94% Colombia 95% Mexico 93% Puerto Rico 92% Israel 93% South Africa 73% V0910
    • Worldwide Tags per Day 15,000,000,000 20,000,000,000 25,000,000,000 #ofrecords 13© comScore, Inc. Proprietary. 0 5,000,000,000 10,000,000,000 Jul 2009 Aug 2009 Sep 2009 Oct 2009 Nov 2009 Dec 2009 Jan 2010 Feb 2010 Mar 2010 Apr 2010 May 2010 Jun 2010 Jul 2010 Aug 2010 Sep 2010 Oct 2010 Nov 2010 Dec 2010 Jan 2011 Feb 2011 #ofrecords Beacon Records Panel Records
    • Monthly Totals 300,000,000,000 400,000,000,000 500,000,000,000 600,000,000,000 #ofrecords 14© comScore, Inc. Proprietary. 0 100,000,000,000 200,000,000,000 300,000,000,000 Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb 2009 2010 2011 #ofrecords Beacon Records Panel Records
    • High Level Data Flow Panel ETL 15© comScore, Inc. Proprietary. Census ETL Delivery
    • Enterprise Data Warehouse : Sybase IQ 15.2 Multiplex EDW is currently comprised of 20 servers running Windows 2003 R2 x64 – Currently 220 Intel CPUs – Dedicated EDW technical team of 3 DBAs and 1 Administrator – Ability to grow compute capacity and storage capacity independently EDW data repository housed on both EMC VMAX and Clarion – 4 EDW instances (2 in Virginia and 2 in Illinois) – One EDW instance is 147TB usable (app. 200TB of raw data) 16© comScore, Inc. Proprietary. – One EDW instance is 147TB usable (app. 200TB of raw data) – Production EDW Drive Layout 416 x 1TB SATA, RAID6, 14+2 42 x 600GB 15K, RAID1 8 X 400GB Flash, RAID5, 7+1 Current Capacity and Performance Metrics – 1,835,412,793,799 Rows loaded – 140TB in 14,168 tables – Capable of Loading 56 Billion rows per hour
    • Subsystem System designed using multiple sub systems Easily take out and replace different components as demands changed Moved from a single server to a cluster of servers in a few months in some cases with first stage tag processing Periodically redesign different subsystems to support increased processing demands 17© comScore, Inc. Proprietary. Many systems on their third generation of technology
    • Homegrown Distributed Processing Reduced core aggregation from Reduce final product creation 2002 – comScore distributed processing framework Open Source Hadoop ScalabilityWall 18© comScore, Inc. Proprietary. aggregation from 48 hours to 7 hours product creation from 24 hours to 2 hours Hadoop framework ScalabilityWall
    • GreenPlum GreenPlum MPP – 80 Node Cluster: 1 Master; 6 ETL; 72 Workers – Using Dell R510 with 12 600GB 15K RAID, 64GB RAM, 24 cores (HT) – Support analytic end users with access to record level data, through a SQL interface – Ability to load over 400 billion rows in 8 hours – Hourly data loading in place 19© comScore, Inc. Proprietary. – Hourly data loading in place – Allow the analysts to mine the data for the business uses – Use for quick analysis of raw event data and for the ideation and creation of new products
    • Hadoop Hadoop – Dev - 6x Dell 2950 w/6 1TB – Prod - 10x Dell R710 w/ 6 600GB – Prod in 2 weeks – 10x Dell R710 w/6 600GB & 20x Dell R510 w/12 2TB – Moving large processing jobs that currently are constrained by our current framework to Hadoop. We have some large analytical runs that currently go for over 40 hours on 32 servers and we are re-engineering to reduce 20© comScore, Inc. Proprietary. for over 40 hours on 32 servers and we are re-engineering to reduce processing time. – We have found that the Fair Scheduler works well for our job loads – We use a “homegrown” workflow system (BORG) that manages tasks inside and outside hadoop.
    • Sharding Sharding divides work across multiple systems using different mechanisms Shard data as far up stream as possible Ability to break data into multiple chunks early in processing, enables ability to compute capacity down stream to accommodate large volume increases in data ingest 21© comScore, Inc. Proprietary.
    • Sorting We use DMExpress from SyncSort across hundreds of servers this allows for efficient data processing We sort input data based on a column in advance To calculate uniques, check if the prior value changed from the current value and then increment a counter We now have aggregation systems that can process over 50 GB of data with 357 million rows in less than an hour on a Dell R710 2U serve 22© comScore, Inc. Proprietary. with 357 million rows in less than an hour on a Dell R710 2U serve
    • Compression w/Sorting Compress Log Files when processing large volumes of log data Several advantages to Sorting Data First: – Reduces the size of the data – Improves application performance Examples: – 1 Hour of our data (313 GB raw, 815 million rows) 23© comScore, Inc. Proprietary. 1 Hour of our data (313 GB raw, 815 million rows) – Standard compression of time ordered data is 93GB (30% of original) – Standard compression on a 2 key sorted set is 56GB (18% of original) – For one day it saves 800GB – For one month it saves 25 TB – For 90 days it saves 75TB
    • Big data makes you think differently Question: How many distinct cookies over 3 months? Data: 3 monthly tables with distinct cookies, indexed Size: 10B records per table Platform: Sybase IQ Attempt: UNION select count(cookies) over 3 monthly tables 24© comScore, Inc. Proprietary. – Union operator distincts Result: FAIL. Out of temp space. Out of luck. – Failed after 30 minutes. Why? UNION performs a SELECT and then a DISTINCT (sorting 30B rows)
    • Rethink the problem! INNER joins are cheaper No sort, they use existing indexes Remember set theory? Of course you do! Let months be {A, B, C} A B ∪ ∪ 25© comScore, Inc. Proprietary. INNER join on only 2 tables of data at a time 2 month intersections took 2 hours each and less taxing on memory Used intersection of intermediate (indexed!) results… 5 mins C A ∪ B ∪ C = A + B + C – A ∩ B – A ∩ C – C ∩ B + A ∩ B ∩ C A ∩ B ∩ C = (A ∩ B) ∩ (A ∩ C) ∩ (C ∩ B) Total query time: 6.5 hours
    • TCO with Large Cluster Systems Examine replication factor and disk configuration for systems with replication built into the framework to support redundancy and concurrency Example: Hadoop cluster that supports 108TB of base compressed data Hypothetical Configurations: 26© comScore, Inc. Proprietary. – Replication Factor of 3 R710 (6x drives, JBOD); requires 162 servers R510 (12x drives JBOD); requires 68 servers – Replication Factor of 2 R710 (6x drives, RAID 5); requires 129 servers R510 (12x drives, RAID 5); requires 54 servers
    • Useful Factoids Colorful, bite-sized graphical representations of the best discoveries we unearth. 27© comScore, Inc. Proprietary. Visit www.comscoredatamine.com or follow @datagems for the latest gems.
    • Thank You! Michael Brown CTO comScore, Inc. mbrown@comscore.com 28© comScore, Inc. Proprietary.