Utilizing Aster nCluster to support processing in excess of 100 Billion rows per month
Upcoming SlideShare
Loading in...5
×
 

Utilizing Aster nCluster to support processing in excess of 100 Billion rows per month

on

  • 2,059 views

 

Statistics

Views

Total Views
2,059
Views on SlideShare
1,952
Embed Views
107

Actions

Likes
2
Downloads
39
Comments
0

7 Embeds 107

http://www.biblogs.com 50
http://www.asterdata.com 35
http://www.slideshare.net 17
http://itknowledgehub.com 2
http://feeds2.feedburner.com 1
http://webcache.googleusercontent.com 1
http://feeds.feedburner.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Utilizing Aster nCluster to support processing in excess of 100 Billion rows per month Utilizing Aster nCluster to support processing in excess of 100 Billion rows per month Presentation Transcript

  • A Plan for Large Scale Data Analytics Utilizing Aster nCluster to support processing in excess of 100 Billion rows per month Will Duckworth, Director Software Engineering (wduckworth@comscore.com)
  • Agenda  comScore – Introduction and Technology  MM360 Initiative  The Challenge  Our Analysis  Plans © comScore, Inc. Proprietary and Confidential. 2
  • comScore, Inc.  Founded in 1999  Publically traded on NASDAQ (SCOR)  Acquired MediaMetrix in 2002, M:Metrics in 2007, Certifica in 2009, and ARSgroup in 2010  Corporate headquarters: Reston, VA – Offices in Chicago, NYC, San Francisco, Seattle, Toronto, London, Tokyo, and Paris – 500+ full-time employees  Experienced senior leadership team with a unique record of innovation in the market research industry  More than 1200 clients across many industries © comScore, Inc. Proprietary and Confidential. 3 View slide
  • Advising Hundreds of Leading Businesses (partial list) Internet Agencies Telecom Financial Retail Travel CPG Pharma Technology © comScore, Inc. Proprietary and Confidential. 4 View slide
  • Powerful Platform: Massive Database and Cost Effective Technology Infrastructure Continuous Massive Operation Operational Scale ■ 24/7 Largest Windows Data ■ 99.99% Uptime Warehouse in the World ■ 1,000 TB of Patents storage ■ 1,100 Servers ■ 3 Issued Database and ■ 30 TB per month ■ 24 Pending Computational Infrastructure Cost Effective Highly Scalable, Proprietary Distributed Processing Technology with System Capex Architecture Strong IP Protection < $7M/Year Sophisticated Technology to Keep Up With Internet Advancements © comScore, Inc. Proprietary and Confidential. 5
  • Even for us this is getting big… New Rows per Day (panel vs. non-panel) 12,000 Millions 10,000 8,000 6,000 4,000 2,000 0 6/24/2009 7/24/2009 8/24/2009 9/24/2009 10/24/2009 11/24/2009 12/24/2009 1/24/2010 2/24/2010 3/24/2010 beacon panel © comScore, Inc. Proprietary and Confidential. 6
  • Where we come from …  Our skill set came from a need to measure Win32  We chose technologies and built a core team around our mandate to have accurate consumer Internet measurement – All Intel Based – 2/3 Microsoft OS, 1/3 Linux OS – C++  Now very much a “best tool for the job” organization © comScore, Inc. Proprietary and Confidential. 7
  • MM360 Initiative © comScore, Inc. Proprietary and Confidential. 8
  • Internet = “The Most Measurable Medium” 100% Accurate count of server requests, but…  How many real users?  What kind of users are they?  Which request is a valid Page View?  How long did the users spend on my site? © comScore, Inc. Proprietary and Confidential. 9
  • Basic Problem with Servers: No Unique User ID Web Analytics Approximation Unique User = Cookie ID (if Cookies can be set) or IP Address + User Agent Sounds Simple, But Major Problems:  Cookies are deleted frequently, and the same person can be counted multiple times  IP Addresses change frequently causing inflation of user counts  In any case, servers identify a machine (or a browser), which can represent multiple persons or a fraction of the usage of a single person © comScore, Inc. Proprietary and Confidential. 10
  • Media Metrix 360: Key Benefits for Participating Sites  Comprehensive coverage: 100% of activity – New “Universe Report” covers mobile and public machines – Census-adjusted metrics in current Media Metrix reports (Home and Work) – Coverage Calculation for beaconing sites  Improved coverage of At-Work population  Harmonization / Reconciliation of panel vs. server  More granularity  More timely reporting  Transparency © comScore, Inc. Proprietary and Confidential. 11
  • The Challenge © comScore, Inc. Proprietary and Confidential. 12
  • Goals  Be able to scale to support an initial monthly volume of 160 Billion records – Store 3 months of data online  Be able to add incrementally to the environment to support growth  Support advanced analytics – 150 analysts  Support end user access to record level data, preferably through a SQL interface  Support the storage of row level data  Have yesterdays data available today © comScore, Inc. Proprietary and Confidential. 13
  • Existing Internal Systems  NGUA – Ability to run specific queries for a given time period very quickly because all processing is parallelized – Currently holding 560+ days of data; 800B+ rows. – All traffic for a machine for a month – 1 minute run time (140k records) – All traffic for pizzahut.com for a month – 4 minutes run time (1.9 million records) – All traffic from google.com where toys is in the URL – 1 hour 15 minutes (400k records) ■ Fusion – Primary System used for processing and providing the data behind the majority of comScore’s products and analysis – Runs on 32 servers – For one month we read over 8TB of compressed log files with over 40B rows – Produces 1.3 B rows and 120 GB of output for load into a DW – Can turn around the processing in less than 8 hours  Both systems leverage the same core concepts of locality to data and distributed processing © comScore, Inc. Proprietary and Confidential. 14
  • Aster Data nCluster  Current Aster environments – Dev: 1 Queen; 3 Workers; 650+GB total storage – Prod: 1 Queen; 4 Loaders; 10 Workers; 32TB total storage  Plans – Building new Prod environment 1 Queen, 70 workers and 10 Loaders / Staging servers – 350TB total storage – 432 Cores © comScore, Inc. Proprietary and Confidential. 15
  • Aster Data nCluster  Table design is key with data of this size – What is the end user going to do 80% of the time?  On the web, no matter how clean you think your data set is there are still going to be issues – 6 Sigma on 10 billion records a day is still nearly 35,000 “bad rows” – Staging Servers  Looking at using Aster-Hadoop Data Connector for integration with in-house Hadoop environment – Aster Data for the analysts – Hadoop for the developers © comScore, Inc. Proprietary and Confidential. 16
  • Critical Cost Drivers to factor into the Analysis  Data Centers – Power is the big issue at data centers today. All allocations for power and space are based on the number of circuits and the cost per circuit are all expected to rise  Servers – Even high end servers have reached relative commodity prices if you stay to the 2U footprint and standard components © comScore, Inc. Proprietary and Confidential. 17