How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying Hadoop for Deeper Consumer Insights

  • 655 views
Uploaded on

Get an insider's view into one of the most talked-about Hadoop deployments in the world! …

Get an insider's view into one of the most talked-about Hadoop deployments in the world!

As more enterprises realize the value of big data, Hadoop is moving from lab curiosity to genuine competitive advantage. But how can you confidently deploy it in a production environment?

In this joint webinar with Syncsort, learn firsthand from industry thought leader, Mike Brown, CTO of comScore, how to offload critical data and optimize your enterprise data architecture with Hadoop to increase performance while lowering costs.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
655
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. © comScore, Inc. Proprietary. Syncsort & MapR @ comScore Michael Brown, CTO | July 9th, 2014
  • 2. © comScore, Inc. Proprietary.© comScore, Inc. Proprietary. The comScore Story Analytics for a Digital World™
  • 3. © comScore, Inc. Proprietary. 3 The Digital World is Complex V0113
  • 4. © comScore, Inc. Proprietary. 4 comScore’s Mission Be the Leader in Digital Media Analytics. Measure all forms of media—content and advertising—at scale, across all platforms, in real-time, globally.
  • 5. © comScore, Inc. Proprietary. 5 comScore Brings it Together TabletPC/Mac TV SmartphoneGaming V0113
  • 6. © comScore, Inc. Proprietary. 6 comScore is a leading internet technology company that provides Analytics for a Digital World™ NASDAQ SCOR Clients 2,400+ Worldwide Employees 1,200+ Headquarters Reston, Virginia, USA Global Coverage Measurement from 172 Countries; 44 Markets Reported Local Presence 32 Locations in 23 Countries V0113
  • 7. © comScore, Inc. Proprietary. 7 Providing Analytics For More Than 2,400+ Clients Globally Media Agencies Telecom/Mobile Financial Retail Travel CPG Health Technology V0113
  • 8. © comScore, Inc. Proprietary. 8 Census Tags & Data Feeds Panels PC, iOS, Android Survey Non-behavioral elements Methods Aggregation Dictionaries Taxonomies Syndicated Data Platform Media Metrix vCE Collection Calibration Delivery Consulting Analysis Models Weighting Projection De-Duplication Attribution Turning Big Data into Powerful Insight Client Analytics Platform Digital Analytix
  • 9. © comScore, Inc. Proprietary. 9
  • 10. © comScore, Inc. Proprietary. 10 Panel Heat Map
  • 11. © comScore, Inc. Proprietary. 11 Average Records Captured per Day (2005-2009) - 200,000,000 400,000,000 600,000,000 800,000,000 1,000,000,000 1,200,000,000 1,400,000,000 1,600,000,000 1,800,000,000 9/26/2005 10/26/2005 11/26/2005 12/26/2005 1/26/2006 2/26/2006 3/26/2006 4/26/2006 5/26/2006 6/26/2006 7/26/2006 8/26/2006 9/26/2006 10/26/2006 11/26/2006 12/26/2006 1/26/2007 2/26/2007 3/26/2007 4/26/2007 5/26/2007 6/26/2007 7/26/2007 8/26/2007 9/26/2007 10/26/2007 11/26/2007 12/26/2007 1/26/2008 2/26/2008 3/26/2008 4/26/2008 5/26/2008 6/26/2008 7/26/2008 8/26/2008 9/26/2008 10/26/2008 11/26/2008 12/26/2008 1/26/2009 2/26/2009 3/26/2009
  • 12. © comScore, Inc. Proprietary. 12 CENSUS Unified Digital Measurement™ (UDM) Establishes Platform For Panel + Census Data Integration Adopted by 90% of Top 100 U.S. Media Properties PANEL Unified Digital Measurement (UDM) Patent-Pending Methodology Global PERSON Measurement Global DEVICE Measurement V0411
  • 13. © comScore, Inc. Proprietary. 13 Beacon Heat Map
  • 14. © comScore, Inc. Proprietary. 14 Monthly Records Collection Billion 200 Billion 400 Billion 600 Billion 800 Billion 1,000 Billion 1,200 Billion 1,400 Billion 1,600 Billion 1,800 Billion 2,000 Billion #ofrecords Beacon Records Panel Records Total records collected in June 2014 = 1,726,563,202,649 Total records collected YTD 2014 = 10,037,131,368,475
  • 15. © comScore, Inc. Proprietary. DMX @ comScore
  • 16. © comScore, Inc. Proprietary. 16 DMX use at comScore Purchased our first 4 licenses in 2000! We use DMX from Syncsort across hundreds of servers for efficient data processing and aggregation. We currently run over 100+ unique jobs every day. With these jobs we process over 150 billion rows of data through DMX! Connect Design Process Accelerate
  • 17. © comScore, Inc. Proprietary. 17 Compression w/Sorting Compress Log Files when processing large volumes of log data Several advantages to Sorting Data First:  Reduces the size of the data  Improves application performance Examples:  1 Hour of one source of our data 2,315 GB raw (2.9 billion rows)  Standard compression of time ordered data is 509 GB (22% of original)  Standard compression on a sorted set is 324 GB (14% of original) When applied to all our sources we save  5.0 TB per day  155 TB per month  460 TB per quarter
  • 18. © comScore, Inc. Proprietary. Hadoop @ comScore
  • 19. © comScore, Inc. Proprietary. 19 Why Hadoop? • comScore built our own distributed computing stack in 2002. • In 2009 we decided it was better to leverage the efforts of the Hadoop community instead of building our own stack. • We recognized the benefit of switching to Hadoop which would allow for seamless scaling of our infrastructure to meet the needs of the business. • Hadoop allows us to add compute, storage and memory linearly and allows you to process things at tremendous scale. • Partnered with SyncSort on their Hadoop efforts from Oct 2010 • Evaluated the beta of MapR in the fall of 2011
  • 20. © comScore, Inc. Proprietary. 20 90 Days of Data 1,148 1,919 3,049 4,862 5,084 Trillion 1,000 Trillion 2,000 Trillion 3,000 Trillion 4,000 Trillion 5,000 Trillion 6,000 Trillion 2009 2010 2011 2012 2013 2014 2016
  • 21. © comScore, Inc. Proprietary. 21 High Level Data Flow Panel Census Custom Code + ADW EDW Delivery
  • 22. © comScore, Inc. Proprietary. 22 Our Cluster Production Hadoop Cluster  400+ nodes: Mix of Dell 720xd, R710 and R510 servers  Each R720xd has (24x1.2TB drives; 128GB RAM; 32 cores)  13,800+ total CPUs  31.6 TB total memory  8.2 PB total disk space  Our distro is MapR M5 2.1.3
  • 23. © comScore, Inc. Proprietary. Leveraging Partitions from MapR
  • 24. © comScore, Inc. Proprietary.
  • 25. © comScore, Inc. Proprietary. Validation Funnel & Target Effectiveness
  • 26. © comScore, Inc. Proprietary. 26 Our growth As our volume has grown we have the following stats:  Over 683 billion events per month  Daily Aggregate 1.8 billion  160 billion aggregate records for 92 days  146K Campaigns  Over 50 countries  We see 15 billion distinct cookies in a month  We only need to output 26 million rows
  • 27. © comScore, Inc. Proprietary. 27 Solution to reduce the shuffle The Problem:  Most aggregations within comScore can not take advantage of combiners, leading to large shuffles and job performance issues The Idea:  Partition and sort the data by cookie on a daily basis  Create a custom InputFormat to merge daily partitions for monthly aggregations
  • 28. © comScore, Inc. Proprietary. 28 Custom Input Format with Map Side Aggregation CB Mapper MapperMapperMap Map Map Reduce ReduceReduce BA AC A B C A B C Combiner Combiner Combiner A B C
  • 29. © comScore, Inc. Proprietary. 29 Risks for Partitioning Data locality  Custom InputFormat requires reading blocks of the partitioned data over the network  This was solved using a feature of the MapR file system. We created volumes and set the chunk size to zero which guarantees that the data written to a volume will stay on one node Map failures might result in long run times  Size of the map inputs is no longer set by block size  This was solved by creating a large number (10K) of volumes to limit the size of data processed by each mapper
  • 30. © comScore, Inc. Proprietary. 30 Partitioning Summary Benefits:  A large portion of the aggregation can be completed in the map phase  Applications can now take advantage of combiners  Shuffles sizes are minimal Results:  Took a job from 35 hours to 3 hours with no hardware changes
  • 31. © comScore, Inc. Proprietary. DMX-h @ comScore
  • 32. © comScore, Inc. Proprietary. 32 Reasons for comScore selecting DMX-h Performance • DMX-h as the pluggable sort in Hadoop allows us to increase throughput on it’s existing platform; this reduces capital and ongoing operational expenses • The increase in throughput allows us to also deliver our data more quickly to our customers. These things make the data more valuable to our clients. Speed of Development • The ability to quickly build out applications in the DMX-h GUI allows us to iterate and respond quicker to the needs of the business. • The ease of development also allows us to democratize the access to the Hadoop platform by leveraging a point and click GUI.
  • 33. © comScore, Inc. Proprietary. 33 Performance - DMx Pluggable Sort Testing Results First Comparison Run on our Dev Cluster Pig scripts and called with SyncSort plug in GroupBy / Distinct Operations • Counting uniques • These have large shuffle steps which leads to more data to sort. • Observed up to a 20% decrease in job runtime Filter Operations • Searching for a specific value • Observed a 5% – 10% decrease in job runtime • Dependent on type of filter and size of job output 40GB compressed data, base run is 86 min, test run is 68 min; Savings of 20% Results from 7 Nodes; 56 cores; 433 GB RAM; 28 TB disk; MapR M5 3.0.2; DMX-h 7.12
  • 34. © comScore, Inc. Proprietary. 34 Speed of Development - POC We took an existing process that runs in our Hadoop cluster and converted that to DMX-h to validate the new capabilities. The existing process: • Written in 75 lines of Pig with 3 Java UDFs • Developed in about 25 hours • Processes 3.5 billion input rows per day • Takes 35 minutes to run on a daily basis
  • 35. © comScore, Inc. Proprietary. 35 DMXh-Process
  • 36. © comScore, Inc. Proprietary. 36 Speed of Development - POC The new process in DMX-h: • Developed a new job with 13 tasks • No Java UDF required • Runs on the same data and in the same environment. • Developed in 12 hours. • Runs in 11 minutes! 1/3 of the time of the Pig & Java code.
  • 37. © comScore, Inc. Proprietary. 37 Useful Factoids Visit www.comscoredatamine.com or follow @datagems for the latest gems. Colorful, bite-sized graphical representations of the best discoveries we unearth.
  • 38. © comScore, Inc. Proprietary. 38 Thank You! Michael Brown CTO comScore, Inc. mbrown@comscore.com
  • 39. © 2014 MapR Technologies 1© 2014 MapR Technologies
  • 40. © 2014 MapR Technologies 2 Today’s Presenters Steve Wooledge VP - Product Marketing @swooledge Jorge Lopez Director - Product Marketing @zanilli Mike Brown CTO
  • 41. © 2014 MapR Technologies 3© 2014 MapR Technologies comScore
  • 42. © comScore, Inc. Proprietary. Syncsort & MapR @ comScore • Michael Brown, CTO | July 9th, 2014
  • 43. © 2014 MapR Technologies 5© 2014 MapR Technologies Leveraging MapR and Syncsort
  • 44. © 2014 MapR Technologies 6 Big Data is Overwhelming Traditional Systems • Mission-critical reliability • Transaction guarantees • Deep security • Real-time performance • Backup and recovery • Interactive SQL • Rich analytics • Workload management • Data governance • Backup and recovery Enterprise Data Architecture 1TRENDTREND ENTERPRISE USERS OPERATIONAL SYSTEMS ANALYTICAL SYSTEMS PRODUCTION REQUIREMENTS PRODUCTION REQUIREMENTS OUTSIDE SOURCES
  • 45. © 2014 MapR Technologies 7 Hadoop: The Disruptive Technology at the Core of Big DataTRENDTREND JOB TRENDS FROM INDEED.COM Jan ‘06 Jan ‘12 Jan ‘14Jan ‘07 Jan ‘08 Jan ‘09 Jan ‘10 Jan ‘11 Jan ‘13 2
  • 46. © 2014 MapR Technologies 8 OPERATIONAL SYSTEMS ANALYTICAL SYSTEMS ENTERPRISE USERS 1REALITYREALITY • Data staging • Archive • Data transformation • Data exploration • Streaming, interactions Hadoop Relieves the Pressure from Enterprise Systems 2 Interoperability 1 Reliability and DR 4 Supports operations and analytics 3 High performance Keys for Production Success
  • 47. © 2014 MapR Technologies 9 FOUNDATION Architecture Matters for Success2REALITYREALITY Data protection & security High performance Multi-tenancy Operational & Analytical Workloads Open standards for integration NEW APPLICATIONS SLAs TRUSTEDINFORMATION LOWERTCO
  • 48. © 2014 MapR Technologies 10 The Power of the Open Source Community ManagementManagement MapR Data Platform APACHE HADOOP AND OSS ECOSYSTEM Security YARN Pig Cascading Spark Batch Spark Streaming Storm* Streaming HBase Solr NoSQL & Search Juju Provisioning & coordination Savannah* Mahout MLLib ML, Graph GraphX MapReduce v1 & v2 EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Governance Tez* Accumulo* Hive Impala Shark Drill* SQL Sentry* Oozie ZooKeeperSqoop Knox* WhirrFalcon*Flume Data Integration & Access HttpFS Hue * Certification/support planned for 2014
  • 49. © 2014 MapR Technologies 11 MapR Distribution for Hadoop ManagementManagement MapR Data Platform APACHE HADOOP AND OSS ECOSYSTEM Security YARN Pig Cascading Spark Batch Spark Streaming Storm* Streaming HBase Solr NoSQL & Search Juju Provisioning & coordination Savannah* Mahout MLLib ML, Graph GraphX MapReduce v1 & v2 EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Governance Tez* Accumulo* Hive Impala Shark Drill* SQL Sentry* Oozie ZooKeeperSqoop Knox* WhirrFalcon*Flume Data Integration & Access HttpFS Hue * Certification/support planned for 2014 • High availability • Data protection • Disaster recovery • Standard file access • Standard database access • Pluggable services • Broad developer support • Enterprise security authorization • Wire-level authentication • Data governance • Ability to support predictive analytics, real-time database operations, and support high arrival rate data • Ability to logically divide a cluster to support different use cases, job types, user groups, and administrators • 2X to 7X higher performance • Consistent, low latency Enterprise-grade Security OperationalPerformance Multi-tenancyInteroperability
  • 50. © 2014 MapR Technologies 12 MapR: Best Solution for Customer Success Top Ranked Exponential Growth 500+ Customers Premier Investors 3X3X bookings Q1 ‘13 – Q1 ‘14 80%80% of accounts expand 3X 90%90% software licenses <1%<1% lifetime churn >$1B>$1B in incremental revenue generated by 1 customer
  • 51. © 2014 MapR Technologies 13 MapR and Syncsort Reference Architecture Sources RELATIONAL, SAAS, MAINFRAME DOCUMENTS, EMAILS LOG FILES, CLICKSTREAMS BLOGS, TWEETS, LINK DATA DATA MARTS DATA WAREHOUSE MapR Data Platform Business Intelligence / Visualization MapR-DB MapR-FS Batch (MR, Spark, Hive, Pig, …) Interactive (Impala, Drill, …) Streaming (Spark Streaming, Storm…) MAPR DISTRIBUTION FOR HADOOP
  • 52. © 2014 MapR Technologies 14 Do You Know Syncsort? • Syncsort provides fast, secure, enterprise‐grade  software spanning “Big Iron to Big Data”  • Fastest sort technology in the market • Powering 50% of mainframes’ sort • A history of innovation • 25+ issued & pending patents • Large global customer base • 12,000+ deployments in 80 countries and serving 87 of  the Fortune 100 • First‐to‐market, fully integrated approach to Hadoop  ETL • Top 7 contributors to Hadoop. Based on number of  lines of code changed in 2013 Our customers are achieving the impossible, every  day! Our customers are achieving the impossible, every  day! Key Partners
  • 53. © 2014 MapR Technologies 15 The Hadoop Challenge PROCESS Sort JoinAggregate Copy Merge DISTRIBUTECOLLECT Most organizations use Hadoop to… EExtract TTransform LLoad
  • 54. © 2014 MapR Technologies 16 Turning Hadoop into a Feature-rich ETL Solution Collect • Broad based connectivity with automated parallelism  • Best in class mainframe data access & translation Process & Distribute • No manual coding. GUI for developing & maintaining MR jobs • No code generation. Engine runs natively on each node • Develop & test locally in Windows; run natively on Hadoop Optimize & Secure • Faster throughput per node • Full support for Kerberos & LDAP • Web‐based monitoring console • Sort‐work compression for storage savings DMX‐h  ETL Collect Process & Distribute Optimize & Secure
  • 55. © 2014 MapR Technologies 17 A Roadmap to Hadoop Success Agile Data  Exploration &  Visualization Next‐gen Analytics Cheap Storage Offload Data  Warehouse Enabling The Data‐driven Organization Solving The Intractable IT Problem 17
  • 56. © 2014 MapR Technologies 18 MapR + Syncsort Solutions Data Warehouse  Optimization Click‐stream  Analysis Mainframe Offload Shift ELT Workloads  to Hadoop Access, Translate & Analyze  Mainframe Data with Hadoop Collect, Process & Analyze More  Data from Your Website
  • 57. © 2014 MapR Technologies 19 Q&AEngage with us! 1. Download the MapR Sandbox for Hadoop: www.mapr.com/sandbox 2. Try Syncsort’s Hadoop ETL in the MapR Sandbox: www.syncsort.com/mapr 3. Learn best practices for Hadoop ETL: www.mapr.com/EDH