Using hadoop to expand data warehousing

3,592 views

Published on

Neustar is a fast growing provider of enterprise services in telecommunications, online advertising, Internet infrastructure, and advanced technology. Neustar has engaged Think Big Analytics to leverage Hadoop to expand their data analysis capacity. This session describes how Hadoop has expanded their data warehouse capacity, agility for data analysis, reduced costs, and enabled new data products. We look at the challenges and opportunities in capturing 100′s of TB’s of compact binary network data, ad hoc analysis, integration with a scale out relational database, more agile data development, and building new products integrating multiple big data sets.

Published in: Technology, Business

Using hadoop to expand data warehousing

  1. 1. Using Hadoopto ExpandData Warehousing Mike Peterson VP of Platforms and Data Architecture, Neustar Ron Bodkin CEO and Founder, Think Big Analytics ron.bodkin@thinkbiganalytics.com . x1 Copyright © Think Big Analytics and Neustar Inc. June 13, 2012
  2. 2. AgendaOverviewTechnologyProcessConclusion2 Copyright © Think Big Analytics and Neustar Inc.
  3. 3. Big Data Highlights at Neustar 2010 2011 2012 Hadoop Hadoop Cluster Cluster Rollout at Quova Rollout with UltraDNS3 Copyright © Think Big Analytics and Neustar Inc.
  4. 4. The Business Case 100 TB of INCREMENTAL Data Storage 3 Year Cost Millions US $ $9.6 $9.6 $6.3 $0.2 Hadoop Big Data Netezza Teradata Oracle4 Copyright © Think Big Analytics and Neustar Inc.
  5. 5. Big Data Warehouse Challenges Goals •  Cost to store unstructured data •  Integrate unstructured data with EDW •  Poor response time to changing BI needs •  Predictive analytics based on data science •  Data Warehouse access for departments •  Access to cluster for all users5 Copyright © Think Big Analytics and Neustar Inc.
  6. 6. Data AgilityClassic Warehouse Big Data Warehouse»  ETL »  Store raw data»  Pre-parse all data »  Parse only when proven»  Normalize up front »  Approximate parse on»  Feed data marts demand»  New ideas = IT projects »  Analysis on demand»  Aggregate/Summarize »  Provide ideas before projects to optimize6 Copyright © Think Big Analytics and Neustar Inc.
  7. 7. Change to Technology Focus»  New data platforms unlock innovation»  Not just package implementation»  More open source technology»  Rethink assumptions»  Increase technology skills»  Focus data teams7 Copyright © Think Big Analytics and Neustar Inc.
  8. 8. Working Together »  Expertise in delivery »  Trusted partner »  Collaborative development »  Open source leader »  Invested in client success »  Price/performance8 Copyright © Think Big Analytics and Neustar Inc.
  9. 9. Technology9 Copyright © Think Big Analytics and Neustar Inc.
  10. 10. Architecture Overview Samples & Aggregates RSync/SCP + Capture Scripts Server Cluster Backup Master Server Cronacle Agent Slave Slave Postgres ETL Slave Slave Server Hive + UDFs Secondary Name Node Postgres Cronacle to HDFS (incl. Hive Scheduler Slave Metastore) Master Server Postgres + HDFS HDFS Task Tracker Hive Name Node Job Tracker Ad-Hoc queries and BI Management, Monitoring LDAP10 Copyright © Think Big Analytics and Neustar Inc.
  11. 11. Initial Hadoop Cluster Current Configuration »  40 servers »  Hadoop and PostgreSQL Data Nodes »  2 x 12 cores »  64 GB memory »  24 x 3TB SATA drives »  Mixed Nodes – Raid 6 storage »  HDFS only nodes – JBOD »  10Gbit Ethernet11 Copyright © Think Big Analytics and Neustar Inc.
  12. 12. System Scale»  Query volume – light but ramping »  10,000 Map Reduce processes/day»  Ingesting over 40B rows a day »  1.5TB with 7x compression»  Storage utilization at 45%»  Core utilization spikes when processing Machine Learning Algorithms»  100% capture of multiple large product data sets12 Copyright © Think Big Analytics and Neustar Inc.
  13. 13. Software Choices by Layer Data Processing &Platform Application Integration Analytics Resource Management Data Science & Algorithms Application SupportBusiness Intelligence & Current configuration: Reporting Hive »  Redwood Software Hortonworks Data Platform, Ganglia LDAP, Hortonworks Data Platform Data Ingestion Data Transformation & Aggregation Data Publication Custom Script Hive FTP scheduler Management & Monitoring Workflow Management »  Hortonworks Distribution Cluster Security Cronacle Metadata Management Low Latency Data Access GridSQL, not Hadoop ecosystem »  Move to Oracle JDK 6 Data Governance Resource Management »  Move to Red Hat Fair Scheduler Platform Software Hortonworks Data Platform, Oracle JDK 6, RHEL 6 Networking Infrastructure 10 GigE HP Servers Cluster Provisioning On Premise Shared Hadoop and Grid SQL13 Copyright © Think Big Analytics and Neustar Inc.
  14. 14. Massive Binary Format Data Query SELECT * FROM datafile WHERE dt=2012-06-15; »  Parse on fly: don’t 1 duplicate or lose original Parse into records »  Reused open source Binary InputFormat Binary SerDe parser with custom extensions 2 Parse into fields lazily »  Optimized with Bean »  profiling Object Inspector »  lazy parsing 3 Fields determined minimizes object large partitioned binary file - 100s of TBs by Java beans methods creation compressed binary record 1 »  CPU bound due to parsing compact structure compressed binary record 2 ...6/19/12 14
  15. 15. Disk Failures and Recovery»  5 drives in 9 months; 3 were DOA»  Hadoop handled failure perfectly»  Raid 6 PostgreSQL & GridSQL working fine15 Copyright © Think Big Analytics and Neustar Inc.
  16. 16. Storage Policy»  Storage still isn’t free!»  Newer data = 3 replicas»  Older data = 2 replicas»  Data retention = 1 year»  Free space = 20% reserve16 Copyright © Think Big Analytics and Neustar Inc.
  17. 17. Process17 Copyright © Think Big Analytics and Neustar Inc.
  18. 18. The Big Data Journey Phase 1: Enterprise Cluster Dev & Deployment Data Science Phase 2: Comprehensive Ingestion by Service Monetization Cost Savings Offerings Ingestion Phase 3: Develop Big Data Capabilities »  New Applications »  Data Science »  Advanced Analytics »  Cost Savings Data Services Strategy18 Copyright © Think Big Analytics and Neustar Inc.
  19. 19. Organizational Approach Executive Support Roles and Organization Data Governance Outreach Training Product Definition Data Sharing Data Science / Analytics Data Acquisition External Data Acquisition Internal Platform Build19 Copyright © Think Big Analytics and Neustar Inc.
  20. 20. Organizational Investment Data Developers DatabaseAdministrator Big Data Big Data Administrator Engineer New workloads & tools Distributed development Data Architect Data Architect Data Scientist Big Data Modeling Math, programming & Varying data structures analysis 20 Copyright © Think Big Analytics and Neustar Inc.
  21. 21. Training to Match Organization Fundamentals Track Tools Track Applications Track Guiding Principles •  Making Hadoop Big Data “relevant” at the company and job •  Cross-train math and engineering skill bases •  Lab Team exposure to new and emerging technology training21 Copyright © Think Big Analytics and Neustar Inc.
  22. 22. Building Data Science Capability22 Copyright © Think Big Analytics and Neustar Inc.
  23. 23. Use of Capability Use Case Technology Selection Criteria »  Structure Basic Only Structured Data Type Complex Structure »  Compute Scale Reporting »  Data Volume »  Latency At least 10 # Calculations Under 10 »  Analysis Type Data Ingestion PetaFLOP PetaFLOP 100 TB Data Size Under Batch EDW or more EDW 10 TB Data Processing EDW EDW EDW Data EDW 10-100 TB Data Latency? Analysis Analysis Fast Analytics Type? Type? Tightly Integrated Other (Simple,A minute Under with existing data a minute Parallel, Complex Production, Tightly Integrated Dataor more Other Structural) with existing data Enrichment existing existing EDW EDW EDW platform platform Data Science Trends •  Compute model scores faster •  Analyze full data sets •  Incorporate new data •  »  23   Build new services from data
  24. 24. Conclusion24 Copyright © Think Big Analytics and Neustar Inc.
  25. 25. Getting Value from Big Data Finalize these take aways»  Expand warehousing capability with Hadoop»  Enable data science to create new value»  Organizational change is a journey25 Copyright © Think Big Analytics and Neustar Inc.
  26. 26. Thank You Mike Peterson VP of Platforms and Data Architecture, Neustar Ron Bodkin CEO and Founder, Think Big Analytics ron.bodkin@thinkbiganalytics.com x26 Copyright © Think Big Analytics and Neustar Inc.

×