Successfully reported this slideshow.

Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Krishnan Parasuraman, Netezza & Greg Rokita, Edmunds

13

Share

Upcoming SlideShare
Hadoop Introduction
Hadoop Introduction
Loading in …3
×
1 of 36
1 of 36

Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Krishnan Parasuraman, Netezza & Greg Rokita, Edmunds

13

Share

Hadoop is rapidly emerging as a viable platform for big data analytics. Thanks to early adoption by organizations like Yahoo and Facebook, and an active open source community, we have seen significant innovation around this platform. With support of relational constructs and a SQL-like query interface, many experts believe that Hadoop will subsume some of the data warehousing tasks at some point in the future. Even though Hadoop and parallel databases have some architectural similarities, they are designed to solve different problems. In this session, you will get introduced to Hadoop architecture, its salient differences from Netezza and typical use cases. You will learn about common co-existence deployment models that have been put into practice by Netezza's customers who have leveraged benefits from both these technologies. You will also understand Netezza's current support for Hadoop and future strategy. If you have currently deployed Hadoop within your organization or in early stages of learning and evaluating Hadoop, you will benefit from attending this session. It will give you an opportunity to interact with practitioners and industry experts who have successfully deployed Hadoop and Netezza within their organizations 

Hadoop is rapidly emerging as a viable platform for big data analytics. Thanks to early adoption by organizations like Yahoo and Facebook, and an active open source community, we have seen significant innovation around this platform. With support of relational constructs and a SQL-like query interface, many experts believe that Hadoop will subsume some of the data warehousing tasks at some point in the future. Even though Hadoop and parallel databases have some architectural similarities, they are designed to solve different problems. In this session, you will get introduced to Hadoop architecture, its salient differences from Netezza and typical use cases. You will learn about common co-existence deployment models that have been put into practice by Netezza's customers who have leveraged benefits from both these technologies. You will also understand Netezza's current support for Hadoop and future strategy. If you have currently deployed Hadoop within your organization or in early stages of learning and evaluating Hadoop, you will benefit from attending this session. It will give you an opportunity to interact with practitioners and industry experts who have successfully deployed Hadoop and Netezza within their organizations 

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Related Audiobooks

Free with a 14 day trial from Scribd

See all

Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Krishnan Parasuraman, Netezza & Greg Rokita, Edmunds

  1. 1. Krishnan Parasuraman Greg Rokita Netezza Edmunds.com Building Scalable Data Platforms Hadoop and Netezza Deployment Models
  2. 2. Talking Points • Building scalable data platforms – Architectural considerations • Hadoop and Massively Parallel Databases – Similarities and differences – Usage patterns • Practitioner’s View Point – Edmunds.com data warehouse platform 2 Hadoop World 2011
  3. 3. Building scalable data platforms Typical Digital Media Information Processing Pipeline Clicks Visits Page Views • Scoring Real Time • Yield optimization Likes Data • Audience Analytics Decision Tweets Processing Impressions Engine Locations • Display Ads • Correlate Reporting • Recommendation • Structure • Personalized Content • Consolidate • Aggregate • Summarize • Ad-hoc analysis 3 Hadoop World 2011
  4. 4. Building scalable data platforms Clicks Visits Page Views Real Time Likes Data Decision Tweets Processing Impressions Engine Locations Reporting DATA PLATFORM 4 Hadoop World 2011
  5. 5. Building scalable data platforms Real Time Data Decision Processing Engine Reporting • Real Time • High Velocity • Compute intensive • Cached Queries • High Concurrency Workloads • Transactional • Linearly Scalable • Full table scans • Low Latency • Disk bound • Disk bound • H. Concurrency • High Thruput • Structured • Structured • Mostly Structured Data • Un-Structured DATA PLATFORM • Un-Structured • Some unstructured • Structured • Relational • Key-Value pairs • Machine Gen. • Stream Processing • Low Disk I/O • In-DB computation • OLAP Capability • Memory resident • Fast Processing • SQL and MR • Columnar • Key based • Low Cost/TB • Analytic Libraries lookups 5 Hadoop World 2011
  6. 6. Building scalable data platforms Real Time Data Decision Processing Engine Reporting • Real Time • High Velocity • Compute intensive • Cached Queries • High Concurrency Workloads • Transactional • Linearly Scalable • Full table scans • Low Latency • Disk bound • Disk bound Massively • High Thruput • H. Concurrency Hadoop Parallel DB NoSQL • Structured • Structured • Mostly Structured • Structured Data Databases • Un-Structured • Un-Structured • Some unstructured • Relational In-Memory • Key-Value pairs • Machine Gen. DB Graph • Stream Processing • Low Disk I/O Plain Ole’ DB • In-DB computation • OLAP DB Capability • Memory resident • Fast Processing on steroids • Columnar • SQL and MR • Key based • Low Cost/TB • Analytic Libraries lookups 6 Hadoop World 2011
  7. 7. Building scalable data platforms Real Time Data Decision Processing Engine Reporting • Real Time • High Velocity • Compute intensive • Cached Queries • High Concurrency Workloads • Transactional • Linearly Scalable • Full table scans • Low Latency • Disk bound • Disk bound Massively • High Thruput • H. Concurrency Hadoop Parallel DB NoSQL • Structured • Structured • Mostly Structured • Structured Data Databases • Un-Structured • Un-Structured • Some unstructured • Relational In-Memory • Key-Value pairs • Machine Gen. DB Graph • Stream Processing • Low Disk I/O Plain Ole’ DB • In-DB computation • OLAP DB Capability • Memory resident • Fast Processing on steroids • Columnar • SQL and MR • Key based • Low Cost/TB • Analytic Libraries lookups 7 Hadoop World 2011
  8. 8. Myt A single technology will meet all the considerations for h our scalable data platform needs Best Practices Workloads scale differently – Monolithic architectures don’t work Minimize components – Data movement is painful Understand tradeoffs – Performance  Price  Effort Start with the core architecture and work in the edge cases 8 Hadoop World 2011
  9. 9. Massively parallel data warehouses SQL And MR Host controllers Hosts Network fabric FPGA CPU FPGA CPU FPGA CPU Massively parallel Memory Memory Memory compute nodes Distributed Storage 9 Hadoop World 2011
  10. 10. Hadoop Map Reduce Job Tracke Name Master Node Node r Network fabric Task Task Task Tracke Data Node Tracke Data Node Tracke Data Node Parallel r r r compute nodes Distributed Storage 10 Hadoop World 2011
  11. 11. There are striking similarities…. Map Reduce Job Tracke Name Node Massive r parallelism Execute code & algorithms next to Task Task Task data Data Data Data Tracke Tracke Tracke Node Node Node r r r Scalable Highly Available Map Reduce 11 Hadoop World 2011
  12. 12. But also key differences Map Reduce Schema on Read – Data loading is fast Hadoop Job Tracker Name Node Batch Mode data access Lower cost of data storage Process unstructured data Task Data Task Data Task Data Tracker Node Tracker Node Tracker Node Optimized for Performance Netezza Real time access, random reads, query optimizer, co-located joins Hardware Accelerated queries Data Loading = File copy SQL and Map Reduce Look Ma, No ETL 12
  13. 13. These differences lead to opportunities for co- existence for Hadoop in a Netezza environment 1. Scalable ETL engine – Complex data – Relationships not defined – Evolving schema 2. Queryable Archive – Moving computation is cheaper than moving data 3. Analytics sandbox – Exploratory analysis 13 Hadoop World 2011
  14. 14. Netezza-Hadoop: Deployment Patterns Create context Analyze unstructured data (classification, text mining) Parse, aggregate Analyze, report semi-structured data Active archival Analyze, report Long running queries structured data 14 Hadoop World 2011
  15. 15. Pattern 1: Data Processing Engine (ETL) Hadoop Cluster Netezza Environment NameNode JobTracker Raw Weblogs DataNode DataNode DataNode TaskTracker TaskTracker TaskTracker 15 Hadoop World 2011
  16. 16. Pattern 2: Low cost storage and dynamic provisioning Amazon Cloud Netezza Environment 2 3 Elastic MapReduce 1 Amazon S3 16 Hadoop World 2011
  17. 17. Pattern 3: Queryable Archive 1 3 Data Sources 2 Netezza Environment 17 Hadoop World 2011
  18. 18. About Greg Rokita o Director, Software Architecture at Edmunds, Inc o M.S. in Computer Science, Stanford University o Research interests o Large scale programing paradigms o Domain specific Data Stores o Semi-structured data representation and search o Designs & Implementations of Core Frameworks o Publishing & Messaging infrastructure o Content & Digital Asset Management systems o Reviews & Ratings system o Search APIs o Big Data Analytics No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  19. 19. Edmunds.com and Scale o Premier online resource for automotive information launched in 1995 as the first automotive information Web site o 15 million unique visitors o 210 million page views o 1 million+ new inventory items per day o 2 TB of new data every month o 40 node Hadoop cluster aggregating logs, advertising, vehicle, pricing, inventory and other data sets No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  20. 20. Edmunds Proposition We have developed an iterative approach to data warehouse development that has dropped the time it takes for us to deliver reports to our users from months to weeks. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 20 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  21. 21. How did we do it? o Process o Technology o Understanding of Value No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  22. 22. Process: agile approach o Continuous and fast delivery of new features o Collaboration between users and developers o Make new data available quickly and inexpensively o Quick problem resolution o No wasting of entire development cycle if data is not useful o Encouragement of exploration and creation of new applications No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  23. 23. Process Pre-process: • Complete • Raw • Modeled as source data • Generically loaded • Quick turn-around • Low retention • Slower performance Post-process: • Filtered • Transformed • Modeled as star schema • Optimized • Slow turn-around • High retention • Fast performance No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 23 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  24. 24. Post-Process Sandbox Use Pre- Load data process in ad-hock data manner Discard:  prevents shadow No production Change  little effort lost schema (by users or Prototype Data has value? developers) Develop Optimized Yes Pipeline:  data is confirmed to Enhance Schema is be useful stable?  effort is warranted No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 24 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  25. 25. Technology Publishing Hadoop Netezza System Stack • All Data • HBase raw data • All data loaded from • Generic • Oozie job coordinator Hadoop in batch • Thrift IDL with • HDFS storage of pre • Analysis and data Versioning and optimized data exploration - use the replica of RDBMS in speed and power files • Report generation No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 25 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  26. 26. Edmunds Publishing System No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 26 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  27. 27. Generic flow for pre-process Producers: Inventory, Pricing, Vehicle, Dealer, Leads Broker Consumer HBase Map- G e Reduce n Netezza e Action r i c No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. , No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 27 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  28. 28. What architecture enables generic consumer? Thrift Camel ActiveMQ o Message o Retries o Delivery o Throttling o Routing o Persistence o Versioning o Durability o Monitoring No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  29. 29. Flexibility for Producers and Consumers: Support for Topologies Field Example Values Purpose Environment PROD, TEST, DEV Promotion cycle of deployment units Index Blue, Green, Stage Environment Index Data Center LAX1, EC2 The data center where deployment unit is located Site Edmunds, Insideline Company’s Product Application HBase, Digital Asset Manager Deployment Unit No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  30. 30. Producer-Consumer matching Match! Producer Virtual Queue Consumer Topic Name Name Publish Publish Inventory Inventory I am I am Prod Test Lax Broker EC2 Edmunds Destination Edmunds Inventory Interceptor Dealer Prod, Test Prod Send To Lax, EC2 Lax, EC2 Receive From Edmunds Edmunds Dealer Inventory No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  31. 31. HBase: how to handle data generically Colum Binary Discrete Type 2 Family Columns Serialized Hashcode of Thrift Thrift Thrift Start End List of Thrift the Thrift Object Object Object Date Date fields Object Object Field 1 Field 2 Field 3 Role System of Check if Versioning at the most Versioning for record updates are granular level for lookups optimized necessary dimension tables (optimization) No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 31 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  32. 32. Generic Thrift Persistence in HBase Column Name Value [ModelYear]|F:id|T:long|I:0 1368 [ModelYear]|F:midYear|T:boolean|I:1 false [ModelYear]|F:year|T:int|I:2 1993 [ModelYear]|F:name|T:java.lang.String|I:4 Celica [ModelYear]#[attributss][0]|F:_key|T:java.lang.Long 64 [ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][0]|F: Standard Sport value|T:java.lang.String|I:1 V:GT-S 2dr [ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][1]|F: Hatchback value|T:java.lang.String|I:1 [ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][1]|F:i 441 d|T:long|I:2 [ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][3]|F: V:GT-S value|T:java.lang.String|I:1 No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 32 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  33. 33. Netezza: Time is Money Compared to Oracle Business Value Up to 12x faster load times  Can reload data more frequently  Failed workflows are no longer a big problem  Helps in transition to real time system: We can now create intraday reports for Leads! Up to 400x faster query  More productive Business Intelligence times  Queries that could ‘never’ finish in Oracle are now providing business value No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 33 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  34. 34. Generic and reusable Oozie actions for Netezza Oozie Load and Remove Action Apache CLI Nzload and Nzsql (provisioned on worker nodes using Chef) No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 34 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  35. 35. Value o Data warehouse proves product value both internally and to our customers o Failing fast and quick turn around allow us to know when we are building the right reporting and analytical products without a large up front investment o By combining all data in a single system we are enabling new products to be developed that we previously could not No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 35 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  36. 36. Krishnan Parasuraman Greg Rokita @kparasuraman Edmunds.com Building Scalable Data Platforms Hadoop and Netezza Deployment Models

×