• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models
 

Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

on

  • 1,023 views

Hadoop has rapidly emerged as a viable platform for Big Data analytics. Many experts believe Hadoop will subsume many of the data warehousing tasks presently done by traditional relational systems. In ...

Hadoop has rapidly emerged as a viable platform for Big Data analytics. Many experts believe Hadoop will subsume many of the data warehousing tasks presently done by traditional relational systems. In this presentation, you will learn about the similarities and differences of Hadoop and parallel data warehouses, and typical best practices. Edmunds will discuss how they increased delivery speed, reduced risk, and achieved faster reporting by combining ELT and ETL. For example, Edmunds ingests raw data into Hadoop and HBase then reprocesses the raw data in Netezza. You will also learn how Edmunds uses prototyping to work on nearly raw data with the company’s Analytics Team using Netezza.

Statistics

Views

Total Views
1,023
Views on SlideShare
1,017
Embed Views
6

Actions

Likes
0
Downloads
81
Comments
0

3 Embeds 6

http://192.168.6.179 4
http://www.docseek.net 1
http://www.slashdocs.com 1

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models Presentation Transcript

    • Krishnan Parasuraman Greg RokitaNetezza Edmunds.com Building Scalable Data Platforms Hadoop and Netezza Deployment Models
    • Talking Points• Building scalable data platforms – Architectural considerations• Hadoop and Massively Parallel Databases – Similarities and differences – Usage patterns• Practitioner’s View Point – Edmunds.com data warehouse platform 2 Hadoop World 2011
    • Building scalable data platformsTypical Digital Media Information Processing Pipeline Clicks Visits Page Views • Scoring Real Time • Yield optimization Likes Data • Audience Analytics Decision Tweets Processing Impressions Engine Locations • Display Ads • Correlate Reporting • Recommendation • Structure • Personalized Content • Consolidate • Aggregate • Summarize • Ad-hoc analysis 3 Hadoop World 2011
    • Building scalable data platforms Clicks Visits Page Views Real Time Likes Data Decision Tweets Processing Impressions Engine Locations Reporting DATA PLATFORM 4 Hadoop World 2011
    • Building scalable data platforms Real Time Data Decision Processing Engine Reporting • Real Time • High Velocity • Compute intensive • Cached Queries • High ConcurrencyWorkloads • Transactional • Linearly Scalable • Full table scans • Low Latency • Disk bound • Disk bound • H. Concurrency • High Thruput • Structured • Structured • Mostly Structured • Structured Data • Un-Structured • Un-Structured • Some unstructured • Relational • Key-Value pairs • Machine Gen. • Stream Processing • Low Disk I/O • In-DB computation • OLAPCapability • Memory resident • Fast Processing • SQL and MR • Columnar • Key based • Low Cost/TB • Analytic Libraries lookups 5 Hadoop World 2011
    • Building scalable data platforms Real Time Data Decision Processing Engine Reporting • Real Time • High Velocity • Compute intensive • Cached Queries • High ConcurrencyWorkloads • Transactional • Linearly Scalable • Full table scans • Low Latency • Disk bound • Disk bound Massively • High Thruput • H. Concurrency Hadoop Parallel DB NoSQL • Structured • Structured • Mostly Structured • Structured Data Databases • Un-Structured • Un-Structured • Some unstructured • Relational In-Memory • Key-Value pairs • Machine Gen. DB Graph • Stream Processing • Low Disk I/O Plain Ole’ DB • In-DB computation • OLAP DBCapability • Memory resident • Fast Processing on steroids • Columnar • SQL and MR • Key based • Low Cost/TB • Analytic Libraries lookups 6 Hadoop World 2011
    • Myt A single technology will meet all the considerations for h our scalable data platform needs Best PracticesWorkloads scale differently – Monolithic architectures don’t workMinimize components – Data movement is painfulUnderstand tradeoffs – Performance  Price  EffortStart with the core architecture and work in the edge cases 7 Hadoop World 2011
    • Massively parallel data warehouses SQL And MR Host controllers Hosts Network fabric FPGA CPU FPGA CPU FPGA CPU Massively parallel Memory Memory Memory compute nodes Distributed Storage 8 Hadoop World 2011
    • Hadoop Map Reduce Job Tracke Name Master Node Node r Network fabric Task Task Task Tracke Data Node Tracke Data Node Tracke Data Node Parallel r r r compute nodes Distributed Storage 9 Hadoop World 2011
    • There are striking similarities…. Map Reduce Job Tracke Name Node Massive r parallelism Execute code & algorithms next to Task Task Task data Data Data Data Tracke Tracke Tracke Node Node Node r r r Scalable Highly Available Map Reduce 10 Hadoop World 2011
    • But also key differences Map Reduce Schema on Read – Data loading is fast Hadoop Job Tracker Name Node Batch Mode data access Lower cost of data storage Process unstructured data Task Data Task Data Task Data Tracker Node Tracker Node Tracker Node Optimized for Performance Netezza Real time access, random reads, query optimizer, co-located joins Hardware Accelerated queries Data Loading = File copy SQL and Map Reduce Look Ma, No ETL 11
    • These differences lead to opportunities for co-existence for Hadoop in a Netezza environment1. Scalable ETL engine – Complex data – Relationships not defined – Evolving schema2. Queryable Archive – Moving computation is cheaper than moving data3. Analytics sandbox – Exploratory analysis 12 Hadoop World 2011
    • Netezza-Hadoop: Deployment Patterns Create context Analyzeunstructured data (classification, text mining) Parse, aggregate Analyze, reportsemi-structured data Active archival Analyze, report Long running queries structured data 13 Hadoop World 2011
    • Pattern 1: Data Processing Engine (ETL) Hadoop Cluster Netezza Environment NameNode JobTrackerRaw Weblogs DataNode DataNode DataNode TaskTracker TaskTracker TaskTracker 14 Hadoop World 2011
    • Pattern 2: Low cost storage and dynamicprovisioning Amazon Cloud Netezza Environment 2 3 Elastic MapReduce 1 Amazon S3 15 Hadoop World 2011
    • Pattern 3: Queryable Archive 1 3 Data Sources 2 Netezza Environment 16 Hadoop World 2011
    • Edmunds.com and Scale o Premier online resource for automotive information launched in 1995 as the first automotive information Web site o 15 million unique visitors o 210 million page views o 1 million+ new inventory items per day o 2 TB of new data every month o 40 node Hadoop cluster aggregating logs, advertising, vehicle, pricing, inventory and other data setsNo part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any suchdisclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
    • Edmunds Proposition We have developed an iterative approach to data warehouse development that has dropped the time it takes for us to deliver reports to our users from months to weeks. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the18 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
    • How did we do it? o Process o Technology o Understanding of ValueNo part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any suchdisclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
    • Process: agile approach o Continuous and fast delivery of new features o Collaboration between users and developers o Make new data available quickly and inexpensively o Quick problem resolution o No wasting of entire development cycle if data is not useful o Encouragement of exploration and creation of new applicationsNo part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any suchdisclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
    • Process Pre-process: • Complete • Raw • Modeled as source data • Generically loaded • Quick turn-around • Low retention • Slower performance Post-process: • Filtered • Transformed • Modeled as star schema • Optimized • Slow turn-around • High retention • Fast performance No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the21 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
    • Post-Process Sandbox Use Pre- Load data process in ad-hock data manner Discard:  prevents shadow No production Change  little effort lost schema (by users or Prototype Data has value? developers) Develop Optimized Yes Pipeline:  data is confirmed to Enhance Schema is be useful stable?  effort is warranted No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the22 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
    • Technology Publishing Hadoop Netezza System Stack • All Data • HBase raw data • All data loaded from • Generic • Oozie job coordinator Hadoop in batch • Thrift IDL with • HDFS storage of pre • Analysis and data Versioning and optimized data exploration - use the replica of RDBMS in speed and power files • Report generation No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the23 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
    • Edmunds Publishing SystemNo part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any suchdisclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the24 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
    • Generic flow for pre-process Producers: Inventory, Pricing, Vehicle, Dealer, Leads Broker Consumer HBase Map- G e Reduce n Netezza e Action r i cNo part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any suchdisclosure requires the express approval of Edmunds Inc. , No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the25 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
    • What architecture enables generic consumer? Thrift Camel ActiveMQ o Message o Retries o Delivery o Throttling o Routing o Persistence o Versioning o Durability o MonitoringNo part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any suchdisclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
    • Flexibility for Producers and Consumers: Support for Topologies Field Example Values Purpose Environment PROD, TEST, DEV Promotion cycle of deployment units Index Blue, Green, Stage Environment Index Data Center LAX1, EC2 The data center where deployment unit is located Site Edmunds, Insideline Company’s Product Application HBase, Digital Asset Manager Deployment UnitNo part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any suchdisclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
    • Producer-Consumer matching Match! Producer Virtual Queue Consumer Topic Name Name Publish Publish Inventory Inventory I am I am Prod Test Lax Broker EC2 Edmunds Destination Edmunds Inventory Interceptor Dealer Prod, Test Prod Send To Lax, EC2 Lax, EC2 Receive From Edmunds Edmunds Dealer InventoryNo part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any suchdisclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
    • HBase: how to handle data generically Colum Binary Discrete Type 2 Family Columns Serialized Hashcode of Thrift Thrift Thrift Start End List of Thrift the Thrift Object Object Object Date Date fields Object Object Field 1 Field 2 Field 3 Role System of Check if Versioning at the most Versioning for record updates are granular level for lookups optimized necessary dimension tables (optimization) No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the29 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
    • Generic Thrift Persistence in HBase Column Name Value [ModelYear]|F:id|T:long|I:0 1368 [ModelYear]|F:midYear|T:boolean|I:1 false [ModelYear]|F:year|T:int|I:2 1993 [ModelYear]|F:name|T:java.lang.String|I:4 Celica [ModelYear]#[attributss][0]|F:_key|T:java.lang.Long 64 [ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][0]|F: Standard Sport value|T:java.lang.String|I:1 V:GT-S 2dr [ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][1]|F: Hatchback value|T:java.lang.String|I:1 [ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][1]|F:i 441 d|T:long|I:2 [ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][3]|F: V:GT-S value|T:java.lang.String|I:1 No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the30 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
    • Netezza: Time is Money Compared to Oracle Business Value Up to 12x faster load times  Can reload data more frequently  Failed workflows are no longer a big problem  Helps in transition to real time system: We can now create intraday reports for Leads! Up to 400x faster query  More productive Business Intelligence times  Queries that could ‘never’ finish in Oracle are now providing business value No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the31 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
    • Generic and reusable Oozie actions for Netezza Oozie Load and Remove Action Apache CLI Nzload and Nzsql (provisioned on worker nodes using Chef) No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the32 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
    • Value o Data warehouse proves product value both internally and to our customers o Failing fast and quick turn around allow us to know when we are building the right reporting and analytical products without a large up front investment o By combining all data in a single system we are enabling new products to be developed that we previously could not No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the33 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
    • Krishnan Parasuraman Greg Rokita@kparasuraman Edmunds.com Building Scalable Data Platforms Hadoop and Netezza Deployment Models