Integrating Hadoop in Your Existing DW and BI Environment

4,812 views

Published on

Integrating Hadoop in your existing data warehouse and business intelligence environment. Speakers Jeff Hammerbacher, Cloudera and Anil Madan, eBay.

Recording of webinar on https://www1.gotomeeting.com/register/515000760

Published in: Technology
2 Comments
15 Likes
Statistics
Notes
No Downloads
Views
Total views
4,812
On SlideShare
0
From Embeds
0
Number of Embeds
1,356
Actions
Shares
0
Downloads
349
Comments
2
Likes
15
Embeds 0
No embeds

No notes for slide

Integrating Hadoop in Your Existing DW and BI Environment

  1. 1.     
  2. 2.  •  •  •   •  
  3. 3.  •      •      
  4. 4. Presentation Outline ! 1. The standard model ! 2. The 3 stages of Hadoop adoption ! 3. Cloudera partnerships ! 4. Analytics at eBay ! Questions and Discussion Wednesday, November 17, 2010
  5. 5. 1. The Standard Model Data Warehousing and Business Intelligence Wednesday, November 17, 2010
  6. 6. Application Database Application Requests Wednesday, November 17, 2010
  7. 7. Application Database Application Requests Data Warehouse Wednesday, November 17, 2010
  8. 8. Application Database Application Requests Data Warehouse ETL Wednesday, November 17, 2010
  9. 9. Application Database Application Requests Data Warehouse ETL Business Intelligence Wednesday, November 17, 2010
  10. 10. Application Database Application Requests Data Warehouse ETL Business Intelligence Analytics Wednesday, November 17, 2010
  11. 11. 2. The 3 Stages of Hadoop Adoption Wednesday, November 17, 2010
  12. 12. Stage 1 Off the Critical Path Wednesday, November 17, 2010
  13. 13. Stage 1 Copy or Archive Application Database Application Requests Data Warehouse ETL Business Intelligence Analytics Hadoop Wednesday, November 17, 2010
  14. 14. Stage 1 Add Unstructured Data Application Database Application Requests Data Warehouse ETL Business Intelligence Analytics Hadoop Wednesday, November 17, 2010
  15. 15. Stage 1 Consolidate Multiple Data Warehouses Application Database Data Warehouse ETL Hadoop Application Database Data Warehouse ETL Wednesday, November 17, 2010
  16. 16. Stage 2 On the Critical Path Wednesday, November 17, 2010
  17. 17. Stage 2 Structure and Store Application Database Application Requests Data Warehouse Business Intelligence Analytics Hadoop Wednesday, November 17, 2010
  18. 18. Stage 3 Ad Hoc Query Support Wednesday, November 17, 2010
  19. 19. Application Database Application Requests Data Warehouse Business Intelligence Analytics Hadoop + Hive Business Intelligence Analytics Wednesday, November 17, 2010
  20. 20. Cloudera’s Distribution for Hadoop The Industry-leading Hadoop Distribution Wednesday, November 17, 2010
  21. 21. 3. Cloudera Partnerships Wednesday, November 17, 2010
  22. 22. Cloudera Partnerships Cloud, Hardware, and OS ! Processor ! AMD, Intel ! Server ! Acer, HP, Supermicro ! OS ! Canonical ! Cloud ! VMware vCloud ! CDH runs on AWS and Rackspace Cloud as well Wednesday, November 17, 2010
  23. 23. Cloudera Partnerships Data Integration ! Informatica ! Talend ! Pentaho Data Integration Wednesday, November 17, 2010
  24. 24. Cloudera Partnerships Database ! Aster Data ! Greenplum ! Membase ! Netezza ! Quest Software (OraOop) ! Teradata ! Vertica Wednesday, November 17, 2010
  25. 25. Cloudera Partnerships Business Intelligence ! Jaspersoft ! Microstrategy ! Pentaho BI Suite Wednesday, November 17, 2010
  26. 26. 4. Analytics at eBay Wednesday, November 17, 2010
  27. 27. 1 eBay’s Data Scale • eBay manages … • Over 90 million active users worldwide • Over 220 million items for sale • Over 10 billion URL requests per day • •  … in a dynamic environment • Tens of new features each week • Roughly 10% of items are listed or ended every day • Collect Everything • eBay processes 40TB of new, incremental data per day • eBay analyzes 40PB of data per day • Store every historical item and purchase eBay has one of the largest EDW system and is building one of the world’s largest Hadoop clusters
  28. 28. 2 Where – it fits in our Data Platform…
  29. 29. Integration into Existing Warehouse 3 Click Stream EDW Images Search Indices Analytics Reporting Algorithmic Models Acquisition Item  Description Data Acquisition BI Generation Insight Delivery
  30. 30. Data Sourcing Patterns 4 Source Preparation Format Pattern / Learning Click Stream Session Event Session Container Session/Event Streamed as Gzip/ Binary. Prepared as LZO/Text. Session/Event Data Build an index and use LzoTextInputFormat for splits Session Container - a join of Session and corresponding Event data. Prepared as Sequence Files. Session Container - Secondary sort with reduce side join EDW Item Transaction User Feedback Bids Incremental feed streamed and maintained as GZIP/Text. Smaller data set , keep it in the original format. Prepare a snapshot as SequenceFile. Rebuild daily snapshot with previous snapshot and incremental day’s data. Build a Hive table on snapshot data Create external Hive table which points to SequenceFile HBase a) Leverage TotalOrderPartitoner with RandomSamplers to identify partition ranges for reducers. b) Create HBaseregions using Hfile c) Update RegionServers using ruby script loadtable.rb Learning a) Incremental data not temporal/sparse, hence not suitable as versions in a column oriented DB. b) HBase insert vs. append performance, 120K vs. 12K rows per sec c) Hfile flush durability issues HBASE-1923
  31. 31. Hadoop Ecosystem 5 5 Hadoop Core (HDFS,Common) MapReduce (Java, Streaming, Pipes,Scala) Data Access (Hbase, Pig, Hive) Tools & Libraries (HUE,UC4,Oozie.Mobius,Mahout) Monitoring & Alerting (Ganglia, Nagios) • MapReduce Sourcing data primarily Java Applications using Perl, Scala, Python… • Data Access Frameworks Pig – data piplelines Hive – Adhoc queries  MQL – Mobius Query Language • Monitoring & Alerting Ganglia, Nagios,  Cloudera Enterprise • Tools & Libraries HUE/Mobius – lifecycle of user  jobs UC4 ‐ scheduling Oozie – user workflow and data pipelines Mahout – data mining    
  32. 32. 6
  33. 33. Metadata ‐ Data Discovery & Management 7 Clients Data Sourcing Data Access Layer HDFS Metadata         Data Discovery Data  Monitoring Logical  Type  System Provisioning  Tools Metadata  Store Hive, Java Pig Schemas Pig  load  UDFs Hive  Tables Java  POJO ValidationLoad HBASE  Tables Extract Transform
  34. 34. Administration • Groups • Cloudera Enterprise • Workload Management • Allocation, Weights , Preemption, Speculative  Execution, Data Locality • Security • Integrate Hadoop security spec with corporate policies • Authentication • HUE – custom module to use corp. credentials • Command Line Interface – PAM custom module • Authorization • Establish roles based on data classification and  access patterns 8
  35. 35. 9 Metrics Details Data Sourcing Latency, Data Load Status, Integrity, Quality, Availability Consumption Cloudera’s Bean Counter , Job Statistics, System consumption Budgeting Resource Allocation Models, Forecasting, Chargeback Utilization Cloudera’s Activity Monitor, Efficiency, Performance Platform Description Availability Standby Nodes ‐ Checkpoint ,Backup , Avatar Node, SLAs Manageability Installation, Provisioning, De‐Provisioning, Version upgrades Scalability Federated NameNode, Metadata Replication, Zookeeper Data Movement Publish/Subscribe  ETL tools, low latency , self‐service Storage Consistency, Partitioning, Compression, Replication Workload Concurrency, Resource Sharing, Schedulers, Allocation Policies Retention, Archival, Backup, Quotas Platform & Metrics
  36. 36.   
  37. 37.         

×