Large Scale Data Warehousing          at Yahoo!              Bohan Chen        (bchen@yahoo-inc.com)      Database Archite...
Agenda•   Project Requirements•   POC Candidates•   Goals•   Tests•   Architecture and Configuration    – Database Server ...
“Pie DB” Project Requirements• Yahoo Product Intelligence Engineering – Pie DB   – Several billions page views per day  – ...
POC Candidates• Oracle• Greenplum• Netezza• Data Allegro• Hadoop• And others…                       4
Goals• High data compression rate   – Hadoop pre-processing improves compression rate     to 4-5x!• ~4GB/s of reads (susta...
Goals• No SQL Hints!• Standard Hardware / Software stack   – Avoid proprietary solutions as much as possible   – Easily re...
Tests• Load 3 months of clicks, page views, and link views  historical data   – Almost 100TB of raw data   – 21TB in datab...
Tests   •   Scalability        – Performance increases close to linearly as we add RAC          nodes   •   Deep analytica...
System Requirements•   Network/Cluster Interconnects     – GigE does not meet the bandwidth requirement     – 10GigE is st...
Overall System Topology                                                                       ApplicationsNAS Storage     ...
Database Server Configuration•   IBM x3850 M2     – 64GB RAM (DDR2 SDRAM)     – 4 x Intel Xeon E7330 @ 2.40GHz (quad core)...
Database Server        Hardware Configuration (Simplified)                       IBM x3850 M2          GigE              H...
Database Server            Software Architecture (Simplified) Oracle       Oracle                           Oracle        ...
Database Server          Init Parameters•   _PX_use_large_pool = TRUE•   db_block_size = 8192•   db_cache_size = 8048M•   ...
Network/Cluster Interconnects         InfiniBand Architecture         IP over IB                     RDS over IB          ...
Network/Cluster Interconnects               InfiniBand Architecture•     InfiniBand Switch is required•     HCA is require...
Network/Cluster Interconnects           InfiniBand Architecture•   Oracle Verification     – “cluster interconnect IPC ver...
StorageEMC SAN Architecture                       18
Storage         EMC SAN Details• 6 x CX3-40F arrays   – 900 x 400GB 10K drives (150 drives @ RAID5 4+1 =     40TB usable p...
Storage        EMC SAN Details• 2 x EMC Brocade 4900 Departmental Switches   – 128 x 4Gb Ports (64 per Switch)   – Simple ...
Storage             Oracle Automatic Storage Management• Only stores metadata about where data lives – an LVM for Oracle d...
Critical Success Factors (Oracle)• gzip support for external tables   – Feature added by Oracle to make POC succeed   – Pa...
Critical Success Factors• InfiniBand Interconnect   – Provide bandwidth needed   – Reduce latency/cluster wait   – Highest...
Oracle Parallel Query            (Simplified)                            select * from table …      Query                 ...
PQ and RAC      Query                     QC    Coordinator                      Px   Px   Px   PxProducer / Consumer     ...
PQ and RAC scaling issue• All architectures, including parallel shared  nothing systems, eventually need a funnel  point (...
Scaling PQ on RAC• Large number of sub-partitions required to achieve  high degree of parallelism and performance• Reduce ...
Oracle Parallel Query           (More Realistic)select … from table pageviews, linkviews where pageviews.pvid = ... groupb...
Need to Avoid                                Node 1   Node 2                          QC  Group by                      Px...
Best Scenario                              Node 1   Node 2                        QC  Group by                    Px      ...
How PQ Survives in RAC        Environment• Node Affinity to avoid interconnect traffic   – The consumer / producer pair al...
Lessons Learned and Challenges•   Parallel Shared Nothing does not always scale linearly•   Although most Data Warehouse t...
Backup and Restore Challenges• Web logs/events (the fact tables) can be  reloaded; no need to back up• Aggregation/summary...
Challenges for Oracle• Degree of parallelism (DOP) is fixed at the query  startup• AWR report has no aggregation for paral...
Major Oracle Enhancements /      Patches for Data Warehouse• 6522622 – External tables need to read  compressed files• 664...
Future Plans• Near future:   – ETL Tool   – Backup/Restore throughput enhancement   – Resource plans for different users a...
Next Stop            37
10 Petabytes!                38
Thank You!             39
Upcoming SlideShare
Loading in …5
×

Oow 2008 yahoo_pie-db

548 views

Published on

2008 Oracle Open World, Large Scale Data Warehouse

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
548
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Oow 2008 yahoo_pie-db

  1. 1. Large Scale Data Warehousing at Yahoo! Bohan Chen (bchen@yahoo-inc.com) Database Architect at Yahoo! Oracle Certified Master 1
  2. 2. Agenda• Project Requirements• POC Candidates• Goals• Tests• Architecture and Configuration – Database Server – Network/Cluster Interconnects – Storage• Critical Success Factors• Parallel Query on RAC• Lessons Learned and Challenges• Future Plans 2
  3. 3. “Pie DB” Project Requirements• Yahoo Product Intelligence Engineering – Pie DB – Several billions page views per day – A unified data warehouse that could support click streams, page views, and link views data• Main requirements: – Support > 1PB of data – Linear scalability when adding storage or CPU – Store data in a compressed format – Standard SQL access – Integrate with 3rd-party BI tools – Support ~60 concurrent queries – Resource management – Reasonable and affordable cost 3
  4. 4. POC Candidates• Oracle• Greenplum• Netezza• Data Allegro• Hadoop• And others… 4
  5. 5. Goals• High data compression rate – Hadoop pre-processing improves compression rate to 4-5x!• ~4GB/s of reads (sustained) – ~20GB/s effective read rate, based on 5x compression rate• Load 10TB in 3 hours – 3.5TB/hr load rate; that is ~1GB/s writes• No Indexes for queries – Avoid additional space needed for indexes – Avoid indexes building/rebuilding time after data loading 5
  6. 6. Goals• No SQL Hints!• Standard Hardware / Software stack – Avoid proprietary solutions as much as possible – Easily repurpose if necessary• Delete / Expire / Rolloff old data – Truncate / drop old partitions – No vacuum process• Leverage hardware investment before deciding on ETL tools – Use database as transformation engine in the initial phase (ELT instead of ETL) 6
  7. 7. Tests• Load 3 months of clicks, page views, and link views historical data – Almost 100TB of raw data – 21TB in database (due to compression)• Load and transform data – Load raw data – Create dimension tables and merge with existing dimensions• 20 base queries to test system – Typical queries we will see in the production – Run queries serially and concurrently – Concurrent test has to finish faster than serial test 7
  8. 8. Tests • Scalability – Performance increases close to linearly as we add RAC nodes • Deep analytical queries • Ad hoc queries – Allow users to submit random queries to system and see if it breaks!-----------------------------------------------------------------------------------| Id | Operation | Rows | Bytes | TempSpc | Cost (%CPU)| Time |-----------------------------------------------------------------------------------| 0 | SELECT STATEMENT | 16M| 7980M| | 610K (16)| 02:02:03||* 1 | VIEW | 16M| 7980M| | 610K (16)| 02:02:03|. .. .| 10 | PX PARTITION HASH ALL | 16M| 3959M| | 610K (16)| 02:02:03||* 11 | HASH JOIN RIGHT OUTER| 16M| 3959M| 932M| 610K (16)| 02:02:03||* 12 | TABLE ACCESS FULL | 11G| 804G| | 25036 (7)| 00:05:01||* 13 | HASH JOIN | 16M| 2794M| | 543K (17)| 01:48:43||* 14 | TABLE ACCESS FULL | 16M| 1894M| | 69951 (1)| 00:14:00||* 15 | TABLE ACCESS FULL | 597G| 31T| | 471K (19)| 01:34:13|----------------------------------------------------------------------------------- 8
  9. 9. System Requirements• Network/Cluster Interconnects – GigE does not meet the bandwidth requirement – 10GigE is still too expensive – InfiniBand is chosen (up to 20Gb/s)• Storage – Block based storage / SAN solution – Price performance justified for warehouse workloads• Oracle 10.2.0.3 x86_64 (RAC) – Native IB support – Many improvements and fixes on “warehousing” features – Latest 10.2 patch set at that time• Oracle Automatic Storage Management – Provides LVM style striping of data – Supports clustered access (required for RAC) 9
  10. 10. Overall System Topology ApplicationsNAS Storage Private LAN Public LANfor RAW data 2 GigE NICs per server Node 1 Node 2 Node 3 Node 4 Node 5 Node 16 16 x3850 M2’s InfiniBand Network (redundant) Storage Area Network (SAN) 4x4Gb FCP 4x4Gb FCP (SP-A) (SP-B) 6 x EMC CX3-40’s Legend 1000TX Public (primary) 20Gb Full Duplex IB 4Gb FCP (Switch 1) 4Gb FCP (Switch 2)
  11. 11. Database Server Configuration• IBM x3850 M2 – 64GB RAM (DDR2 SDRAM) – 4 x Intel Xeon E7330 @ 2.40GHz (quad core) • 4 x 4 = 16 cores per node – One of the fastest servers in the same class; power efficiency• 3 x QLogic QLE 2462 HBA (dual port) • 4Gb FCP per port (for EMC SAN)• 2 x QLogic 7104-HCA-128LPX-DDR • 20Gb (for InfiniBand)• RHEL4 Update 6 – Large SMP Kernel for x86_64 (2.6.9-67.ELlargesmp x86_64)• Oracle 10.2.0.3 X86_64 Clusterware/ASM/RDBMS (with patches) 11
  12. 12. Database Server Hardware Configuration (Simplified) IBM x3850 M2 GigE HBA HCAPublic/ Fibre RDS over IB orOracle VIP Channel IP over IB Cisco 4948 Brocade QLogic/SilverStorm Ethernet Switch 4900 SAN Switch 9024 IB Switch EMC CX3-40 12
  13. 13. Database Server Software Architecture (Simplified) Oracle Oracle Oracle ASM Oracle RDBMS ClusterwareOperating SCSI IP RDS/IBSystems Multipath IP/IBHardware HBA GigE NIC HCA 13
  14. 14. Database Server Init Parameters• _PX_use_large_pool = TRUE• db_block_size = 8192• db_cache_size = 8048M• db_file_multiblock_read_count = 128• large_pool_size = 4G• parallel_adaptive_multi_user = FALSE• parallel_execution_message_size = 16384• parallel_max_servers = 32• parallel_threads_per_cpu = 2• pga_aggregate_target = 38G• sga_max_size = 18512M• shared_pool_size = 6G 14
  15. 15. Network/Cluster Interconnects InfiniBand Architecture IP over IB RDS over IB RAC RAC Database Database IPC IPC Library LibraryUser UserKernel Kernel UDP IB/RDS IP IPoIBHardware Hardware NIC HCA NIC HCA 15
  16. 16. Network/Cluster Interconnects InfiniBand Architecture• InfiniBand Switch is required• HCA is required – Run INSTALL script to provide IP and netmask• Relink Oracle – cd $ORACLE_HOME/rdbms/lib – make -f ins_rdbms.mk ipc_rds ioracle• Oracle patch 6643259 – Intermittent hang for inter-instance parallel query using RDS over IB – Patch available for 10.2.0.3 and 11.1.0.6• Kernel panic on an idle system/IB hang at reboot – Fixed by upgrading the HCA driver $ cat /proc/iba/mt25218/config SilverStorm Technologies Inc. MT25218/MT25204 Verbs Provider Driver, version 4.2.0.5.2 for SilverStorm Technologies Inc. InfiniBand(tm) Transport Driver, version 4.2.0.5.2 Built for Linux Kernel 2.6.9-67.ELlargesmp 16
  17. 17. Network/Cluster Interconnects InfiniBand Architecture• Oracle Verification – “cluster interconnect IPC version: Oracle RDS/IP (generic)” in alert log• Linux Verification – cat /proc/driver/rds/info – cat /proc/driver/rds/stats – cat /proc/driver/rds/config $ cat /proc/driver/rds/stats Rds Statistics: Sockets open: 205 End Nodes connected: 15 Performance Counters: ON Transmit: Xmit bytes 268914077203 Xmit packets 250454334 17
  18. 18. StorageEMC SAN Architecture 18
  19. 19. Storage EMC SAN Details• 6 x CX3-40F arrays – 900 x 400GB 10K drives (150 drives @ RAID5 4+1 = 40TB usable per array) – 96GB cache (16GB per array) – 48 x 4Gb ports (8 per array) – Capable of ~7.5GB/s read throughput (1.25GB/s per array)• 240TB usable storage capacity – 200TB for Oracle data (1PB logical with 5:1 Oracle compression) – 40TB additional storage required for Oracle TEMP space 19
  20. 20. Storage EMC SAN Details• 2 x EMC Brocade 4900 Departmental Switches – 128 x 4Gb Ports (64 per Switch) – Simple Dual-Fabric Design• Ability to expand by adding drives and/or arrays• Linear scaling with 6 arrays• Oracle ASM to rebalance data when adding stroage• Best price/performance at the time 20
  21. 21. Storage Oracle Automatic Storage Management• Only stores metadata about where data lives – an LVM for Oracle data• Stripe size is 1MB (_asm_stripesize=1048576)• Stripe a Datafile evenly across all storage arrays to use all spindles• Vendor agnostic; Can add / remove storage as needed ASM Software Layer Storage 1MB 1MB 1MB 1MB 1MB SAN Based Storage 1MB 1MB 1MB 1MB 1MB (iSCSI / FCP) 21
  22. 22. Critical Success Factors (Oracle)• gzip support for external tables – Feature added by Oracle to make POC succeed – Patch 6522622: External tables need to read compressed files• Compression – Reduce required disk space – More effective throughput (5x)• Automatic Storage Management – Distribute IO evenly; scale IO linearly• Features and Enhancements for Data warehouse – Partitioning and composite partitioning – Patch 6402957: Adaptive aggregation push-down 22
  23. 23. Critical Success Factors• InfiniBand Interconnect – Provide bandwidth needed – Reduce latency/cluster wait – Highest utilization is 7Gb/s but only for a brief period (when using RDS over IB) – 1~2Gb/s is more typical under load• EMC SAN solution – IO throughput to support the full table scan – Max 1.25GB/s per array 23
  24. 24. Oracle Parallel Query (Simplified) select * from table … Query QC Coordinator Px Px Px PxProducer / Consumer Pairs Px Px Px Px Link Views Table Table Partitions P1 P2 P3 P4 24
  25. 25. PQ and RAC Query QC Coordinator Px Px Px PxProducer / Consumer Pairs Px Px Px Px Link Views Table Table Partitions P1 P2 P3 P4 25
  26. 26. PQ and RAC scaling issue• All architectures, including parallel shared nothing systems, eventually need a funnel point (query coordinator) – Lots of “select * from petabyte_table order by 1” will kill everyone• During POC, we had to ensure that Oracle could parallelize ALL operations, otherwise parallel query becomes useless – This is a common source of PQ scaling problems as it requires too much data to traverse the interconnect 26
  27. 27. Scaling PQ on RAC• Large number of sub-partitions required to achieve high degree of parallelism and performance• Reduce interconnect traffic• Need an interconnect that can support throughput requirements of QC• Avoid “broadcast” redistribution of PQ results 27
  28. 28. Oracle Parallel Query (More Realistic)select … from table pageviews, linkviews where pageviews.pvid = ... groupby date; QC Group by Px Px Px Px Hash Join Table Scan Px Px Px Px Link Views Page Views PVID Partitions P1 P2 P1 P2 28
  29. 29. Need to Avoid Node 1 Node 2 QC Group by Px Px Px Px Hash Join Table Scan Px Px Px Px Link Views Page ViewsPVID Partitions P1 P2 P1 P2 29
  30. 30. Best Scenario Node 1 Node 2 QC Group by Px Px Px Px Hash Join Table Scan Px Px Px Px LVS PVS LVS PVSPVID Partitions P1 P1 P2 P2 30
  31. 31. How PQ Survives in RAC Environment• Node Affinity to avoid interconnect traffic – The consumer / producer pair always lives on the same node• Joining tables that have the same partition key and the same number of partitions result in partition-wise join – This is the key to scaling! – Queries that join large tables that are not partitioned on the same key will require “brute force” interconnects to survive 31
  32. 32. Lessons Learned and Challenges• Parallel Shared Nothing does not always scale linearly• Although most Data Warehouse technology did very well within 25TB, things started to change quickly at 100TB• At this data volume, do not expect any commercial solution to work without some growing pains – Expect to see bugs!• Avoiding proprietary solutions and staying open means possibly multiple vendors are involved – Working with multiple vendors/teams might be challenging – Select vendors with quality support and knowledge transfer – Dedication from Oracle support and development team to make the POC successful 32
  33. 33. Backup and Restore Challenges• Web logs/events (the fact tables) can be reloaded; no need to back up• Aggregation/summary is backed up – Range-partitioned by date – Set read-only for historical partitions – Only back up new partitions; skip RO partitions• Backup and Restore – Oracle RMAN: 6 Channels; level 0 – NetVault with 6 Tapes – 300+ MB/s backup and 200+ MB/s restore 33
  34. 34. Challenges for Oracle• Degree of parallelism (DOP) is fixed at the query startup• AWR report has no aggregation for parallel executions yet• ORA-12805: parallel query server died unexpectedly – Once that happens, all work is abandoned, and resubmit is the only solution so far – Hope to see “auto-recovery” feature in the future!• No DOP information is available in the execution plan – Improved in 11g (AUTOTRACE can see the DOP!)• Lacking detailed information on parallel servers activity and progress – Improved in 11g (GV$SQL_MONITOR) 34
  35. 35. Major Oracle Enhancements / Patches for Data Warehouse• 6522622 – External tables need to read compressed files• 6643259 – Intermittent hang for inter- instance parallel query using RDS over IB• 6748058 – Transformed query does not parallelize• 6402957 – Predicate pushdown not working with window functions for some cases• 6808773 – Sub optimal hash distribution when join on highly skewed columns• 6471770 – Parallel servers die unexpectedly 35
  36. 36. Future Plans• Near future: – ETL Tool – Backup/Restore throughput enhancement – Resource plans for different users and workloads• Further collaboration/integration with Hadoop• Oracle 11g evaluation and upgrade• EMC CX4-960 – Up to 2x IO and 2x capacity (vs CX3) – Upgrade without migrating data• Intel 7400 series 6-core CPU “Dunnington” – Up to 50% more performance and 10% less power consumption vs 7300 series• 10 GigE evaluation 36
  37. 37. Next Stop 37
  38. 38. 10 Petabytes! 38
  39. 39. Thank You! 39

×