• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
VLDB 2005 31st International Conference on Very Large Databases
 

VLDB 2005 31st International Conference on Very Large Databases

on

  • 359 views

 

Statistics

Views

Total Views
359
Views on SlideShare
359
Embed Views
0

Actions

Likes
0
Downloads
2
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • From an application and user perspective the main concept of grid computing is viewing computing as a utility. It should not matter where data resides or what computer processes a task. It would be nice to just request information or computation and have it delivered as much as is needed and whenever it is needed. From an implementation perspective grid means much more. It is about data virtualization, resource provisioning and availability Virtualization enables components on all levels, such as storage devices, processors and database servers, to collaborate without rigidity. Provisioning means that all those requesting resources get what they need to complete their tasks and that no resources remain idle High availability ensures that resources are always available, even during component outages and upgrades
  • There are many research projects which are successfully using grid computing. I am naming just two of them. Seti@Home, a project at the university of Berkeley uses idle PCs on the Internet to mine radio telescope data for signs for extraterrestrial intelligence. It uses a proprietary architecture of data servers located the university of Berkeley which sends out data for analysis to subscribed computers on the internet who then mine the data and send the results back. IBM together with representatives of the world’s leading science organizations launched the World Community Grid. The use subscribed computers around the globe to find cures for diseases like malaria and tuberculosis. Applying the above grid concept to corporate data warehouses is appealing but difficult because of data security concerns and network performance bottlenecks. Corporate data warehouses contain business intelligence that cannot be shared with thousand of computers around the world. Nor do the network bandwidths allows us to ship terabytes of data for scanning to these computers. Traditionally corporate data warehouses have been implemented as islands of systems. Each department has their own data warehouse of data mart implemented on their own hardware. It is very difficult to share resources between these island. The whole setup is too rigid to answer to the rapidly changing business needs. And, it does not scale beyond a certain point. Grid based data warehouses can solve this dilemma. In the following slides we will show how HP and Oracle apply the grid concept to corporate data warehouses.
  • There are many features build into Oracle 10g to support grid computing. In this talk I will focus on those that are significant for implementing a Data Warehouse Grid. I would like to focus on the following three areas: 1) dynamic parallel processing 2) data virtualization and dynamic resource provisioning in data warehouses 3) smart inter node parallelism
  • Parallel execution of data warehouse queries is pertinent because of the amount of data that needs to be scanned to answer them. Many databases today offer this feature. In Oracle 10g queries are automatically parallelized to maximize resource utilization. The query coordinator parses the statement, decides the optimal parallel query execution and the degree of parallelism The degree of parallelism (DOP) is not static but is determined at parse time according to resource availability and computing demands. The DOP is adjusted when: The number of concurrent users on the system changes The nodes are taken down for hardware or software upgrades, or simply exchanges The grid is scaled out due to increased computing demands, when nodes are added When nodes are assigned to different applications
  • Oracle 10g grid is implemented a shared everything architecture which is very suitable for data virtualization and dynamic resource provisioning in data warehouses Conceptually, a data warehouse grid consists of a number of nodes. Oracle recommends to have homogeneous nodes, such as 1, 2, 4 or 8 processor systems, but non-homogeneous nodes with some limitations work as well. They are connected via an Interconnect. On commodity hardware Oracle supports, Gigabit Ethernet and Infiniband running TCP/IP, UDP or uDAPL. The interconnect is used to transfer data between nodes during parallel operations spanning nodes, which is very important for instance during hash join operations The nodes are then connected to a disk subsystem, typically a SAN solution. The IO connections are usually fibre channel, but can also be Infiniband based. I do not want to go more into the system side of this because Raghu will cover this in later slides. However, I would like to emphasize that in this shared everything architecture, all nodes can access all data, which is a very important feature for data virtualization and dynamic resource provisioning.
  • In order to run operations with a very high degree of parallelism in a Grid system operations need to run across multiple nodes. This is called inter node parallelism. Inter node parallelism can be very expensive if data needs to be shipped from one node to another. In this case the Interconnect can the bottleneck. The optimizer tries to avoid inter node parallelism whenever possible. This leads to reduced network traffic and faster execution times. 1) node locality: if possible the optimizer executes each query on one node. This is possible if the requested DOP is lower than twice the number of CPUs per node However, if the DOP is too high for one node it is executed on 2 nodes or more nodes. Then the following two optimizations might be used. 2) full partition wise join. This is an optimization that can be applied when two tables are equi-partitioned on the join key. That is they are both partitioned on their join key with the same DOP. Then the optimizer generates a plan that allows partitions to be joined locally without generating any traffic across the interconnect. This is a very efficient execution of a join 3) partial partition wise join. In case only one of the two tables to be joined is partitioned on the join key, the optimizer might decide to dynamically repartition the other table. During the repartitioning data gets transferred across nodes. However the final join is again executed efficiently.

VLDB 2005 31st International Conference on Very Large Databases VLDB 2005 31st International Conference on Very Large Databases Presentation Transcript

  • Raghunath Othayoth Nambiar Meikel Poess Hewlett-Packard Company Oracle Corporation Large Scale Data Warehouses on Grid: Oracle Database 10g and HP ProLiant Systems
  • Agenda
    • Grid Computing
    • Hardware Support
    • Software Support
    • TPC-H Result
  • Grid Computing
    • 1) application and user perspective:
      • just like the power grid: Have computing power delivered as requested
    • 2) implementation perspective:
      • Data virtualization
      • Resource provisioning
      • High availability
  • From Research to Industry
    • Research projects using grid technology:
      • [email_address]
      • World Community Grid
    • Traditionally companies used islands of systems to implement corporate data warehouses
      • Unable to share resources
      • Too rigid to answer rapidly changing business needs
      • Cannot be scaled indefinitely
    •  HP and Oracle are applying the grid concept to industry data warehouses (DW)
  • Commercial Grid Market
    • IDC calls grid computing the fifth generation of computing
    • Commercial grid computing revenue was
      • 2003: 1 Billion USD
      • 2008: 12 Billion USD [estimate]
    • Forrester Research:
      • 37% of enterprises are piloting, rolling out or have implemented some form of grid computing.
      • 30% of firms are considering grid technology.
    (IDC,2004.Www.oracle.com/technology/tech/grid/collateral/idc_oracle10g.pdf) (Forrester, 2004. www.forrester.com/go?docid=34449 )
  • N-tier v/s Grid Computing
  • Commercial Grid Components
    • Commodity hardware (x-86 based servers)
    • Linux OS - cost effective
    • SAN – highly scalable
    • High speed interconnect (Gigabit Ethernet, InfiniBand)
    • Management software (manage as individual servers or manage as one large virtual servers)
    • Database layer (ties the resources together, Dynamic resource allocation, parallel processing)
  • Commercial Grid benefits
    • High scalability
    • High flexibility
    • Low total cost of ownership
    • High availability
    • Easy manageability
  • Oracle Features for a Data Warehouse Grid
    • Dynamic parallel processing
    • Data virtualization and dynamic resource provisioning in DW
    • Smart inter node parallelism
  • Dynamic Parallel Processing
    • Queries are automatically parallelized to maximize resource utilization
    • Degree of Parallelism (DOP) is adjusted according to resource availability and computing demands at parse time
    • DOP is automatically adjusted when:
      • Number of concurrent users change
      • Nodes are taken down for maintenance
      • Nodes are added due to increased computing demand (scale-out)
      • Nodes are assigned to different application
  • Data Virtualization and Dynamic Resource Provisioning in DW
    • Oracle’s shard everything architecture provides data virtualization and provisioning in Data Warehouses
    1 2 3 4 5 6 7 8 Nodes Disk Subsystem Interconnect
  • Data Virtualization and Dynamic Resource Provisioning in DW
    • Oracle’s shard everything architecture provides data virtualization and provisioning in Data Warehouses
    1 2 3 4 5 6 7 8 OLAP Reports ETL Nodes Disk Subsystem Workload Type Interconnect
  • Data Virtualization and Dynamic Resource Provisioning in DW
    • Oracle’s shard everything architecture provides data virtualization and provisioning in Data Warehouses
    1 2 3 4 5 6 7 8 OLAP Reports ETL During peak working hours Nodes Disk Subsystem Workload Type Interconnect
  • Data Virtualization and Dynamic Resource Provisioning in DW
    • Oracle’s shard everything architecture provides data virtualization and provisioning in Data Warehouses
    1 2 3 4 5 6 7 8 OLAP Reports ETL During the night Nodes Disk Subsystem Workload Type Interconnect
  • Data Virtualization and Dynamic Resource Provisioning in DW
    • Oracle’s shard everything architecture provides data virtualization and provisioning in Data Warehouses
    1 2 3 4 5 6 7 8 OLAP Reports ETL During short intervals when the DW is synchronized with the OLTP system Nodes Disk Subsystem Workload Type Interconnect
  • Data Virtualization and Dynamic Resource Provisioning in DW
    • Oracle’s shard everything architecture provides data virtualization and provisioning in Data Warehouses
    OLAP Reports ETL Without response time requirements all types of workload can run on all nodes Nodes Disk Subsystem Workload Type 1 2 3 4 5 6 7 8 Interconnect
  • Data Virtualization and Dynamic Resource Provisioning in DW
    • This concept can be extended to different applications
    1 2 3 4 5 6 7 8 OLTP DW DM Nodes Disk Subsystem Workload Type Interconnect
  • Data Virtualization and Dynamic Resource Provisioning in DW
    • This concept can be extended to different applications
    1 2 3 4 5 6 7 8 OLTP DW DM Nodes Disk Subsystem Workload Type 1 2 3 4 5 6 7 8 Interconnect
  • Smart Inter Node Parallelism
    • Optimizer avoids inter node parallelism when possible  reduced interconnect traffic  faster execution time
    • 1) node locality
      • If possible operations are executed on one node
      • When the DOP of an operation can be satisfied with resources of one server it executes locally
      • 2) full partition wise join
      • If two tables are equipartitioned on their join key, the join can be divided into smaller joins between partitions
      • 3) partial partition wise join
      • If only one table is partitioned on the join key, the other table is dynamically repartitioned on the join key to break the large join into smaller joins.
  • TPC-H Benchmark
    • The industry standard benchmark for data warehouse applications
    • Stresses grid based data warehouses:
      • Complex queries
        • Sequential scans of large amounts of data
        • Aggregations of large amounts of data
        • Multi-table joins
        • Extensive sorting of very large sets of data
      • Single-user test
      • Multi-user test
      • Parallel insert operations
      • Parallel delete operations
  • Benchmarked Configuration
  • Current results
  • Result Analysis
    • Leadership performance
      • Query performance of 35,141 QphH @ 1000GB
      • Price-to-performance ratio of $60/QphH @ 1000GB
    • Database grid of ProLiant systems with multiple Opteron–-x86 processors deliver performance comparable to large SMP systems
    • The Linux operating system delivers the throughput and processing demands necessary to achieve the benchmark result
    • Oracle’s 10g + RAC database delivers consistent, high performance query execution in large grid environments
  • Future Hardware for Grid – HP BladeSystems
  • Conclusion
    • Grid is ready for prime time
    • In grid computing resources are provisioned on demand and virtualized for applications to meet today’s challenging business needs
    • Commodity x-86 based servers and blade servers offer reduced total cost of ownership
    • Overcomes the natural limitations of SMP systems such as number of processors, memory and disk arrays
  •