Historical data _ Detailed data _ Diverse data = Lots of data
Splitting data over multiple storage media based on frequency of usage
Archival storage is very similar to near-line storage , except that in archival storage, the probability of access drops very low. To put the probability of access in perspective, consider the following simple chart: High performance disk storage Access a unit of data once a month Near-line storage Access 0.5 units of data every year Archival storage Access 0.1 units of data every decade. Near-line storage can be thought of as a logical extension of the data warehouse. Archival storage cannot be thought of as a logical extension.
Options for Moving Data: ADVANTAGES Manual Very simple; available immediately; operates at the row level HSM Relatively simple; not too expensive; fully automated CMSM Fully automated; operates at the row level DISADVANTAGES Manual Prone to error; requires human interaction HSM Operates at the data set level CMSM Expensive; complex to implement and operate
third-party monitors are much better because the monitors supplied by the DBMS vendors require far more resources than those supplied The Extension of the Data Warehouse across Different Storage Media: The data warehouse can grow to petabytes (equivalent to a quadrillion bytes) of data and can still be effective and still be managed.
third-party monitors are much better because the monitors supplied by the DBMS vendors require far more resources than those supplied
Lecture 12 The Really Large Data Warehouse
Building Data WareHouse byInmonChapter 12: The Really Large Data Warehousehttp://it-slideshares.blogspot.com/
Why the Rapid Growth?The Impact of Large Volumes of DataDisk Storage in the Face of Data SeparationMoving Data from One Environment to AnotherInverting the Data WarehouseTotal CostMaximum CapacitySummary
Why the Rapid Growth?The data warehouse contains history.Data warehouses collect data at the most granular levelThe need to bring lots of different kinds of data together
The Impact of Large Volumes of Data Basic Data-Management Activities ◦ As data volumes grow large, normal database functions require increasingly larger amounts of resources. The Cost of Storage ◦ The volume of data grows, the cost of the data increases dramatically
The Impact of Large Volumes of Data The Real Costs of Storage ◦ There are lots of components to disk storage aside from the storage device itself Disk controller Communications lines Processor Software
The Impact of Large Volumes of Data The Usage Pattern of Data in the Face of Large Volumes ◦ Over time, as the volume of data grows, the percentage of data actually used drops
The Impact of Large Volumes of Data A Simple Calculation Usage ratio = Actual bytes used / Total data warehouse bytes ◦ the volume of data found in your data warehouse goes up, the actual percentage used goes down Two Classes of Data ◦ Infrequently used data is often called dormant data or inactive data. ◦ Frequently used data is often called actively used data.
The Impact of Large Volumes of Data Implications of Separating Data into Two Classes
Disk Storagein the Face of Data SeparationNear-Line Storage ◦ near-line storage, (depending on the vendor) is sequential storage ◦ Characteristics: Robotically controlled Inexpensive Bulk amounts of data Reliable over a long period of time Seconds to access first record
Disk Storagein the Face of Data SeparationAccess Speed and Disk Storage ◦ The difference between freely flowing blood and blood with many restricting components
Disk Storagein the Face of Data SeparationArchival Storage ◦ Needs for split storage to manage large amount of data ◦ Besides disk storage and near-line or bulk storage ◦ Different with near-line storage
Disk Storagein the Face of Data SeparationImplications of Transparency ◦ A record or row in the data warehouse is identical to a record or row in near-line storage.
Moving Data fromOne Environment to Another Many ways: ◦ have a database administrator manually move data ◦ hierarchical storage management (HSM) ◦ the cross-media storage management (CMSM) option
Moving Data from One Environment to AnotherThe CMSM Approach ◦ The CMSM technology is fully automated. ◦ The CMSM is software that makes the physical location of the data transparent ◦ The end user does not need to know where data is—in the data warehouse or on near-line storage.
Moving Data fromOne Environment to AnotherA Data Warehouse Usage Monitor ◦ Streamline the operations of the CMSM environment ◦ Two types: those that are supplied by the DBMS vendor those supplied by third-party monitors
Inverting the Data Warehouseinverteddata warehouse: Consider a normal data warehouse.To build a data warehouse: ◦ Normal way: put data first into disk storage (after the data ages) near-line or archival storage ◦ Alternative way: first enter data into near-line storage (not disk storage) data is “staged” from the near-line environment to the disk environment (to accessed and analyzed) (after over) returned to near-line storage
Total CostWith the introduction of near-line and archival storage, the growing costs of a data warehouse can be mitigated
Maximum Capacity“XYZ machine can handle up to nnn terabytes of data.”Parameters measures the machines capacity: Volumes of data Number of users Workload complexityThe balanced case is where there is a fair amount of data, a fair number of users, and a reasonably complex workload
SummaryData warehouses grow large explosivelyThe data inside the warehouse separates into one of two classes—frequently used data or infrequently used dataWithout near-line and/or archival storage, the costs of the data warehouseskyrocket as the data warehouse grows largehttp://it-slideshares.blogspot.com/