Lecture 05 - The Data Warehouse and Technology


Published on

Building the data warehouse

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Lecture 05 - The Data Warehouse and Technology

  1. 1. Building Data WareHouse by InmonChapter 5: The Data Warehouse and Technology http://it-slideshares.blogspot.com/
  2. 2. 5.0 OverviewRequires a simpler set of technological features than its operational predecessors: ◦ Online updating: Not need. ◦ Locking, integrity: needs are minimal. ◦ Teleprocessing interface: is required very basic.This chapter outlines some of technological requirements for the data warehouse.
  3. 3. MANAGING LARGEAMOUNTS OF DATA 1. Manage Volumes 2. Manage multiple media technology 3. Index and monitoring data 4. Interface to retrieve and passing data
  4. 4. Managing Multiple MediaFollowing is a hierarchy of storage of data in terms of speed of access and cost of storage: Main memory Very fast Very expensive Expanded memory Very fast Expensive Cache Very fast Expensive DASD Fast Moderate Magnetic tape Not fast Not expensive Near line Not fast* Not expensive Optical disk Not slow Not expensive Fiche Slow Cheap*Not fast to find first record sought; very fast to find all other records in the block.
  5. 5. Indexing and Monitoring DataMonitoring data warehouse data determines such factors as the following: ◦ If a reorganization needs to be done ◦ If an index is poorly structured ◦ If too much or not enough data is in overflow ◦ The statistical composition of the access of the data ◦ Available remaining space
  6. 6. Interfaces to Many TechnologiesThe interface to different technologies requires several considerations: Does the data pass from one DBMS to another easily? Does it pass from one operating system to another easily? Does it change its basic format in passage (EBCDIC, ASCII, and so forth)? Can passage into multidimensional processing be done easily? Can selected increments of data, such as changed data capture (CDC) be passed rather than entire tables? Is the context of data lost in translation as data is moved to other environments?
  7. 7. PROGRAMMER ORDESIGNER CONTROL OFDATA PLACEMENT Place data at block/page level Manage data in parallel Solid Meta Data control Rich Language Interface
  8. 8. Parallel Storage and Management ofDataMetadata Management Data warehouse table structures Data warehouse table attribution Data warehouse source data (the system of record) Mapping from the system of record to the data warehouse Data model specification Extract logging Common routines for access of data Definitions and/or descriptions of data Relationships of one unit of data to another
  9. 9. Language InterfaceTypically, the language interface to the data warehouse should do the following: ◦ Be able to access data a set at a time ◦ Be able to access data a record at a time ◦ Specifically ensure that one or more indexes will be used in the satisfaction of a query ◦ Have an SQL interface ◦ Be able to insert, delete, or update data
  10. 10. EFFICIENT LOADING OFDATA Load efficiently Use indexes efficiently Store data in compact way Support compound Keys
  11. 11. Efficient Index UtilizationTechnology can support efficient index access in several ways: ◦ Using bit maps ◦ Having multileveled indexes ◦ Storing all or parts of an index in main memory ◦ Compacting the index entries when the order of the data being indexed allows such compaction ◦ Creating selective indexes and range indexes
  12. 12. Compaction of DataManage large amounts of data.Programmer gets the most out of a given I/O when data is stored compactly
  13. 13. Compound KeysThe time valiancy of data warehouse data.Key-foreign key relationships are quite common in the atomic data
  14. 14. VARIABLE-LENGTHDATAVariable-length data efficientlyLock Manager, explicit control at programmer LevelAble Index Only processingRestore data in Bulk efficiently
  15. 15. Lock ManagementEnsures that two or more people are not updating the same record at the same time.Turn the lock manager off and on is necessary.
  16. 16. Index-Only ProcessingLooking in an index (or indexes)— without going to the primary source of data
  17. 17. Fast RestoreThe capability to quickly restore a data warehouse table from non-DASD storage
  18. 18. Other Technological FeaturesSome of those features include the following: ◦ Transaction integrity ◦ High-speed buffering ◦ Row- or page-level locking ◦ Referential integrity ◦ VIEWs of data ◦ Partial block loadin
  19. 19. DBMS Types and the DataWarehouseData warehouses manage massive amounts of data because: Granular, atomic detail Historical information Summary as well as detailed dataBecause record level, transaction-based updates are a regular feature of the general-purpose DBMS, must offer facilities: Locking COMMITs Checkpoints Log tape processing Deadlock  Backout
  20. 20. Changing DBMS TechnologySuch a change may be in order for several reasons: DBMS technologies may be available. The size of the warehouse has grown. Use of the warehouse has escalated and changed. The basic DBMS decision must be revisited from time to time.Should the decision be made to go to a new DBMS technology, what are the considerations? Will the new DBMS technology meet the foreseeable requirements? How will the conversion from the older DBMS technology to the newer DBMS technology be done? How will the transformation programs be converted?
  21. 21. Multidimensional DBMS and theData Warehouse
  22. 22. The multidimensional DBMS The data warehouse1. holds at least an order of 1. holds massive amounts of data magnitude less data.2. is geared for very heavy and unpredictable access and analysis 2. is geared for a limited amount of of data. flexible access3. holds a much shorter time 3. contains data with a very lengthy horizon of data. time horizon (from 5 to 10 years)4. allows unfettered access. 4. allows analysts to access its data in a constrained fashion 5. being housed in a5. enjoy a complementary multidimensional DBMS relationship. Multidimensional DBMS and the Data Warehouse con’t
  23. 23. Multidimensional DBMS and theData Warehouse con’tFollowing is the relational foundation for multidimensional DBMS data marts:Strengths: Can support a lot of data. Can support dynamic joining of data. Has proven technology.  Is capable of supporting general-purpose update processing. If there is no known pattern of usage of data, then the relational structure is as good as any other.Weaknesses: Has performance that is less than optimal. Cannot be purely optimized for access
  24. 24. Multidimensional DBMS and theData Warehouse con’tFollowing is the cube foundation for multidimensional DBMS data marts: Strengths: Performance that is optimal for DSS processing. Can be optimized for very fast access of data. If pattern of access of data is known, then the structure of data can be optimized.  Can easily be sliced and diced. Can be examined in many ways. Weaknesses:  Cannot handle nearly as much data as a standard relational format. Does not support general-purpose update processing. May take a long time to load. If access is desired on a path not supported by the design of the data, the structure is not flexible.
  25. 25. Multidimensional DBMS and theData Warehouse con’t
  26. 26. Multidimensional DBMS and theData Warehouse con’t
  28. 28. Data Warehousing across MultipleStorage MediaA large amount of data is spread across more than one storage medium. ◦ One processing environment is the DASD environment where online, interactive processing is done. ◦ The other processing environment is often a tape or mass store environment
  29. 29. The Role of Metadata in the DataWarehouse Environment
  30. 30. The Role of Metadata in the DataWarehouse Environment
  31. 31. The Role of Metadata in the DataWarehouse Environment
  32. 32. Context and ContentThe context of the reports is explained for the contents
  33. 33. Three Types of ContextualInformationThreelevels of contextual information must be managed: Simple contextual information Complex contextual information External contextual informationSimple contextual information relates to the basic structure of data itself, and includes such things as these: The structure of data The encoding of data The naming conventions used for data The metrics describing the data, such as: How much data there is How fast the data is growing  What sectors of the data are growing
  34. 34. Three Types of ContextualInformation con’tThis type of information addresses such aspects of data as these: ◦ Product definitions ◦ Marketing territories ◦ Pricing ◦ Packaging ◦ Organization structure ◦ Distribution
  35. 35. Three Types of ContextualInformation con’tSome examples of external contextual information include the following: Economic forecasts: Inflation Financial trends Taxation Economic growth Political information Competitive information Technological advancements Consumer demographic movements
  36. 36. Capturing and Managing ContextualInformationComplex and external contextual types of information are hard to capture and quantify because they are so unstructured.
  37. 37. Looking at the PastSome of these shortcomings are as follows:The information management attempts were aimed at the information systems developer, not the end user.Attempts at contextual management were passive.Attempts at contextual information management were in many cases removed from the development effort.Attempts to manage contextual
  38. 38. Refreshing the Data Warehouse Reading a log tape is no small matter, however. Many obstacles are in the way, including the following: The log tape contains much extraneous data. The log tape format is often arcane. The log tape contains spanned records. The log tape often contains addresses instead of data values. The log tape reflects the idiosyncrasies of
  39. 39. TestingIt is very unusual to find a similar test environment in the world of the data warehouse, for the following reasons:Data warehouses are so large that a corporation has a hard time justifying one of them, much less two of them.The nature of the development life cycle for the data warehouse is iterative.For the most part, programs are run in a heuristic manner, not in a repetitive
  40. 40. Summary Some technological features are required:  Robust language interface  Compound keys  Variable-length data  The abilities to do the following:  Manage large amounts of data  Have metadata control of the  Manage data on a diverse media warehouse  Easily index and monitor data  Efficiently load the warehouse  Interface with a wide number of  Efficiently use indexes technologies  Store data in a compact way  Allow the programmer to place  Support compound keys the data directly on the physical  Selectively turn off the lock device manager  Store and access data in parallel  Do index-only processing  Quickly restore from bulk storage
  41. 41. Summary con’tThe data architect must recognize the differences between a transaction-based DBMS and a data warehouse-based DBMS.
  42. 42. Summary con’tMultidimensionalOLAP technology is suited for data mart processing and not data warehouse processing.When the data mart approach is used, many problems become evident: The number of extract programs grows large. Each new multidimensional database must return to the legacy operational environment for its own data. There is no basis for reconciliation of differences in analysis. A tremendous amount of redundant data among different multidimensional DBMS environments exists.
  43. 43. Summary con’tMetadata in the data warehouse environment plays a very different role than metadata in the operational legacy environment. http://it-slideshares.blogspot.com/