Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data warehouse physical design

Data Warehouse Physical Design,Physical Data Model, Tablespaces, Integrity Constraints, ETL (Extract-Transform-Load) ,OLAP Server Architectures, MOLAP vs. ROLAP, Distributed Data Warehouse ,

  • Login to see the comments

Data warehouse physical design

  1. 1. Er. Nawaraj Bhandari Data Warehouse/Data Mining Chapter 3: Data Warehouse Physical Design
  2. 2. Physical Design Physical design is the phase of a database design following the logical design that identifies the actual database tables and index structures used to implement the logical design. In the physical design, you look at the most effective way of storing and retrieving the objects as well as handling them from a transportation and backup/recovery perspective.
  3. 3. Physical design decisions are mainly driven by query performance and database maintenance aspects. During the logical design phase, you defined a model for your data warehouse consisting of entities, attributes, and relationships. The entities are linked together using relationships. Attributes are used to describe the entities. The unique identifier (UID) distinguishes between one instance of an entity and another.
  4. 4. Figure: Logical Design Compared with Physical Design
  5. 5. During the physical design process, you translate the expected schemas into actual database structures. At this time, you have to map: ■ Entities to tables ■ Relationships to foreign key constraints ■ Attributes to columns ■ Primary unique identifiers to primary key constraints ■ Unique identifiers to unique key constraints
  6. 6. Physical Data Model Features of physical data model include: Specification all tables and columns. Specification of Foreign keys. De-normalization may be performed if necessary. At this level, specification of logical data model is realized in the database.
  7. 7. The steps for physical data model design involves: Conversion of entities into tables, Conversion of relationships into foreign keys, Conversion of attributes into columns Changes to the physical data model based on the physical constraints.
  8. 8. Figure: Logical model and physical model
  9. 9. Physical Design Objectives Involves tradeoffs among  Performance  Flexibility  Scalability  Ease of Administration  Data Integrity  Data Consistency  Data Availability  User Satisfaction
  10. 10. Physical Design Structures Once you have converted your logical design to a physical one, you will need to create some or all of the following structures: ■ Tablespaces ■ Tables and Partitioned Tables ■ Views ■ Integrity Constraints ■ Dimensions Some of these structures require disk space. Others exist only in the data dictionary. Additionally, the following structures may be created for performance improvement: ■ Indexes and Partitioned Indexes ■ Materialized Views
  11. 11. Tablespaces  A tablespace consists of one or more datafiles, which are physical structures within the operating system you are using.  A datafile is associated with only one tablespace.  From a design perspective, tablespaces are containers for physical design structures.
  12. 12. Tables and Partitioned Tables  Tables are the basic unit of data storage. They are the container for the expected amount of raw data in your data warehouse.  Using partitioned tables instead of non-partitioned ones addresses the key problem of supporting very large data volumes by allowing you to divide them into smaller and more manageable pieces.  Partitioning large tables improves performance because each partitioned piece is more manageable.
  13. 13. Views  A view is a tailored presentation of the data contained in one or more tables or other views.  A view takes the output of a query and treats it as a table.  Views do not require any space in the database.
  14. 14. Improving Performance with the Use of Views View of selected rows or columns of these tables Table 1 Table 2 Table 3 Query
  15. 15. View  A view is a virtual table which completely acts as a real table.  The use of view as a way to improve performance.  Views can be used to combine tables, so that instead of joining tables in a query, the query will just access the view and thus be quicker.
  16. 16. View  We can perform different SQL queries.  DESC department_worker_view;
  17. 17. Integrity Constraints  Integrity constraints are used to enforce business rules associated with your database and to prevent having invalid information in the tables.  In data warehousing environments, constraints are only used for query rewrite.  NOT NULL constraints are particularly common in data warehouses.
  18. 18. Indexes and Partitioned Indexes  Indexes are optional structures associated with tables.  Indexes are just like tables in that you can partition them (but the partitioning strategy is not dependent upon the table structure)  Partitioning indexes makes it easier to manage the data warehouse during refresh and improves query performance.
  19. 19. Materialized Views  Materialized views are query results that have been stored in advance so long-running calculations are not necessary when you actually execute your SQL statements.  From a physical design point of view, materialized views resemble tables or partitioned tables and behave like indexes in that they are used transparently and improve performance.
  20. 20. Data Warehouse: A Multi-Tiered Architecture Data Warehouse Extract Transform Load Refresh (2) OLAP Engine Analysis Query/Reports Data mining Monitor & Integrator Metadata Data Sources (3) Front-End Tools Server Data Marts Operational DBs Other sources (1) Data Storage OLAP Server ROLAP Server MOLAP Server
  21. 21. ETL (Extract-Transform-Load)  ETL comes from Data Warehousing and stands for Extract-Transform-Load. ETL covers a process of how the data are loaded from the source system to the data warehouse.  Currently, the ETL encompasses a cleaning step as a separate step. The sequence is then Extract-Clean-Transform-Load.
  22. 22. Extract  The Extract step covers the data extraction from the source system and makes it accessible for further processing.  The main objective of the extract step is to retrieve all the required data from the source system with as little resources as possible.  The extract step should be designed in a way that it does not negatively affect the source system in terms or performance, response time or any kind of locking.
  23. 23. Extract There are several ways to perform the extract:  Update notification - if the source system is able to provide a notification that a record has been changed and describe the change, this is the easiest way to get the data.  Incremental extract - some systems may not be able to provide notification that an update has occurred, but they are able to identify which records have been modified and provide an extract of such records. During further ETL steps, the system needs to identify changes and propagate it down. Note, that by using daily extract, we may not be able to handle deleted records properly.  Full extract - some systems are not able to identify which data has been changed at all, so a full extract is the only way one can get the data out of the system. The full extract requires keeping a copy of the last extract in the same format in order to be able to identify changes. Full extract handles deletions as well.
  24. 24. Clean The cleaning step is one of the most important as it ensures the quality of the data in the data warehouse. Cleaning should perform basic data unification rules, such as:  Making identifiers unique (sex categories Male/Female/Unknown, M/F/null, Man/Woman/Not Available are translated to standard Male/Female/Unknown)  Convert null values into standardized Not Available/Not Provided value  Convert phone numbers, ZIP codes to a standardized form  Validate address fields, convert them into proper naming, e.g. Street/St/St./Str./Str  Validate address fields against each other (State/Country, City/State, City/ZIP code, City/Street).
  25. 25. Transform The transform step applies a set of rules to transform the data from the source to the target. This includes converting any measured data to the same dimension (i.e. conformed dimension) using the same units so that they can later be joined. The transformation step also requires joining data from several sources, generating aggregates, generating surrogate keys(candidate key), sorting, deriving new calculated values, and applying advanced validation rules.
  26. 26. OLAP Server Architectures Types of OLAP Servers  Relational OLAP (ROLAP)  Multidimensional OLAP (MOLAP)  Hybrid OLAP (HOLAP)
  27. 27. Relational OLAP (ROLAP)  Relational OLAP servers are placed between relational back-end server and client front-end tools. To store and manage the warehouse data, the relational OLAP uses relational or extended-relational DBMS.  ROLAP servers can be easily used with existing RDBMS.  ROLAP tools do not use pre-calculated data cubes.
  28. 28. Multidimensional OLAP(MOLAP)  Multidimensional OLAP (MOLAP) uses array-based multidimensional storage engines for multidimensional views of data. With multidimensional data stores, the storage utilization may be low if the data set is sparse. Therefore, many MOLAP servers use two levels of data storage representation to handle dense and sparse data-sets  MOLAP allows fastest indexing to the pre-computed summarized data.  Easier to use, therefore MOLAP is suitable for inexperienced users.
  29. 29. MOLAP vs. ROLAP MOLAP ROLAP Information retrieval is fast. Information retrieval is comparatively slow. Uses sparse array to store data-sets. Uses relational table. MOLAP is best suited for inexperienced users, since it is very easy to use. ROLAP is best suited for experienced users. Maintains a separate database for data cubes. It may not require space other than available in the Data warehouse.
  30. 30. Hybrid OLAP (HOLAP)  Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of ROLAP and faster computation of MOLAP.  HOLAP servers allows to store the large data volumes of detailed information. The aggregations are stored separately in MOLAP store.
  31. 31. Distributed Data Warehouse  (DDW) Data shared across multiple data repositories, for the purpose of OLAP. Each data warehouse may belong to one or many organizations. The sharing implies a common format or definition of data elements (e.g. using XML).  Distributed data warehousing encompasses a complete enterprise DW but has smaller data stores that are built separately and joined physically over a network, providing users with access to relevant reports without impacting on performance.  A distributed DW, the nucleus of all enterprise data, sends relevant data to individual data marts from which users can access information for order management, customer billing, sales analysis, and other reporting and analytic functions.
  32. 32. Data Warehouse Manager  Collects data inputs from a variety of sources, including legacy operational systems, third-party data suppliers, and informal sources.  Assures the quality of these data inputs by correcting spelling, removing mistakes, eliminating null data, and combining multiple sources  Releases the data from the data staging area to the individual data marts on a regular schedule.  Measures the costs and benefits.  Estimates the cost and benefits
  33. 33. Virtual Warehouse  The data warehouse is a great idea, but it is complex to build and requires investment. Why not use a cheap and fast approach by eliminating the transformation steps of repositories for metadata and another database.  This approach is termed the 'virtual data warehouse'. To accomplish this there is need to define 4 kinds of information:  A data dictionary containing the definitions of the various databases.  A description of the relationship among the data elements.  The description of the way user will interface with the system.  The algorithms and business rules that define what to do and how to do it.
  34. 34. References 1. Sam Anahory, Dennis Murray, “Data warehousing In the Real World”, Pearson Education. 2. Kimball, R. “The Data Warehouse Toolkit”, Wiley, 1996. 3. Teorey, T. J., “Database Modeling and Design: The Entity-Relationship Approach”, Morgan Kaufmann Publishers, Inc., 1990. 4. “An Overview of Data Warehousing and OLAP Technology”, S. Chaudhuri, Microsoft Research 5. “Data Warehousing with Oracle”, M. A. Shahzad 6. “Data Mining Concepts and Techniques”, Morgan Kaufmann J. Han, M Kamber Second Edition ISBN : 978-1-55860-901-3
  35. 35. ANY QUESTIONS?