Course: Data Warehousing Concepts & Design Owner: Technical Training Cell (TTC), Mastek Ltd. Document Version: 1.0
What Is Business Intelligence (BI)? Business Intelligence is the process of transforming relevant business data into information, information into knowledge and through iterative discoveries turning knowledge into Intelligence. Data Represents reality, facts and figures. Also referred to as ‘raw data’. Information It is the relevant data processed and interpreted. Knowledge It refers to the meaning and understanding that results from the information processed.
Objective of BI BI can be also be defined as taking ‘ Decisions based on Data ’. The purpose of BI is to turn large volumes of data into information and linking bits of information together within a decision context that turns into knowledge that can be used for decision making.
Evolution of BI EIS applications were developed by the IS team and written in 3GL, 4GL, C++, or some other structured programming language. These were predefined, restrictive queries that were delivered in tabular or chart form. Generally, information provided was limited. (1GL: Machine language 2GL: Assembly 3GL: C, C++, Cobol, Java 4GL: VB, PB, SQL, Access 5GL: Prolog) MIS provided management with reports to assess the performance of the business. Reports were submitted as a request to MIS team and were delivered to users after a period of time- few days, weeks or even months. DSS applications (VisiCalc, Lotus 1-2-3, VPP, Excel) were the first generation of packaged software that provided dynamically generated SQL enabling users to extract data from relational databases. This data was relevant to their business needs and focus. Primarily used by analysts, BI, the next generation of DSS, provides the capability to format reports easily. Additionally, multiple sources and multiple subject matters can be used simultaneously to provide an accurate assessment of the business. What-if scenarios can be run against the multi-dimensional models and Web-enabled applications provide decentralization of the warehouse.
Information It is one of the most important assets of the organization. Information in an organization exists in 2 different forms: Operational systems (OLTP) Data Warehouse (DWH) Both the systems have different business needs, purpose, users.
Features of OLTP Systems OLTP systems handle day-to-day transactions and operations of the business. They are high performance, high throughput Systems. They run mission critical applications. OLTP systems store, update and retrieve Operational Data. Operational Data is the data that runs the business. Accounting system, banking software, payroll package, Order-processing system, SAP, airline reservation system are some typical examples of OLTP systems. OLTP systems are built over a period of time. They are not built under a single project. They evolve/change as business grows and new lines of business get added. They are built using different tools and technologies. E.g. A bank could start its operations with savings and current account only and later with market changes and time, might add Demat, Forex and home loans services to its operations. All these systems could get built at different intervals in isolation.
Why OLTP Is Not Suitable for Analytical Reporting Operational systems largely exist to support transactions. Decision support is a complex analysis and is different from OLTP. Most OLTP transactions require a single record in a database to be located and updated or an addition of one or more new records. Solution is to separate the two systems – Production systems and Analytical systems. This is done by building a Data Warehouse.
Data Warehouse Versus Online Transaction Processing (OLTP) Response Time and Data Operations Data warehouses are constructed for very different reasons than online transactional processing (OLTP) systems. OLTP systems are optimized for getting data in—for storing data as a transaction occurs. Data warehouses are optimized for getting data out—for providing quick response for analysis purposes. Because there tends to be a high volume of activity in the OLTP environment, rapid response is critical; whereas, data warehouse applications are analytical rather than operational. Therefore slower performance is acceptable. Nature of Data The data stored in each database varies in nature: the data warehouse contains snapshots of data over time to support time-series analysis whereas, the OLTP system stores very detailed data for a short time such as 30 to 60 days or 1 year to 2 years. The data is current in OLTP. In a warehouse, data is Historical.
Data Warehouse Versus Online Transaction Processing (OLTP) (continued) Data Organization The data warehouse is subject specific and supports analysis so data is arranged accordingly. In order for the OLTP system to support sub-second response, the data must be arranged to optimize the application. For example, an order entry system may have tables that hold each of the elements of the order whereas a data warehouse may hold the same data but arrange it by subject such as customer, product, and so on. Size The size of the operational database is few MB to GB. In a data warehouse, size of the database ranges from GB to TB. Data Sources Because the data warehouse is created to support analytical activities, data from a variety of sources can be integrated. The operational data store of the OLTP system holds only internal data or data necessary to capture the operation or transaction.
Data Warehouse Versus Online Transaction Processing (OLTP) (continued) Activities OLTP is process oriented designed to handle day-to-day operations of the business. A data warehouse is subject oriented built to handle querying and analysis. No. of records OLTP system deals with one transaction or one record at a time. However, a query fired on a data warehouse may involve going through few thousands to millions of records. Grain The grain of data in an OLTP system is the most granular. Data exists at detail or transaction level (atomic). In a data warehouse, data could exist at atomic and/or aggregate level depending on the business needs. Database Design OLTP systems are fully normalized and are designed to store operational data, one transaction at a time. A data warehouse is designed differently. It is in dimensional model form (generally in star schema) with highly de-normalized dimension tables. This design enables complex query analysis. Note : Dimensional modeling is covered later in the session.
Data Extract Processing Extract processing was a logical progression from decision support systems. It was seen as a way to move the data from the high-performance, high throughput online transaction processing systems onto client machines that are dedicated to analysis. Extract processing also gave the user ownership of the data. DSS and Degradation The problem of performance degradation was partially solved by using extract processing techniques that select data from one environment and transport it to another environment for user access (a data extract). Data Extract Program The data extract program searches through files and databases, gathering data according to specific criteria. The data is then placed into a separate set of files, which resides on another environment, for use by analysts for decision support activities.
Management Issues with Data Extract Programs Although the principle of extracts appears logical, and to some degree represents a model similar to the way a data warehouse works, there are problems with processing extracts . Extract programs may become the source for other extracts, and extract management can become a full-time task for information systems departments. In some companies hundreds of extract programs are run at any time. Productivity Issues Extract effort is duplicated, because multiple extracts access the same data and use mainframe resources unnecessarily. The program designed to access the extracted data must encompass all technologies that are employed by the source data. There is no common metadata providing a standard way of extracting, integrating, and using the data.
Data Quality Issues with Extract Processing Following are the data quality issues in an extract processing environment: The data has no time basis and users cannot compare query results with confidence. The data extracts may have been taken at a different point-in-time. Each data extract may use a different algorithm for calculating derived and computed values. This makes the data difficult to evaluate, compare, and communicate by managers who may not know the methods or algorithms that are used to create the data extract or reports. Data extract programs may use different levels of extraction. Access to external data may not be consistent, and the granularity of the external data may not be well defined. Data sources may be difficult to identify, and data elements may be repeated on many extracts. The data field names and values may have different meanings in the various systems in the enterprise (lack of semantic integrity). There are no data correction rules to ensure that the extracted data is correct and clean. The reports provide data rather than information, and no drill-down capability.
Data Warehousing and Business Intelligence For the past decade, there has been a transition from decision support using data extracts to decision support using the data warehouse. A data warehouse is a strategic collection of all types of data in support of the decision-making process at all levels of an enterprise. A Data warehouse is a single data store created for two primary reasons: Analytical reporting and Decision support.
Technology Needed to Support the Business Needs Today’s information technology climate provides you with cost-effective computing resources in the hardware and software arena, Internet and intranet solutions, and databases that can hold very large volumes of data for analysis, using a multitude of data access technologies. Technological Advances Enabling Data Warehousing Technology (specifically open systems technology) is making it affordable to analyze vast amounts of data, and hardware solutions are now more cost-effective. Recent advances in parallelism have benefited all aspects of computing: Hardware environment Operating system environment Database management systems and all associated database operations Query tools and techniques and Applications
W. H. Inmon states that “A data warehouse is a subject oriented, integrated, nonvolatile, and time variant collection of data in support of management’s decisions”. Subject-oriented - Data that gives information about a particular subject instead of about a company's on-going process or operations. Integrated - Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Time-variant - All data in the data warehouse is identified with a particular time period. Non-volatile - Data is read-only. Data is stable in a data warehouse. More data is added, but data is never removed. This enables management to gain a consistent picture of the business.
Data Warehouse Properties Bill Inmon’s definition of a data warehouse makes reference to the main properties of a data warehouse: Subject-oriented Integrated Nonvolatile Time-variant Let us understand and discuss these properties one by one in detail.
Subject-oriented data is organized around major subject areas of an enterprise, and is useful for an enterprisewide understanding of those subjects. For example, a banking operational system keeps independent records of customer savings, loans, and other transactions. A warehouse pulls this independent data together to provide financial information. You can access subject-oriented data related to any major subject area of an enterprise: Customer financial information Toll calls made in the telecommunications industry Airline passenger booking information Insurance claim data While the data in an OLTP system is stored to support a specific business process (for example, order entry, campaign management, and so on) as efficiently as possible, data in a data warehouse is stored based on common subject areas (for example, customer, product, and so on). That is because the complete set of questions to be posed to a data warehouse is never known. Every question the data warehouse answers spawns new questions. Thus, the focus of the design of a data warehouse is providing users easy access to the data so that current and future questions can be answered.
Integrated In many organizations, data resides in diverse independent systems, making it difficult to integrate into one set of meaningful information for analysis. A key characteristic of a warehouse is that data is completely integrated. Data is stored in a globally acceptable manner, even when the underlying source data is stored differently. The transformation and integration process can be time consuming and costly. Data Consistency You must deal with data inconsistencies and anomalies before the data is loaded into the warehouse. Consistency is applied to naming conventions, measurements, encoding structures, and physical attributes of the data. Data Redundancy Data Redundancy at the detail level in the warehouse is eliminated. However, selective redundancy in the form of aggregates and summaries is required to improve the performance of queries especially drill down analysis. In many implementations, about 70% or more of the warehouse development resources are used in developing extraction, transformation, and Loading (ETL) routines.
Time-Variant Warehouse data is by nature historical; it does not usually contain real-time transactional data. Data is represented over a long time horizon, from 2 to 10 years, compared with 1 to 3 months (sometimes 1 to 2 years) of data for a typical operational system. The data allows for analysis of past and present trends, and for forecasting using “what-if” scenarios. Time Element The data warehouse always contains a key element of time, such as quarter, month, week, or day, that determines when the data was loaded. The date may be a single snapshot date, such as 10-JAN-02, or a range, such as 01-JAN-02 to 31-JAN-02. Snapshots by Time Period Warehouse data is a series of snapshots by time periods that do not change.
Nonvolatile Typically, data in the data warehouse is read-only (from end-user or business user perspective). Data is loaded into the data warehouse for the first-time load, and then refreshed regularly. Warehouse operations typically involve: Loading the initial set of warehouse data (often called the first-time load) Refreshing the data regularly (called the refresh cycle) Accessing the Data Once a snapshot of data is loaded into the warehouse, it rarely changes. Therefore, data manipulation is not a consideration at the physical design level. The physical warehouse is optimized for data retrieval and analysis. Refresh Cycle The data in the warehouse is refreshed; that is, snapshots are added. The refresh cycle is determined by the business users. A refresh cycle need not be the same as the grain (level at which the data is stored) of the data for that cycle. For example, you may choose to refresh the warehouse weekly, but the grain of the data may be daily.
Changing Warehouse Data The following operations are typical of a data warehouse: The initial set of data is loaded into the warehouse, often called the first-time load. This is the data by which you will measure the business, and the data containing the criteria by which you will analyze the business. Frequent snapshots of core data warehouse data are added, (more occurrences), according to the refresh cycle and using data from the multiple source systems. Warehouse data may need to be changed for a number of reasons: The data that you are using to analyze the business may change, the data warehouse must be kept up-to-date to keep it accurate. The business determines how much historical data is needed for analysis, say five years worth. Older data is either archived or purged. Inappropriate or inaccurate data values may be deleted from or migrated out of the data warehouse.
As seen above, successful data warehousing demands a bevy of technical and business skills. Being only a DBA or a technician does not help. According to Ralph Kimball, the skill set to be possessed by an individual to successfully implement a data warehouse are: ½ MBA and ½ DBA .
Usage Curves Operational systems and data warehouses have different usage curves. An operational system has a more predictable usage curve, whereas the warehouse a less predictable, more varied, and random usage curve. Access to the warehouse varies not just on a daily basis, but may even be affected by forces such as seasonal variations. For this reason, you cannot expect the operational system to handle heavy analytical queries (DSS) and at the same time be able to handle the load of transactions for the minute-by-minute processing that require fast transaction rates.
User Expectations The difference in response time may be significant between a data warehouse and a client-server environment fronted by personal computers. You must control the user’s expectations regarding response. Set reasonable and achievable targets for query response time, which can be assessed and proved in the first increment of development. You can then define, specify, and agree on Service Level Agreements (SLA). If users are accustomed to fast PC-based systems, they may find the warehouse excessively slow. However, it is up to those educating the users to ensure that they are aware of just how big the warehouse is, how much data there is, and of what benefit the information is to both the user and the business. Exponential Growth and Use Once implemented, data warehouses continue to grow in size. Each time the warehouse is refreshed more data is added, deleted, or archived. The refresh happens on a regular cycle. Successful data warehouses grow very quickly. Once the success of the warehouse is proven, its use increases dramatically and it often grows faster than expected.
Enterprisewide Data Warehouse To summarize, an enterprisewide warehouse stores data from all subject areas within the business for analysis by end users. The scope of the warehouse is the entire business and all operational aspects within the business. An enterprisewide warehouse is normally (and should be) created through a series of incrementally developed solutions. Never create an enterprisewide data warehouse under one project umbrella, it will not work. With an enterprisewide data warehouse all users access the warehouse, which provides: A single source of corporate enterprisewide data A single source of synchronized data in the enterprisewide warehouse for each subject area A single point for distribution of data to dependent data marts
Grain of Data - Granularity Granularity is usually mentioned in the context of dimensional data structures (i.e., facts and dimensions) and refers to the level of detail in a given fact table. The more detail there is in the fact table, the higher its granularity and vice versa. The meaning of a single row in a fact table is referred to as a grain of data. Fact table It is similar to the transaction table. It stores the facts or measures of the business. Facts are generally numeric and additive. E.g.: SALES table in dimensional model. Dimension table It is similar to the master table. It stores the textual descriptors of the business. E.g.: CUSTOMER, PRODUCT table in dimensional model. Note : Dimensional modeling is covered in detail later in the session.
Data Warehouse Versus Data Mart Definition Data mart is a subset of data warehouse that provides users with information specific to their departmental requirements. It can be a subject-oriented data warehouse for functional or departmental information needs, or it can be a mini enterprisewide data warehouse combining data from multiple subject areas and acting as a kernel to feed the enterprise warehouse. Scope A data warehouse deals with multiple subject areas and is typically implemented and controlled by a central organizational unit such as the Corporate Information Technology group. It is often called a central or enterprise data warehouse. Subjects A data mart is a departmental form of a data warehouse designed for a single line of business (LOB) or functional area such as sales, finance, or marketing.
Data Warehouse Versus Data Mart Definition Data mart is a subset of data warehouse fact and summary data that provides users with information specific to their departmental requirements. It can be a subject-oriented data warehouse for functional or departmental information needs, or it can be a mini enterprisewide data warehouse combining data from multiple subject areas and acting as a kernel to feed the enterprise warehouse. Scope A data warehouse deals with multiple subject areas and is typically implemented and controlled by a central organizational unit such as the Corporate Information Technology group. It is often called a central or enterprise data warehouse. Subjects A data mart is a departmental form of a data warehouse designed for a single line of business (LOB) or functional area such as sales, finance, or marketing.
Data Warehouse Versus Data Mart (continued) Data Source A data warehouse typically assembles data from multiple source systems. A data mart typically assembles data from fewer sources. Implementation Time Data marts are typically less complex than data warehouses and therefore are typically easier to build and maintain. A data mart can be built as a “ proof of concept ” (POC) step toward the creation of an enterprisewide warehouse. Note Size is not necessarily a differentiating factor when comparing a data warehouse to a data mart. In fact, sometimes, data marts are often as big or even bigger than the enterprise data warehouse because they are leveraging third party data sources. The main difference to highlight is that data marts are departmental in focus. Several independent enterprise DWs or DMs Distributed Single enterprise DW Centralized Enterprise DW with several dependent DMs Multi-tiered When There Exists... System Design Is
Dependent Data Marts Data marts can be categorized into two types: dependent and independent. The categorization is based primarily on the data source that feeds the data mart. Dependent Data Mart Dependent data marts have the following characteristics: The source is the warehouse. Dependent data marts rely on the data warehouse for content. The extraction, transformation, and loading (ETL) process is easy. Dependent data marts draw data from a central data warehouse that has already been created. Thus, the main effort in building a mart, the data cleansing and extraction, has already been performed. The dependent data mart simply requires data to be moved from one database to another. The data mart is part of the enterprise plan. Dependent data marts are usually built to achieve improved performance and availability, better control, and lower telecommunication costs resulting from local access to data relevant to a specific department.
Independent Data Marts Independent data marts are stand-alone systems built from scratch that draw data directly from operational or external sources of data. Independent data marts have the following characteristics: The sources are operational systems and external sources. The ETL process is difficult. Because independent data marts draw data from unclean or inconsistent data sources, efforts are directed towards error processing and integration of data. The data mart is built to satisfy analytical needs. The creation of independent data marts is often driven by the need for a quick solution to analysis demands. Note Many dependent data marts still obtain some of their internal and external data outside of the data warehouse. Independent data marts are often not seen as a good solution and should be avoided for a number of reasons (for example, different answers to the same business question from multiple data marts, duplication of ETL, and so on).
Warehouse Development Approaches The most challenging aspect of data warehousing lies not in its technical difficulty, but in choosing the best approach to data warehousing for your company’s structure and culture, and dealing with the organizational and political issues that will inevitably arise during implementation. Among the different approaches to developing a data warehouse are: Top-down approach Bottom-up approach Hybrid approach Federated approach There is still a great deal of confusion about the similarities and differences among these architectures, especially true for the “top-down” and “bottom-up” approaches.
Top-Down Approach Also referred as “Big-Bang” approach. There are certain similarities and differences among these 4 data warehouse approaches. This is especially true of the “top-down” and “bottom-up” approaches. The “top-down” and “bottom-up” approaches have existed the longest and occupy the polar ends of the development spectrum. The two most influential approaches are championed by industry heavyweights Bill Inmon and Ralph Kimball, both prolific authors and consultants in the data warehousing field. Inmon, who is credited with coining the term “data warehousing” in the early 1990s, advocates a top-down approach, in which companies first build a data warehouse followed by data marts. Kimball’s approach, on the other hand, is often called bottom-up because it starts and ends with data marts, negating the need for a physical data warehouse altogether.
Top-Down Approach (Continued) In the top-down approach, the data warehouse holds atomic or transaction data that is extracted from one or more source systems and integrated within a normalized, enterprise data model. From there, the data is summarized, dimensionalized, and distributed to one or more “dependent” data marts. These data marts are “dependent” because they derive all their data from a centralized data warehouse. Sometimes, organizations supplement the data warehouse with a staging area to collect and store source system data before it can be moved and integrated within the data warehouse. A separate staging area is particularly useful if there are numerous source systems, large volumes of data, or small batch windows with which to extract data from source systems. The major benefit of a “top-down” approach is that it provides an integrated, flexible architecture to support downstream analytic data structures. First, this means the data warehouse provides a departure point for all data marts, enforcing consistency and standardization so that organizations can achieve a single version of the truth. Second, the atomic data in the warehouse lets organizations re-purpose that data in any number of ways to meet new and unexpected business needs. For example, a data warehouse can be used to create rich data sets for statisticians, deliver operational reports, or support operational data stores (ODS) and analytic applications. Moreover, users can query the data warehouse if they need cross-functional or enterprise views of the data. On the downside, a top-down approach may take longer and cost more to deploy than other approaches, especially in the initial increments. This is because organizations must create a reasonably detailed enterprise data model as well as the physical infrastructure to house the staging area, data warehouse, and the marts before deploying their applications or reports. (Of course, depending on the size of an implementation, organizations can deploy all three “tiers” within a single database.) This initial delay may cause some groups with their own IT budgets to build their own analytic applications. Also, it may not be intuitive or seamless for end users to drill through from a data mart to a data warehouse to find the details behind the summary data in their reports.
Bottom-Up Approach In a bottom-up approach, the goal is to deliver business value by deploying dimensional data marts as quickly as possible. Unlike the top-down approach, these data marts contain all the data—both atomic and summary—that users may want or need, now or in the future. Data is modeled in a star schema design to optimize usability and query performance. Each data mart builds on the next, reusing dimensions and facts so users can query across data marts, if desired, to obtain a single version of the truth as well as both summary and atomic data. The “bottom-up” approach consciously tries to minimize back-office operations, preferring to focus an organization’s effort on developing dimensional designs that meet end-user requirements. The “bottom-up” staging area is non-persistent, and may simply stream flat files from source systems to data marts using the file transfer protocol. In most cases, dimensional data marts are logically stored within a single database. This approach minimizes data redundancy and makes it easier to extend existing dimensional models to accommodate new subject areas.
Bottom-Up Approach This approach is similar to the top-down approach but the emphasis is on the data rather than the business benefit. This is a “proof of concept” type of approach, therefore appealing to IT. Both approaches advocate building a robust enterprise architecture that adapts easily to changing business needs and delivers a single version of the truth. In some cases, the differences are more semantic than substantive in nature. For example, both approaches collect data from source systems into a single data store, from which data marts are populated. But while “top-down” subscribers call this a data warehouse, “bottom-up” adherents often call this a “staging area.”
Hybrid Approach It attempts to capitalize on the speed and user-orientation of the “bottom-up” approach without sacrificing the integration enforced by a data warehouse in a “top down” approach. Pieter Mimno is currently the most vocal proponent of this approach. The hybrid approach recommends spending about two weeks developing an enterprise model before developing the first data mart. The first several data marts are also designed concurrently. After deploying the first few “dependent” data marts, an organization then backfills a data warehouse behind the data marts, instantiating the “fleshed out” version of the enterprise data model. The organization then transfers atomic data from the data marts to the data warehouse and consolidates redundant data feeds, saving the organization time, money, and processing resources. Organizations typically backfill a data warehouse once business users request views of atomic data across multiple data marts. Once the DM is proven to be a good investment, we have to avoid building too many DMs (generally no more than two to three) without implementing an enterprise DW. The main reason for this is to be able to create a common data model for the enterprise, by which smaller DMs can be refreshed, maintained and tuned for performance. It is here that we combine the strength of the top-down approach by reducing the data replication from the operational systems. Furthermore, it is essential to centralize the data extraction and removal process by having one common, shared repository. Data cleansing and transformation, which account for most of the unexpected cost, can now be minimized. At the end, this translates to less maintenance and better performance.
Federated Approach As defined by its most vocal proponent, Doug Hackney—is not a methodology or architecture per se, but a concession to the natural forces that undermine the best laid plans for deploying a perfect system. A federated approach rationalizes the use of whatever means possible to integrate analytical resources to meet changing needs or business conditions. Hackney says the federated approach is “an architecture of architectures.” It recommends how to integrate a multiplicity of heterogeneous data warehouses, data marts, and packaged applications that companies have already deployed and will continue to implement in spite of the IT group’s best effort to enforce standards and adhere to a specific architecture. The major problem with the federated approach is that it is not well documented. There are only a few columns written on the subject. But perhaps this is enough, as it doesn’t prescribe a specific end-state or approach. Another potential problem is that without a specific architecture in mind, a federated approach can perpetuate the continued decentralization and fragmentation of analytical resources, making it harder to deliver an enterprise view in the end. Also, integrating meta data is a pernicious problem in a heterogeneous, ever-changing environment.
Typical Data Warehouse Components 1. Source Systems Source systems may be in the form of data existing in: Production operational systems within an organization Archives Internal files not directly associated with company operational systems, such as individual spreadsheets and workbooks External data from outside the company 2. Data Staging Area The data staging area is analogous to a kitchen of a restaurant where raw food products are transformed into a fine meal. In a data warehouse, raw operational data is transformed into a warehouse deliverable analytical data fit for query and consumption. It is both a storage area and a set of processes commonly known as Extract-Transformation-Load (ETL). It is off limits to business users and it is not suitable for querying and reporting. ETL is covered in detail later in the session. It may be an operational data store (ODS) environment, a set of flat files, a series of tables in a relational database server, or proprietary data structures used by data staging tools. A normalized database for data staging area is acceptable.
Operational Data Store (ODS) An ODS is an integrated database of operational data. ODS is a subject-oriented, integrated, current-valued and volatile collection of detailed data that provides a true enterprise view of information. ODS is frequently updated (synchronously, hourly, daily or batch processing). The frequency of update and degree of integration depends on the specific business requirements. The detailed current information in the ODS is transactional in nature. An ODS may contain 30 to 60 days of information. ODS can also provide a stepping-stone to feed operational data into the data warehouse. The data warehouse industry has developed ODS, usually presenting it as running on a separate platform, to satisfy tactical decision making requirements. ODS can deliver operational reports, especially when both the legacy systems and the OLTP systems do not provide adequate operational reports. It relieves production system of reporting and analysis demands and provides access to current data. ODS can be a third physical system sitting between the operational systems and the data warehouse or a partition of the data warehouse itself. ODS is optional. ODS can be transient or persistent. 3. Data Presentation The data presentation area is where the data is stored and optimized for direct querying, reporting, and analysis. It consists of the data warehouse or a series of integrated data marts. For business community, presentation area is the data warehouse. Data in the presentation area strictly is in dimensional form. It must contain detailed, atomic data. It must adhere to the data warehouse bus architecture. While the logical design of presentation area is dimensional model, physical implementation may be in RDBMS or MDDB. If RDBMS is used, data is stored in tables referred to as star schemas. If MDDB is used, then data is stored in cubes. 4. Data Access Tools Data access tools are used by the business community to query the data warehouse’s presentation area.They may be ad hoc query tools, data mining tools, or application tools. Users may need to perform simple to complex business modeling, complex drill-down, simple queries on prepared summary information, what-if analysis, trend analysis and forecasting and data mining using data spanning several time periods. Metadata Metadata is information about the actual data itself. Metadata explains what data exists, where it is located and how to access it. It comes in many shapes and forms and it supports the data warehouse’s technical, administrative, and business user groups.
Examining Data Sources The data source systems may comprise data existing in: Production - refers to operational systems running on various operating system platforms, File systems (flat-files, VSAM, ISAM), Database systems (Oracle, SQL, Sybase), vertical applications (SAP, PS, Baan, Oracle Financials) Archives - Historical data useful for first-time-load. May require unique transformations and clear details of the changes must be maintained in metadata Internal – refers to files not directly associated with company operational systems, such as individual spreadsheets and workbooks. Internal data must be transformed, documented in metadata, and mapped between the source and target databases. External – refers to data collected from outside the company. External data is important to compare the performance of your business against others. There are many sources for external data: Periodicals and reports External syndicated data feeds Competitive analysis information Newspapers/ Purchased marketing, competitive, and customer related data Free data from the Web
Production Data Production data may come from a multitude of different sources: Operating system platforms File systems (flat files, VSAM, and ISAM) Database systems, for example, Oracle, DB2, dBase, Informix, and so on Vertical applications, such as Oracle Financials, SAP, PeopleSoft, Baan, and Dun and Bradstreet (D&B) (D&B is the leading provider of global business information, tools, and insight, that enables customers to decide with confidence and provide them with quality information whenever and wherever they need it)
Archive Data Archive data may be useful to the enterprise in supplying historical data. Historical data is needed if analysis over long periods of time is to be achieved. Archive data is not used consistently as a source for the warehouse; for example, it would not be used for regular data refreshes. However, for the initial implementation of a data warehouse (and for the first-time load), archived data is an important source of historical data. You need to consider this carefully when planning the data warehouse. How much historical data do you have available for the data warehouse? How much effort is necessary to transform it into an acceptable format? The archive data may need some careful and unique transformations, and clear details of the changes must be maintained in metadata.
Internal Data Internal data may be information prepared by planning, sales, or marketing organizations that contains data such as budgets, forecasts, or sales quotas. The data contains figures (numbers) that are used across the enterprise for comparison purposes. The data is maintained using software packages such as spreadsheets and word processors and uploaded into the warehouse. Internal data is treated like any other source system data. It must be transformed, documented in metadata, and mapped between the source and target databases.
External Data External data is important if you want to compare the performance of your business against others. There are many sources for external data: Periodicals and reports External syndicated data feeds Competitive analysis information Newspapers Purchased marketing, competitive, and customer related data Free data from the Web You must consider the following issues with external data: Frequency: There is no real pattern like that of internal data. Constant monitoring is required to determine when it is available. Format: The data may be different in format than internal data, and the granularity of the data may be an issue. In order to make it useful to the warehouse a certain amount of reformatting may be required. In addition, you may find that external data, particularly that available on the Web, comes with digital audio data, picture image data, and digital video data. These present an interesting challenge for storage and speed of access. Predictability: External data is not predictable; it can come from any source at any time, in any format, on any medium.
Extraction, Transformation, and Loading (ETL) In the next few slides, we would be discussing the
Extraction, Transformation, and Loading (ETL) These processes are fundamental to the creation of quality information in the data warehouse. You take data from source systems; clean, verify, validate, and convert it into a consistent state; then move it into the warehouse. Extraction: The process of selecting specific operational attributes from the various operational systems. Transformation: The process of integrating, verifying, validating, cleaning, and time stamping the selected data into a consistent and uniform format for the target databases. Rejected data is returned to the data owner for correction and reprocessing. Loading : The process of moving data from an intermediate storage area into the target warehouse database.
Staging Models Out of the possible staging models listed in the slide, the model you choose depends on operational and warehouse requirements, system availability, connectivity bandwidth, gateway access, and volume of data to be moved or transformed.
Remote Staging Model In this model, the staging area is not part of the Operational environment. You may choose to extract the data from the operational environment and transport it into the warehouse environment for transformation processing. The other option is to have a separate data staging area, which is neither a part of the operational system nor a part of the warehouse environment, which eliminates the negative impact on the performance of the warehouse.
On-site Staging Model Alternatively, you may choose to perform the cleansing, transformation, and summarization processes locally in the operational environment and then load to the warehouse. This model may conflict with the day-to-day working of the operational system. If chosen, this model’s process should be executed when the operational system is idle or less heavily used.
Extraction Methods The extraction method that you choose is highly dependent on the source system and the business needs in the target data warehouse environment. In addition, the estimated amount of the data to be extracted and the stage in the ETL process (initial load or maintenance of data) may also impact the decision of how to extract, from a logical and a physical perspective. Basically, you have to decide how to extract data logically and physically. Logical Extraction Methods Full Extraction: The data is extracted completely from the source system. Since this extraction reflects all the data currently available on the source system, there’s no need to keep track of changes to the data source since the last successful extraction. The source data will be provided as-is and no additional logical information (for example, timestamps) is necessary on the source site. Incremental Extraction: At a specific point of time, only the data that has changed since a well-defined event back in history will be extracted. This event may be the last time of extraction or a more complex business event like the last booking day of a fiscal period. To identify this delta change there must be a possibility to identify all the changed information since this specific time event.
Physical Extraction Methods Depending on the chosen logical extraction method and the capabilities and restrictions on the source side, the extracted data can be physically extracted by two mechanisms. The data can either be extracted online from the source system or from an offline structure. Such an offline structure might already exist or it might be generated by an extraction routine. Online Extraction: The data is extracted directly from the source system. The extraction process can connect directly to the source system to access the source tables or to an intermediate system that stores the data in a preconfigured manner (for example, snapshot logs or change tables). Note that the intermediate system is not necessarily physically different from the source system Offline Extraction: The data is not extracted directly from the source system but is staged explicitly outside the original source system. The data may already have one of the following structures: flat files, redo and archive logs, (Oracle-specific) dump files, and so on.
Extraction Techniques You can extract data from different source systems to the warehouse in different ways: Programmatically, using languages such as COBOL, C, C++, PL/ SQL or Java Using a gateway to access data sources: This method is acceptable only for small amounts of data; otherwise, the network traffic becomes unacceptably high. In-house developed tools. Alternatively, though it is expensive, you can use a vendor’s data extraction tool. Note: Oracle open Gateway product set (previously known as SQL*Connect) is used to access data from non-Oracle databases and even non-relational sources like flat files.
Mapping Data Once you have determined your business subjects for the warehouse, you need to determine the required attributes from the source systems. On an attribute-by-attribute basis you must determine how the source data maps into the data warehouse, and what, if any, transformation rules to apply. This is known as mapping. There are mapping tools available. Mapping information should be maintained in the metadata that is server (RDBMS) resident, for ease of access, maintenance, and clarity.
Transformation Routines Transformation process uses many transformation routines, to eliminate the inconsistencies and anomalies from the extracted data. These transformation routines are designed to perform the following tasks: Cleaning the data, also referred to as data cleansing or scrubbing Adding an element of time (timestamp) to the data, if it does not already exist Translating the formats of external and purchased data into something meaningful for the warehouse Merging rows or records in files Integrating all the data into files and formats to be loaded into the warehouse Transformation can be performed: Before the data is loaded into the warehouse In parallel (On larger databases, there is not enough time to perform this process as a single threaded process.) The transformation process should be self-documenting, generate summary statistics, and process exceptions. Note: The terms scrubbing , cleaning , cleansing , and data re-engineering are used interchangeably.
Transforming Data: Problems and Solutions The factors listed in the slide can potentially cause problems in the transformation process. Each of these problems, and the probable solutions are discussed in the following pages.
Data Anomalies Reasons for Data Anomalies One of the causes of inconsistencies within internal data is that in-house system development takes place over many years, often with different software and development standards for each implementation. There may be no consistent policy for the software used in the corporate environment. Systems may be upgraded or changed over the years. Each system may represent data in different ways. Source Data Anomalies Many potential problems can exist with source data: No unique key for individual records Anomalies within data fields, such as differences between naming and coding (data type) conventions Differences in the interpreted meaning of the data by different user groups Spelling errors and other textual inconsistencies (this is particularly relevant in the area of customer names and addresses)
Transforming Data: Problems and Solutions Multipart Keys Problem Many older operational systems used record key structures that had a built-in meaning. Solution To allow for decision support reporting, these keys must be broken down into atomic values. In the example, the key contains four atomic values. Key Code: 12M65431345 Where: 12 is the country code M is the sales territory 654313 is the product code 45 is the salesperson
Multiple Local Standards Problem This is particularly relevant for values entered in different countries. For example, some countries use imperial measurements and others metric; currencies and date formats differ; currency values and character sets may vary; and numeric precision values may differ. Currency values are often stored in two formats, a local currency such as sterling, French francs, or Australian dollars, and a global currency such as U.S. dollars. Solution Typically, you use tools or filters to preprocess this data into a suitable format for the database, with the logic needed to interpret and reconstitute a value. You might employ steps that are similar to those identified for multiple encoding. You may consider revising source applications to eliminate these inconsistencies early on.
Multiple Files Problem The source of information may be one file for one condition, and a set of files for another. Logic (normally procedural) must be in place to detect the right source. The complexity of integrating data is greatly increased according to the number of data sources being integrated. For example, if you are integrating data from two sources, there is a single point of integration where conflicts must be resolved. Integrate from three sources, and there are three points of conflict. Four sources provide six conflict points. The problem is exponential. Solution This is a complex problem that requires the use of tools or well-documented transformation mechanisms. Try not to integrate all the sources in the first instance. Start with two or three and then enhance the program to incorporate more sources. Build on your learning experiences.
Missing Values Problem Null and missing values are always an issue. NULL values may be valid entries where NULLs are allowed; otherwise, NULLs indicate missing values. Solution You must examine each occurrence of the condition to determine validity and decide whether these occurrences must be transformed; that is, identify whether a NULL is valid or invalid (missing data). You may choose to: Ignore the missing data. If the volume of records is relatively small, it may have little impact overall. Wait to extract the data until you are sure that missing values are entered from the operational system. Mark rows when extracted, so that on the next extract you can select only those rows not previously extracted. It does involve the overhead of SELECT and UPDATE , and if the extracted data forms the basis of a summary table, these need re-creating. Extract data only when it is time-stamped.
Duplicate Values Problem You must eliminate duplicate values, which invariably exist. This can be time-consuming, although it is a simple task to perform. Solution You can use standard SQL self-join techniques or RDBMS constraint utilities to eliminate duplicates.
Element Names Problem Individual attributes, columns, or fields may vary in their naming conventions from one source to another. These need to be eliminated to ensure that one naming convention is applied to the value in the warehouse. If you are employing independent data marts, then you should ensure that the ETL solution is mirrored; should you plan to employ the data marts dependently in the future, they will all refer to the same object. Solution You must obtain agreement from all relevant user groups on renaming conventions, and rename the elements accordingly. Document the changes in metadata. The programs you use determine the solution. For example, if you are using SQL CREATE TABLE AS SELECT (CTAS) statement, the new column name is used in that statement.
Element Meaning Problem The element meaning is often interpreted differently by different user groups. The variations in naming conventions typically drive this misinterpretation. Solution It is a difficult problem, often political, but you must ensure that the meaning is clear. By documenting the meaning in metadata you can solve this problem
Input Format Problem Input formats vary considerably. For example, one entry may accept alphanumeric data, so the format may be “123-73”. Another entry may accept numeric data only, so the format may be “12373”. You may also need to convert from ASCII to EBCDIC, or even convert complex character sets such as Hebrew, Arabic, or Japanese. Solution First, ensure that you document the original and the resulting formats. Your program (or tool) must then convert those data types either dynamically or through a series of transforms into one acceptable format. You can use Oracle SQL*Loader to perform certain transformations, such as EBCDIC to ASCII conversions and assigning values to default or NULL values.
Referential Integrity Problem If the constraints at the application or database level have in the past been less than accurate, child and parent record relationships can suffer; orphaned records can exist. You must understand data relationships built into legacy systems. The biggest problem encountered here is that they are often undocumented. You must gain the support of users and technicians to help you with analysis and documentation of the source data. Solution This cleaning task is time-consuming and requires business experience to resolve the inconsistencies. You can use SQL anti-join query techniques, server constraint utilities, or dedicated tools to eliminate these inconsistencies.
Name and Address Problem One of the largest areas of concern, with regard to data quality, is how name and address information is held, and how to transform it. Name and address information has historically suffered from a lack of legacy standards. This information has been stored in many different formats, sometimes dependent upon the software or even the data processing center used. Some of the following data inconsistencies may appear: No unique key Missing data values (NULLs) Personal and commercial names mixed Different addresses for same member Different names and spelling for same member Many names on one line One name on two lines The data may be in a single field of no fixed format (example shown in the slide) Each component of an address may be in a specific field (example shown in the slide)
Transformation Timing and Location You need to consider carefully when and where you perform transformation. You must perform transformation before the data is loaded into the warehouse or in parallel; on larger databases, there is not enough time to perform this process as a single threaded process. Consider the different places and points in time where transformation may take place. Transformation Points On the operational platform: This approach transforms the data on the operational platform, where the source data resides. The negative impact of this approach is that the transformation operation conflicts with the day-to-day working of the operational system. If it is chosen, the process should be executed when the operational system is idle or less utilized. The impact of this approach is so great that is very unlikely to be employed. In a separate staging area: This approach transforms data on a separate computing environment, the staging area, where summary data may also be created. This is a common approach because it does not affect either the operational or warehouse environment.
Adding a Date Stamp: Fact Tables and Dimensions Fact Table Data Assume that you need to add the next set of records from the source systems to your fact table. You need to determine which records are to be moved into the fact table. You have added data for March 2002. Now you need to add data for April 2002. You need to find a mechanism to stamp records so that you pick up only April 2002 records for the next refresh. You might choose from a number of techniques to time-stamp data: Code application or database triggers at the operational level, which can then be extracted using date selection criteria Perform a comparison of tables, original and new, to identify differences Maintain a table containing copies of changed records to be loaded You must decide which are the best techniques for you to use according to your current system implementations. Dimension Table Data The techniques discussed for fact tables can also be used with dimensions. Dimensions change and you can employ one of the different techniques to trap changes.
Summarizing Data Creating summary data is essential for a data warehouse to perform well. Here summarization is classified under transformation because you are changing (transforming) the way the data exists in the source system, to be used with the data warehouse. You can summarize the data: At the time of extraction in batch routines: This reduces the amount of work performed by the data warehouse server, as all the effort is concentrated on the source systems. However, summarizing at this time increases: the complexity and time taken to perform the extract, the number of files created, the number of load routines, and the complexity of the scheduling process After the data is loaded into the warehouse server: In this process the fact data is queried, summarized, and placed into the requisite summary fact table. This method reduces the complexity and time taken for the extract tasks. However, it places all the CPU and I/O intensive work on the warehouse server, thus increasing the time that the warehouse is unavailable to the users.
Loading Data into the Warehouse The acronym ETL for “Extraction, Transformation, and Loading” is perhaps too simplistic, because it omits the transportation phase and implies that each of these processes are essentially distinct. This may not be true always; sometimes, the entire process, including data loading, is referred to as ETL. The transportation process involves moving the data from source data stores or an intermediate staging area to the data warehouse database. The loading process loads the data into the target warehouse database in the target system server. Transportation is often one of the simpler portions of the ETL process, and is often integrated with the loading process. These processes comprise a series of actions, such as moving the data and loading data into tables. There may also be some processing of objects after the load, often referred to as post-load processing. Moving and loading of the data can be a time-consuming task, depending upon the volumes of data, the hardware, the connectivity setup, and whether parallel operations are in place. The time period within which the warehouse system can perform the load is called the load window. Loading should be scheduled and prioritized. You should also ensure that the loading process is automated as much as possible.
Load Window Requirements The load window is the amount of time available to extract, transform, load, postload process data, and make the data warehouse available to the user. The load performs many sequential tasks that take time to execute. You must ensure that every event that occurs during the load window is planned, tested, proven, and constantly monitored. Poor planning will extend the load time and prevent users from accessing the data when it is needed. Careful planning, defining, testing, and scheduling is critical.
Planning the Load Window Load Window Strategy The load time is dependent upon a number of factors, such as data volumes, network capacity, and load utility capabilities. You must not forget that the aim is to ensure the currency of data for the users, who require access to the data for analysis. To work out an effective load window strategy, consider the user requirements first, and then work out the load schedule backward from that point. Determining the Load Window It is usual to define the user access requirements first and work the load schedule backward from that point. Once the user access time is defined, you can establish the load cycles. Some of the processes overlap to enable all processes to run within the window. More realistically, almost 24-hour access is required. This means the load window is significantly smaller than the example shown here. In that event, you need to consider how to process the refresh and keep users presented with current realistic data. This is where you can use partitioning strategies.
Initial Load and Refresh Initial Load: The initial load (also called first-time load) is a single event that occurs before implementation. It populates the data warehouse database with as much data as needed or available. The first-time load moves data in the same way as the regular refresh. However, the complexity of the task is greater then the refresh cycle. Data volumes that may be very large (Company decides to load the last five years of data, which may comprise millions of rows. The time taken to load the data may be in days rather than hours.) The task of populating all fact tables, all dimension tables, and any other ancillary tables that you may have created such as reference tables With all the issues surrounding the initial load, it is a task not to be considered lightly. Refresh: After the first time load, the refresh is performed on a regular basis according to a cycle determined by refresh policy. The cycle may be daily, weekly, monthly, quarterly, or any other business period. The refresh is a simpler task than first time load for these reasons: There is less fact data to load. You are moving a new snapshot of data but not all fact data into the data warehouse. There is no dimension data to load (unless your model has changed, which would be an exception). There may be some dimensional data changes to incorporate. This can be complex.
Data Refresh Models: Extract Processing Environment First, to ensure that you understand how the warehouse data presentation differs from nonwarehouse data presentation, consider how up-to-date data is presented to users in two different decision support environments: a simple extract processing environment and a data warehouse environment. Extract Processing Environment A snapshot of operational data is taken at regular time intervals: T1, T2, and T3. At each interval a new snapshot of the database is created and presented to the user; the old snapshot is purged.
Data Refresh Models: Warehouse Processing Environment Warehouse Environment An initial snapshot is taken and the database is loaded with data. At regular time intervals, T1, T2, and T3, a delta database or file is created and the warehouse is refreshed. A delta contains only the changes made to operational data that need to be reflected in the data warehouse. The warehouse fact data is refreshed according to the refresh cycle that is determined by user requirements analysis. The warehouse dimension data is updated to reflect the current state of the business, only when changes are detected in the source systems. The older snapshot of data is not removed, ensuring that the warehouse contains the historical data that is needed for analysis. The oldest snapshots are archived or purged only when the data is not required any longer.
Post-Processing of Loaded Data You have seen how to extract data to an intermediate file store or staging area, where it is: Transformed into acceptable warehouse data Loaded to the warehouse server You have also seen how the ETL process is for: First-time load, which requires all data to be loaded once Refreshing, which requires only changed data to be loaded You now need to consider the different tasks that might take place once the data is loaded. There are various terms used for these tasks. In this course the choice of terms is post-processing . The post-processing tasks are not definitive; you may or may not have to perform them, depending on the volumes of data moved, the complexity of transformations, and the loading mechanism.
Unique Indexes If the index you are creating is a unique index (that forces unique values in key columns without any duplicates), then it is usual to load the data with the database constraints disabled, and then enable the constraints after the load process. Then you build the index, which may find duplicate values and fail. Ensure that the action catches the errors so that you can correct and reindex.
Creating Derived Keys A derived (sometimes referred as generalized or artificial key or synthetic key or a warehouse key) key may be used to guarantee that every row in the table is unique. Concatenate Operational Key with a Number For example, if a customer record key value contains six digits, such as 109908, the derived key may be 109908 01 . The last n digits are the sequential number generated automatically. The advantage of this method is that it is relatively easy to maintain and set up the necessary programs to manage number allocation. The disadvantage of this method is that the key may become long and cumbersome. There is no clean key value for retrieval of a record, unless you have another copy of the key. For example, if the operational Customer_Id is 109908 but the warehouse key is now 10990801, then extracting information about that customer from the warehouse using 109908 is impossible—unless the old value has been retained in another field such as: Customer_key Customer_id Customer_Name 10990801 109908 Acme Inc.
Creating Derived Keys (continued) Assign a Number from a List You can also assign the key sequentially from a simple list of numbers. For example, operational Customer_Id is 109908, the warehouse key can be 100, the key corresponding to the Customer_Id 109909 can be 101 and so on. The advantage of this method is that it does not occupy much space. The disadvantage of this method is that the keys therefore have no semantic or intuitive meaning. But this is industry preferred methodology. Irrespective of using any of the above methods, you must ensure the metadata is updated to register the latest key allocations. The method you choose to create derived keys, depends upon the extract methods, the tools available, and the hardware and network capability and availability.
Metadata Users In the warehouse, metadata is employed directly or indirectly by all warehouse users for many different tasks. End Users The decision support analyst (or user) uses metadata directly. The user does not have the high degree of knowledge that the IT professional has, and metadata is the map to the warehouse information. One measure of a successful warehouse is the strength and ease of use of enduser metadata. Developers and IT Professionals For the developer (or an IT professional), metadata contains information on the location, structure, and meaning of data, information on mappings, and a guide to the algorithms used for summarization between detail and summary data.
Metadata Documentation Approaches Regardless of the tools that you use to create a data warehouse, metadata must play a central role in the design, development, and ongoing evolution of the warehouse. Automated: Data modeling tools record metadata information as you perform modeling activities with the tool. ETL tools can also generate metadata. These tools also use the metadata repository as a resource to generate build and load scripts for the warehouse. Each of the tools used in your warehouse environment might generate its own set of metadata. The management and integration of different metadata repositories is one of the biggest challenges for the warehouse administrator. Manual: The manual approach provides flexibility, however, it is severely hampered by the labor-intensive nature of managing a manual approach with the ongoing maintenance of metadata content.
Data Warehouse Dimensional Model Phases: Identify the ‘Business Process’ Determine the ‘Grain’ Identify the ‘Facts’ Identify the ‘Dimensions
Business Requirements Drive the Design Process The entire scope of the data warehouse initiative must be driven by business requirements. Business requirements determine: What data must be available in the warehouse How data is to be organized How often data is updated End-user application templates Maintenance and growth Primary Input The business requirements are the primary input to the design of the data warehouse. Information requirements as defined by the business people—the end users—will lay the foundation for the data warehouse content. Secondary Input Requirements gathered from user interviews can be clarified or augmented with existing source data information and further research regarding how data is currently used. Other sources may be: Legacy systems metadata, Source ERD from OLTP systems, Research
Performing Strategic Analysis Performed at the enterprise level, strategic analysis identifies, prioritizes, and selects the major business processes (also called business events or subject areas) that are most important to the overall corporate strategy. Strategic analysis includes the following steps: Identify the business processes that are most important to the overall corporate strategy. Examples of business processes are orders, invoices, shipments, inventory, sales, account administration, and the general ledger. Understand the business processes by drilling down on the dimensions that characterize each business process. The creation of a business process matrix can aid in this effort. Prioritize and select the business process to implement in the warehouse, based on which one will provide the quickest and largest return on investment (ROI).
Using a Business Process Matrix A useful tool to understand and quantify business processes is the business process matrix (also called the process/dimension matrix). This matrix establishes a blueprint for the data warehouse database design to ensure that the design is extensible over time. The business process matrix aids in the strategic analysis task in two ways: Helps identify high-level analytical information that is required to satisfy the analytical needs for each business process, and serves as a method of cross checking whether you have all of the required business dimensions for each business process. Helps identify common business dimensions shared by different business processes. Business dimensions that are shared by more than one business process should be modeled with particular rigor, so that the analytical requirements of all processes that depend on them are supported. This is true even if one or more of the potential business processes are not selected for the first increment of the warehouse. Model the shared business dimensions to support all processes, so that later increments of the warehouse will not require a redesign of these crucial dimensions. A sample business process matrix is developed and shown in the slide, with business processes across the top and dimensions down the column on the very left side.
Conformed Dimensions Dimensions are conformed when they are exactly the same including the keys or one is a perfect subset of the other. Shared dimensions are in effect &quot;conformed&quot; dimensions that can be shared across the data marts in an organization. One Date dimension or product dimension are such examples. DW bus architecture provides a standard set of conformed dimensions. Conformed dimensions will be replicated either physically or logically throughout the enterprise. However, they should be built once in the staging area.
Determining Granularity When gathering more specific information about measures and analytic parameters (dimensions), it is also important to understand the level of detail that is required for analysis and business decisions. Granularity is defined as the level of summarization (or detail) that will be maintained by your warehouse. The greater the level of detail, the finer the level of granularity. Grain is defined as the lowest level of detail that is retained in the warehouse, such as the transaction level. Such data is highly detailed and can then be summarized to any level that is required by the users. During your interviews, you should discern the level of detail that users need for near-term future analysis. After that is determined, identify whether there is a lower level of grain available in the source data. If so, you should design for at least one grain finer, and perhaps even to the lowest level of grain. Remember that you can always aggregate upward, but you cannot decompose the aggregate lower than the data that is stored in the warehouse. The level of granularity for each dimension determines the grain for the atomic level of the warehouse, which in turn will be used for rollups.
Granularity The single most important design consideration of a data warehouse is the issue of granularity. Granularity, or the grain, determines the lowest level of detail or summarization contained in the warehouse. A low level of granularity means greater detail; a high level of granularity mean less detail. Granularity affects the warehouse size and the types of queries that can be performed (the dimensionality of the warehouse). Additionally, the type of query that is required determines the granularity of the data. The data can be stored at different levels of granularity: individual transactions, daily snapshots, monthly snapshots, quarterly snapshots, or yearly snapshots. Maintaining a low level of grain in the warehouse is expensive and requires more disk space and more processing in access operations. Because the data may even exist at the transaction level, there is less demand during the ETT process than there would be with a higher level of granularity. A high level of granularity requires less space and fewer resources for access, but may prevent users from accessing the level of detail that they require to answer their business questions. Consider starting your data warehouse with the lowest granularity possible that is available from the source systems. After the warehouse is running, users can determine what level of granularity is necessary. Keep in mind however, that different users have different requirements.
Defining Time Granularity The grain you choose for the time dimension can have a significant impact on the size of your database. It is important to know how your data has been defined with regard to time, to accurately determine the grain. This is particularly true with external data that is commonly aggregated data. Even when you think there is no gap in the granularity of your systems, you may find that basic definitions between systems differ. The primary consideration here is to keep a low enough level of data in your warehouse to be able to aggregate and match values with other data that has a predetermined aggregate. The most common use of partitioning for a data warehouse is to range-partition the fact table of the star schema by time providing huge benefits in querying, manageability, and performance when loading. Note: A good practice is to store data at one level of granularity lower than your business user has requested. For example, if the user requests monthly data, you may want to store weekly data. However, the impact of the increased storage can be prohibitive.
Identifying Measures and Dimensions Measures A measure (or fact) contains a numeric value that measures an aspect of the business. Typical examples are gross sales dollars, total cost, profit, margin dollars, or quantity sold. A measure can be additive or partially additive across dimensions. Dimensions A dimension is an attribute by which measures can be characterized or analyzed. Dimensions bring meaning to raw data. Typical examples are customer name, date of order, or product brand. During the warehouse design, you must decide whether a piece of data is a measure or a dimension. You can use the following as a guide: If the data regularly changes value, it is a measure; for example, units sold or account balances. If the data is constant or it takes only a discrete number of values, it is a dimension. For example, the color of a product and the address of a customer are unlikely to change frequently. A need or capability to summarize often identifies a measure. These rules are not definitive but act as a guide where there is indecision.
Data Warehouse Environment Data Structures Warehouse environment table structures can take on a number of forms. The data modeling structures that are commonly encountered in a data warehouse environment are: Third normal form (3NF) Star schema Snowflake schema Hybrid schema Note: Today, most of the very large data warehouses are mixing 3NF and star schema. Normalized structures store the greatest amount of data in the least amount of space. Entity-relationship modeling (ERM) also seeks to eliminate data redundancy. This is immensely beneficial to transaction processing, OLTP systems. Dimensional modeling (DM) is a design that presents the data in an intuitive manner and allows for high-performance access. For these two reasons, dimensional modeling, such as, star and snowflake schemas, has become the standard design for data marts and data warehouses.
Basic Form A star schema (dimensional model) is a typical warehouse model. It is portrayed as a star, with a central table containing fact data, and multiple dimensional tables radiating out from it, all connected by keys. Dimensional models are represented by two main types of data: Fact data Dimension data (This data is denormalized, unlike that in a typical relational model.) Instructor Note The main types of data are discussed later in greater length in the dimensional modeling lesson.
Star Schema Model A star schema model can be depicted as a simple star; a central table contains fact data, and multiple tables radiate out from it, connected by database primary and foreign keys. Unlike other database structures, a star schema has denormalized dimensions. The star schema is emerging as the predominant model for data warehouses/marts. Star dimensional modeling is a logical design technique that seeks to present the data in a standard framework that is intuitive and provides high performance. Every dimensional model is composed of one table called the fact table, and a set of smaller tables called dimension tables. This characteristic (denormalized, star-like structure) is commonly known as a star model. Within this star model, redundant data is posted from one object to another for performance considerations. A fact table has a multipart primary key composed of two or more foreign keys and expresses a many-to-many relationship. Each dimension table has a single-part primary key that corresponds exactly to one of the components of the multipart key in the fact table.
Fact Table Characteristics Facts are the numerical measures of the business. The fact table is the largest table in the star schema and is composed of large volumes of data, usually making up 90% or more of the total database size. It can be viewed in two parts: Multipart primary key Business metrics Numeric Additive (usually) Often a measure may be required in the warehouse, but it may not appear to be additive. These are known as semiadditive facts. Inventory and room temperature are two such numerical measurements. It does not make sense to add these numerical measurements over time, but they can be aggregated by using an SQL function other than sum, for example average. Although a star schema typically contains one fact table, other DSS schemas can contain multiple fact tables.
Dimension Table Characteristics Dimensions are the textual descriptions of the business. Dimension tables are typically smaller than fact tables and the data changes much less frequently. Dimension tables give perspective regarding the whys and hows of the business and element transactions. Although dimensions generally contain relatively static data, customer dimensions are updated more frequently. Dimensions Are Essential for Analysis The key to a powerful dimensional model lies in the richness of the dimension attributes because they determine how facts can be analyzed. Dimensions can be considered as the entry point into “fact space.” Always name attributes in the users’ vocabulary. That way, the dimension will document itself and its expressive power will be apparent.
Advantages of Using a Star Dimensional Schema Provides rapid analysis across different dimensions for drilling down, rotation, and analytical calculations for the multidimensional cube. Creates a database design that improves performance Enables database optimizers to work with a more simple database design to yield better execution plans Provides an extensible design which supports changing business requirements Broadens the choices for data access tools, because some products require a star schema design
Base and Derived Data The fields in your fact tables are not just source data columns. There may be many more columns that the business wants to analyze, such as year-to-date sales or the percentage difference in sales from a year ago. You must keep track of the derived facts as well as the base facts. Derived Data Derived data is data that is calculated or created from two or more sources of data. A derived value can be more efficiently stored for access, rather than calculating the value at execution time. Derived data in the warehouse is very important because of its inherent support for queries. When you store derived data in the database, values are available immediately for analysis through queries. In this example, you see the employees’ monthly salary and monthly commission figures. In the OLTP system, you would not store the monthly compensation, but you might do so for the warehouse. Derived values can be calculated at an instance in time, as with an OLTP system. More derived values feature prominently in data warehouses.
Translating Business Measures into Fact Tables The slide shows the translation of some of the business measures in a business model into an Order fact table. In addition, each measure in the fact table is identified as either a base or derived measure, for example, Profit = Order_amount – Order_cost. Derived values can be created during the extraction or transformation processes.
Snowflake Schema Model According to Ralph Kimball “a dimension is said to be snowflaked when the low cardinality fields in the dimension have been removed to separate tables and linked back into the original table with artificial keys.” A snowflake model is closer to an entity relationship diagram than the classic star model because the dimension data is more normalized. Developing a snowflake model means building class hierarchies out of each dimension (normalizing the data). A snowflake model: Results in severe performance degradation because of its greater number of table joins Provides a structure that is easier to change as requirements change Is quicker at loading data into its smaller normalized tables, compared to loading into a star schema’s larger denormalized tables Allows using history tables for changing data, rather than level fields (indicators) Has a complex metadata structure that is harder for end user tools to support One of the major reasons why the star schema model has become more predominant than the snowflake model is its query performance advantage. In a warehouse environment, the snowflake’s quicker load performance is much less important than its slower query performance.
Snowflake Model Is an extension of a star design Is a normalized star model Can cause potential performance degradation A dimension is said to be snowflaked when the low cardinality fields in the dimension have been removed to separate tables and linked back to the original tables with artificial keys. The snowflake model is closer to an entity relationship diagram than the classic star model, because the dimension data is more normalized. Developing a snowflake model means building hierarchies from each dimension. The snowflake model is an extended, more normalized star model. The dimensions are normalized to form class hierarchies. For example, a Location Dimension may be normalized to a hierarchy showing Countries, Counties, and Cities; or Region, District, and Office. The slide shows an example where the Order dimension is normalized to a hierarchy of Channel and Web. The two tables are then joined together. Note: We do not recommend using snowflaking because usually this model almost always interferes with the user’s comprehension. The exception is when a user access tool requires the use of a snowflake model.
Snowflake Schema Model (continued) Besides the star and snowflake schemas, there are other models that can be considered. Constellation A constellation model (also called galaxy model) simply comprises a series of star models. Constellations are a useful design feature if you have a primary fact table, and summary tables of a different dimensionality. It can simplify design by allowing you to share dimensions among many fact tables. Third Normal Form Warehouse Some data warehouses consist of a set of relational tables that have been normalized to third normal form (3NF). Their data can be directly accessed by using SQL code. They may have more efficient data storage, at the price of slower query performance due to extensive table joins. Some large companies build a 3NF central data warehouse feeding dependent star data marts for specific lines of business.
Constellation (Summary) Configuration The constellation configuration consists of a central star surrounded by other stars. The central star comprises base-level or atomic data. The surrounding stars can be stars that share dimensions or summary data. The surrounding stars can share dimension attributes with the atomic star. A galaxy is composed of those constellations that do not share dimensions and a universe is composed of galaxies.
Fact Table Measures Additive: Additive data in a fact table can be added across all of the dimensions to provide an answer to your query. Additive facts are fundamental components of most queries. Typically, when you query the database you are requesting that data be summed to a particular level to meet the constraints of your query. Additive facts are numeric and can be logically used in calculations across all dimensions. Additive data examples are units sold, customer account balance, and shipping charge. Semiadditive: Semiadditive data can be added along some but not all of the dimensions such as a bank account balance. The bank records balances at the end of each banking day for customers, by account, over time. This allows the bank to study deposits, as well as individual customers. In some cases, the account balance measure is additive. If a customer holds both checking and savings accounts, you can add together the balances of each account at the end of the day and get a meaningful combined balance. You can also add balances together across accounts at the same branch for a picture of the total deposits at each location; however, you cannot add together account balances for a single customer over multiple days. Nonadditive: Nonadditive data cannot logically be added between records. Nonadditive data can be numeric and is usually combined in a computation with other facts (to make it additive) before being added across records. Margin percent is an example of a nonadditive value. Nonadditive are also factless fact tables.
More on Factless Fact Tables The factless fact table represents the many-to-many relationships between the dimensions, so that the characteristics of the event can be analyzed. Materialized views are generally built on factless fact tables to create summaries. Examples Retail store — Promotions are typical within the retail environment. An upscale retail chain wants to compare its customers who do not respond to direct mail promotion to those who make a purchase. A factless fact table supports the relationship between the customer, product, promotion, and time dimensions. Student attendance — Factless fact tables can be used to record student class attendance in a college or school system. There is no fact associated with this; it is a matter of whether the students attended. Note: FK = foreign and PK = primary key
Factless Fact Tables A factless fact table is a fact table that does not contain numeric additive values, but is composed exclusively of keys. There are two types of factless fact tables: event-tracking and coverage. Event-tracking records and tracks events that have occurred, such as college student class attendance, whereas coverage factless tables support the dimensional model when the primary fact table is sparse, for example, a sales promotion factless table. In the latter case, the events did not occur. Although promotional fact data can be stored within the fact table itself, creating a coverage factless fact table is far more efficient because the complex many-to-many relationship formed through the dimensions for a promotion require massive amounts of detail with zero detail.
Bracketed Dimensions To enhance performance and analytical capabilities, create bracketed dimensions for categorical attributes. Creating groups of values for attributes with many unique values, such as income, reduces the cardinality (fewer bitmap indexes) and creates ranges that are more meaningful to the end user. For example, even though the actual income amount is available at the customer level, the business requirements do not need discrete values for analysis. Income can then be grouped in $10,000 ranges. Similarly for ages you can have a 21–30 bracket and a 31–40 bracket. To extract a person’s actual age, you can perform a calculation on birth date. Attributes can be grouped into brackets as well, creating unique identifiers for combinations of attributes. For some queries, pre-aggregating the data can minimize the need for full table scans. The bracketed definition is selected as the query constraint and the index for the table is scanned for records that satisfy the query. If the data was not pre-aggregated into brackets, the table must be scanned multiple times to select each range of values. Assumptions: There are far fewer brackets than actual occurrences of the value bracketed The ranges are agreed upon Values that will constitute summary tables should be preaggregated .
Bracketing Dimensions Bracketed dimensions are typically used to support complex analytical models, because they: Contain information used to classify, categorize, or group multiple attributes Are frequently used in end-user queries where the actual value is too discrete to be meaningful Therefore, consider preaggregating to enhance performance and analytical power. If you have the data, you can use a query tool or data mining tool to determine and define the brackets based on actual values. If connected to the fact table, the Bracket_PK field must be part of the fact table composite key; otherwise, no modifications to the fact table are required. If the bracketed dimension is not connected to the fact table, it can be built at anytime, as required. In this example, the bracketed dimension is created from a combination of attributes. The bracketed table is related to the fact table, requiring that the Bracket_PK field be part of the composite key. The example also presupposes that these brackets have been agreed upon during requirement analysis. Examples of other applications for which you might use bracketing include: Marketing applications with customer salary ranges Risk assessment applications containing credit card limits Note: Bracketing data in this way makes it more difficult to manage, load, or access with drill capabilities.
Example In the example on the slide, a store dimension has one hierarchy, the organization hierarchy.
Multiple Hierarchies Multiple hierarchies can be stored in a single dimension, using the flat model approach. The definition of the relationship between the data in this model must be well maintained and stored in the metadata. All the levels of the hierarchy have been collapsed into one table. Example: Multiple hierarchies within the same dimension are so common that they may be the rule rather than the exception. In this example, the store dimension has two hierarchies: Organization Geography These relationships were most likely created to manage a particular business process. The organization hierarchy may support the evaluation of sales groups or personnel, whereas the geography hierarchy may be used to manage distributors or estimate tax exposure.
Multiple Time Hierarchies Representation of time is critical in the data warehouse. You may decide to store multiple hierarchies in the data warehouse to satisfy the varied definitions of units of time. If you are using external data, you may find that you create a hierarchy or translation table simply to be able to integrate the data. Matching the granularity of time defined in external data to the time dimension in your own warehouse may be quite difficult. A simple time hierarchy corresponds to a calendar approach: days, months, quarters, years. A hierarchy based on weeks seems fairly simple as well: weeks, four-week period. What is the definition of a week? Does the week start on Sunday or Monday? Internally, you may define it one way; however, when you try to integrate external data that is defined in a different way, you may get unexpected or misleading results. Are there not 13 weeks in a quarter? Why can I not map 13-week periods to a quarter? Typically the start and stop date of a quarter corresponds to a calendar date—the first of the month, the last day of a month. Thirteen-week periods may start at any time but are usually consistent with the start day of the week definition. Example: The time dimension is described by multiple hierarchies and can be used to support both calendar and fiscal year.
Drilling Up and Drilling Down Drilling refers to the investigation of data to greater or lesser detail from a starting point. Typically, in an analytical environment you start with less detail, at a higher level within a hierarchy, and investigate down through a hierarchy to greater detail. This process is drilling down (to more detailed data). Drilling down means retrieving data at a greater level of granularity. Using the market hierarchy as an example, you might start your analysis at the group level and drill down to the region or store level of detail. Drilling up is the reverse of that process. Consider the market hierarchy example. If your starting point were an analysis of districts, drilling up would mean looking at a lesser level of detail, such as the region or group level. Drilling can occur in three ways: Through a single hierarchy From a single hierarchy to nonhierarchical attributes From a single hierarchy to another hierarchy The example of the market hierarchy displays one drill path.
Drilling Across Beginning the analysis with the group level of the market hierarchy, you drill down to a region within the same hierarchy. Your analysis then leads you to find all stores (part of the market hierarchy) with greater than 20,000 square feet (a nonhierarchical attribute). Having identified those stores, you now want to identify the cities in which those stores are located. In this analysis you have gone from drilling in one hierarchy, to an independent attribute, and finally to another hierarchy. The risk in drilling across is that you are not necessarily going to enter another hierarchy at the same level, so the totals may be different. Therefore, you would not try to balance the results from your final query, which looked at the city level within the geography hierarchy, with totals at different levels within the market hierarchy. Drilling across can be a very powerful approach to analysis. If dimensions have been conformed, for example, the grain and detail are identical, reliability for drilling across is greatly enhanced.
Using Time in the Data Warehouse Though it may seem obvious, real-life aggregations based on time can be quite complex. Which weeks roll up to which quarters? Is the first quarter the calendar months of January, February, and March, or the first 13 weeks of the year that begin on Monday? Some causes for nonstandardization are: Some countries start the work week on Mondays, others on Sunday. Weeks do not cleanly roll up to years, because a calendar year is one day longer than 52 weeks (one day longer in leap years). There are differences between calendar and fiscal periods. Consider a warehouse that includes data from multiple organizations, each with its own calendars. Holidays are not the same for all organizations and all locations. Representing time is critical in the data warehouse. You may decide to store multiple hierarchies in the data warehouse to satisfy the varied definitions of units of time. If you are using external data, you may find that you create a hierarchy or translation table simply to be able to integrate the data. Matching the granularity of time defined in external data to the time dimension in your own warehouse may be quite difficult.
Date Dimension Automated:
Applying the Changes to Data There are a number of methods for applying changes to existing data in dimension tables: Overwrite a record Add a new record Add a current field Maintain history records Versioning of records
OLAP Models ROLAP — a two-dimensional table where queries are posed and run without the assistance of cubes, providing greater flexibility for drilling down, across, and pivoting results. Each row in the table holds data that pertains to a specific thing or a portion of that thing. Each column of the table contains data regarding an attribute. MOLAP — a cube (the equivalent of a table in a relational database) stores a pre-calculated data set which contains all possible answers to a given range of questions. Each cube has several dimensions (equivalent to index fields in relational tables). The cube acts like an array in a conventional programming language. Logically, the space for the entire cube is pre-allocated. To find or insert data, you use the dimension values to calculate the position. Summarized tables are quite common in MOLAP configurations. HOLAP —a hybrid of the MOLAP and ROLAP, combines the capabilities of both by utilizing both precalculated cubes and relations data sources. DOLAP — Desktop OLAP. It downloads a relatively small hypercube from a central point (usually a data mart or data warehouse) and perform multidimensional analyses while disconnected from the source. This functionality is particularly useful for mobile users who can’t always connect to the data warehouse. DOLAP is often contrasted with MOLAP (i.e., cube-based) and ROLAP (i.e., star schema based) business intelligence tools. DOLAP product does everything on the desktop after extracting the data).
Slowly Changing Dimensions (SCDs) Data changes in the fields for levels or attributes is one of the most challenging design issues for multidimensional (star schema) data modeling. This issue is often referred to as the handling of Slowly Changing Dimensions. There are three ways outlined by Ralph Kimball (and others) to handle this situation: Type-1 : Overwrite the data in the dimension table. It is easy and fast to implement but loss of history is the drawback. Type-2 : Add new records to the dimension table that contain the new data. Perfect partition history but requires maintenance of surrogate keys Type-3 : Add new fields to the dimension table to contain the values before and after the change.
Overwriting a Record This method is easy to implement, but it is useful only if you are not interested in maintaining the history of data. If the data you are changing is critical to the context of information and analysis of the business, then overwriting a record is to be avoided at all costs. For example, by overwriting dimension data, you lose all track of history—you can never see that John Doe was single if the value “Single” is overwritten with the value “Married” from the operational system. The Customer_Id for John Doe remains constant throughout the life of the warehouse, because only one record for John Doe is stored.
Adding a New Record Using this method, you add another dimension record for John Doe. One record shows that he was “single” until December 31, 2000, another that he was “married” from January 1, 2001. Using this method, history is accurately preserved, but the dimension tables get bigger. A generalized (or derived) key is created for the second John Doe record. The generalized key is a derived value that ensures that a record remains unique. However, you now have more than one key to manage. You also need to ensure that the record keeps track of when the change has occurred. The Customer_Id for John Doe does not remain constant throughout the life of the warehouse, because each record added for John Doe contains a unique key. The key value is usually a combination of the operational system identifier with characters or digits appended to it. Consider using real data keys. The example here shows a method that is commonly identified in warehouse reference material.
Adding a Current Field In this method, you add a new field to the dimension table to hold the current value of the attribute. Using this method, you can keep some track of history. You know that John Doe was “single” before he was “married.” Each time John’s marital status changes, the two status attributes are updated and a new Effective Date is entered. However, what you cannot see from this method is what changes have taken place between the two records you are storing for John Doe—intermediate values are lost. Consider using an Effective Date attribute to show when the status changed. Partitioning of data can then be performed by effective date. The method you choose is again determined by the business requirements. If you want to maintain history, this method is a logical choice that can be enhanced by using a generalized key.
History Tables and One-to-Many Relationships Using this method, you keep one current record of the customer and many history records in the customer history table (a one-to-many relationship between the tables), thus maintaining history in a more normalized data model. In the CUSTOMER table the customer operational unique identifier is retained in the CUSTOMER.Id column. In the HIST_CUST table, the operational key is maintained in the HIST_CUST.Id column and the generalized key in the HIST_CUST.G_Id column. You can use this to keep all the keys needed and multiple records for the customer. The data in the table might appear.
Versioning You can also maintain a version number for the customer in the Customer dimension: You must ensure that the measures in the fact table, such as sales figures, also contain the customer version number to avoid any double counting: For Comer Version 1, the sales total is $36,000. For Comer Version 2, the sales total is $87,000.
Rapidly Changing Dimensions (RCDs) A dimension that changes too rapidly. This issue is often referred to as the Rapidly Changing Dimensions. One powerful technique is to break off the frequently changing attributes into a separate dimension referred to as mini dimension. For example, if we have a large customer dimension with demographic attributes like income level, age, marital status, education, job title, no of children. There would be one row in mini dimension table for each unique combination of the frequently changing attributes. Every time we build a fact table row, we include two foreign keys related to the customer: customer key from customer dimension and demographics key from the mini dimension. The most recent value of the demographics key can exist as a foreign key in the customer dimension. If you embed the most recent demographics key in the Customer dimension, you must treat it as Type 1 attribute. If you treat it as Type 2, you have reintroduced the RCD problem. Mini dimension terminology refers to when the mini dimension key is a part of the fact table (demographics key is part of the fact table composite key). Outrigger terminology refers to when the mini dimension key is a part of the dimension table (demographics key is foreign key in the customer dimension table). You may choose to store the mini dimension key both in the dimension and fact table.
Junk Dimension Junk dimension is an abstract dimension with the decodes for a group of low cardinality flags and indicators, thereby removing them from fact table. Remove all the flags and indicators from the fact table. Create a separate dimension table referred to as Junk dimension. E.g: Order processing system ::::; :::::: ::::: ::::: Fax Urgent Credit 4 Fax Normal Credit 3 Web Urgent Cash 2 Web Normal Cash 1 Order Mode Order type Payment Type Junk Key
Secret of Success Your eventual goal may be the enterprisewide solution, but take small steps to achieve it. The enterprisewide warehouse is not a realistic objective for your first pass. Always use the proven low-risk incremental approach.
Objectives <ul><li>Data Warehousing Concepts </li></ul><ul><li>What is Business Intelligence (BI)? </li></ul><ul><li>Evolution of BI </li></ul><ul><li>Characteristics of an OLTP system </li></ul><ul><li>Why OLTP is not suitable for complex analysis? </li></ul><ul><li>Characteristics of a Data Warehouse </li></ul><ul><li>Define DWH and its properties – </li></ul><ul><li>Subject Oriented, Integrated, Time variant, Non-Volatile </li></ul><ul><li>Define Grain/Granularity </li></ul><ul><li>Differentiate between OLTP and Data Warehouse </li></ul><ul><li>User expectations and User community </li></ul><ul><li>Enterprise Data Warehouse </li></ul><ul><li>Data Warehouse versus Data marts </li></ul><ul><li>Dependent Data marts </li></ul><ul><li>Independent Data marts </li></ul><ul><li>Data Warehouse components – </li></ul><ul><li>Source systems, Staging area, Presentation area, Access tools </li></ul>
Objectives <ul><li>Data Warehousing Concepts </li></ul><ul><li>Goals of a Data Warehouse </li></ul><ul><li>Data Warehouse development approaches - </li></ul><ul><li>Top-down, Bottom-up, Hybrid, Federated </li></ul><ul><li>Incremental approach to warehouse development </li></ul><ul><li>Dimensional Modeling </li></ul><ul><li>Star Schema – Fact and Dimension tables </li></ul><ul><li>Dimensions and Measure objects </li></ul><ul><li>Snowflake Schema </li></ul><ul><li>Types of Fact tables </li></ul><ul><li>Factless Fact table </li></ul><ul><li>OLAP storage modes – MOLAP, ROLAP, HOLAP, DOLAP </li></ul><ul><li>Slowly and Rapidly changing Dimensions- Type I, II, III </li></ul><ul><li>Degenarated Dimension </li></ul><ul><li>Junk Dimension </li></ul><ul><li>CASE-STUDIES </li></ul>
What is Business Intelligence (BI)? <ul><ul><li>“ Business Intelligence (BI) is the process of transforming data into </li></ul></ul><ul><ul><li>information , information into knowledge and through iterative discoveries </li></ul></ul><ul><ul><li>turning knowledge into Intelligence .” </li></ul></ul><ul><ul><ul><ul><li>— Gartner group </li></ul></ul></ul></ul>
Objective of Business Intelligence Value Volume BI can be defined as taking ‘Decisions based on Data’. The objective of BI is to transform large volumes of data into useful information. Intelligence Knowledge Information Data
Evolution of BI <ul><ul><li>Executive information systems (EIS) </li></ul></ul><ul><ul><li>Management Information System (MIS) </li></ul></ul><ul><ul><li>Decision Support Systems (DSS) </li></ul></ul><ul><ul><li>Business Intelligence (BI) </li></ul></ul>EIS MIS DSS BI
Information <ul><li>Information in an organization could exists in two different types of systems: </li></ul><ul><ul><li>Online Transaction Processing (OLTP) systems </li></ul></ul><ul><ul><li>(Operational Systems) </li></ul></ul><ul><ul><li>Data Warehouse (DWH) systems </li></ul></ul><ul><li>Both OLTP and DWH systems have different purpose, business needs and users . </li></ul>
Features of OLTP Systems <ul><li>OLTP systems handle day-to-day transactions and operations of the business. They are </li></ul><ul><li>high performance, high throughput systems. They run mission critical applications. </li></ul><ul><li>OLTP systems store, update and retrieve Operational Data. </li></ul><ul><li>Operational Data is the data that runs the business. </li></ul><ul><li>Some of the Operational systems that we interact with are Net Banking system, Tax Accounting </li></ul><ul><li>system, Payroll package, Order-processing system, SAP, Airline reservation system etc. </li></ul>
Why OLTP systems are not suitable for analysis? Database design: Dimensional Database design: Normalized Data needs to be integrated Islands of operational systems Data required at summary level Data stored at transaction level Historical information to analyze Supports day-to-day operations Analytical Reporting OLTP
OLTP Versus Data Warehouse Large to very large, Few GB to TB Small to large Few MB to GB Size Subject, time Application Data Organization Snapshots over time (Quarter, Month, etc). Historical 30 – 60 days or 1 year - 2 years. Current Age of Data Primarily Read only Data goes out DML Data goes in Operations Seconds to hours Sub seconds to seconds Response Time Data Warehouse OLTP Property
OLTP Versus Data Warehouse De-Normalized, Star schema Normalized Database Design Thousands to millions of records One record at a time No. of records Atomic and/or Summarized (aggregate), less granularity Atomic (Detail), transactional level, Highest granularity Grain Analysis Processes Activities Operational, Internal, External Operational, Internal Data Sources Data Warehouse OLTP Property
Data Extract Processing <ul><ul><li>A logical progression towards a data warehouse – Data Extracts </li></ul></ul><ul><ul><li>End user computing offloaded from the operational environment </li></ul></ul><ul><ul><li>User’s own data </li></ul></ul>Decision makers Operational systems Extracts
Issues with Data Extract Programs Extracts Operational systems Decision makers Extract Explosion
Data Quality Issues with Extract Processing <ul><ul><li>No common time basis </li></ul></ul><ul><ul><li>Different calculation algorithms </li></ul></ul><ul><ul><li>Different levels of extraction </li></ul></ul><ul><ul><li>Different levels of granularity </li></ul></ul><ul><ul><li>Different data field names </li></ul></ul><ul><ul><li>Different data field meanings </li></ul></ul><ul><ul><li>Missing information </li></ul></ul><ul><ul><li>No data correction rules </li></ul></ul><ul><ul><li>No Metadata </li></ul></ul><ul><ul><li>No drill-down capability </li></ul></ul>
Advances Enabling Data Warehousing <ul><ul><li>Technology </li></ul></ul><ul><ul><li>Hardware </li></ul></ul><ul><ul><li>Operating system </li></ul></ul><ul><ul><li>Database </li></ul></ul><ul><ul><li>BI Tools & Applications </li></ul></ul><ul><ul><li>Business </li></ul></ul><ul><ul><li>Competition </li></ul></ul>
Definition of a Data Warehouse <ul><li>“ A data warehouse is a subject oriented , integrated , non-volatile , </li></ul><ul><li>and time-variant collection of data to support management decisions.” </li></ul><ul><li> — Bill Inmon </li></ul>
Data Warehouse Properties Integrated Time-variant Nonvolatile Subject- oriented Data Warehouse
Subject-Oriented <ul><li>Data is categorized and stored by business subject rather than by application. </li></ul>OLTP Applications Equity Plans Shares Insurance Loans Savings Data Warehouse Subject Customer financial information
Integrated <ul><li>Data on a given subject is collected from various sources and stored once. </li></ul>Data Warehouse OLTP Applications Customer Savings Current Accounts Loans
Time-Variant <ul><li>Data is stored as a series of snapshots, each representing a period of time. </li></ul>Data Warehouse
Non-volatile <ul><li>Typically data in the data warehouse is not updated or deleted. </li></ul>Warehouse Read Load Operational Insert, Update, Delete, or Read
Changing Warehouse Data Operational Databases Warehouse Database First time load Refresh Refresh Refresh Purge or Archive
Goals of a Data Warehouse <ul><li>The Data Warehouse must assist in decision making process </li></ul><ul><li>The Data Warehouse must meet the requirements of the business community </li></ul><ul><li>The Data Warehouse must provide easy access to information </li></ul><ul><li>The Data Warehouse must present information consistently and accurately </li></ul><ul><li>The Data Warehouse must be adaptive and resilient to change </li></ul><ul><li>The Data Warehouse must provide a secured access to information </li></ul>
Usage Curves <ul><ul><li>Operational system is predictable </li></ul></ul><ul><ul><li>Data warehouse: </li></ul></ul><ul><ul><ul><li>Variable </li></ul></ul></ul><ul><ul><ul><li>Random </li></ul></ul></ul>
User Expectations <ul><ul><li>Control expectations </li></ul></ul><ul><ul><li>Set achievable targets for query response </li></ul></ul><ul><ul><li>Set SLAs </li></ul></ul><ul><ul><li>Educate business and end users </li></ul></ul><ul><ul><li>Growth and use is exponential </li></ul></ul>
Enterprisewide Data Warehouse <ul><li>Large scale implementation </li></ul><ul><li>Scopes the entire business </li></ul><ul><li>Data from all subject areas </li></ul><ul><li>Developed incrementally </li></ul><ul><li>Single source of enterprisewide data </li></ul><ul><li>Synchronized enterprisewide data </li></ul><ul><li>Single distribution point to dependent data marts </li></ul>
Data Warehouse Vocabulary <ul><li>Grain of Data - Granularity </li></ul><ul><li>Grain is defined as the level of detail of data captured in the data </li></ul><ul><li>warehouse. More the detail, higher the granularity and vice-versa </li></ul><ul><li>Fact table </li></ul><ul><li>It is similar to the transaction table in an OLTP system. </li></ul><ul><li>It stores the facts or measures of the business. </li></ul><ul><li>Eg: SALES, ORDERS </li></ul><ul><li>Dimension table </li></ul><ul><li>It is similar to the master table in an OLTP system. </li></ul><ul><li>It stores the textual descriptors of the business. </li></ul><ul><li>Eg: CUSTOMER, PRODUCT </li></ul>
Data Marts <ul><li>A Data mart is a subset of data warehouse. </li></ul><ul><li>A data mart is designed for a single line of business (LOB) or functional area </li></ul><ul><li>such as sales, finance, or marketing. </li></ul>
Data Warehouses Versus Data Marts Bottom-up Top-Down Approach Data Warehouse Data Mart Next level of migration Lower Higher Initial effort, cost, Risk < 100 GB 100 GB to > 1 TB Size Months Months to years Implementation time Few Many Data Source Single-subject, LOB Multiple Subjects Department Enterprise Scope Data Mart Data Warehouse Property
Top-Down Approach Build the Data Warehouse Build the Data Marts
Top-Down Approach Data Warehouse Data Marts Marketing Sales Finance HR Flat Files Marketing Sales Finance Operational Systems External Data Operations Data Legacy Data External Data
Bottom-Up Approach Build Data Marts Build the Data Warehouse
Bottom-Up Approach Data Warehouse Data Marts Marketing Sales Finance Operational Systems External Data Operations Data Legacy Data
Hybrid Approach <ul><li>The hybrid approach tries to blend the best of both </li></ul><ul><ul><ul><ul><ul><li>“ top-down and “bottom-up” approaches </li></ul></ul></ul></ul></ul>Starts by designing DW and DM models synchronously, Build out first 2-3 DMs that are mutually exclusive and critical Backfill a DW behind the DMs Build the enterprise model and move atomic data to the DW
Federated Approach This approach is referred to as “an architecture of architectures”. Emphasizes the need to integrate new and existing heterogeneous BI environments.
Data Warehouse Components Source Systems Staging Area Presentation Area Access Tools Operational External Legacy Metadata Repository Data Marts Data Warehouse ODS
Source Systems Staging Area Presentation Area Access Tools Operational External Legacy Metadata Repository Data Marts Data Warehouse Data Warehouse Components ODS
Examining Data Sources <ul><ul><li>Production </li></ul></ul><ul><ul><li>Archive </li></ul></ul><ul><ul><li>Internal </li></ul></ul><ul><ul><li>External </li></ul></ul>
Production Data <ul><li>Operating system platforms </li></ul><ul><li>File systems </li></ul><ul><li>Database systems </li></ul><ul><li>Vertical applications </li></ul>IMS DB2 Oracle Sybase Informix VSAM SAP Dun and Bradstreet Financials Oracle Financials Baan PeopleSoft
Archive Data <ul><ul><li>Historical data </li></ul></ul><ul><ul><li>Useful for analysis over long periods of time </li></ul></ul><ul><ul><li>Useful for first-time load </li></ul></ul>Operation databases Warehouse database
Internal Data <ul><ul><li>Planning, sales, and marketing organization data </li></ul></ul><ul><ul><li>Maintained in the form of: </li></ul></ul><ul><ul><ul><li>Spreadsheets (structured) </li></ul></ul></ul><ul><ul><ul><li>Documents (unstructured) </li></ul></ul></ul><ul><ul><li>Treated like any other source data </li></ul></ul>Warehouse database Planning Accounting Marketing
External Data <ul><ul><li>Information from outside the organization </li></ul></ul><ul><ul><li>Issues of frequency, format, and predictability </li></ul></ul><ul><ul><li>Described and tracked using metadata </li></ul></ul>A.C. Nielsen, IRI, IMRB, ORG-MARG Barron's Dun and Bradstreet Purchased databases Wall Street Journal Economic forecasts Competitive information Warehousing databases
Extraction, Transformation and Loading (ETL) <ul><li>“ Effective data extract, transform and load (ETL) processes represent the number one success factor for your data warehouse project and can absorb up to 70 percent of the time spent on a typical data warehousing project.” </li></ul><ul><ul><li>DM Review, March 2001 </li></ul></ul>Source Target Staging Area
Staging Models <ul><li>Remote staging model </li></ul><ul><li>Onsite staging model </li></ul>
Remote Staging Model Load Load Data staging area within the warehouse environment Data staging area in its own independent environment Extract Extract Transform Staging area Transform Staging area Warehouse Warehouse Operational system Operational system
On-site Staging Model <ul><li>Data staging area within the operational environment, possibly affecting the operational system </li></ul>Extract Load Warehouse Operational system Transform Staging area
Mapping Data <ul><li>Mapping data defines: </li></ul><ul><ul><li>Which operational attributes to use </li></ul></ul><ul><ul><li>How to transform the attributes for the warehouse </li></ul></ul><ul><ul><li>Where the attributes exist in the warehouse </li></ul></ul>Metadata File A F1 Staging File One Number F2 F3 Name DOB Staging File One Number USA123 Name Mr. Bloggs DOB 10-Dec-56 File A F1 123 F2 Bloggs F3 10/12/56
Transformation Routines <ul><ul><li>Cleaning data </li></ul></ul><ul><ul><li>Eliminating inconsistencies </li></ul></ul><ul><ul><li>Adding elements </li></ul></ul><ul><ul><li>Merging data </li></ul></ul><ul><ul><li>Integrating data </li></ul></ul><ul><ul><li>Transforming data before load </li></ul></ul>
Data Anomalies <ul><ul><li>No unique key </li></ul></ul><ul><ul><li>Data naming and coding anomalies </li></ul></ul><ul><ul><li>Data meaning anomalies between groups </li></ul></ul><ul><ul><li>Spelling and text inconsistencies </li></ul></ul>181 North Street, Key West, FLA Oracle Corp UK Ltd 90345672 15 Main Road, Ft. Lauderdale, FLA Oracle Corp. UK 90234889 15 Main Road, Ft. Lauderdale Oracle Computing 90233489 100 N.E. 1st St. Oracle Limited 90233479 ADDRESS NAME CUSNUM
Multipart Keys Problem <ul><li>Multipart keys </li></ul>Country code Sales territory Product number Salesperson code Product code = 12 M 65431 3 45
Multiple Local Standards Problem <ul><ul><li>Multiple local standards </li></ul></ul><ul><ul><li>Tools or filters to preprocess </li></ul></ul>cm inches cm USD 600 1,000 GBP FF 9,990 DD/MM/YY MM/DD/YY DD-Mon-YY
Multiple Source Files Problem <ul><ul><li>Added complexity of multiple source files </li></ul></ul>Transformed data Multiple source files Logic to detect correct source
Missing Values Problem <ul><li>Solution: </li></ul><ul><ul><li>Ignore </li></ul></ul><ul><ul><li>Wait </li></ul></ul><ul><ul><li>Mark rows </li></ul></ul><ul><ul><li>Extract when time-stamped </li></ul></ul>If NULL then field = ‘A’ A
Duplicate Values Problem <ul><li>Solution: </li></ul><ul><ul><li>SQL self-join techniques </li></ul></ul><ul><ul><li>RDMBS constraint utilities </li></ul></ul>ACME Inc ACME Inc ACME Inc SQL> SELECT ... 2 FROM table_a, table_b 3 WHERE table_a.key (+)= table_b.key 4 UNION 5 SELECT ... 6 FROM table_a, table_b 7 WHERE table_a.key = table_b.key (+);
Element Names Problem <ul><li>Solution: </li></ul><ul><ul><li>Common naming conventions </li></ul></ul>Customer Customer Client Contact Name
Element Meaning Problem <ul><ul><li>Avoid misinterpretation </li></ul></ul><ul><ul><li>Complex solution </li></ul></ul><ul><ul><li>Document meaning in metadata </li></ul></ul>Product number p_no Purchase order number Policy number
Input Format Problem <ul><li>Different character sets or data-types </li></ul>ASCII EBCDIC 12373 “ 123-73” ACME Co. áøåëéí äáàéí Beer (Pack of 8)
Referential Integrity Problem <ul><li>Solution: </li></ul><ul><ul><li>SQL anti-join (outer join) </li></ul></ul><ul><ul><li>Server constraints </li></ul></ul><ul><ul><li>Dedicated tools </li></ul></ul>40 30 20 10 Department 60 Harris 6786 50 Doe 1234 20 Jones 1289 10 Smith 1099 Department Name Emp
Name and Address Problem <ul><ul><li>Single-field format </li></ul></ul><ul><ul><li>Multiple-field format </li></ul></ul>Mr. J. Smith,100 Main St., Bigtown, County Luth, 23565 Database 1 M300 HARRY H. ENFIELD N100 DIANNE ZIEFELD LOCATION NAME Database 2 300 ENFIELD, HARRY H 100 ZIEFELD, DIANNE LOCATION NAME 23565 Code County Luth Country Bigtown Town 100 Main St. Street Mr. J. Smith Name
Transformation Timing and Location <ul><ul><li>Transformation is performed: </li></ul></ul><ul><ul><ul><li>Before load </li></ul></ul></ul><ul><ul><ul><li>In parallel while loading </li></ul></ul></ul><ul><ul><li>Can be initiated at different points: </li></ul></ul><ul><ul><ul><li>On the operational platform </li></ul></ul></ul><ul><ul><ul><li>In a separate staging area </li></ul></ul></ul>
Adding a Date Stamp: Fact Tables and Dimensions Item Table Item_id Dept_id Time_key Store Table Store_id District_id Time_key Sales Fact Table Item_id Store_id Time_key Sales_dollars Sales_units Time Table Week_id Period_id Year_id Time_key Product Table Product_id Time_key Product_desc
Summarizing Data <ul><li>1. During extraction on staging area </li></ul><ul><li>2. After loading to the warehouse server </li></ul>Operational databases Warehouse database Staging area
Loading Data into the Warehouse <ul><ul><li>Loading moves the data into the warehouse </li></ul></ul><ul><ul><li>Loading can be time-consuming: </li></ul></ul><ul><ul><ul><li>Consider the load window </li></ul></ul></ul><ul><ul><ul><li>Schedule and automate the loading </li></ul></ul></ul><ul><ul><li>Initial load moves large volumes of data </li></ul></ul><ul><ul><li>Subsequent refresh moves smaller volumes of data </li></ul></ul>Operational databases Warehouse database Staging area Extract Transform Transport, Load
Load Window Requirements <ul><ul><li>Time available for entire ETL process </li></ul></ul><ul><ul><li>Plan </li></ul></ul><ul><ul><li>Test </li></ul></ul><ul><ul><li>Prove </li></ul></ul><ul><ul><li>Monitor </li></ul></ul>0 3 am 6 9 12 pm 3 6 9 12 User Access Period Load Window Load Window
Planning the Load Window <ul><ul><li>Plan and build processes according to a strategy. </li></ul></ul><ul><ul><li>Consider volumes of data. </li></ul></ul><ul><ul><li>Identify technical infrastructure. </li></ul></ul><ul><ul><li>Ensure currency of data. </li></ul></ul><ul><ul><li>Consider user access requirements first. </li></ul></ul><ul><ul><li>High availability requirements may mean a small load window. </li></ul></ul>0 3 am 6 9 12 pm 3 6 9 12 User Access Period
Initial Load and Refresh <ul><li>Initial Load: </li></ul><ul><ul><li>Single event that populates the database with historical data </li></ul></ul><ul><ul><li>Involves large volumes of data </li></ul></ul><ul><ul><li>Employs distinct ETL tasks </li></ul></ul><ul><ul><li>Involves large amounts of processing after load </li></ul></ul><ul><li>Refresh: </li></ul><ul><ul><li>Performed according to a business cycle </li></ul></ul><ul><ul><li>Less data to load than first-time load </li></ul></ul><ul><ul><li>complex ETL tasks </li></ul></ul><ul><ul><li>Smaller amounts of post-load processing </li></ul></ul>
Data Refresh Models <ul><li>Extract Processing Environment </li></ul><ul><ul><li>After each time interval, build a new snapshot of the database. </li></ul></ul><ul><ul><li>Purge old snap shots. </li></ul></ul>T1 T2 T3 Operational databases
Data Refresh Models <ul><li>Warehouse Environment </li></ul><ul><ul><li>Build a new database the first time. </li></ul></ul><ul><ul><li>After each time interval, add delta changes to database. </li></ul></ul><ul><ul><li>Archive or purge oldest data. </li></ul></ul>T1 T2 T3 Operational databases
Post-Processing of Loaded Data Post-processing of loaded data Extract Transform Load Warehouse Staging area Create indexes Generate keys Summarize Filter
Unique Indexes <ul><ul><li>Disable constraints before load. </li></ul></ul><ul><ul><li>Enable constraints after load. </li></ul></ul><ul><ul><li>Re-create index if necessary. </li></ul></ul>Load data Disable constraints Enable constraints Create index Reprocess Catch errors
Creating Derived Keys <ul><li>The use of derived (sometimes referred as generalized or artificial key or synthetic key or a surrogate or a warehouse key) is recommended to maintain the uniqueness of a row. </li></ul><ul><li>Method </li></ul><ul><ul><li>Concatenate key </li></ul></ul><ul><ul><li>Assign a number sequentially from a list </li></ul></ul>109908 01 109908 109908 100
Metadata Users Metadata repository End users Developers IT Professionals
Data Warehouse Design <ul><li>Dimensional Modeling </li></ul><ul><li>Identify the ‘ Business Process ’ </li></ul><ul><li>Determine the ‘ Grain ’ </li></ul><ul><li>Identify the ‘ Facts ’ </li></ul><ul><li>Identify the ‘ Dimensions ’ </li></ul>
Business Requirements Drive the Design Process <ul><ul><li>Primary input </li></ul></ul><ul><ul><li>Secondary input </li></ul></ul>Existing Metadata Production ERD Model Business Requirements Research
Perform Strategic Analysis <ul><ul><li>Identify crucial business processes </li></ul></ul><ul><ul><li>Understand business processes </li></ul></ul><ul><ul><li>Prioritize and select the business processes to implement </li></ul></ul>Business Benefit Low High Low High Feasibility
Using a Business Process Matrix DW Bus Architecture Promotion Channel Product Date Inventory Customer Returns Sales Business Processes Business Dimensions
Conformed Dimensions <ul><li>Dimensions are conformed when they are exactly the same including the keys or one is a perfect subset of the other. </li></ul><ul><li>DW bus architecture provides a standard set of conformed dimensions </li></ul>
Determine the Grain YEAR? QUARTER? MONTH? WEEK? DAY?
Documenting the Granularity <ul><li>Is an important design consideration </li></ul><ul><li>Determines the level of detail </li></ul><ul><li>Is determined by business needs </li></ul>Low-level grain (Transaction-level data) High-level grain (Summary data)
Defining Time Granularity Fiscal Time Hierarchy Current dimension grain Future dimension grain Fiscal Year Fiscal Quarter Fiscal Month Fiscal Week Day
Identify the Facts and Dimensions <ul><li>The attribute is perceived as constant or discrete: </li></ul><ul><ul><li>Product </li></ul></ul><ul><ul><li>Location </li></ul></ul><ul><ul><li>Time </li></ul></ul><ul><ul><li>Size </li></ul></ul><ul><li>The attribute varies continuously: </li></ul><ul><ul><li>Balance </li></ul></ul><ul><ul><li>Units Sold </li></ul></ul><ul><ul><li>Cost </li></ul></ul><ul><ul><li>Sales </li></ul></ul>Facts (Measures) Dimensions
Data Warehouse Environment Data Structures <ul><li>The data structures that are commonly found in a data warehouse environment: </li></ul><ul><ul><li>Third normal form (3NF) </li></ul></ul><ul><ul><li>Star schema </li></ul></ul><ul><ul><li>Snowflake schema </li></ul></ul>
Star Schema Customer Location Sales Supplier Product
Star Schema Model Product Table Product_id Product_disc,... Time Table Day_id Month_id Year_id,... Sales Fact Table Product_id Store_id Item_id Day_id Sales_amount Sales_units, ... Item Table Item_id Item_desc,... Store Table Store_id District_id,... Central fact table Denormalized dimensions
Fact Table Characteristics <ul><ul><li>Contain numerical metrics of the business </li></ul></ul><ul><ul><li>Can hold large volumes of data </li></ul></ul><ul><ul><li>Can grow quickly </li></ul></ul><ul><ul><li>Can contain base, derived, and summarized data </li></ul></ul><ul><ul><li>Are typically additive </li></ul></ul><ul><ul><li>Are joined to dimension tables </li></ul></ul><ul><ul><li>through foreign keys that reference </li></ul></ul><ul><ul><li>Primary keys in the dimension tables </li></ul></ul>Sales Fact Table Product_id Store_id Item_id Day_id Sales_amount Sales_units ...
Dimension Table Characteristics <ul><ul><li>Contain descriptors of the business / </li></ul></ul><ul><ul><li>textual information that represents the attributes of the business </li></ul></ul><ul><ul><li>Contain relatively static data </li></ul></ul><ul><ul><li>Are usually smaller than fact tables </li></ul></ul><ul><ul><li>Are joined to a fact table through </li></ul></ul><ul><ul><li>a foreign key reference </li></ul></ul>Item Table Item_id Item_desc,...
Advantages of Using a Star Dimensional Model <ul><ul><li>Design improves performance by reducing table joins. </li></ul></ul><ul><ul><li>The model is easy for users to understand. </li></ul></ul><ul><ul><li>Supports multidimensional analysis. </li></ul></ul><ul><ul><li>Provides an extensible design </li></ul></ul><ul><ul><li>Primary keys represent a dimension. </li></ul></ul><ul><ul><li>Non-foreign key columns are values. </li></ul></ul><ul><ul><li>Facts are usually highly normalized. </li></ul></ul><ul><ul><li>Dimensions are completely de-normalized. </li></ul></ul><ul><ul><li>End users can express complex queries. </li></ul></ul>
Base and Derived Data Payroll table Derived data Base data Emp_FK Month_FK Salary Comm Comp 101 05 1,000 0 1,000 102 05 1,500 100 1,600 103 05 1,000 200 1,200 104 05 1,500 1,000 2,500
Translating Business Measures into a Fact Table Business measures Facts Business Measures Number of Items Amount Cost Profit Fact Number of Items Item Amount Item Cost Profit Base Base Base Derived
Snowflake Model . . . . Order Web History_PK Customer History History_FK Customer_FK Product_FK Channel_FK Item_nbr Item_desc Quantity Discnt_price Unit-price Order_amt … Product Channel Customer_PK . . . . Product_PK . . . . Web_PK Web_url Channel_PK Web_PK Channel_desc
Snowflake Schema Model <ul><ul><li>Provides for speedier data loading </li></ul></ul><ul><ul><li>Can become large and unmanageable </li></ul></ul><ul><ul><li>Degrades query performance </li></ul></ul><ul><ul><li>More complex metadata </li></ul></ul><ul><ul><li>Facts are usually highly normalized </li></ul></ul><ul><ul><li>Dimensions are also normalized </li></ul></ul>Country State County City
Bracketed Dimensions <ul><ul><li>Enhance performance and analytical capabilities </li></ul></ul><ul><ul><li>Create groups of values for attributes with many unique values, such as income ranges and age brackets </li></ul></ul><ul><ul><li>Minimize the need for full table scans by pre-aggregating data </li></ul></ul>
Bracketing Dimensions Customer_PK Bracket_FK Bracket_PK Customer_PK Bracket_FK Bracket dimension Customer dimension Income fact Bracket_PK Income (10Ks) Marital Status Gender Age 1 60-90 Single Male <21 2 60-90 Single Male 21-35 3 60-90 Single Male 35-55 4 60-90 Single Male >55 5 60-90 Single Female <21 6 60-90 Single Female 21-35
Identifying Analytical Hierarchies Store dimension Store ID Store Desc Location Size Type District ID District Desc Region ID Region Desc Business hierarchies describe organizational structure and logical parent-child relationships within the data. Region District Store Organization hierarchy
Multiple Hierarchies Store ID Store Desc Location Size Type District ID District Desc Region ID Region Desc City ID City Desc County ID County Desc State ID State Desc Region District Store Organization hierarchy Store dimension Region District Store Geography hierarchy
Multiple Time Hierarchies Fiscal year Fiscal quarter Fiscal month Fiscal time hierarchy Fiscal week Calendar year Calendar quarter Calendar month Calendar time hierarchy Calendar week
Drilling Up and Drilling Down Store 5 Store 1 Store 2 Region 2 District 2 District 4 Store 4 Group Market Hierarchy Region 1 District 1 Store 6 Store 3 District 3
Drilling Across Region District Stores > 20,000 sq. ft. Group Market hierarchy Region District Store Store City City City hierarchy
Using Time in the Data Warehouse <ul><ul><li>Defining standards for time is critical. </li></ul></ul><ul><ul><li>Aggregation based on time is complex. </li></ul></ul><ul><ul><li>Time is critical to the data warehouse. A consistent representation of time is required for extensibility. </li></ul></ul>Where should the element of time be stored? Time dimension Sales fact
Date Dimension <ul><ul><li>Should Date Dimension be modeled? </li></ul></ul>
Applying the Changes to Data <ul><li>You have a choice of techniques: </li></ul><ul><ul><li>Overwrite a record </li></ul></ul><ul><ul><li>Add a record </li></ul></ul><ul><ul><li>Add a field </li></ul></ul><ul><ul><li>Maintain history </li></ul></ul><ul><ul><li>Add version numbers </li></ul></ul>
Slowly Changing Dimensions (SCDs) <ul><li>What is a SCD? </li></ul><ul><li>It is a dimension that has attribute data that needs to be updated, rather slowly over time. </li></ul><ul><li>There are 3 standard ways outlined by Kimball (and others) to handle this situation: </li></ul><ul><ul><li>Type-I </li></ul></ul><ul><ul><li>Type-II </li></ul></ul><ul><ul><li>Type-III </li></ul></ul>
Type I - Overwriting a Record <ul><ul><li>Easy to implement </li></ul></ul><ul><ul><li>Loses all history </li></ul></ul><ul><ul><li>Not recommended </li></ul></ul>Single John Doe 42135 Married John Doe 42135
Type II - Adding a New Record <ul><ul><li>History is preserved; dimensions grow. </li></ul></ul><ul><ul><li>Generalized key is created. </li></ul></ul>Single John Doe 42135 Married John Doe 42135_01
Type III - Adding a Current Field <ul><ul><li>Maintains some history </li></ul></ul><ul><ul><li>Loses intermediate values </li></ul></ul><ul><ul><li>Is enhanced by adding an Effective Date field </li></ul></ul>Single John Doe 42135 Married 1-Jan-01 Single John Doe 42135
Maintain History <ul><li>History tables: </li></ul><ul><ul><li>One-to-many relationships </li></ul></ul><ul><ul><li>One current record and many history records </li></ul></ul>Product Time Sales HIST_CUST CUSTOMER
Versioning <ul><ul><li>Avoid double counting </li></ul></ul><ul><ul><li>Facts hold version number </li></ul></ul>Time Product Customer Sales $12,000 2 1234 $11,000 1 1234 Sales Facts Version Sales.CustId Comer 2 1234 Comer 1 1234 Customer Name Version Customer.CustId
Rapidly Changing Dimensions (RCDs) <ul><li>It is a dimension that has attribute data that needs to be updated, rather quickly over time. </li></ul><ul><li>Also referred to as Rapidly Changing Monster dimension. </li></ul><ul><li>Create a separate dimension referred to as mini dimension </li></ul>Mini Dimension :::::::::: :::: ::::: :::: 20000 – 30000 1-2 25-30 5 <20000 0 25-30 4 >30000 > 2 20-24 3 20000 – 30000 1-2 20-24 2 <20000 0 20–24 1 income children Age Demographics Key
Junk Dimension <ul><li>Junk dimension is an abstract dimension with the decodes for a group of low cardinality flags and indicators, thereby removing them from fact table. </li></ul>Junk Dimension ::::; :::::: ::::: ::::: Fax Urgent Credit 4 Fax Normal Credit 3 Web Urgent Cash 2 Web Normal Cash 1 Order Mode Order type Payment Type Junk Key