Data Warehouse• The data warehouse is that portion of an overall Architected Data Environment that serves as the single integrated source of data for processing information.• Data ware house contains Historical data as well as Current data.
Relational Database• Its known as RDBMS.• Data is stored as Relations• Concept of Primary and Foreign Key• ACID Properties
ACID properties• ACID properties are an important concept for databases. The acronym stands for Atomicity, Consistency, Isolation, and Durability• In the context of databases, a single logical operation on the data is called a transaction. An example of a transaction is a transfer of funds from one account to another, even though it might consist of multiple individual operations (such as debiting one account and crediting another). The ACID properties that such transactions are processed reliably.
1. ATOMICITY• Atomicity refers to the ability of the DBMS to guarantee that either all of the tasks of a transaction are performed or none of them are.• The transfer of funds can be completed or it can fail for a multitude of reasons, but atomicity guarantees that one account wont be debited if the other is not credited as well.
Consistency• Consistency refers to the database being in a legal state when the transaction begins and when it ends.• This means that a transaction cant break the rules, or integrity constraints, of the database. If an integrity constraint states that all accounts must have a positive balance, then any transaction violating this rule will be aborted.
ISOLATION• Isolation refers to the ability of the application to make operations in a transaction appear isolated from all other operations.• This means no operation outside the transaction can ever see the data in an intermediate state
Durability• Refers to the guarantee that once the user has been notified of success, the transaction will persist, and not be undone
Data Warehouse Characteristics1. Subject Oriented.2. Integrated3. Non Volatile4. Time variant5. Accessible
1. Subject Oriented• Information is presented according to specific subjects or areas of interest, not simply as computer files. Data is manipulated to provide information about a particular subject.
2. Integrated• A single source of information for and about understanding multiple areas of interest.• The data warehouse provides one-stop shopping and contains information about a variety of subjects.• Thus the University data warehouse has information on students, faculty and staff, instructional workload, and student outcomes.
3. Non Volatile• Stable information that doesn’t change each time an operational process is executed.• Information is consistent regardless of when the warehouse is accessed.
4. Time variant• Containing a history of the subject, as well as current information.• Historical information is an important component of a data warehouse.
5. Accessible• The primary purpose of a data warehouse is to provide readily accessible information to end- users.
Some definitions• Data Warehouse: A data structure that is optimized for distribution. It collects and stores integrated sets of historical data from multiple operational systems and feeds them to one or more data marts. It may also provide end-user access to support enterprise views of data.• Data Mart: A data structure that is optimized for access. It is designed to facilitate end-user analysis of data. It typically supports a single, analytic application used by a distinct set of workers.• Staging Area: Any data store that is designed primarily to receive data into a warehousing environment.
• Operational Data Store: A collection of data that addresses operational needs of various operational units. It is not a component of a data warehousing architecture, but a solution to operational needs.• OLAP (On-Line Analytical Processing): A method by which multidimensional analysis occurs.• Multidimensional Analysis: The ability to manipulate information by a variety of relevant categories or “dimensions” to facilitate analysis and understanding of the underlying data. It is also sometimes referred to as “drilling-down”, “drilling-across” and “slicing and dicing”
• Hypercube: A means of visually representing multidimensional data.• OLAP Tools: A set of software products that attempt to facilitate multidimensional analysis. Can incorporate data acquisition, data access, data manipulation, or any combination thereof.
Star Schema• In computing, the star schema (also called star- join schema, data cube, or multi-dimensional schema) is the simplest style of data warehouse schema. The star schema consists of one or more fact tables referencing any number of dimension tables. The star schema is an important special case of the snowflake schema, and is more effective for handling simpler queries.• Benefit of Star Schema is ease of access in terms of writing queries.
Snow Flake Schema• In computing, a snowflake schema is a logical arrangement of tables in a multidimensional database such that the entity relationship diagram resembles a snowflake in shape. The snowflake schema is represented by centralized fact tables which are connected to multiple dimensions.• However, in the snowflake schema, dimensions are normalized into multiple related tables, whereas the star schemas dimensions are normalized with each dimension represented by a single table.• Star and snowflake schemas are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations
OLAP and OLTP• OLTP vs. OLAP We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it.
• OLTP (On-line Transaction Processing) is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second. In OLTP database there is detailed and current data, and schema used to store transactional databases is the entity model (usually Normalized).• OLAP (On-line Analytical Processing) is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. In OLAP database there is aggregated, historical data, stored in multi- dimensional schemas (usually star schema).
Data Warehouse Approaches• 1. Bottom up• 2. Top down• 3. Hybrid• 4. Federated
• Bottom-up design• Built from a series of incremental, architected data marts.• Data marts are first created to provide reporting and analytical capabilities for specific business processes. Data marts contain atomic data and, if necessary, summarized data.• These data marts can eventually be unioned together to create a data warehouse.• An important benefit to the organization is, it allows the project team to develop the skills and techniques required for data warehousing in a much lower risk and exposure environment than a full scale EDW project.
• Bottom up• Advantages• * Offers low risk and low exposure• * Yields incremental design• * Requires lower level, shorter term• * Can be delivered quickly• * Enables a focused approach• * Provides faster ROI• Disadvantages• * requires to integrate incremental data marts• * Needs multiple team coordination
• Top-down design• An EDW is composed of multiple subject areas -- finance, human resources, marketing, sales, and manufacturing. In a top down scenario, the entire EDW is architected, and then a small slice of a subject area is chosen for construction.• The top down EDW is architected, designed and constructed in an iterative manner.• Data warehouse is as a centralized repository for the entire enterprise. The data warehouse is designed using a normalized enterprise data model. "Atomic" data, that is, data at the lowest level of detail, are stored in the data warehouse.• Dimensional data marts containing data needed for specific business processes or specific departments are created from the data warehouse.
• Top-down design• Advantages• * Coordinated environment• * Single point of control & development.• Disadvantages• * Time required to complete project• * Too much time spent analyzing problems• * Difficult to control the project• * Enterprise-wide nature of the project• * Too much risk
• Hybrid Approach• The hybrid approach tries to blend the best of both “top-down” and “bottom-up” approaches.• It attempts to capitalize on the speed and user- orientation of the “bottom-up” approach without sacrificing the integration enforced by a data warehouse in a “top down” approach.• The hybrid approach relies on ETL tool to store and manage the enterprise and local models in the data marts as well as synchronize the differences between them.
• Hybrid Approach• Advantage• It combines rapid development techniques within an enterprise architecture framework.• It develops an enterprise data model iteratively and only develops a heavyweight infrastructure once it’s really needed (e.g. when executives start asking for reports that cross data mart boundaries.)Disadvantages• Backfilling a data warehouse can be a highly disruptive process.• Few query tools can dynamically and intelligently query atomic data in one database (i.e. the data warehouse) and summary data in another database (i.e. the data marts.) Users may be confused when to query which database.• Relies heavily on an ETL tool ,although ETL tools have matured considerably, they can never enforce adherence to architecture.
• Federated Approach• Is not a methodology or architecture, but a concession to the natural forces that undermine the best laid plans for deploying a perfect system.• A federated approach rationalizes the use of whatever means possible to integrate analytical resources to meet changing needs or business conditions.• In short, it’s a salve for the soul of the stressed out data warehousing project manager who must sacrifice architectural purity to meet the immediate (and ever- changing) needs of his business users.
• Federated Approach• Advantage• No need to change their architecture as per the IT standard• Disadvantage• It is not well documented.• Without a specific architecture in mind, it may lead to the continued decentralization and fragmentation of analytical resources, making it harder to deliver an enterprise view in the end.• Also, integrating meta data is a very big problem in a heterogeneous, ever-changing environment.
• Summary of DW Methodology• Ultimately, organizations need to understand the strengths and limitations of each methodology and then pursue their own way through the data warehousing thicket.