Data warehouse — + subject oriented + integrated + time variant + nonvolatile collection of data for management’s decisions …data warehouses are granular. They contain the bedrock data that forms the single source for all Decision Support System/Executive Information System processing. With a data warehouse there is reconcilability of information when there are differences of opinion. The atomic data found in the warehouse can be shaped in many ways, satisfying both known requirements and standing ready to satisfy unknown requirements.
Data Warehouses became a distinct type of computer database during the late 1980s and early 1990s. They were developed to meet a growing demand for management information and analysis that could not be met by operational systems
the extra processing load of reporting reduced the response time of the operational systems
the development of reports in operational systems requires writing specific SQL queries which put a heavy load on the system
Separate computer databases began to be built that were specifically designed to support management information and analysis purposes.
Data warehouses were able to bring in data from a range of different data sources
mainframe computers, minicomputers, personal computers and office automation software such as spreadsheets,
Data warehouses integrate this information in a single place.
User-friendly reporting tools and freedom from operational impacts, has led to a growth of data warehousing systems
As technology improved (lower cost for more performance) and user requirements increased (faster data load cycle times and more features), data warehouses have evolved through several fundamental stages:
Offline Operational Databases - Data warehouses in this initial stage are developed by simply copying the database of an operational system to an off-line server where the processing load of reporting does not impact on the operational system's performance.
Offline Data Warehouse - Data warehouses in this stage of evolution are updated on a regular time cycle (usually daily, weekly or monthly) from the operational systems and the data is stored in an integrated reporting-oriented data structure.
The real next generation warehousing – not really being done:
Real Time Data Warehouse - Data warehouses at this stage are updated on a transaction or event basis, every time an operational system performs a transaction (e.g. an order or a delivery or a booking etc.)
Integrated Data Warehouse - Data warehouses at this stage are used to generate activity or transactions that are passed back into the operational systems for use in the daily activity of the organization.
The term data warehouse architecture describes the overall structure of the system.
historical terms include decision support systems (DSS), management information systems (MIS)
Newer terms include business intelligence competency center (BICC)
The data warehouse architecture describes the overall system components: infrastructure, data and processes.
The infrastructure technology stack perspective determines the hardware and software products needed to implement the components of the system.
The data perspective typically diagrams the source and target data structures and aid the user in understanding what data assets are available and how they are related.
The process perspective is primarily concerned with communicating the process and flow of data from the originating source system through the process of loading the data warehouse, and often the process that client products use to access and extract data from the warehouse.
Architecture facilitates the structure, function and interrelationships of each component.
Data integration is the aspect of combining diverse sources and giving the user a unified view of their data.
This important problem emerges in a variety of situations both commercial (when two similar companies need to merge their databases) and scientific (combining research results from different bioinformatics repositories).
Data integration appears with increasing frequency as the volume and the need to share existing data explodes.
It has been the focus of extensive theoretical work and numerous open problems remain to be solved.
In practice, data integration is frequently called Enterprise Information Integration.
Operational Systems are the internal and external core systems that support the day-to-day business operations. They are accessed through application program interfaces (APIs) and are the source of data for the data warehouse and operational data store. (Encompasses all operational systems including ERP, relational and legacy.)
Data Acquisition is the set of processes that capture, integrate, trans-form, cleanse, reengineer and load source data into the data warehouse and operational data store. Data reengineering is the process of investigating, standardizing and providing clean consolidated data.
The Data Warehouse is a subject-oriented, integrated, time-variant, non-volatile collection of data used to support the strategic decision-making process for the enterprise. It is the central point of data integration for business intelligence and is the source of data for the data marts, delivering a common view of enterprise data.
Primary Storage Management consists of the processes that manage data within and across the data warehouse and operational data store. It includes processes for backup and recovery, partitioning, summarization, aggregation, and archival and retrieval of data to and from alternative storage.
Alternative Storage is the set of devices used to cost-effectively store data warehouse and exploration warehouse data that is needed but not frequently accessed. These devices are less expensive than disks and still provide adequate performance when the data is needed.
Data Delivery is the set of processes that enable end users and their supporting IS group to build and manage views of the data warehouse within their data marts. It involves a three-step process consisting of filtering, formatting and delivering data from the data warehouse to the data marts.
The Data Mart is customized and/or summarized data derived from the data warehouse and tailored to support the specific analytical requirements of a business unit or function. It utilizes a common enterprise view of strategic data and provides business units more flexibility, control and responsibility. The data mart may or may not be on the same server or location as the data warehouse.
The Operational Data Store (ODS) is a subject-oriented, integrated, current, volatile collection of data used to support the tactical decision-making process for the enterprise. It is the central point of data integration for business management, delivering a common view of enterprise data.
Meta Data Management is the process for managing information needed to promote data legibility, use and administration. Contents are described in terms of data about data, activity and knowledge.
The Exploration Warehouse is a DSS architectural structure whose purpose is to provide a safe haven for exploratory and ad hoc processing. An exploration warehouse utilizes data compression to provide fast response times with the ability to access the entire database.
The Data Mining Warehouse is an environment created so analysts may test their hypotheses, assertions and assumptions developed in the exploration warehouse. Specialized data mining tools containing intelligent agents are used to perform these tasks.
Activities are the events captured by the enterprise legacy and/or ERP systems as well as external transactions such as Internet interactions.
Statistical Applications are set up to perform complex, difficult statistical analyses such as exception, means, average and pattern analyses. The data warehouse is the source of data for these analyses. These applications analyze massive amounts of detailed data and require a reasonably performing environment.
Analytic Applications are pre-designed, ready-to-install, decision sup-port applications. They generally require some customization to fit the specific requirements of the enterprise. The source of data is the data warehouse. Examples of these applications are risk analysis, database marketing (CRM) analyses, vertical industry "data marts in a box," etc.
External Data is any data outside the normal data collected through an enterprise's internal applications. There can be any number of sources of external data such as demographic, credit, competitor and financial information. Generally, external data is purchased by the enterprise from a vendor of such information.
Normalization is a relational database modeling process where the relations or tables are progressively decomposed into smaller relations to a point where all attributes in a relation are very tightly coupled with the primary key of the relation. Most data modelers try to achieve the “Third Normal Form” with all of the relations before they de-normalize for performance, ease of query or other reasons.
First Normal Form : A relation is said to be in First Normal Form if it describes a single entity and it contains no arrays or repeating attributes. For example, an order table or relation with multiple line items would not be in First Normal Form because it would have repeating sets of attributes for each line item. The relational theory would call for separate tables for order and line items.
Second Normal Form : A relation is said to be in Second Normal Form if in addition to the First Normal Form properties, all attributes are fully dependent on the primary key for the relation.
Third Normal Form : A relation is in Third Normal Form if in addition to Second Normal Form, all non-key attributes are completely independent of each other. http://www.sserve.com/ftp/dwintro.doc
Entity Relationship Diagrams example 3 rd normal form
AND Product Category = ‘TV' GROUP BY Brand, Country
FROM Fact_Sales F (NOLOCK)
INNER JOIN Dim_Date D (NOLOCK) ON F.Date_Id = D.Id
INNER JOIN Dim_Store S (NOLOCK) ON F.Store_Id = S.Id
INNER JOIN Dim_Geography G (NOLOCK) ON S.Geography_Id = G.Id
INNER JOIN Dim_Product P (NOLOCK) ON F.Product_Id = P.Id
INNER JOIN Dim_Product_Category C (NOLOCK) ON P.Product_Category_Id = C.ID
INNER JOIN Dim_Brand B (NOLOCK) ON P.Brand_Id = B.Id WHERE D.Year = 2010
AND C.Product_Category = 'tv'
Basic EDW Data Model Design Party Account Product & Service Event Each represents a subject area in the model, with third normal tables to accommodate the data and its relationships with hierarchy
Account, Customer & Address Relationships Account Contact Party Address link Account Party link Address Account Party Account Information loaded from ALL Source Systems ETL process builds the relationship between Accounts and Customers (Party) based on the relationship file from CUSTOMER CRM SYSTEM
Architecture for an EDW or other large Data Warehouse
How do get from where you are to implement an actual system?
Start with defining your requirements
DO IT YOURSELF, DO NOT RELY ON THE EXPERTS – staff augment and hire the talent internally
EDW Process State Staging Area EDW Metadata | Data Governance | Data Management DM CPS MANTAS CRDB MKTG FIN SALES EDW Data cleansing Data profiling Sync & Sort BI Source System Cleanse / Pre-process IMP RM OEC ALS AFS ST RE DFP SBA AFS V-PR
DW saves data for longer periods than transactional/operational systems (trending analysis, where I was vs where I am…..)
Real-time DW vs point in time
DW needs to be extendable, align with business structure http://www.sserve.com/ftp/dwintro.doc
Enterprise Data Solution Data Marts and OLAP Enterprise Data Warehouse Source Systems Reporting Data Mining OLAP Analysis Dashboard Scorecard Master Data Management Application Master / Reference Data Store
EDW - Objective Follow the process methodology to achieve these architectural aspects : Meta Data, Security, Scalability, Reliability and Supportability
EDW – Data Model Design Party Account Product & Service Event Each represents the subject area we have in the model, with third normal tables to accommodate the data and its relationships with hierarchy
Account, Customer & Address Relationships Account Contact Party Address link Account Party link Address Account Party Account Information loaded from ALL Source Systems Customer Information Loaded EDW ETL process builds the relationship between Account and Customers based on the relationship file from RM
DW is only successful is it provides the view the business needs of its data
A data warehouse is a structured extensible environment designed for the analysis of non-volatile data, logically and physically transformed from multiple source applications to align with business structure, updated and maintained for a long time period, expressed in simple business terms, and summarized for quick analysis.
Vivek R. Gupta, Senior Consultant [email_address] System Services corporation, Chicago, Illinois http://www.system-services.com
Example of conforming data for business view: http://www.sserve.com/ftp/dwintro.doc
Data Quality is measured as the degree of superiority, or excellence, of the various data that we use to create information products.
“ Reason #1 for the failure of CRM projects : Data is ignored. Enterprise must have a detailed understanding of the quality of their data. How to clean it up, how to keep it clean, where to source it, and what 3 rd -party data is required. Action item: Have a data quality strategy. Devote ½ of the total timeline of the CRM project to data elements.” - Gartner
Conformance: The degree to which data values are consistent with their agreed upon definitions.
A detailed definition must first exist before this can be measured.
Information quality begins with a comprehensive understanding of the data inventory. The information about the data is as important as the data itself.
A Data Dictionary must exist! An organized, authoritive collection of attributes is equivalent to the old “Card Catalog” in a library, or the “Parts and List Description” section of an inventory system. It must contain all the know usage rules and an acceptable list of values. All known caveats and anomalies must be descried.
The degree to which a piece of data is correct and believable. The value can be compared to the original source for correctness, but it can still be unbelievable. Conformed values can be compared to lists of reference values.
Zip code 35244 is correct and believable.
Zip code 3524B is incorrect and unbelievable.
Zip code 35290 is incorrect but believable (it looks right, but does not exist).
AL is a correct and believable state code (compared to the list of valid state codes)
A1 is an incorrect and unbelievable state code (compared to the list of valid state codes)
AA is an incorrect but believable state code (compared to the list of valid state codes)
Data Management and Integration Topic, Gartner, http://www.gartner.com/it/products/research/asset_137953_2395.jsp
Articles: Key Issues for Implementing an Enterprise wide Data Quality Improvement Project, 2008, Key Issues for Enterprise Information Management Initiatives, 2008, Key Issues for Establishing Information Governance Policies, Processes and Organization, 2008
Data Quality Management, The Most Critical Initiative You Can Implement, J. G. Geiger, http://www2.sas.com/proceedings/sugi29/098-29.pdf
Information Management, How to Measure and Monitor the Quality of Master Data, http://www.information-management.com/issues/2007_58/master_data_management_mdm_quality-10015358-1.html?ET=informationmgmt:e963:2046487a:&st=email
Data Management Assn of Michigan Bits & Bytes, Critical Data Quality Controls, D Jeffries, Fall 2006 http://dama-michigan.org/2%20Newsletter.pdf