i. Databases are developed on the IDEA that DATA is one of the
critical materials of the Information Age
ii. Information, which is created by data, becomes the bases for
decision making
iii. a database is basically a collection of information organized in
such a way that computer program can quickly select desired
pieces of data.
DATABASE
DATA WAREHOUSE
i. A data warehouse is a collection of integrated databases
designed to support a DSS (Decision Support System)
ii. A data warehouse is a relational database that is designed for
query and analysis rather than for transaction processing. It
usually contains historical data derived from transaction data,
but it can include data from other sources. It separates analysis
workload from transaction workload and enables an
organization to consolidate data from several sources.
iii. In addition to a relational database, a data warehouse
environment includes an extraction, transportation,
transformation, and loading (ETL) solution, an online
analytical processing (OLAP) engine, client analysis tools, and
other applications that manage the process of gathering data
and delivering it to business users.
DECISION SUPPORT SYSTEMS
i. Created to facilitate the decision making process
ii. So much information that it is difficult to extract it all from a
traditional database
iii. Need for a more comprehensive data storage facility
iv. Extract Information from data to use as the basis for decision
making
v. Used at all levels of the Organization
vi. Tailored to specific business areas
vii. Interactive
viii. Ad Hoc queries to retrieve and display information
ix. Combines historical operation data with business activities
DATA WAREHOUSE
Data Warehouse Environment:
i. Data store
ii. Data mart
iii. Metadata
In order for data to be effective, DW must be:
i. Consistent.
ii. Well integrated.
iii. Well defined.
iv. Time stamped.
DATA WAREHOUSE ENVIRONMENT
DATA STORE
i. Data come from internal and external nonintegrated
operational systems
ii. An operational data store (ODS) stores data for a specific
application. It feeds the data warehouse a stream of desired
raw data.
iii. It Is the most common component of DW environment.
iv. Data store is generally subject oriented, volatile, current
commonly focused on customers, products, orders, policies,
claims, etc.
v. Its day-to-day function is to store the data for a single specific
set of operational application.
vi. Its function is to feed the data warehouse data for the purpose
of analysis.
DATA STORE & DATA WAREHOUSE
DATA MART
i. Small Data Stores
ii. More manageable data sets
iii. Targeted to meet the needs of small groups within the
organization
iv. It is lower-cost, scaled down version of the DW.
v. Small, Single-Subject data warehouse subset that provides
decision support to a small group of people
vi. Data Mart offer a targeted and less costly method of gaining the
advantages associated with data warehousing and can be scaled
up to a full DW environment over time.
META DATA
i. Last component of DW environments.
ii. It is information that is kept about the warehouse rather than
information kept within the warehouse.
iii. Legacy systems generally don’t keep a record of characteristics
of the data (such as what pieces of data exist and where they are
located).
iv. The metadata is simply data about data.
v. For example, a line in a sales database may contain:
4056 KJ596 223.45
vi. This is mostly meaningless until we consult the metadata that
tells us it was store number 4056, product KJ596 and sales of
$223.45
vii. The metadata are essential ingredients in the transformation of
raw data into knowledge. They are the “keys” that allow us to
handle the raw data.
GENERAL METADATA ISSUES
General metadata issues associated with Data Warehouse use:
i. What tables, attributes and keys does the DW contain?
ii. Where did each set of data come from?
iii. What transformations were applied with cleansing?
iv. How have the metadata changed over time?
v. How often do the data get reloaded?
vi. Are there so many data elements that you need to be careful
what you ask for?
 A common way of introducing data warehousing is to refer to the
characteristics of a data warehouse
i. Subject Oriented
ii. Integrated
iii. Nonvolatile
iv. Time Variant
CHARACTERISTICS OF DATA
WAREHOUSE
SUBJECT ORIENTED
i. Data warehouses are designed to help you analyze data. For
example, to learn more about your company's sales data, you
can build a warehouse that concentrates on sales. Using this
warehouse, you can answer questions like "Who was our best
customer for this item last year?" This ability to define a data
warehouse by subject matter, sales in this case, makes the data
warehouse subject oriented.
ii. Organized around major subjects, such as customer, product,
sales.
iii. Focusing on the modeling and analysis of data for decision
makers, not on daily operations or transaction processing.
iv. Provide a simple and concise view around particular subject
issues by excluding data that are not useful in the decision
support process.
INTEGRATED
i. Integration is closely related to subject orientation. Data
warehouses must put data from disparate sources into a
consistent format. They must resolve such problems as naming
conflicts and inconsistencies among units of measure. When
they achieve this, they are said to be integrated.
ii. Data cleaning and data integration techniques are applied.
iii. The data warehouse is a centralized, consolidated database that
integrated data derived from the entire organization
NONVOLATILE
i. Nonvolatile means that, once entered into the warehouse, data
should not change. This is logical because the purpose of a
warehouse is to enable you to analyze what has occurred.
ii. Once data is entered it is NEVER removed
iii. Read-Only database for data analysis and query processing
iv. Data are stored in read-only format.
v. Represents the company’s entire history
TIME VARIANT
i. In order to discover trends in business, analysts need large
amounts of data. This is very much in contrast to online
transaction processing (OLTP) systems, where performance
requirements demand that historical data be moved to an
archive. A data warehouse's focus on change over time is what is
meant by the term time variant.
ii. In an operational application system, the expectation is that all
data within the database are accurate as of the moment of
access. In the DW data are simply assumed to be accurate as of
some moment in time and not necessarily right now.
iii. One of the places where DW data display time variance is in the
structure of the record key. Every primary key contained within
the DW must contain, either implicitly or explicitly an element
of time( day, week, month, etc.)
TIME VARIANT
i. Every piece of data contained within the warehouse must be
associated with a particular point in time if any useful analysis
is to be conducted with it.
ii. Another aspect of time variance in DW data is that, once
recorded, data within the warehouse cannot be updated or
changed.
ONLINE TRANSACTION
PROCESSING(OLTP)
Online transaction processing. OLTP systems are optimized for fast
and reliable transaction handling. Compared to data warehouse
systems, most OLTP interactions will involve a relatively small
number of rows, but a larger group of tables.
OLAP functionality is characterized by dynamic, multidimensional
analysis of historical data, which supports activities such as the
following:
i. Calculating across dimensions and through hierarchies
ii. Analyzing trends
iii. Drilling up and down through hierarchies
iv. Rotating to change the dimensional orientation
DATA WAREHOUSE BASIC
ARCHITECTURE
This illustrates three things:
i. Data Sources (operational systems and flat files)
ii. Warehouse (metadata, summary data, and raw data)
iii. Users (analysis, reporting, and mining)
The metadata and raw data of a traditional OLTP system is present,
as is an additional type of data, summary data. Summaries are very
valuable in data warehouses because they pre-compute long
operations in advance. For example, a typical data warehouse query
is to retrieve something like August sales. A summary in Oracle is
called a materialized view.
DATA WAREHOUSE BASIC
ARCHITECTURE
DATA WAREHOUSE ARCHITECTURE
(WITH A STAGING AREA)
i. Data Sources (operational systems and flat files)
ii. Staging Area (where data sources go before the warehouse)
iii. Warehouse (metadata, summary data, and raw data)
iv. Users (analysis, reporting, and mining)
We need to clean and process our operational data before putting it
into the warehouse. we can do this programmatically, although most
data warehouses use a staging area instead. A staging area simplifies
building summaries and general warehouse management.
DATA WAREHOUSE ARCHITECTURE
(WITH A STAGING AREA)
DATA WAREHOUSE ARCHITECTURE (WITH A
STAGING AREA AND DATA MARTS)
i. Data Sources (operational systems and flat files)
ii. Staging Area (where data sources go before the warehouse)
iii. Warehouse (metadata, summary data, and raw data)
iv. Data Marts (purchasing, sales, and inventory)
v. Users (analysis, reporting, and mining)
DATA WAREHOUSE ARCHITECTURE (WITH A
STAGING AREA AND DATA MARTS)
DATA WAREHOUSING TYPOLOGY
i. The virtual data warehouse – the end users have direct
access to the data stores, using tools enabled at the data access
layer
ii. The central data warehouse – a single physical database
contains all of the data for a specific functional area
iii. The distributed data warehouse – the components are
distributed across several physical databases
DATA WAREHOUSE ETL TOOLS
ETL is short for Extract, Transform, Load.Three database functions
that are combined into one tool to pull data out of one database and
place it into another database. ETL is used to migrate data from one
database to another, to form data marts and data warehouses and
also to convert databases from one format or type to another
i. Extract is the process of reading data from a database.
ii. Transform is the process of converting the extracted data from
its previous form into the form it needs to be in so that it can be
placed into another database. Transformation occurs by using
rules or lookup tables or by combining the data with other data.
iii. Load is the process of writing the data into the target database.
ETL TOOLS
Tools Version ETL Vendors
Oracle Warehouse Builder
11gR1 Oracle
Data Services
XI 4.0 SAP Business Objects
IBM Info sphere Information Server 9.1 IBM
SAS Data Integration Studio 9.4M1 SAS Institute
Power Center Informatica 9.5 Informatica
Elixir Repertoire 7.2.2 Elixir
Data Migrator 7.7 Information Builders
SQL Server Integration Services 10 Microsoft
Talend Studio for Data Integration 5.2 Talend
Data Flow Manager 6.5 Pitney Bowes Business Insight
Pervasive Data Integrator 10.0 Actian (Pervasive Software)
ETL TOOLS
Tools Version ETL Vendors
Open Text Integration Center 7.1 Open Text
Oracle Data Integrator (ODI) 11.1.1.5 Oracle
Data Manager/Decision Stream 8.2 IBM (Cognos)
Clover ETL 3.4.1 Javlin
Centerprise 6.0 Astera
DB2 Infosphere Warehouse
Edition
9.1 IBM
Pentaho Data Integration 4.1 Pentaho
Adeptia Integration Suite 5.1 Adeptia
DMExpress 5.5 Syncsort
Expressor Data Integration 3.7 QlikTech
DATA WAREHOUSE TECHNOLOGIES
i. No one currently offers an end-to-end DW solution.
Organizations buy bits and pieces from a number of vendors
and hopefully make them work together.
ii. SAS, IBM, Software AG, Information Builders and Platinum
offer solutions that are at least fairly comprehensive.
iii. The market is very competitive. Table 10-6 in the text lists 90
firms that produce DW products.
IMPLEMENTING THE DATA
WAREHOUSE
Kozar list of “seven deadly sins” of data warehouse implementation:
i. “If you build it, they will come” – the DW needs to be
designed to meet people’s needs
ii. Omission of an architectural framework – you need to
consider the number of users, volume of data, update cycle,
etc.
iii. Underestimating the importance of documenting
assumptions – the assumptions and potential conflicts must
be included in the framework
iv. Failure to use the right tool – a DW project needs different
tools than those used to develop an application
v. Life cycle abuse – in a DW, the life cycle really never ends
vi. Ignorance about data conflicts – resolving these takes a lot
more effort than most people realize
vii. Failure to learn from mistakes – since one DW project tends
to beget another, learning from the early mistakes will yield
higher quality later
THE FUTURE OF DATA WAREHOUSING
As the DW becomes a standard part of an organization, there will be
efforts to find new ways to use the data. This will likely bring with it
several new challenges:
i. Regulatory constraints may limit the ability to combine
sources of disparate data.
ii. These disparate sources are likely to contain unstructured
data, which is hard to store.
iii. The Internet makes it possible to access data from virtually
“anywhere”. Of course, this just increases the disparity.
REFERENCES
i. Google.com
ii. Oracle.com
iii. Webopedia.com
iv. Etltool.com
Datawarehouse

Datawarehouse

  • 1.
    i. Databases aredeveloped on the IDEA that DATA is one of the critical materials of the Information Age ii. Information, which is created by data, becomes the bases for decision making iii. a database is basically a collection of information organized in such a way that computer program can quickly select desired pieces of data. DATABASE
  • 2.
    DATA WAREHOUSE i. Adata warehouse is a collection of integrated databases designed to support a DSS (Decision Support System) ii. A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. iii. In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.
  • 3.
    DECISION SUPPORT SYSTEMS i.Created to facilitate the decision making process ii. So much information that it is difficult to extract it all from a traditional database iii. Need for a more comprehensive data storage facility iv. Extract Information from data to use as the basis for decision making v. Used at all levels of the Organization vi. Tailored to specific business areas vii. Interactive viii. Ad Hoc queries to retrieve and display information ix. Combines historical operation data with business activities
  • 4.
  • 5.
    Data Warehouse Environment: i.Data store ii. Data mart iii. Metadata In order for data to be effective, DW must be: i. Consistent. ii. Well integrated. iii. Well defined. iv. Time stamped. DATA WAREHOUSE ENVIRONMENT
  • 6.
    DATA STORE i. Datacome from internal and external nonintegrated operational systems ii. An operational data store (ODS) stores data for a specific application. It feeds the data warehouse a stream of desired raw data. iii. It Is the most common component of DW environment. iv. Data store is generally subject oriented, volatile, current commonly focused on customers, products, orders, policies, claims, etc. v. Its day-to-day function is to store the data for a single specific set of operational application. vi. Its function is to feed the data warehouse data for the purpose of analysis.
  • 7.
    DATA STORE &DATA WAREHOUSE
  • 8.
    DATA MART i. SmallData Stores ii. More manageable data sets iii. Targeted to meet the needs of small groups within the organization iv. It is lower-cost, scaled down version of the DW. v. Small, Single-Subject data warehouse subset that provides decision support to a small group of people vi. Data Mart offer a targeted and less costly method of gaining the advantages associated with data warehousing and can be scaled up to a full DW environment over time.
  • 9.
    META DATA i. Lastcomponent of DW environments. ii. It is information that is kept about the warehouse rather than information kept within the warehouse. iii. Legacy systems generally don’t keep a record of characteristics of the data (such as what pieces of data exist and where they are located). iv. The metadata is simply data about data. v. For example, a line in a sales database may contain: 4056 KJ596 223.45 vi. This is mostly meaningless until we consult the metadata that tells us it was store number 4056, product KJ596 and sales of $223.45 vii. The metadata are essential ingredients in the transformation of raw data into knowledge. They are the “keys” that allow us to handle the raw data.
  • 10.
    GENERAL METADATA ISSUES Generalmetadata issues associated with Data Warehouse use: i. What tables, attributes and keys does the DW contain? ii. Where did each set of data come from? iii. What transformations were applied with cleansing? iv. How have the metadata changed over time? v. How often do the data get reloaded? vi. Are there so many data elements that you need to be careful what you ask for?
  • 11.
     A commonway of introducing data warehousing is to refer to the characteristics of a data warehouse i. Subject Oriented ii. Integrated iii. Nonvolatile iv. Time Variant CHARACTERISTICS OF DATA WAREHOUSE
  • 12.
    SUBJECT ORIENTED i. Datawarehouses are designed to help you analyze data. For example, to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented. ii. Organized around major subjects, such as customer, product, sales. iii. Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. iv. Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.
  • 13.
    INTEGRATED i. Integration isclosely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated. ii. Data cleaning and data integration techniques are applied. iii. The data warehouse is a centralized, consolidated database that integrated data derived from the entire organization
  • 14.
    NONVOLATILE i. Nonvolatile meansthat, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred. ii. Once data is entered it is NEVER removed iii. Read-Only database for data analysis and query processing iv. Data are stored in read-only format. v. Represents the company’s entire history
  • 15.
    TIME VARIANT i. Inorder to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse's focus on change over time is what is meant by the term time variant. ii. In an operational application system, the expectation is that all data within the database are accurate as of the moment of access. In the DW data are simply assumed to be accurate as of some moment in time and not necessarily right now. iii. One of the places where DW data display time variance is in the structure of the record key. Every primary key contained within the DW must contain, either implicitly or explicitly an element of time( day, week, month, etc.)
  • 16.
    TIME VARIANT i. Everypiece of data contained within the warehouse must be associated with a particular point in time if any useful analysis is to be conducted with it. ii. Another aspect of time variance in DW data is that, once recorded, data within the warehouse cannot be updated or changed.
  • 17.
    ONLINE TRANSACTION PROCESSING(OLTP) Online transactionprocessing. OLTP systems are optimized for fast and reliable transaction handling. Compared to data warehouse systems, most OLTP interactions will involve a relatively small number of rows, but a larger group of tables. OLAP functionality is characterized by dynamic, multidimensional analysis of historical data, which supports activities such as the following: i. Calculating across dimensions and through hierarchies ii. Analyzing trends iii. Drilling up and down through hierarchies iv. Rotating to change the dimensional orientation
  • 18.
  • 19.
    This illustrates threethings: i. Data Sources (operational systems and flat files) ii. Warehouse (metadata, summary data, and raw data) iii. Users (analysis, reporting, and mining) The metadata and raw data of a traditional OLTP system is present, as is an additional type of data, summary data. Summaries are very valuable in data warehouses because they pre-compute long operations in advance. For example, a typical data warehouse query is to retrieve something like August sales. A summary in Oracle is called a materialized view. DATA WAREHOUSE BASIC ARCHITECTURE
  • 20.
  • 21.
    i. Data Sources(operational systems and flat files) ii. Staging Area (where data sources go before the warehouse) iii. Warehouse (metadata, summary data, and raw data) iv. Users (analysis, reporting, and mining) We need to clean and process our operational data before putting it into the warehouse. we can do this programmatically, although most data warehouses use a staging area instead. A staging area simplifies building summaries and general warehouse management. DATA WAREHOUSE ARCHITECTURE (WITH A STAGING AREA)
  • 22.
    DATA WAREHOUSE ARCHITECTURE(WITH A STAGING AREA AND DATA MARTS)
  • 23.
    i. Data Sources(operational systems and flat files) ii. Staging Area (where data sources go before the warehouse) iii. Warehouse (metadata, summary data, and raw data) iv. Data Marts (purchasing, sales, and inventory) v. Users (analysis, reporting, and mining) DATA WAREHOUSE ARCHITECTURE (WITH A STAGING AREA AND DATA MARTS)
  • 24.
    DATA WAREHOUSING TYPOLOGY i.The virtual data warehouse – the end users have direct access to the data stores, using tools enabled at the data access layer ii. The central data warehouse – a single physical database contains all of the data for a specific functional area iii. The distributed data warehouse – the components are distributed across several physical databases
  • 25.
    DATA WAREHOUSE ETLTOOLS ETL is short for Extract, Transform, Load.Three database functions that are combined into one tool to pull data out of one database and place it into another database. ETL is used to migrate data from one database to another, to form data marts and data warehouses and also to convert databases from one format or type to another i. Extract is the process of reading data from a database. ii. Transform is the process of converting the extracted data from its previous form into the form it needs to be in so that it can be placed into another database. Transformation occurs by using rules or lookup tables or by combining the data with other data. iii. Load is the process of writing the data into the target database.
  • 26.
    ETL TOOLS Tools VersionETL Vendors Oracle Warehouse Builder 11gR1 Oracle Data Services XI 4.0 SAP Business Objects IBM Info sphere Information Server 9.1 IBM SAS Data Integration Studio 9.4M1 SAS Institute Power Center Informatica 9.5 Informatica Elixir Repertoire 7.2.2 Elixir Data Migrator 7.7 Information Builders SQL Server Integration Services 10 Microsoft Talend Studio for Data Integration 5.2 Talend Data Flow Manager 6.5 Pitney Bowes Business Insight Pervasive Data Integrator 10.0 Actian (Pervasive Software)
  • 27.
    ETL TOOLS Tools VersionETL Vendors Open Text Integration Center 7.1 Open Text Oracle Data Integrator (ODI) 11.1.1.5 Oracle Data Manager/Decision Stream 8.2 IBM (Cognos) Clover ETL 3.4.1 Javlin Centerprise 6.0 Astera DB2 Infosphere Warehouse Edition 9.1 IBM Pentaho Data Integration 4.1 Pentaho Adeptia Integration Suite 5.1 Adeptia DMExpress 5.5 Syncsort Expressor Data Integration 3.7 QlikTech
  • 28.
    DATA WAREHOUSE TECHNOLOGIES i.No one currently offers an end-to-end DW solution. Organizations buy bits and pieces from a number of vendors and hopefully make them work together. ii. SAS, IBM, Software AG, Information Builders and Platinum offer solutions that are at least fairly comprehensive. iii. The market is very competitive. Table 10-6 in the text lists 90 firms that produce DW products.
  • 29.
    IMPLEMENTING THE DATA WAREHOUSE Kozarlist of “seven deadly sins” of data warehouse implementation: i. “If you build it, they will come” – the DW needs to be designed to meet people’s needs ii. Omission of an architectural framework – you need to consider the number of users, volume of data, update cycle, etc. iii. Underestimating the importance of documenting assumptions – the assumptions and potential conflicts must be included in the framework iv. Failure to use the right tool – a DW project needs different tools than those used to develop an application v. Life cycle abuse – in a DW, the life cycle really never ends vi. Ignorance about data conflicts – resolving these takes a lot more effort than most people realize vii. Failure to learn from mistakes – since one DW project tends to beget another, learning from the early mistakes will yield higher quality later
  • 30.
    THE FUTURE OFDATA WAREHOUSING As the DW becomes a standard part of an organization, there will be efforts to find new ways to use the data. This will likely bring with it several new challenges: i. Regulatory constraints may limit the ability to combine sources of disparate data. ii. These disparate sources are likely to contain unstructured data, which is hard to store. iii. The Internet makes it possible to access data from virtually “anywhere”. Of course, this just increases the disparity.
  • 31.
    REFERENCES i. Google.com ii. Oracle.com iii.Webopedia.com iv. Etltool.com