Data Warehouse
DefinitionDefinition
Importance of Data WarehouseImportance of Data Warehouse
Its ComponentsIts Components
Two Data Warehousing StrategiesTwo Data Warehousing Strategies
ETL ProcessesETL Processes
For a Successful WarehouseFor a Successful Warehouse
Data Warehouse PitfallsData Warehouse Pitfalls
Data Warehouse
 A subject oriented, integrated, time-variant, non-volatile
collection of data in support of management decisions (Bill
Inmon)
 Subject oriented -- data are organized around sales,
products, etc.
 Integrated -- data are integrated to provide a
comprehensive view
 Time variant -- historical data are maintained
 Nonvolatile -- data are not updated by users
Limitations of Traditional
Databases
 lack of on-line historical data
 residing in different operational systems
 extremely poor query performance
 operational database designs not suited for
decision support
The Importance of Data
Warehousing
 More cost – effective decision making
 Increase quality and flexibility of enterprise analysis as
data warehouse contain accurate and reliable data
 Ability to maintain better customer relationships
 Unlimited analyses of enterprise information
Components of Data warehouse
 Summarized data
 Basically of two type: 1) Lightly (departmental information)
2) Highly (enterprise wide decision)
 Current detail
 Comes directly from operational system
 But stored by subject area and represent entire organization not a department
 System of record
 Maintaining the source of record
 Integration and transformation Programs
 Programs that convert an application – specific data to enterprise data
Cont..
 Performs many function like
 Reformatting, recalculating
 Adding time element
 Identifying the default value
 Summarizing and merging the data
 Filling up the blank fields
 Archives
 Contain old data which hold some amount of significance to the organization
 Used for trend analysis
 Metadata
 Control access and analysis of the data warehouse contents

To manage and control data warehouse creation and maintenance
Two Data Warehousing
Strategies
 Enterprise-wide warehouse, top down, the
Inmon methodology
 Data mart, bottom up, the Kimball
methodology
 When properly executed, both result in an
enterprise-wide data warehouse
The Data Mart Strategy
 The most common approach
 Begins with a single mart and are added over
time for more subject areas
 Relatively inexpensive and easy to implement
 Can be used as a proof of concept for data
warehousing
 Requires an overall integration plan
The Enterprise-wide Strategy
 A comprehensive warehouse is built initially
 An initial dependent data mart is built using a
subset of the data in the warehouse
 Additional data marts are built using subsets of the
data in the warehouse
 Like all complex projects, it is expensive, time
consuming, and prone to failure
 When successful, it results in an integrated, scalable
warehouse
ETL Processes
 Extraction, Transformation, and Loading Process
 The “plumbing” work of data warehousing
 Data are moved from source to target data
bases
 A very costly, time consuming part of data
warehousing
Sample ETL Tools
 Teradata Warehouse Builder from Teradata
 DataStage from Ascential Software
 SAS System from SAS Institute
 Power Mart/Power Center from Informatica
 Sagent Solution from Sagent Software
Reasons for “Dirty” Data
• Dummy Values
• Absence of Data
• Multipurpose Fields
• Inappropriate Use of Address Lines
• Violation of Business Rules
• Non-Unique Identifiers
• Data Integration Problems
I. Data Cleansing and
Extracting
 Source systems contain “dirty data” that must be cleansed
 ETL software contains rudimentary data cleansing capabilities
 Specialized data cleansing software is often used. Important
for performing name and address correction and householding
functions
 Leading data cleansing vendors include Vality (Integrity),
Harte-Hanks (Trillium), and Firstlogic (i.d.Centric)
Steps in Data Cleansing
 Parsing
 Correcting
 Standardizing
 Matching
 Consolidating
Parsing
 Parsing locates and identifies individual data
elements in the source files and then isolates
these data elements in the target files.
 Examples include parsing the first, middle,
and last name; street number and street
name; and city and state.
Correcting
 Corrects parsed individual data components
using sophisticated data algorithms and
secondary data sources.
 Example include replacing a vanity address
and adding a zip code.
Standardizing
 Standardizing applies conversion routines to
transform data into its preferred (and
consistent) format using both standard and
custom business rules.
 Examples include adding a pre name,
replacing a nickname, and using a preferred
street name
Matching
 Searching and matching records within and
across the parsed, corrected and standardized
data based on predefined business rules to
eliminate duplications.
 Examples include identifying similar names
and addresses.
Consolidating
• Analyzing and identifying relationships between
matched records and consolidating/merging
them into ONE representation.
II. Data Transformation
 Transforms the data in accordance with the
business rules and standards that have been
established
 Example include: format changes,
deduplication, splitting up fields,
replacement of codes, derived values, and
aggregates
III. Data Loading
 Data are physically moved to the data
warehouse
 The loading takes place within a “load
window”
 The trend is to near real time updates of the
data warehouse as the warehouse is
increasingly used for operational applications
For a Successful Warehouse
 From day one establish that warehousing is a joint
user/builder project
 Establish that maintaining data quality will be an
ONGOING joint user/builder responsibility
 Train the users one step at a time
 Consider doing a high level corporate data model in no
more than three weeks
 Look closely at the data extracting, cleaning, and loading
tools
Cont..
 Determine a plan to test the integrity of the data in the
warehouse
 From the start get warehouse users in the habit of 'testing'
complex queries
 Coordinate system roll-out with network administration
personnel
 Implement a user accessible automated directory to information
stored in the warehouse
Data Warehouse Pitfalls
 Many warehouse end users will be trained and never or
seldom apply their training
 Large scale data warehousing can become an exercise in
data homogenizing
 Loading information only because it is available
 Providing no maintenance to the data warehouse
Contact Us
Business Name: Skyline Business School
Address: Hauz Khas Enclave, 
New Delhi ­ 110 016, India.
Phone: 91­11­26864848,:91­11­26866968
E­mail: info@skylinecollege.com
Resource: 
www.skylinecollege.com/our­programmes/pgp­data­warehousing

Data Warehouse Basic Guide

  • 1.
    Data Warehouse DefinitionDefinition Importance ofData WarehouseImportance of Data Warehouse Its ComponentsIts Components Two Data Warehousing StrategiesTwo Data Warehousing Strategies ETL ProcessesETL Processes For a Successful WarehouseFor a Successful Warehouse Data Warehouse PitfallsData Warehouse Pitfalls
  • 2.
    Data Warehouse  Asubject oriented, integrated, time-variant, non-volatile collection of data in support of management decisions (Bill Inmon)  Subject oriented -- data are organized around sales, products, etc.  Integrated -- data are integrated to provide a comprehensive view  Time variant -- historical data are maintained  Nonvolatile -- data are not updated by users
  • 3.
    Limitations of Traditional Databases lack of on-line historical data  residing in different operational systems  extremely poor query performance  operational database designs not suited for decision support
  • 4.
    The Importance ofData Warehousing  More cost – effective decision making  Increase quality and flexibility of enterprise analysis as data warehouse contain accurate and reliable data  Ability to maintain better customer relationships  Unlimited analyses of enterprise information
  • 5.
    Components of Datawarehouse  Summarized data  Basically of two type: 1) Lightly (departmental information) 2) Highly (enterprise wide decision)  Current detail  Comes directly from operational system  But stored by subject area and represent entire organization not a department  System of record  Maintaining the source of record  Integration and transformation Programs  Programs that convert an application – specific data to enterprise data
  • 6.
    Cont..  Performs manyfunction like  Reformatting, recalculating  Adding time element  Identifying the default value  Summarizing and merging the data  Filling up the blank fields  Archives  Contain old data which hold some amount of significance to the organization  Used for trend analysis  Metadata  Control access and analysis of the data warehouse contents  To manage and control data warehouse creation and maintenance
  • 7.
    Two Data Warehousing Strategies Enterprise-wide warehouse, top down, the Inmon methodology  Data mart, bottom up, the Kimball methodology  When properly executed, both result in an enterprise-wide data warehouse
  • 8.
    The Data MartStrategy  The most common approach  Begins with a single mart and are added over time for more subject areas  Relatively inexpensive and easy to implement  Can be used as a proof of concept for data warehousing  Requires an overall integration plan
  • 9.
    The Enterprise-wide Strategy A comprehensive warehouse is built initially  An initial dependent data mart is built using a subset of the data in the warehouse  Additional data marts are built using subsets of the data in the warehouse  Like all complex projects, it is expensive, time consuming, and prone to failure  When successful, it results in an integrated, scalable warehouse
  • 10.
    ETL Processes  Extraction,Transformation, and Loading Process  The “plumbing” work of data warehousing  Data are moved from source to target data bases  A very costly, time consuming part of data warehousing
  • 11.
    Sample ETL Tools Teradata Warehouse Builder from Teradata  DataStage from Ascential Software  SAS System from SAS Institute  Power Mart/Power Center from Informatica  Sagent Solution from Sagent Software
  • 12.
    Reasons for “Dirty”Data • Dummy Values • Absence of Data • Multipurpose Fields • Inappropriate Use of Address Lines • Violation of Business Rules • Non-Unique Identifiers • Data Integration Problems
  • 13.
    I. Data Cleansingand Extracting  Source systems contain “dirty data” that must be cleansed  ETL software contains rudimentary data cleansing capabilities  Specialized data cleansing software is often used. Important for performing name and address correction and householding functions  Leading data cleansing vendors include Vality (Integrity), Harte-Hanks (Trillium), and Firstlogic (i.d.Centric)
  • 14.
    Steps in DataCleansing  Parsing  Correcting  Standardizing  Matching  Consolidating
  • 15.
    Parsing  Parsing locatesand identifies individual data elements in the source files and then isolates these data elements in the target files.  Examples include parsing the first, middle, and last name; street number and street name; and city and state.
  • 16.
    Correcting  Corrects parsedindividual data components using sophisticated data algorithms and secondary data sources.  Example include replacing a vanity address and adding a zip code.
  • 17.
    Standardizing  Standardizing appliesconversion routines to transform data into its preferred (and consistent) format using both standard and custom business rules.  Examples include adding a pre name, replacing a nickname, and using a preferred street name
  • 18.
    Matching  Searching andmatching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications.  Examples include identifying similar names and addresses.
  • 19.
    Consolidating • Analyzing andidentifying relationships between matched records and consolidating/merging them into ONE representation.
  • 20.
    II. Data Transformation Transforms the data in accordance with the business rules and standards that have been established  Example include: format changes, deduplication, splitting up fields, replacement of codes, derived values, and aggregates
  • 21.
    III. Data Loading Data are physically moved to the data warehouse  The loading takes place within a “load window”  The trend is to near real time updates of the data warehouse as the warehouse is increasingly used for operational applications
  • 22.
    For a SuccessfulWarehouse  From day one establish that warehousing is a joint user/builder project  Establish that maintaining data quality will be an ONGOING joint user/builder responsibility  Train the users one step at a time  Consider doing a high level corporate data model in no more than three weeks  Look closely at the data extracting, cleaning, and loading tools
  • 23.
    Cont..  Determine aplan to test the integrity of the data in the warehouse  From the start get warehouse users in the habit of 'testing' complex queries  Coordinate system roll-out with network administration personnel  Implement a user accessible automated directory to information stored in the warehouse
  • 24.
    Data Warehouse Pitfalls Many warehouse end users will be trained and never or seldom apply their training  Large scale data warehousing can become an exercise in data homogenizing  Loading information only because it is available  Providing no maintenance to the data warehouse
  • 25.