Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Data Warehousing Dr. Anol Bhattacherjee University of South Florida E-mail: Web: Telephone: (813) 974-6760 Copyright  2006 Anol Bhattacherjee
  2. 2. Agenda <ul><li>Data warehouse: </li></ul><ul><ul><li>Production databases versus data warehouse. </li></ul></ul><ul><ul><li>Characteristics. </li></ul></ul><ul><ul><li>Structure: Dimensions, hierarchies. </li></ul></ul><ul><ul><li>Design: Star schema. </li></ul></ul><ul><ul><li>Building: ETL methodology. </li></ul></ul><ul><li>Online analytical processing (OLAP): </li></ul><ul><ul><li>Examples. </li></ul></ul><ul><ul><li>Techniques. </li></ul></ul><ul><li>Data mining: </li></ul><ul><ul><li>Techniques. </li></ul></ul><ul><li>Data warehouse administrator. </li></ul>
  3. 3. Two Categories of Business Systems <ul><li>Transaction processing systems (TPS): </li></ul><ul><ul><li>Application systems used by company employees for everyday operational tasks, such as sales, manufacturing, and customer support. </li></ul></ul><ul><ul><li>Employ production databases. </li></ul></ul><ul><li>Decision support systems (DSS): </li></ul><ul><ul><li>Systems specifically designed to aid managers in decision-making tasks, such as budgeting, forecasting, and planning. </li></ul></ul><ul><ul><li>Employ data warehouses and/or data marts. </li></ul></ul><ul><ul><li>Require analytical capabilities, such as data mining (OLAP) tools. </li></ul></ul><ul><ul><li>Also called business intelligence (BI) systems. </li></ul></ul>
  4. 4. Data Warehouse <ul><li>What it is: </li></ul><ul><ul><li>A subject-oriented, integrated, time-variant enterprise-wide repository of historical data designed to support executive decision-making. </li></ul></ul><ul><ul><li>Data is aggregated by business dimensions (e.g., region, year, product line), and can be analyzed along these dimensions. </li></ul></ul><ul><ul><li>Allows trend analysis, planning, etc. without complex SQL queries. </li></ul></ul><ul><li>Example: Wal-Mart’s RetailLink system: </li></ul><ul><ul><li>Gives suppliers full access to WM’s sales and inventory data in real-time for collaborative planning, forecasting, and replenishment (CPFR). </li></ul></ul><ul><ul><li>Powered by NCR’s Teradata servers: </li></ul></ul><ul><ul><ul><li>Runs 30+ business applications. </li></ul></ul></ul><ul><ul><ul><li>Supports 18,000+ users (WM managers). </li></ul></ul></ul><ul><ul><ul><li>Handles 120,000 queries/week. </li></ul></ul></ul><ul><ul><ul><li>Receives 8.4 million updates/minute (transactions) at peak-time. </li></ul></ul></ul>
  5. 5. Data Warehouse versus Databases Supports decision support systems used for managerial decision making Supports transaction processing systems used in everyday business operations Terabytes in size MB/GB in size Supports special analytical operations such as drill-down and slice and dice No special analytical operations supported Aggregated from production databases Exists independently Poor for data input/output, but uses vector arithmetic for fast computation Good for data input/output, but poor for computation (e.g., aggregate) Supports time-series/periodicity No specific support for time-series Data stored in multidimensional format Data stored in relational format Data Warehouse Production Databases
  6. 6. Characteristics of DW Data <ul><li>Subject-oriented: </li></ul><ul><ul><li>Data is organized around subjects or business dimensions, such as sales, customers, orders, claims, accounts, employees, etc. </li></ul></ul><ul><li>Integrated: </li></ul><ul><ul><li>Data is collected from several transactional databases, and integrated in a way to provide a unified picture of each subject over time. </li></ul></ul><ul><ul><li>Data from different databases is transformed into a common schema, measurement, code, data type. </li></ul></ul><ul><li>Aggregated: </li></ul><ul><ul><li>Data stored is not transaction-level, but aggregated by products, regions, months/years, or some other business dimension. </li></ul></ul>
  7. 7. Characteristics of DW Data <ul><li>Historical: </li></ul><ul><ul><li>Data updated at some time interval: weekly, monthly, etc. </li></ul></ul><ul><ul><li>Data stored by weeks, months, etc. for historical comparison and trend analysis. </li></ul></ul><ul><li>Time variant: </li></ul><ul><ul><li>Data always includes a timestamp (e.g., sales by weeks, months, quarters, or years). </li></ul></ul><ul><li>Non-volatile: </li></ul><ul><ul><li>Data is historical, and does not change with time. </li></ul></ul><ul><li>Denormalized: </li></ul><ul><ul><li>Denormalized data is used to improve query performance, though it also increases update time and introduces data integrity problems. </li></ul></ul><ul><ul><li>Works because historic data in the data warehouse is rarely updated. </li></ul></ul>
  8. 8. Data Warehouse versus Data Marts <ul><li>Enterprise data warehouse (EDW): </li></ul><ul><ul><li>Large-scale data repository that incorporates aggregated historical data for an entire company, division, or business unit. </li></ul></ul><ul><ul><li>Built around many subjects, can support a wide range of decision tasks. </li></ul></ul><ul><li>Data marts: </li></ul><ul><ul><li>Small-scale data repository serving the needs of one department. </li></ul></ul><ul><ul><li>Based on a limited number of subjects (sometimes one). </li></ul></ul><ul><ul><li>Constructed from few transactional databases or a subset of EDW data. </li></ul></ul><ul><ul><li>Provides a buffer between managers and EDW: managers work with DM data, so that even if the DM data is corrupted, EDW data is unchanged. </li></ul></ul><ul><li>Which is done first: </li></ul><ul><ul><li>Top-down development: EDW is created first, from which data is extracted to create one or more DMs. </li></ul></ul><ul><ul><li>Bottom-up approach: Build independent DMs as needed, overall EDW built later from existing DMs. </li></ul></ul>
  9. 9. Dimensions of a Data Warehouse Two-dimensional data warehouse Three-dimensional data warehouse Data warehouses can have four or more dimensions
  10. 10. Dimensions and Hierarchies <ul><li>Two key characteristics of a DW: </li></ul><ul><ul><li>Subject orientation of data. </li></ul></ul><ul><ul><li>Temporal nature of data (time dimension). </li></ul></ul><ul><li>Multidimensional databases: </li></ul><ul><ul><li>Each DW subject reflects a separate business dimension (e.g., product line, sales area, year, etc.), hence DW are often called multidimensional databases. </li></ul></ul><ul><ul><li>Multidimensional databases are not yet mature or sophisticated; hence multidimensional data is often stored in relational databases. </li></ul></ul><ul><li>Hierarchies: </li></ul><ul><ul><li>Business dimensions can be organized into hierarchies: </li></ul></ul><ul><ul><ul><li>Sales area by city, county state region, etc. </li></ul></ul></ul><ul><ul><ul><li>Time grouped by day, month, quarter, year, etc. </li></ul></ul></ul><ul><ul><li>Drill-down analysis: Extracting data from higher to lower hierarchy. </li></ul></ul><ul><ul><li>Slice and dice: Extracting data from two hierarchies. </li></ul></ul>
  11. 11. Hierarchies <ul><li>Products (hierarchy): </li></ul><ul><li>By product lines </li></ul><ul><li>By responsibility centers </li></ul><ul><li>By work centers </li></ul>Sales <ul><li>Sales area (hierarchy): </li></ul><ul><li>Region: Northeast </li></ul><ul><ul><li>State: NY </li></ul></ul><ul><ul><ul><li>Area: NYC </li></ul></ul></ul><ul><ul><ul><li>Area: Albany </li></ul></ul></ul><ul><ul><ul><li>Area: Buffalo </li></ul></ul></ul><ul><ul><ul><li>Area: Long Island </li></ul></ul></ul><ul><ul><li>State: NJ </li></ul></ul><ul><ul><li>State: PA </li></ul></ul><ul><li>Region: Midwest </li></ul><ul><li>Region: West </li></ul><ul><li>Time (hierarchy): </li></ul><ul><li>Year: 1995 </li></ul><ul><ul><li>Quarter: Q1 </li></ul></ul><ul><ul><ul><li>Month: January </li></ul></ul></ul><ul><ul><ul><ul><li>Day: 01 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Day: 02 </li></ul></ul></ul></ul><ul><ul><ul><li>Month: February </li></ul></ul></ul><ul><ul><li>Quarter: Q2 </li></ul></ul><ul><li>Year: 1996 </li></ul><ul><li>Year: 1997 </li></ul><ul><ul><li>Drill-down: Overall sales figures for NY vs. sales figures for NYC, Albany, Buffalo, etc. </li></ul></ul><ul><ul><li>Slice and dice: Sales of individual product lines in NYC vs. Albany, vs. Buffalo, etc. </li></ul></ul>
  12. 12. Designing a Data Warehouse <ul><li>Star schema: </li></ul><ul><ul><li>Design technique use to create multidimensional tables using a relational database. </li></ul></ul><ul><li>Two components: </li></ul><ul><ul><li>Fact table: Sales. </li></ul></ul><ul><ul><li>Dimension tables: Time Period, Salesperson, Products. </li></ul></ul><ul><li>Snowflake design: </li></ul><ul><ul><li>One dimension table (Car) leads to another dimension table (Manufacturer). </li></ul></ul>Lucky Rent-A-Car Data Warehouse Design
  13. 13. Star Schema Example Fact table provides sales statistics broken down by product, period and store dimensions Dimension tables contain descriptions about subjects of the business 1:N relationship between fact and dimension tables
  14. 14. Star Schema With Sample Data
  15. 15. Design Considerations in Star Schema <ul><li>Fact table: </li></ul><ul><ul><li>Should contain quantitative time-period data. </li></ul></ul><ul><ul><li>Granularity: what level of detail should you store in fact table? </li></ul></ul><ul><ul><li>Transactional grain (finest level) versus aggregated grain (summarized). </li></ul></ul><ul><ul><li>Finer grain provides better analysis capability, but require more rows in dimension and fact tables and hence, slower performance. </li></ul></ul><ul><li>Dimension table: </li></ul><ul><ul><li>Keys must be time-invariant (i.e., non-business dependent). </li></ul></ul><ul><ul><li>Should be denormalized to maximize performance. </li></ul></ul><ul><li>Relationship: </li></ul><ul><ul><li>1:N relationship between fact and dimension tables </li></ul></ul>
  16. 16. Building a Data Warehouse <ul><li>ETL Methodology : </li></ul><ul><li>Extract data </li></ul><ul><li>Transform data </li></ul><ul><li>Load data </li></ul>API Flat files Oracle 3 rd party feeds VSAM Load Transform Extract Temporary data hub Data warehouse Data marts
  17. 17. ETL Methodology <ul><li>Data extraction: </li></ul><ul><ul><li>Process of copying relevant data from a variety of transactional databases for inclusion in a DW. </li></ul></ul><ul><ul><li>May occur at regular intervals (e.g., weekly, monthly) to add new data. </li></ul></ul><ul><ul><li>Data from incompatible databases, flat files, text documents, etc. must be filtered through appropriate API (application programming interfaces) as needed. </li></ul></ul><ul><li>Data transformation: </li></ul><ul><ul><li>Next slide. </li></ul></ul><ul><li>Data loading: </li></ul><ul><ul><li>Extracted, cleaned, and transformed data is loaded into DW at a predetermined data refresh frequency. </li></ul></ul>
  18. 18. Building a Data Warehouse <ul><li>Data transformation/cleaning: </li></ul><ul><ul><li>Data extracted from transactional databases must be cleaned (“scrubbed”) and transformed before loading into a DW. </li></ul></ul><ul><ul><li>Format differences across different tables/databases must be reconciled. </li></ul></ul><ul><ul><li>Missing or misspelled data values must be resolved. </li></ul></ul><ul><ul><li>Erroneous data are identified using application programs, and scrutinized/ corrected by DW analysts using system-generated exception reports. </li></ul></ul><ul><ul><li>Transaction-level data is aggregated by business dimensions. </li></ul></ul><ul><ul><li>Key step in DW construction since DW is very sensitive to data errors. </li></ul></ul>PK: SS# (123-45-6789) Name (Robert G. Smith) Life Insurance Database PK: DL# (FL-B12345678) Name (Bob Smith) Auto Insurance Database PK: Acc# (12345678905) Name (R. G. Smith) Home Insurance Database Challenges of Data Reconciliation
  19. 19. Data Cleaning Example Good Reading Bookstores Questionable data: Is book quantity correct? Out-of-range data: A single book can’t cost $3,200.99 Referential integrity problem: Customer# 12738 does not exist in Customers table Possible misspelling: Do rows 3 & 8 refer to the same person? Missing data: City is blank. Questionable data: State for rows 2 & 6 could be the same
  20. 20. Using a Data Warehouse: OLAP <ul><li>Online analytic processing (OLAP): </li></ul><ul><ul><li>A decision support approach based on viewing data by dimensions. </li></ul></ul><ul><ul><li>Well suited for multidimensional data hierarchies in a DW. </li></ul></ul><ul><li>OLAP techniques: </li></ul><ul><ul><li>Drill-down: Retrieving finer levels of data detail. </li></ul></ul><ul><ul><li>Slice: Data subset based on a single value of one dimension. </li></ul></ul><ul><ul><li>Pivot or Rotation: Interchanging data dimensions in a slice. </li></ul></ul>Slice operation
  21. 21. Drill Down Drill-down by Package Size Drill-down by Package Size and Color
  22. 22. OLAP Reports: Yahoo Stores Yahoo Page Stats: Last 365 Days [ vitanet ] 50/16315 entries shown. [ See All ] [ See More ] [ See Fewer ] Sort: [By Hits] [ By Count of Items Sold ] [ By Count of Orders ] [ By Revenue ] Download: [ Spreadsheet ] Hits Items Sold Revenue Page 329,215 Vitanet Front Page 79,640 9,373 211567.34 Xenadrine RFA-1, 120 capsules, 24,790 Ind 23,147 Rate Us 16,626 Shop by 100s of Vitamin/Mineral 12,645 856 19776.90 Ripped Fuel 200 capsules Twinlab 11,885 TwinLab Products 7,471 11 172.45 Free Samples Drawing Win 6000 grams 7,446 Vita-net Nutritional Products 7,231 Creatine Monohydrate 6,917 Aphrodisiacs 6,896 Androstenedione 6,162 On Sale Items 6,149 55 1207.25 Natural Sex Woman 5,859 Growth Hormone 5,843 1,070 37115.60 Hydroxycut 240 Capsules MuscleTech 5,756 247 9345.35 Creatine Monohydrate 2000 grams
  23. 23. OLAP Examples <ul><li>Sears’ Strategic Performance Reporting System (SPRS): </li></ul><ul><ul><li>Goal: Daily tuning of buying, merchandising, and marketing strategies. </li></ul></ul><ul><ul><li>Tracks real-time sales; inventory in stores, transit, and distribution centers; promotion outcomes by item, location, promotion, etc. </li></ul></ul><ul><ul><li>Analytics: Price-reduction modeling to “move” products; inventory analysis; customer profitability analysis; store reconfiguration. </li></ul></ul><ul><ul><li>1.7 TB data warehouse, replacing 18 prior databases. </li></ul></ul><ul><li>British Telecom’s Interactive and Reporting Information System (IRIS): </li></ul><ul><ul><li>Goal: Tracking 10,000+ ongoing service projects using key performance indicators such as project cost, status, etc. </li></ul></ul><ul><ul><li>Permits precanned reports, modeling, forecasting, analytics, etc. </li></ul></ul><ul><ul><li>Implemented using SAP’s Strategic Enterprise Management suite using real-time data feed from SAP’s Business Warehouse application. </li></ul></ul>
  24. 24. Using a Data Warehouse: Data Mining <ul><li>Data mining: </li></ul><ul><ul><li>Searching for hidden patterns or knowledge in a company’s data using a blend of statistical, AI, and computer graphics techniques. </li></ul></ul><ul><ul><li>Goal is to discover new knowledge or explain observed events. </li></ul></ul><ul><li>Applications: </li></ul><ul><ul><li>Identify patterns of credit card fraud. </li></ul></ul><ul><ul><li>Identify patterns of consumer purchases. </li></ul></ul><ul><li>Data mining techniques: </li></ul><ul><ul><li>Decision trees. </li></ul></ul><ul><ul><li>Case-based reasoning. </li></ul></ul><ul><ul><li>Neural networks. </li></ul></ul><ul><ul><li>Genetic algorithms. </li></ul></ul>
  25. 25. Data Warehouse Administrator <ul><li>Specialized personnel in charge of the DW maintenance/upgrade. </li></ul><ul><li>Needs three kinds of expertise: </li></ul><ul><ul><li>Business expertise: </li></ul></ul><ul><ul><ul><li>Company’s business processes and transactional data/databases. </li></ul></ul></ul><ul><ul><ul><li>Company’s business goals to know what data should be stored in DW. </li></ul></ul></ul><ul><ul><li>Data expertise: </li></ul></ul><ul><ul><ul><li>Transactional data/databases for selection and integration into DW. </li></ul></ul></ul><ul><ul><ul><li>Designing and overseeing data cleaning/ transformation efforts. </li></ul></ul></ul><ul><ul><li>Technical expertise: </li></ul></ul><ul><ul><ul><li>Principles of data warehouse design. </li></ul></ul></ul><ul><ul><ul><li>Knowledge of OLAP and data mining techniques. </li></ul></ul></ul><ul><ul><ul><li>Experience with handling very large databases with unique needs for security, backup and recovery, data distribution, etc. </li></ul></ul></ul>
  26. 26. Challenges in Data Warehousing <ul><li>Data cleaning and finding more “dirty” data than expected. </li></ul><ul><li>Coordinating the regular appending of new data from transactional databases to the data warehouse. </li></ul><ul><li>Managing very large databases. </li></ul><ul><li>Building and maintaining the data dictionary. </li></ul>