Ch1 data-warehousing


Published on

material ==> data mining
Ch1 data-warehousing

Published in: Education, Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Ch1 data-warehousing

  1. 1. Data Warehousing 1
  2. 2. Overview • • • • • • What is data warehouse? Why data warehouse? Data reconciliation – ETL process Data warehouse architectures Star schema – dimensional modeling Data analysis 2
  3. 3. What is Data Warehouse? • Defined in many different ways, but not rigorously. – A decision support database that is maintained separately from the organization’s operational database – Support information processing by providing a solid platform of consolidated, historical data for analysis. • “A data warehouse is a subject-oriented, integrated, timevariant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon • Data warehousing: – The process of constructing and using data warehouses 3
  4. 4. Data Warehouse—SubjectOriented • Organized around major subjects, such as customer, product, sales • Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing • Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process 4
  5. 5. Data Warehouse—Integrated • Constructed by integrating multiple, heterogeneous data sources – relational databases, flat files, on-line transaction records • Data cleaning and data integration techniques are applied. – Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources • E.g., Hotel price: currency, tax, breakfast covered, etc. – When data is moved to the warehouse, it is converted. 5
  6. 6. Data Warehouse—Time Variant • The time horizon for the data warehouse is significantly longer than that of operational systems – Operational database: current value data – Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) • Every key structure in the data warehouse – Contains an element of time, explicitly or implicitly – But the key of operational data may or may not 6 contain “time element”
  7. 7. Data Warehouse—Nonvolatile • A physically separate store of data transformed from the operational environment • Operational update of data does not occur in the data warehouse environment – Does not require transaction processing, recovery, and concurrency control mechanisms – Requires only two operations in data accessing: • initial loading of data and access of data 7
  8. 8. Trends in Organisations that encourage the need for data warehousing • No single system of record • Multiple systems are not synchronized • Organisations want to analyse the activities in a balanced way • Customer relationship management • Supplier relationship management 8
  9. 9. Need for Data Warehousing • Integrated, company-wide view of high-quality information (from different databases) • Separation of operational and informational systems and data (for improved performance) 9
  10. 10. Operational & Informational System The need to separate operational and informational systems is based on three primary factors: • A data warehouse centralizes data that are scattered throughout disparate operational systems and make them a available for decision support applications • A properly designed data warehouse adds value to data by improving their quality • A separate data warehouse eliminates much of contention for resources that result when informational application confounded with operational processing 10
  11. 11. 11
  12. 12. Data Reconciliation • Typical operational data is: – Transient – not historical – Not normalised (perhaps due to denormalisation for performance) – Restricted in scope – not comprehensive – Sometimes poor quality – inconsistencies and errors • After ETL (Extract, Transform, Load), data should be: – – – – – Detailed – not summarized yet Historical – periodic Normalised – 3rd normal form or higher Comprehensive – enterprise-wide perspective Timely – data should be current enough to assist decisionmaking – Quality controlled – accurate with full integrity 12
  13. 13. The ETL Process/ Data Reconciliation Main Steps • • • • Capture/Extract Scrub or data cleansing Transform Load and Index 13
  14. 14. Static extract = capturing a Incremental extract = snapshot of the source data at a point in time capturing changes that have occurred since the last static extract 14
  15. 15. Fixing errors: misspellings, Also: decoding, reformatting, time erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies stamping, conversion, key generation, merging, error detection/logging, locating 15 missing data
  16. 16. Record-level: Field-level: Selection – data partitioning Joining – data combining Aggregation – data summarization single-field – from one field to one field multi-field – from many fields to one, or 16 one field to many
  17. 17. Refresh mode: bulk rewriting of target data at periodic intervals Update mode: only changes in source data are written to data 17 warehouse
  18. 18. Data Warehouse Architectures • Generic Two-Level Architecture • Independent Data Mart • Dependent Data Mart and Operational Data Store • Logical Data Mart and @ctive Warehouse 18
  19. 19. Generic two-level architecture L One companywide warehouse T E Periodic extraction  data is not completely current in warehouse 19
  20. 20. Independent data mart Data marts: Mini-warehouses, limited in scope L T E Separate ETL for each independent data mart Data access complexity due to multiple data marts 20
  21. 21. Dependent data mart with operational data store ODS provides option for obtaining current data L T E Single ETL for enterprise data warehouse (EDW) Dependent data marts loaded from21 EDW
  22. 22. ODS and data warehouse are one and the same L T E Near real-time ETL for @active Data Warehouse Data marts are NOT separate databases, but logical views of the data warehouse 22  Easier to create new data marts
  23. 23. Data Characteristics Status vs. Event Data Status Event – a database action (create/update/delete) that results from a transaction Status 23
  24. 24. Data Characteristics Transient vs. Periodic Data Changes to existing records are written over previous records, thus destroying the previous data content Data are never physically altered or deleted once they have been added to the store 24
  25. 25. star schema Fact tables contain factual or quantitative data 1:N relationship between dimension tables and fact tables Dimension tables are denormalized to maximize performance Dimension tables contain descriptions about the subjects of the business Star Schema: Simple database design in which dimensional data are separated from fact data. Excellent for queries, but bad for 25 online transaction processing
  26. 26. Star schema example Fact table provides statistics for sales broken down by product, period and store dimensions 26
  27. 27. 27
  28. 28. On-Line Analytical Processing (OLAP) • The use of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniques • Relational OLAP (ROLAP) – Traditional relational representation • Multidimensional OLAP (MOLAP) – Cube structure • OLAP Operations – Cube slicing – come up with 2-D view of data – Drill-down – going from summary to more detailed views 28
  29. 29. Data Warehouse vs. Operational DBMS • OLTP (on-line transaction processing) – Major task of traditional relational DBMS – Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. • OLAP (on-line analytical processing) – Major task of data warehouse system – Data analysis and decision making • Distinct features (OLTP vs. OLAP): – User and system orientation: customer vs. market – Data contents: current, detailed vs. historical, consolidated – Database design: ER + application vs. star + subject – View: current, local vs. evolutionary, integrated – Access patterns: update vs. read-only but complex queries 29
  30. 30. OLTP vs. OLAP OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date detailed, flat relational isolated repetitive historical, summarized, multidimensional integrated, consolidated ad-hoc lots of scans unit of work read/write index/hash on prim. key short, simple transaction # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response usage access complex query 30
  31. 31. Slicing a data cube 31
  32. 32. Summary report Example: Drill-down Drill-down with color added 32
  33. 33. Data Warehouse Usage • Three kinds of data warehouse applications – Information processing • supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs – Analytical processing • multidimensional analysis of data warehouse data • supports basic OLAP operations, slice-dice, drilling, pivoting – Data mining • knowledge discovery from hidden patterns • supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools 33