Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Warehouse Modeling

  • Be the first to comment

Data Warehouse Modeling

  1. 1. Data Warehouse Modeling Thijs Kupers Vivek Jonnaganti
  2. 2. Agenda <ul><li>Introduction </li></ul><ul><li>Data Warehousing Concepts </li></ul><ul><li>OLAP </li></ul><ul><li>Dimension Modeling </li></ul><ul><li>Conceptual Modeling </li></ul><ul><li>Indexing </li></ul><ul><li>Conclusion </li></ul>
  3. 3. Introduction
  4. 4. The Evolution <ul><li>1960 - DSS processing using Fortron or COBOL </li></ul><ul><li>1970 - DBMS systems and the advent of DASD </li></ul><ul><li>1975 - OLTP systems facilitating faster access to data </li></ul><ul><li>1980 - PC/4GL technology and the advent of MIS </li></ul><ul><li>1985 - OLAP systems and separation of analytical processing from transactional processing </li></ul><ul><li>1994 - Architectured environments with integrated OLAP engines and tools </li></ul>
  5. 5. What is a Data Warehouse? <ul><li>A copy of transaction data specifically structured to Query and Analysis (Ralph Kimball, 1996) </li></ul><ul><li>A collection of integrated, subject oriented databases designed to support the DSS function where each unit of data is relevant at some moment of time (Bill Inmon, 1991) </li></ul><ul><li>The data characteristics of a Data Warehouse are; </li></ul><ul><ul><li>Subject-oriented </li></ul></ul><ul><ul><li>Time-variant </li></ul></ul><ul><ul><li>Non-volatile </li></ul></ul><ul><ul><li>Integrated </li></ul></ul>
  6. 6. What is a Data Warehouse? (cont’d) <ul><li>A single, complete and consistent store of data obtained from a variety of different sources made available to end users, in what they can understand and use in a business context (Barry Devlin 1992) </li></ul><ul><li>A process of transforming data into information and making it available to users in a timely enough manner to make a difference (Forrester Research 1996) </li></ul>
  7. 7. Data Warehouse Goals/Characteristics <ul><li>It must make an organization’s information easily accessible (slicing and dicing) </li></ul><ul><li>It must present the organization’s information consistently </li></ul><ul><li>It must be adaptive and resilient to change </li></ul><ul><li>It must be a secure bastion that protects our information assets </li></ul><ul><li>It must serve as the foundation for improved decision making </li></ul><ul><li>The business community must accept the DW, if it is to be deemed successful </li></ul>
  8. 8. Data Warehouse Applications <ul><li>Retail Industry </li></ul><ul><ul><li>Forecasting, Market research, Merchandising etc. </li></ul></ul><ul><li>Manufacturing and distribution </li></ul><ul><ul><li>Sales history/trends, Market demand projects etc. </li></ul></ul><ul><li>Banks </li></ul><ul><ul><li>Spot market trends, Marketing, Credit cards etc. </li></ul></ul><ul><li>Insurance Companies </li></ul><ul><ul><li>Property and casualty fraud etc. </li></ul></ul><ul><li>Health Care Providers </li></ul><ul><ul><li>Fraud detection, Patient matching etc. </li></ul></ul>
  9. 9. Data Warehouse Applications <ul><li>Government Agencies </li></ul><ul><ul><li>Auditing tax records, information sharing across different agencies etc. </li></ul></ul><ul><li>Internet Companies </li></ul><ul><ul><li>Analyzing shopping behavior, CRM etc. </li></ul></ul><ul><li>Telecommunications </li></ul><ul><ul><li>Telemarketing, Product development etc. </li></ul></ul><ul><li>Sports </li></ul><ul><ul><li>Analyzing strategies, Winning player combinations etc. </li></ul></ul>
  10. 10. Data Warehouse Sizes <ul><li>Terabyte (10^12) - Walmart (24 TB) </li></ul><ul><li>Petabyte (10^15) - Geographic Information Systems </li></ul><ul><li>Exabyte (10^18) - National Medical Association </li></ul><ul><li>Zettabyte (10^21) - Weather Images </li></ul><ul><li>Zottabyte (10^24) - Intelligence Agency (Video) </li></ul>
  11. 11. Data Warehousing Concepts
  12. 12. Data Warehouse (OLAP) and OLTP
  13. 13. Data Warehouse Architecture Enterprise Data Warehouse Data Mart Data Mart <ul><li>Execution Systems </li></ul><ul><li>CRM </li></ul><ul><li>ERP </li></ul><ul><li>Legacy </li></ul><ul><li>e-Commerce </li></ul><ul><li>Reporting Tools </li></ul><ul><li>OLAP Tools </li></ul><ul><li>Ad Hoc Query Tools </li></ul><ul><li>Data Mining Tools </li></ul><ul><li>External Data </li></ul><ul><li>Purchased Market Data </li></ul><ul><li>Spreadsheets </li></ul><ul><li>Oracle </li></ul><ul><li>SQL Server </li></ul><ul><li>Teradata </li></ul><ul><li>DB2 </li></ul><ul><li>Custom Tools </li></ul><ul><li>HTML Reports </li></ul><ul><li>Cognos </li></ul><ul><li>Business Objects </li></ul><ul><li>MicroStrategy </li></ul><ul><li>Oracle Discoverer </li></ul><ul><li>Brio </li></ul><ul><li>Data Mining Tools </li></ul><ul><li>Portals </li></ul>Data and Metadata Repository Layer <ul><li>Informatica PowerMart </li></ul><ul><li>Ab Initio </li></ul><ul><li>Data Stage </li></ul><ul><li>Oracle Warehouse Builder </li></ul><ul><li>Custom programs </li></ul><ul><li>SQL scripts </li></ul><ul><li>Extract, Transformation, and Load (ETL) Layer </li></ul><ul><li>Cleanse Data </li></ul><ul><li>Filter Records </li></ul><ul><li>Standardize Values </li></ul><ul><li>Decode Values </li></ul><ul><li>Apply Business Rules </li></ul><ul><li>Householding </li></ul><ul><li>Dedupe Records </li></ul><ul><li>Merge Records </li></ul>Presentation Layer ETL Layer Operational Source Systems Technologies: Metadata Repository ODS <ul><li>PeopleSoft </li></ul><ul><li>SAP </li></ul><ul><li>Siebel </li></ul><ul><li>Oracle Applications </li></ul><ul><li>Custom Systems </li></ul>Data Mart
  14. 14. Data Warehouse Structure Highly Summarized Lightly Summarized Atomic/Detailed Departmentally Structured Individually Structured Data Warehouse Organizationally Structured Data Information
  15. 15. Data Warehouse Architecture Drivers <ul><li>The requirements that drive the DW architecture are; </li></ul><ul><li>Granularity of data </li></ul><ul><li>Data retention and timeliness </li></ul><ul><li>Reporting capability </li></ul><ul><li>Availability </li></ul><ul><li>Scalability </li></ul>
  16. 16. Data Mart Centric Data Marts Data Sources Data Warehouse
  17. 17. Data Mart Centric If you end up creating multiple warehouses, integrating them is a problem
  18. 18. Data Warehouse Centric Data Marts Data Sources Data Warehouse
  19. 19. OLAP
  20. 20. OLAP: 3 Tier DSS Data Warehouse Database Layer Store atomic data in industry standard Data Warehouse. OLAP Engine Application Logic Layer Generate SQL execution plans in the OLAP engine to obtain OLAP functionality. Decision Support Client Presentation Layer Obtain multi-dimensional reports from the DSS Client.
  21. 21. OLAP Servers <ul><li>Support multidimensional OLAP queries </li></ul><ul><li>Characterized by how the underlying data is stored </li></ul><ul><li>Multidimensional OLAP (MOLAP) Servers </li></ul><ul><ul><li>Data stored in array based structures e.g. Hyperion Essbase </li></ul></ul><ul><li>Relational OLAP (ROLAP) Servers </li></ul><ul><ul><li>Data stored in relational tables e.g. Microstrategy, IBM Informix </li></ul></ul><ul><li>Hybrid OLAP (HOLAP) Servers </li></ul><ul><ul><li>Data distributed between relational and specialized storage e.g. Cognos, Microsoft Analysis Services </li></ul></ul>
  22. 22. OLAP Operations <ul><li>Rollup; summarize operations </li></ul><ul><ul><li>E.g. given sales data, summarize sales for last year by product category and region </li></ul></ul><ul><li>Drill down; get more details </li></ul><ul><ul><li>E.g. given summarized sales as above, find breakup of sales within each region </li></ul></ul><ul><li>Slice and dice; select and project </li></ul><ul><ul><li>Sales of soft-drinks in Gothenburg over the last quarter </li></ul></ul><ul><li>Pivot; change the view of data </li></ul>
  23. 23. Strengths of OLAP <ul><li>It is a powerful visualization tool </li></ul><ul><li>It provides fast, interactive response times </li></ul><ul><li>It is good for analyzing time series </li></ul><ul><li>It can be useful to find some clusters and outliners </li></ul><ul><li>Many vendors offer OLAP tools </li></ul>
  24. 24. Dimensional Modeling
  25. 25. What is Dimensional Modeling? <ul><li>Logical design technique that seeks to present the data in a standard, intuitive framework that allows for high-performance access. </li></ul><ul><li>Adheres to a discipline that uses the relational model with some important restrictions. </li></ul><ul><li>Composed of one table with a multi-part key, called the fact table, and a set of smaller tables called dimension tables. </li></ul>
  26. 26. DM v/s ER Models DM ER Used to design database for Online Analytical Processing (OLAP) Used to design database for Online Transaction Processing (OLTP) Support ad hoc end-user queries Support defined queries Intuitive & facilitates high-performance retrieval of data Removes redundancy of data De-normalized Normalized
  27. 27. Fact Tables <ul><li>Primary table in the DM </li></ul><ul><li>Each row corresponds to a measurement </li></ul><ul><li>Facts in the fact table are numeric and additive </li></ul><ul><li>Narrow rows with a few columns </li></ul><ul><li>Large number of rows (billions) </li></ul><ul><li>Express many-to-many relationships between dimensions </li></ul>
  28. 28. Dimension Tables <ul><li>Define business in terms already familiar to users </li></ul><ul><li>Implement the user interface to the DW </li></ul><ul><li>Wide rows with lots of descriptive text </li></ul><ul><li>Small tables (about a million rows) </li></ul><ul><li>Joined to fact table by a foreign key </li></ul><ul><li>Heavily indexed </li></ul><ul><li>E.g. of typical dimensions </li></ul><ul><ul><li>time periods, geographic region (markets, cities), products, customers, salesperson, etc. </li></ul></ul>
  29. 29. Four Step Dimensional Design Process <ul><li>Step 1 - Select the business process to model </li></ul><ul><ul><li>The first step in converting an ER diagram to a set of DM diagrams is to separate the ER diagram into its discrete business processes and to model each one separately. </li></ul></ul><ul><li>Step 2 - Choose The Grain of the Business Process </li></ul><ul><ul><li>The grain is the fundamental atomic level of data to be represented in the fact table. </li></ul></ul>
  30. 30. Four Step Dimensional Design Process (cont’d) <ul><li>Step 3 - Designate the Fact Tables </li></ul><ul><ul><li>The third step is to select those many-to-many relationships in the ER model containing numeric and additive non-key facts and to designate them as fact tables. </li></ul></ul><ul><li>Step 4 - Choose the dimensions that will apply to each fact table record </li></ul><ul><ul><li>This involves de-normalizing all of the remaining tables into flat tables with single-part keys that connect directly to the fact tables. </li></ul></ul>
  31. 31. Classic Star Schema Model
  32. 32. Snowflake Schema
  33. 33. Fact Constellation Schema
  34. 34. Slowly Changing Dimensions <ul><li>Type 1: Overwrite the value </li></ul>
  35. 35. Slowly Changing Dimensions (cont’d) <ul><li>Type 2: Add a Dimension row </li></ul><ul><li>Type 3: Add a Dimension column </li></ul>
  36. 36. Conceptual Modeling
  37. 37. Graph Theory <ul><li>Directed, acyclic, weakly connected graph </li></ul><ul><li>Quasi-tree </li></ul>
  38. 38. The Dimensional Fact Model <ul><li>Fact Schemes </li></ul><ul><ul><li>Facts </li></ul></ul><ul><ul><li>Measures </li></ul></ul><ul><ul><li>Dimensions </li></ul></ul><ul><ul><li>Hierarchies </li></ul></ul><ul><ul><ul><li>Dimension attributes </li></ul></ul></ul><ul><ul><ul><li>Non-dimension attributes </li></ul></ul></ul>
  39. 39. The Dimensional Fact Model
  40. 40. Why Formalize?
  41. 41. Why Formalize? <ul><li>Give meaning to the model </li></ul><ul><li>Tool support </li></ul><ul><ul><li>Transformation Algorithms </li></ul></ul><ul><ul><li>CASE-Tool (Computer Aided Software Engineering) </li></ul></ul>
  42. 42. Fact Scheme <ul><li>M is a set of measures </li></ul><ul><li>A is a set of dimension attributes </li></ul><ul><li>N is a set of non-dimension attributes </li></ul><ul><li>R is a set of ordered couples, having the form (a i , a j ), indicating the ‘edges’ of the scheme </li></ul>
  43. 43. Fact Scheme <ul><li>O is a set of optional relationships </li></ul><ul><li>S is a set of aggregation statements, in the form (m j , d i , Ω ) </li></ul>
  44. 44. Fact Scheme <ul><li>We call the set Dim(f) a dimension pattern. Each element in Dim(f) is a dimension </li></ul>
  45. 45. Fact Scheme
  46. 46. Algorithm <ul><li>From ER to Conceptual Design </li></ul><ul><li>Define Facts </li></ul><ul><li>For each fact </li></ul><ul><ul><li>Build attribute tree </li></ul></ul><ul><ul><li>Prune & Graft </li></ul></ul><ul><ul><li>Define Dimensions </li></ul></ul><ul><ul><li>Define Measures </li></ul></ul><ul><ul><li>Define Hierarchies </li></ul></ul>
  47. 47. Sample Schema
  48. 48. Define Facts <ul><li>Entity F </li></ul><ul><li>Relationship R between entities E 1 …E n </li></ul><ul><ul><li>Transform R into an entity F </li></ul></ul><ul><li>Frequently updated archives are good candidates for defining facts </li></ul><ul><ul><li>E.g. Sale </li></ul></ul><ul><ul><li>Not: Store, City </li></ul></ul><ul><li>Each Fact becomes a root in a fact scheme </li></ul>
  49. 49. Transform Relation
  50. 50. Build Attribute Tree <ul><li>Each vertex corresponds to an attribute of the scheme </li></ul><ul><li>Root corresponds to the identifier of F </li></ul>
  51. 51. Build Attribute Tree <ul><li>root=newVertex(identifier(F)); </li></ul><ul><li>translate(F, root); </li></ul>
  52. 52. Build Attribute Tree <ul><li>translate(E,v) { </li></ul><ul><li>for each attribute a E | a identifier(E) </li></ul><ul><li>addChild(v, newVertex({a})); </li></ul><ul><li>for each entity G connected to E by a </li></ul><ul><li>relationship R | max(E,R) = 1 { </li></ul><ul><li>for each attribute b R </li></ul><ul><li>addChild(v, newVertex({b})); </li></ul><ul><li>next=newVertex(identifier(G)); </li></ul><ul><li>addChild(v, next); </li></ul><ul><li>translate(G, next); </li></ul><ul><li>} </li></ul><ul><li>} </li></ul>
  53. 53. Example <ul><li>translate(E= SALE , v= sale ) </li></ul><ul><li>addChild(v, qty ); </li></ul><ul><li>addChild(v, unitPrice ); </li></ul><ul><li>for G= PURCHASE TICKET </li></ul><ul><li>addChild(v, ticketNumber ); </li></ul><ul><li>translate(PURCHASE TICKET, ticketNumber ) </li></ul><ul><li>for G= PRODUCT </li></ul><ul><li>addChild(v, product ); </li></ul><ul><li>translate(PRODUCT, product ); </li></ul>
  54. 54. Attribute Tree
  55. 55. Attribute Tree <ul><li>Label the root with the name of the entity F instead of his identifier </li></ul><ul><li>Optional relationships not in algorithm </li></ul><ul><ul><li>if min(E,R)=0 </li></ul></ul>
  56. 56. From ER till Conceptual Design <ul><ul><li>Build attribute tree </li></ul></ul><ul><ul><li>Prune & Graft </li></ul></ul><ul><ul><li>Define Dimensions </li></ul></ul><ul><ul><li>Define Measures </li></ul></ul><ul><ul><li>Define Hierarchies </li></ul></ul>
  57. 57. Prune & Graft <ul><li>Prune or graft to eliminate unnecessary level of detail </li></ul><ul><li>Pruning: Drop a subtree from the quasi-tree </li></ul><ul><li>Grafting: Vertex contains uninteresting information but its descendants must be preserved </li></ul>
  58. 58. Graft <ul><li>graft(v) { </li></ul><ul><li>for each v’ | v’ is father of v </li></ul><ul><li>for each v’’ | v’’ is child of v </li></ul><ul><li>addChild(v’, v’’); </li></ul><ul><li>drop(v); </li></ul><ul><li>} </li></ul>
  59. 59. Graft <ul><li>1-to-1 relation is a good candidate </li></ul><ul><li>When an optional vertex is grafted, all his children inherit the optional dash </li></ul>
  60. 60. Prune & Graft
  61. 61. Prune & Graft
  62. 62. Dimensions <ul><li>Determines the granularity of fact instances </li></ul><ul><li>Time is a key dimension </li></ul><ul><ul><li>Snapshot </li></ul></ul><ul><ul><li>Temporal </li></ul></ul>
  63. 63. Measures <ul><li>Numerical attributes of the attribute tree </li></ul><ul><li>Glossary </li></ul><ul><ul><li>How measure can be calculated from source scheme </li></ul></ul><ul><ul><li>e.g. qty sold, no. of customers </li></ul></ul>
  64. 64. Hierarchies <ul><li>Tree has already a kind of hierarchy </li></ul><ul><ul><li>We can still prune/graft details </li></ul></ul><ul><ul><li>Add new levels for aggregation </li></ul></ul><ul><ul><ul><li>E.g. month-quarter-year </li></ul></ul></ul><ul><li>Identify non-dimension attributes </li></ul><ul><ul><li>E.g. address </li></ul></ul>
  65. 65. Aggregation <ul><li>Primary fact instances </li></ul><ul><ul><li>Null assumption </li></ul></ul><ul><ul><li>Zero assumption </li></ul></ul><ul><li>Roll-up </li></ul><ul><li>Sum, Avg, Count, Min, Max, … </li></ul>
  66. 66. Aggregation <ul><li>Graphical Notation </li></ul><ul><ul><li>Sum </li></ul></ul>
  67. 67. Multi-Aggregation
  68. 68. Multi-Aggregation <ul><li>Order matters </li></ul><ul><ul><li>{week, product}  {month, type} </li></ul></ul><ul><ul><li>Time-Dimension: Min </li></ul></ul><ul><ul><li>Product-Dimension: Sum </li></ul></ul>
  69. 69. Multi-Aggregation
  70. 70. Multi-Aggregation
  71. 77. Indexing
  72. 78. Cost Model <ul><li>Cost of answering a query is number of rows processed </li></ul><ul><li>Subcubes </li></ul><ul><ul><li>Powerset of the dimensions </li></ul></ul>
  73. 79. Cost Model
  74. 80. Indexes <ul><li>B-tree indexes to speed up query processing </li></ul><ul><li>E.g. for cube ps, we can construct the following indexes </li></ul><ul><ul><li>I ps </li></ul></ul><ul><ul><li>I sp </li></ul></ul>
  75. 81. Example <ul><li>Consider Q 1 : </li></ul><ul><ul><li>Using subcube ps: 0,8M rows </li></ul></ul><ul><ul><li>Using subcube psc: 6M rows </li></ul></ul><ul><li>What if we use index I sp on subcube ps? </li></ul><ul><li>80 rows </li></ul>
  76. 82. Indexes <ul><li>Ideal situation </li></ul><ul><ul><li>All subcubes </li></ul></ul><ul><ul><li>All indexes </li></ul></ul>
  77. 83. Algorithms <ul><li>Balance space subcubes – indexes </li></ul><ul><li>Greedy Algorithm </li></ul><ul><ul><li>Given a set of queries </li></ul></ul><ul><ul><li>Every step select index/subcube with the highest benefit </li></ul></ul>
  78. 84. ?
  79. 85. References <ul><li>Text books </li></ul><ul><ul><li>Ralph Kimball, The Data Warehouse Toolkit, John Wiley and Sons, 1996 </li></ul></ul><ul><ul><li>W.H. Inmon, Building the Data Warehouse, Second Edition, John Wiley and Sons, 1996 </li></ul></ul><ul><ul><li>Barry Devlin, Data Warehouse from Architecture to Implementation, Addison Wesley Longman, Inc 1997 </li></ul></ul><ul><li>Research Papers/Whitepapers </li></ul><ul><ul><li>M. Golfarelli, D. Maio, S. Rizzi, The Dimensional Fact Model: a Conceptual Model for Data Warehouses, International Journal of Cooperative Information, Vol.7 (issue 2/3), pages 215-247, 1998. </li></ul></ul><ul><ul><li>H. Gupta, V. Harinarayan, A. Rajaraman, J.D. Ullman, Index Selection for OLAP, Proceedings of the Thirteenth international Conference on Data Engineering, April 07 - 11, pages 208-219, 1997. </li></ul></ul><ul><ul><li>S. Luján-Mora J. Trujillo. A comprehensive method for data warehouse design. Proc. DMDW, 2003. </li></ul></ul>
  80. 86. References (cont’d) <ul><ul><li>Luján-Mora, S., Trujillo, J., and Song, I. Extending the UML for Multidimensional Modeling. Lecture Notes In Computer Science, Vol. 2460, pages 290-304., 2002. </li></ul></ul><ul><ul><li>Husemann, B., Lechtenborger, J., Vossen, G.: Conceptual Data Warehouse Design. </li></ul></ul><ul><ul><li>In: Proc. of the 2nd. Intl. Workshop on Design and Management of Data Warehouses (DMDW'2000), Stockholm, pages 3-9, 2000. </li></ul></ul><ul><ul><li>Lehner, W., Albrecht, J., and Wedekind, H. 1998. Normal Forms for Multidimensional Databases. In Proceedings of the 10th international Conference on Scientific and Statistical Database Management (July 01 – 03), pages 63-72, 1998. </li></ul></ul><ul><li>Web Articles </li></ul><ul><ul><li> </li></ul></ul><ul><ul><li> </li></ul></ul><ul><ul><li> </li></ul></ul>
  81. 87. References (cont’d) <ul><ul><li> </li></ul></ul><ul><ul><li> </li></ul></ul><ul><ul><li> </li></ul></ul><ul><ul><li> </li></ul></ul>

    Be the first to comment

    Login to see the comments

  • GhazalehBaghaei

    Apr. 20, 2019
  • VickyLee57

    May. 1, 2019
  • MadeleineFournierGho

    May. 7, 2019
  • DDDD88

    May. 14, 2019
  • SukuSandy

    May. 23, 2019
  • truptikatte

    Jun. 14, 2019
  • YokoBali

    Jul. 3, 2019
  • KevinCHAN18

    Aug. 2, 2019
  • ShivaSomalaraju

    Sep. 9, 2019
  • hendyw

    Sep. 10, 2019
  • TahaMahmoudPMPITILTO

    Oct. 12, 2019
  • CristianPOP6

    Oct. 16, 2019
  • EvaBernice

    Oct. 29, 2019
  • huyuhy

    Nov. 7, 2019
  • mnchowdaryd

    Dec. 15, 2019
  • ssuser8e29b7

    Jan. 1, 2020
  • shilpikri

    Feb. 19, 2020
  • AlperTra1

    Feb. 29, 2020
  • LisLeckeySwanson

    Apr. 13, 2020
  • ha174

    Dec. 9, 2020


Total views


On Slideshare


From embeds


Number of embeds