Optimizing the design of your data warehouse 09222010


Published on

Published in: Technology, Business
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Optimizing the design of your data warehouse 09222010

  1. 1. Optimizing the Design of your DataWarehouseMichael WaceyCSCmwacey@csc.com
  2. 2. Introduction• Who am I? – Michael Wacey – Partner with CSC since 1986 – Architected many large scale data warehouses• What are we going to discuss today? – Motivation – Tools – Approach PAGE 2
  3. 3. Motivation• Data Here, Data There, Data Everywhere• Solutions – Architecture – the SAP approach – very hard to sustain and SAP can not solve all problems – Data Integration – requires architecture on the boundaries and infrastructure, lots of infrastructure – Data Warehouse – Periodically collect the data and bring it all together for one or more purposes – the best bet for the foreseeable future• Solutions are always trying to answer - How do we get this data to fit together? PAGE 3
  4. 4. Motivation• Making data fit together is difficult – Local countries report numbers in their local (possibly multiple) currencies and there is no agreed to set of conversion rates – The Trust department would rather not share that data with finance – The current policy administration system has serious data quality issues, but there is a new system being built and scheduled to go online in June 2011, but that date may be in jeopardy• We need a way to collect and analyze all this knowledge about the data PAGE 4
  5. 5. Motivation• A high level view: Customer Profitability Accounting Sales Data Warehouse Sales Forecasts Marketing• May help with scoping• Each line could represent many files or feeds• Each box could represent many applications PAGE 5
  6. 6. Motivation• A detailed view: BEGIN SELECT ml.sequence, al.sequence, m.msgkey INTO mseq, aseq, mkey FROM mqseries.levelcodes ml, mqseries.messages m, mqseries.appctl a, mqseries.levelcodes al WHERE m.msglevel = ml.levelcodekey AND m.msgcode = inmsgcode AND a.msglevel = al.levelcodekey AND a.appctlkey = 1; IF sql%ROWCOUNT = 1 THEN IF aseq <= mseq THEN SELECT statuscodekey INTO sck FROM mqseries.statuscodes WHERE statuscode = n; insert into mqseries.msglog (msglogkey, msgkey, msgdata, msgstatus, msgsqlcode, msgsqlerrm) values(mqseries.msgseq.nextval, mkey, inmsgdata, sck, inmsgsqlcode, SUBSTR(inmsgsqlerrm,1,4000)); IF incommit = true THEN commit; END IF; END IF; ELSE• Too much detail to plan and analyze and understand• As usual, we have a forest and trees problem PAGE 6
  7. 7. Motivation• What to do? – PowerPoint? – Visio? – ERwin?• They all help, but none gives us that right picture• We need a way to see the problem and the solution at the right level of detail PAGE 7
  8. 8. Motivation• What is a data warehouse?• It includes: – Sources of data – Processing of data – Storage of data – probably multiple times in different structures – Analytics• Except for Analytics, these are either static views of data or dynamic processing of data• ERwin DM is great for the static views of data, we just need to capture the dynamic processing PAGE 8
  9. 9. Motivation• I have used many techniques to capture the dynamic processing• Spreadsheets to capture data mapping (who hasn’t)• Process flow diagrams in PowerPoint and Visio• UML Diagrams in the IBM and Sparx tools• They all worked to an extent but were hard to maintain and did not provide a leveling mechanism PAGE 9
  10. 10. Motivation• Many years ago, I had used Data Flow Diagrams to describe systems under development• They provided insight into the flow of data and leveling of those processes• So, I tried that – first in Visio and later in ERwin PM• The rest of this talk is an approach to using ERwin DM and ERwin PM together to model a Data Warehouse• I have used this approach for the past five years and find it is very successful• It provides information to both the user community and developers PAGE 10
  11. 11. The Tools• ERwin Data Modeler – Used to model databases – Supports both Logical and Physical models – If needed, I create conceptual models in PowerPoint or Visio – Each model has to represent one type of database – But, data warehouses use many – Flat Files, Oracle, SQL Server, Cubes, etc – I use UDP to represent the actual type of an Entity/Table – For example, a table that represents a flat file would have that setting in a UDP PAGE 11
  12. 12. The Tools• ERwin Process Modeler (ERwin PM) – Previously called BPwin – Supports several diagram types – I have only found the Data Flow diagrams useful for the design of a data warehouse – The other diagrams could be used in analysis to understand how the data warehouse will be used PAGE 12
  13. 13. The Tools• ERwin DM and ERwin PM• There is a connection between the tools• I have not used it extensively PAGE 13
  14. 14. The Tools• Other Tools – These are minor but needed – PDF Viewer – Microsoft Excel – Microsoft Word PAGE 14
  15. 15. The Approach• So, we have two tools to design a data warehouse• ERwin DM will be used to design and document static data stores• ERwin PM will be used to design the processing• Lets take a look at an example and then discuss how it works PAGE 15
  16. 16. The Approach• Start in ERwin PM• Create a new model that is a data flow model• First we will create a context model• This will provide a view of the sources and uses of data• On the left side, the sources of data are listed – using the external entity symbol – Sources can be Systems, Databases, People, etc.• On the right hand side, the uses of data are listed – using the external entity symbol – Uses can be reports, cubes, analytics, data feeds, etc. PAGE 16
  17. 17. The Approach E1 Allocation Exception E11 Allocation Factors Report Exception Factors Data Report Demand E2 Deposit D ata E12 Demand Deposit $0 A0 Balancing Accounts Report Cons um er Loan Balancing Report Data E3 Data Cons um er Loans E4 Mortgage Data Mortgages E13 Cus tomer Profitability Comm ercial Comm ercial Loan Cus tomer E5 Data Comm ercial Analytics Comm erical Loans Cus tomer Data Treas ury Data E6 Treas ury E14 Retail Retail Cus tomer Cus tomer E7 Data Analytics Trus t Accounts Trus t Data E8 Organization Data Organization E9 General Ledger General Ledger DataNODE: TITLE: NUMBER: Customer Profitability A-0 PAGE 17
  18. 18. The Approach• The Context Diagram is a good start• It sets the scope• But does not provide any details about what is going to be done• This comes in the next diagram – The details of the central process PAGE 18
  19. 19. The Approach $0 A3 D3 Exceptions Exception Report Exception Output Data Source Exceptions Allocation Factors $0 A1 Calculation Exceptions Cus tomer Cus tomer Comm ercial Profitability Loan Data Validated $0 A2 Profitability D2 Data Data Dim ens ion Warehous e Cus tomer Profitability Calculation Data Dim ens ion Mortgage Data Data for Comm ercial Cus tomer Calculation Retial BI Data BI Data Cons um er D1 Profitability Loan Data Validated Staging Fact Data $0 A4 Fact Data for Demand Sourcing Calculation Deposit D ata Comm ercial Calculation Balance Comm ercial BI Cus tomer Data Organization Data Values Trus t Data Comm ercial Balancing Data Treas ury Data $0 A5 General Ledger Data Retail Cus tomer Retail Balancing Data Retail BI Data Input Balance Values $0 A6 Balancing Balancing D4 Values Balance Input and Output Report DataNODE: TITLE: NUMBER: Customer Profitability A0 PAGE 19
  20. 20. The Approach• This level one diagram shows all the key components of the solution.• There is no magic formula of should be included here• There needs to at least be some sort of sourcing, processing, and display/output activities• In this case, there one source processing, one calculation, and four output activities• Each can be broken down into more details• Lets look at the Commercial BI Activity PAGE 20
  21. 21. The Approach Data for Comm ercial$0 A4.1 Cube BI Data Out Comm ercial Data for C ube $0 A4.3 $0 A4.6 Load Commercial Cube D16 Profitability In Data for Cube Cube Provider Reporting Comm erical Profitability Reporting Comm ercial Balancing Data Comm ercial Cus tomer Data NODE: TITLE: NUMBER: Commercial BI A4 PAGE 21
  22. 22. The Approach• This decomposition can continue until you are comfortable• I try to get to the point where one developer can implement it in one module• At this point, we will have a series of diagrams that show the flow of data through the system• The diagrams contain: – Activities – Data Stores (note that a single data store can be used on multiple diagrams) – Data Flows – External Entities PAGE 22
  23. 23. The Approach• Each of the diagram elements, except for the Data Flows, can be further modeled in ERwin DM• This gives the developer a further level of detail of what is intended• It also provides the physical names that will be used• To maintain the mapping between the models, I use a naming convention for ERwin DM Subject Areas• The convention is: – A01.01.01 – {Activity Name} – D01 – {Data Store Name} – E01 – {External Name} PAGE 23
  24. 24. The Approach• Some examples for External Entities and Data Stores from the model above: – D01 – Customer Profitability Staging – E05 – Commercial Loans• Each of these subject areas should have the portion of the data model relevant to it• Note that these are just typical ER models• They can represent more than just table – for example, an external entity could be a flat file• Below is an example – the E05 – Commercial Loans external entity PAGE 24
  25. 25. The Approach PAGE 25
  26. 26. The Approach• Next we need to look at the activities• Because activities have a hierarchical numbering system, we need one for the subject areas• We simply start with A and separate each level with a period• Combine Retail Loans from the model above is in Activity 7 inside of Activity 2. It is called A2.7 Combine Retail Loans in the model.• The associated subject area will be: – A02.07 – Combine Retail Loans• The data model will show the input and out put entities and how they are processed PAGE 26
  27. 27. The Approach PAGE 27
  28. 28. The Approach• With the Diagrams from ERwin DM, ERwin PM, and the narrative in ERwin PM, the developer has all the information they need to implement a portion of the solution• The diagrams and narratives are also accessible to technical users• Twice, I have had the user community write papers to explain the details of specific areas of the ERwin PM model PAGE 28
  29. 29. The Approach• Notes – Using ERwin DM we can quickly build detailed reports with diagrams and descriptions – The developers use these reports to track what they have to do – The Project Managers use these reports as an inventory for project planning – The ERwin PM reports are like a roadmap that ties everything together – It takes some effort to keep everything synchronized but it is well worth it PAGE 29
  30. 30. The Approach• In Summary – A data warehouse is very much a store of data and a flow of data – ERwin DM and ERwin PM can model both of these areas – Use ERwin PM to decompose the solution • There is no right or best decomposition • Try it until it works – Use ERwin DM to model the internals of External Entities, Data Stores, and Activities • Tie the two models together through an appropriate naming convention • Do not worry if the entities model more than tables – The goal is to communicate with users and developers PAGE 30
  31. 31. Questions? PAGE 31