Data Management & Warehousing                                                              WHITE PAPER   Process Neutral D...
White Paper - Process Neutral Data ModellingTable of ContentsTable of Contents ..............................................
White Paper - Process Neutral Data ModellingFurther Reading .................................................................
White Paper - Process Neutral Data ModellingSynopsisThis paper describes in detail the process for creating an enterprise ...
White Paper - Process Neutral Data ModellingIntroductionCommissioning a data warehouse system is a major undertaking. Orga...
White Paper - Process Neutral Data ModellingThe ProblemData modelling is the process of defining the database structures i...
White Paper - Process Neutral Data Modelling                 and the end date of the record in each of the tables. The ext...
White Paper - Process Neutral Data Modelling         These business process changes results in a new data model for the op...
White Paper - Process Neutral Data Modelling      The Real World      The example above is designed to illustrate some of ...
White Paper - Process Neutral Data ModellingThe Customer ParadigmData Warehouse development often start with a requirement...
White Paper - Process Neutral Data ModellingAs a result of this approach two questions are common:    •    Isn’t one of th...
White Paper - Process Neutral Data ModellingRequirements of a Data Warehouse Data ModelHaving looked at the problems that ...
White Paper - Process Neutral Data Modelling                                                   16           4. Convention ...
White Paper - Process Neutral Data ModellingThe Data ModelAs this white paper has defined requirements for the data model ...
White Paper - Process Neutral Data Modelling    •   Electronic_Address        Any electronic address such as a telephone n...
White Paper - Process Neutral Data Modelling              Lifetime Value              The next decision is which columns (...
White Paper - Process Neutral Data Modelling       Type Tables       There is often a need to categorise information into ...
White Paper - Process Neutral Data Modelling       GEOGRAPHY_TYPES     Column                                   Example Ro...
White Paper - Process Neutral Data Modelling         Band Tables         Whilst _TYPES tables classify information into di...
White Paper - Process Neutral Data Modelling      Property Tables      In the discussion of major entities and lifetime va...
White Paper - Process Neutral Data Modelling       This means that not only the current marital status can be stored but a...
White Paper - Process Neutral Data ModellingThe real saving in the number of rows is normally less than expected when comp...
White Paper - Process Neutral Data Modelling       Link Tables       Up to this point major entity attributes within a sin...
White Paper - Process Neutral Data Modelling Segment Tables The final type of information that might be required about a m...
White Paper - Process Neutral Data ModellingThe Sub-ModelThe major entities and the six supporting data structures (_TYPES...
White Paper - Process Neutral Data ModellingHistory TablesExtending the example above it is noticeable that the party does...
White Paper - Process Neutral Data ModellingOccurrences and TransactionsThe final part of the data model is to build up al...
White Paper - Process Neutral Data Modelling            After the close of business on the last working day of each month ...
White Paper - Process Neutral Data Modelling          •    A high net worth individual is a member of a similarly named se...
White Paper - Process Neutral Data Modelling      On a daily basis the exposure (i.e. sum of all account balances) is calc...
White Paper - Process Neutral Data Modelling                           Party Sub Model                           including...
White Paper - Process Neutral Data ModellingThe model above has been almost fully described in detail by this document sin...
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
Upcoming SlideShare
Loading in …5
×

White Paper - Process Neutral Data Modelling

1,606 views

Published on

This paper describes in detail the process for creating an enterprise data warehouse physical data model that is less susceptible to change. Change is one of the largest on-going costs in a data warehouse and therefore reducing change reduces the total cost of ownership of the system. This is achieved by removing business process specific data and concentrating on core business information.
The white paper examines why data-modelling style is important and how issues arise when using a data model for reporting. It discusses a number of techniques and proposes a specific solution. The techniques should be considered when building a data warehouse solution even when an organisation decides against using the specific solution.
This paper is intended for a technical audience and project managers involved with the technical aspects of a data warehouse project.

Published in: Technology, Business
  • Be the first to comment

White Paper - Process Neutral Data Modelling

  1. 1. Data Management & Warehousing WHITE PAPER Process Neutral Data Modelling DAVID M WALKER Version: 1.0 Date: 10/02/2009 Data Management & Warehousing 138 Finchampstead Road, Wokingham, Berkshire, RG41 2NU, United Kingdom http://www.datamgmt.com
  2. 2. White Paper - Process Neutral Data ModellingTable of ContentsTable of Contents ...................................................................................................................... 2 Synopsis .................................................................................................................................... 4 Intended Audience .................................................................................................................... 4 About Data Management & Warehousing ................................................................................. 4 Introduction................................................................................................................................ 5 The Problem .............................................................................................................................. 6  The Example Company......................................................................................................... 6  The Real World ..................................................................................................................... 9 The Customer Paradigm ......................................................................................................... 10 Requirements of a Data Warehouse Data Model.................................................................... 12  Assumptions........................................................................................................................ 12  Requirements...................................................................................................................... 12 The Data Model ....................................................................................................................... 14  Major Entities ...................................................................................................................... 14  Type Tables ........................................................................................................................ 17  Band Tables ........................................................................................................................ 19  Property Tables................................................................................................................... 20  Event Tables ....................................................................................................................... 22  Link Tables.......................................................................................................................... 23  Segment Tables .................................................................................................................. 24 The Sub-Model ........................................................................................................................ 25  History Tables ..................................................................................................................... 26  Occurrences and Transactions ........................................................................................... 27 Implementation Issues ............................................................................................................ 33  The ‘Party’ Special Case..................................................................................................... 33  Partitioning .......................................................................................................................... 35  Data Cleansing.................................................................................................................... 36  Null Values .......................................................................................................................... 36  Indexing Strategy ................................................................................................................ 36  Enforcing Referential Integrity............................................................................................. 36  Data Insert versus Data Update.......................................................................................... 37  Row versus Set Based Loading in ETL............................................................................... 37  Disk Space Utilisation ......................................................................................................... 38  Implementation Effort .......................................................................................................... 38 Data Commutativity ................................................................................................................. 39 Data Model Explosion and Compression ................................................................................ 40  How big does the data model get?...................................................................................... 40  Can the data model be compressed? ................................................................................. 40 Which Results to Store? .......................................................................................................... 41 The Holistic Approach ............................................................................................................. 42 Summary ................................................................................................................................. 43 Appendix 1 – Data Modelling Standards ................................................................................. 44  General Conventions .......................................................................................................... 44  Table Conventions .............................................................................................................. 44  Column Conventions........................................................................................................... 46  Index Conventions .............................................................................................................. 50  Standard Table Constructs ................................................................................................. 50  Sequence Numbers For Primary Keys................................................................................ 52 Appendix 2 – Understanding Hierarchies ................................................................................ 53  Sales Regions ..................................................................................................................... 53  Internal Organisation Structure ........................................................................................... 53 Appendix 3 – Industry Standard Data Models ......................................................................... 55 Appendix 4 – Information Sparsity .......................................................................................... 57 Appendix 5 – Set Processing Techniques............................................................................... 59 Appendix 6 – Standing on the shoulders of giants .................................................................. 60  © 2009 Data Management & Warehousing Page 2
  3. 3. White Paper - Process Neutral Data ModellingFurther Reading ...................................................................................................................... 61  Overview Architecture for Enterprise Data Warehouses..................................................... 61  Data Warehouse Governance............................................................................................. 61  Data Warehouse Project Management ............................................................................... 62  Data Warehouse Documentation Roadmap ....................................................................... 62  How Data Works ................................................................................................................. 63 List of Figures .......................................................................................................................... 64 Copyright ................................................................................................................................. 64  © 2009 Data Management & Warehousing Page 3
  4. 4. White Paper - Process Neutral Data ModellingSynopsisThis paper describes in detail the process for creating an enterprise data warehouse physicaldata model that is less susceptible to change. Change is one of the largest on-going costs ina data warehouse and therefore reducing change reduces the total cost of ownership of thesystem. This is achieved by removing business process specific data and concentrating oncore business information.The white paper examines why data-modelling style is important and how issues arise whenusing a data model for reporting. It discusses a number of techniques and proposes a specificsolution. The techniques should be considered when building a data warehouse solution evenwhen an organisation decides against using the specific solution.This paper is intended for a technical audience and project managers involved with thetechnical aspects of a data warehouse project.Intended AudienceReader Recommended ReadingExecutive SynopsisBusiness Users SynopsisIT Management SynopsisIT Strategy Entire DocumentIT Project Management Entire DocumentIT Developers Entire DocumentAbout Data Management & WarehousingData Management & Warehousing is a specialist consultancy in data warehousing, based inWokingham, Berkshire in the United Kingdom. Founded in 1995 by David M Walker, ourconsultants have worked for major corporations around the world including the US, Europe,Africa and the Middle East. Our clients are invariably large organisations with a pressing needfor business intelligence. We have worked in many industry sectors but have specialists inTelco’s, manufacturing, retail, financial and transport as well as technical expertise in many ofthe leading technologies.For further information visit our website at: http://www.datamgmt.comCrossword Clue: Expert Gives Us Real Understanding (4 letters) © 2009 Data Management & Warehousing Page 4
  5. 5. White Paper - Process Neutral Data ModellingIntroductionCommissioning a data warehouse system is a major undertaking. Organisations will investsignificant capital in the development of the system. The data model is always a majorconsideration and many projects will invest a significant part of the budget on developing andre-working the initial data model.Unfortunately projects also often fail to look at the maintenance costs of the data model thatthey develop. A data model that is fit for purpose when developed will rapidly become anexpensive overhead if it needs to change when the source systems change. The costinvolved is not only in the change to the data model but also in the changes to the ETL thatfeed the data model.This problem is exacerbated by the fact that changes to the data model may be done in aninconsistent way from the original design approach. The data model loses transparency andbecomes even more difficult to maintain.For many large data warehouse solutions it is not uncommon to have a resource permanentlyassigned to maintaining the data model and several more resources assigned to managingthe change in the associated ETL within a short time of going live.By understanding the problem and using techniques imported from other areas of systemsand software development and well as change management techniques it is possible todefine a method that will greatly reduce this overhead.This white paper sets out an example of the issues from which to develop a statement ofrequirements for the data model and then demonstrates a number of techniques which, whenused together, can address those requirements in a sustainable way. © 2009 Data Management & Warehousing Page 5
  6. 6. White Paper - Process Neutral Data ModellingThe ProblemData modelling is the process of defining the database structures in which to hold information.To understand the Process Neutral Data Modelling approach first this paper looks at whythese database structures have such an impact on the data warehouse.In order to demonstrate the issues with creating a data model for a data warehouse moreexperienced readers are asked bear with the necessarily simplistic examples that follow. The Example Company A company supplies and installs widgets. There are a number of different widget types, each having a name and specific colour. Each individual widget has a unique serial number and can have a number of red lamps and a number of green lamps plugged into it. The widgets are installed into cabinets at customer sites and from time to time engineers come in and change the relative numbers of red and green lamps. The customer name and a customer cabinet number identify cabinets. For operational 1 systems the data model might look something like this : 2Figure 1 - Initial Operational System Data Model This simple data model describes both the widget and the cabinet and provides the current combinations. It does not provide any historical context: “What was the previous configuration and when was it changed?” Historical data can be recorded by simply adding start date and end date to each of 3 the main tables. This provides the ability to report on the historical configuration . In order to facilitate this a separate reporting environment would be setup because retaining history in the operational system would unacceptably reduce the operational system performance. There are three consequences of doing this: • Queries are now more complex. In order to report the information for a given date the query has to allow for the required date being between the start date1 Data models in this document are illustrative and therefore should be viewed as suitable for makingspecific points rather than complete production quality solutions. Some errors exist to explicitlydemonstrate certain issues.2 The are several conventions for data modelling. In this and subsequent diagrams the link with a 1 and∞ represents a one-to-many relationship where the ‘1’ record is a primary key field and the ‘∞’represents the foreign key field.3 Note that the ‘WIDGET_LOCATIONS’ table requires an additional field called ‘INSTALL_SEQUENCE’to allow for the case where a widget is re-installed in a cabinet. © 2009 Data Management & Warehousing Page 6
  7. 7. White Paper - Process Neutral Data Modelling and the end date of the record in each of the tables. The extra complexity slows the execution of the query. o The volume of data stored has also increased. The storage of dates has a minor impact on the size of each row but this is small when compared to the 4 number of additional rows that need to be stored. o Data has to be moved from the operational system to the reporting system via an extract, transform and load (ETL) process. This process has to extract the data from the operational system, compare the records to the current records in the reporting system to determine if there are any changes and if so make the required adjustments to the existing record (e.g. updating the end date) and insert the new record. Already the process is more complex 5 and time consuming than simply copying the data across.Figure 2 - Initial Reporting System Data Model When the reporting system is built, it accurately reflects the current business processes, operational systems and provides historical data. From a systems management perspective there is now an additional database, and a series of ETL or interface scripts that have to be run reliably every day. The systems architecture may be further enhanced so that the reporting system becomes a data warehouse and the users make their queries on data marts, or sets of tables where the data has been re-structured in order to simplify of the users query environment. The ‘data marts’ typically use star-schema or snowflake-schema data 6 modelling techniques or tool specific storage strategies . This adds an additional layer of ETL to move between the data warehouse and the data mart. However the company doesn’t stop here. The product development team create a new type of widget. This new widget allows amber lamps and can optionally be mounted in a rack that is in turn mounted in a cabinet. The IT director also insists that the new OLTP application is more flexible for other future developments.4 Assume that everything remains the same except that widgets are moved around (i.e. there are nonew widgets and no new cabinet/customer combination) then the WIDGET_LOCATIONS table grows indirect proportion to the number of changes. If each widget were modified in some way once a monththen the reporting system table would be twelve times bigger than the operational system after one yearand this before any other change is handled.5 Additional functionality such as data cleansing will also impact the complexity of ETL and affectperformance6 This is accepted good practice and the design and implementation of data marts is outside the scopeof this paper. © 2009 Data Management & Warehousing Page 7
  8. 8. White Paper - Process Neutral Data Modelling These business process changes results in a new data model for the operational system.Figure 3 - Second Version Operational System Data Model The reporting system is also now a live system with a large amount of historical information. It too can be re-designed. The operational system will be implemented to meet the business requirements and timescales regardless of whether the reporting system is ready. It also may not be possible to create the history required for the new 7 data model when it is changed. If a data mart is built from the data warehouse there are two impacts. Firstly that the data mart model will need to be changed to exploit the new data and secondly that the change to data warehouse model will require the data mart ETL to be modified regardless of any changes to the data mart data model. The example company does not stop here however as senior management decide to acquire a smaller competitor. The new subsidiary has it’s own systems that reflect their own business processes. The data warehouse was built with a promise of providing an integrated management reporting so there is an expectation that the data from the new source system will be quickly and seamlessly integrated into the data warehouse. From a technical perspective this could present issues around mapping the new source system data model to the existing data warehouse data 8 9 model, critical information data types , duplication of keys , etc. that all cause problems with the integration of data and therefore slow down the processing. Within a few short iterations of change it is possible to see the dramatic impact on the data warehouse and that the system is likely to run into issues.7 A common example of this is an organisation that captures the fact that an individual is married or not.Later the organisation decided to capture the name of the partner if someone is married. It is notpossible to create the historical information systemically so for a period of time the system has tosupport the continued use of the marital status and then possibly run other activities such as outboundcalling to complete the missing historical data.8 The example database assumed that serial number was numeric and used it as a primary key but whathappens if the acquired company uses alphanumeric serial numbers?9 If both companies use numbers starting from 1 for their customer ID then there will be two customerswho have the same ‘unique’ id, and customers that have two ‘unique’ IDs. © 2009 Data Management & Warehousing Page 8
  9. 9. White Paper - Process Neutral Data Modelling The Real World The example above is designed to illustrate some of the issues that affect data warehouse data modelling. In reality business and technical analysts will handle some of these issues in the design phase but how big is the data-modelling problem in the real world? o A UK transport industry organisation has three mainframes, each of which is only allowed to perform one release a quarter. Each system also feeds the data warehouse. As a consequence the mainframe feeds require validation and change every month. Whilst the main data comes from these three systems there are sixty-five other Unix based operational system that feed the data warehouse and data from several hundred desktop based applications that are also provide data. Most of these source systems do not have good change control or governance procedures to assist in impact analysis. Change for this organisation is business as usual. o A global ERP vendor supplies a system with over five thousand database objects and typically makes a major release every two years, a ‘dot’ release every six months and has numerous patches and fixes in between each major release. This type of ERP system is in use in nearly every major company and the data is a critical source to most data warehouses. o A global food and drink manufacturer that came into existence as a result of numerous mergers and acquisitions and also divested some assets found itself with one hundred and thirty-seven general ledger instances in ten countries with seventeen different ERP packages. Even where the ERP packages were the same they were not necessarily using the same version of the package. The business intelligence requirement was for a single data warehouse and a single data model. o A European Telco purchased a three hundred-table ‘industry standard’ enterprise data model from a major business intelligence vendor and then spent two years analysing it before they started the implementation. Within six months of implementation they had changed some sixty percent of tables as a result of analysis omissions. o A UK based banking and insurance business outsources all of its product management to business partners and only maintains the unified customer management systems (website, call centres and marketing). As a result nearly all of the ‘source systems’ are external to the organisation and whilst there are contractual agreements about the format and data remaining fixed in practice there is significant regular change in the format and information provided to both operational and reporting systems. Obviously these issues cannot be fixed just by creating the correct data model for the 10 data warehouse but the objective of the data model design should be two fold: o To ensure that all the required data can be stored effectively in the data warehouse. o To ensure that the design of the data model does not impose cost and where possible actively reduces the cost of change on the system.10 Data Management & Warehousing have published a number of other white papers that are availableat http://www.datamgmt.com and look at other aspects of data warehousing and address some of theseissues. See Further Reading at the end of this document for more details. © 2009 Data Management & Warehousing Page 9
  10. 10. White Paper - Process Neutral Data ModellingThe Customer ParadigmData Warehouse development often start with a requirements gathering exercise. This maytake the form of interviews or workshops where people try to define what the customer is. If anumber of different parts of the business are involved then the definition of customer soonbecomes confused and controversial and negatively impacts the project. Most organisationshave a sales funnel that describes the process of capturing, qualifying, converting andretaining customers. Marketing say that the customer is anyone and everyone that they communicate with. The sales teams view the customer as those organisations in their qualified lead database or for whom they have account management responsibility post-sales. The customer services team are clear that the customer is only those organisations who have purchased a product and where appropriate have purchased a support agreement as well. Other questions are asked in the workshops such as “What about customers who are also suppliers or partners?” and “How do we deal with customers who have gone away and then come back after a long period of time?” Figure 4 - The Sales Funnel The most common solutions that are created as a result either add ‘flag’ or‘indicator’ columns to the customer table to represent each category or to create multipletables for the different categories required and to repeat the data in each of the tables.This example clearly demonstrates that the business process is being embedded into thedata model. The current business process definition(s) of customer are defining how the datamodel is created. What has been forgotten is that these ‘customers’ exist outside theorganisation and it is their interaction with different parts of the organisation that defines theirstatus of being a customer, supplier, etc. In legal documents there is the concept of a ‘party’where a party is a person or group of persons that compose a single entity that can be 11identified as one for the purposes of the law . This definition is one that should be borrowedand used in the data model.If users query a data mart that is loaded with data extracted from the transaction repositoryand data marts are built for a specific team or function that only requires one definition of the 12data then the current definition can be used to build that data mart and different definitionsused for other departments.11 http://en.wikipedia.org/wiki/Party_(law)12 This also allows flexibility, as, when business processes change, it is possible at a cost to change therules by which data is extracted. The cost of change is relatively much lower than trying to rebuild thedata warehouse and data mart with a new definition. © 2009 Data Management & Warehousing Page 10
  11. 11. White Paper - Process Neutral Data ModellingAs a result of this approach two questions are common: • Isn’t one of the purposes of building a data warehouse to have a single version of the truth? Yes. There is a single version of the truth in the data warehouse and this single version is perpetuated into the data marts, the difference is that the information in the data mart is qualified. Asking the question “How many customers do we have?” should get the answer “Customer Services have X active service contract customers” and not the answer “X” without any further qualification. • What happens if different teams or departments have different data? People within the organisation work within different processes and with the same terminology but often different definitions, it is unlikely and impractical in the short term to change this, although it is possible that in the long term the data warehouse project will help with the standardization process. In the mean time it is an education process to ensure that answers are qualified. It is important to recognise that different departments legitimately have different definitions and therefore to recognise and understand the differences, rather than fighting about who is right.It might be argued that there are too many differences to put all individuals and organisationsin a single table; this and other issues will be discussed later in the paper. © 2009 Data Management & Warehousing Page 11
  12. 12. White Paper - Process Neutral Data ModellingRequirements of a Data Warehouse Data ModelHaving looked at the problems that can affect a data warehouse data model it is possible todescribe the requirements that should be made on any data model design. Assumptions 1. The data model is for use in the architectural component called the transaction 13 repository or data warehouse. 2. As the data model is used in the data warehouse it will not be a place where users go to query the data, instead users will query separate dependant data marts. 3. As the data model is used in the data warehouse data will be extracted from it to populate the data marts by ETL tools. 4. As the data model is used in the data warehouse the data will be loaded into it from the source systems by ETL tools. 5. Direct updates (i.e. not through formally released ETL processes) will be prohibited; instead a separate application or applications will exist as a surrogate source. 6. The data model will not be used in a ‘mixed mode’ where some parts use one data modelling convention and other parts use another. (This is generally bad practice with any modelling technique but often the outcome where the responsibility for data modelling changes is distributed or re-assigned over time). Requirements 1. The data model will work on any standard business intelligence relational 14 database. This is to ensure that it can be deployed on any current platform and if necessary re-deployed on a future platform. 2. The data model will be process neutral i.e. it will not reflect current business processes, practices or dependencies but instead will store the data items and relationships as defined by their use at the point in time when the information is acquired. 15 3. The data model will use a design pattern i.e. a general reusable solution to a commonly occurring problem. A design pattern is not a finished design but a description or template for how to solve a problem that can be used in many different situations.13 For further information on Transaction Repositories see the Data Management & Warehousing whitepaper ”An Overview Architecture For Enterprise Data Warehouses”14 A typical list would (at the time of writing) include IBM DB2, Microsoft SQL Server, Netezza, Oracle,Sybase, Sybase IQ, and Teradata. For the purposes of this document it implies compliance with at leastthe SQL92 standard15 http://en.wikipedia.org/wiki/Software_design_pattern © 2009 Data Management & Warehousing Page 12
  13. 13. White Paper - Process Neutral Data Modelling 16 4. Convention over configuration : This is a software design paradigm which seeks to decrease the number of decisions that developers need to make, gaining simplicity, but not necessarily losing flexibility. It can be applied successfully to data modelling and reduce the number of decisions of the data modeller by ensuring that tables and columns use a standard naming convention and are populated and queried in a consistent fashion. This also has a significant impact on the efforts of an ETL developer. 5. The design should also follow the DRY (Don’t Repeat Yourself) principle. This is a process philosophy aimed at reducing duplication. The philosophy emphasizes that information should not be duplicated, because duplication increases the difficulty of change, may decrease clarity, and leads to 17 opportunities for inconsistency. 6. The data model should be significantly static over a long period of time, i.e. there should not be a need to add or modify tables on a regular basis. In this case there is a difference between designed and implemented, it is possible to have designed a table but not to implement it until it is actually required. This does not affect the static nature of the data model, as the placeholder already exists. 18 7. The data model should store data at the lowest possible level and avoid the storage of aggregates. 8. The data model should support the best use of platform specific features whilst 19 not compromising the design. 9. The data model should be completely time-variant, i.e. it should be possible to 20 reconstruct the information at any available point in time. 10. The data model should act as a communication tool to aid the refinement of requirements and an explanation of possibilities.16 For further information see http://en.wikipedia.org/wiki/Convention_over_Configuration andhttp://softwareengineering.vazexqi.com/files/pattern.html. The Ruby on Rails language(http://www.rubyonrails.org/) makes extensive use of this principle.17 DRY is a core principle of Andy Hunt and Dave Thomass book The Pragmatic Programmer. Theyapply it quite broadly to include "database schemas, test plans, the build system, even documentation."When the DRY principle is applied successfully, a modification of any single element of a system doesnot change other logically unrelated elements. Additionally, elements that are logically related all changepredictably and uniformly, and are thus kept in sync. (http://en.wikipedia.org/wiki/DRY). This does notautomatically imply database normalisation but database normalisation is one method for ensuring‘dryness’.18 This is the origin of the term ‘Transaction Repository’ rather than ‘Data Warehouse’ in DataManagement & Warehousing documentation. The transaction repository stores the lowest level of datathat is practical and/or available. (See An Overview Architecture for Enterprise Data Warehouses)19 This turns out to be both simple and very effective. For Oracle the most common features that needsupport include partitioning and materialized views. For Sybase IQ and Netezza there is a preference forinserts over updates due to their internal storage mechanisms. For all databases there is variation inindexing strategies. These and other features should be easily accommodated.20 Also known as temporal. Most data warehouses are not linearly time variant but quantum time variant.If a status field is updated three times in a day and the data warehouse reflects all changes then it islinearly time-variant. If a data warehouse holds the first and last values only because a batch processloads it once a day then it is quantum time-variant where the quantum is, in this case, one day.Quantum time variant solutions can only resolve data to the level of the quantum unit of measure. © 2009 Data Management & Warehousing Page 13
  14. 14. White Paper - Process Neutral Data ModellingThe Data ModelAs this white paper has defined requirements for the data model it is now possible to startlooking at what is needed to design a data model. This is done by breaking down the tablesthat will be created into different groups depending on how they are used. The section belowdiscusses the main elements of the data models. There are some basics such as namingconventions, standard short names, keys used in the data model, etc. that are not described.A complete set of data modelling rules and example models can be found in the appendices. Major Entities Party is, as described in the customer paradigm section above, an example of a type of table within the Process Neutral Data Modelling method known as a ‘Major Entity’. These are tables that deliver the placeholders for all major subject areas of the data model and around which other information is grouped. Each business transaction will relate to a number of major entities. Some major entities are global i.e. they apply to all types of organisation (e.g. Calendar) and there are a number of major entities that are industry specific (e.g. for Telco, Manufacturing, Retail, Banking, etc.). It would be very unusual for an organisation to need a major entity that was not industry wide. Below is a list of some of the most common: • Calendar Every data warehouse will need a calendar. It should always contain data to the day level and never to parts of the day. In some cases there is a need to 21 support sub-types of calendar for non-Gregorian calendars . • Party Every organisation will have dealings between parties. This will normally include three major sub-types: individuals, organisations (any formal organisation such as a company, charity, trust, partnership, etc.) and organisational units (the components within an organisation including the system owners organisation). • Geography The information about where. This is normally sub-typed into two components, 22 address and location. Address information is often limited to postal addresses whilst location is normally described by the longitude and latitude via GPS co- ordinates. Other specialist geographic models exist that may need to be taken 23 into account. • Product_Service (also known as Product or as Service) This is the catalogue of the products and/or services that an organisation supplies. • Account Every customer will have at least one account if financial transactions are involved (even those organisations that do not think they currently use the concept of account will do so as accounting systems always have the concept of a customer with one or more accounts).21 See http://www.qppstudio.net/footnotes/non-gregorian.htm for various calendars, notably 2008 is theMuslin Year 1429 and the Jewish Year 596822 Some countries, such as the UK, have validated lists of all addresses (see the UK Post OfficePostcode Address File at http://www.royalmail.com/portal/rm/jump2?mediaId=400085&catId=400084)23 Network Rail in the UK use an Engineers Line Reference, which is based on a linear reference modeland refers to a known distance from a fixed point on a track. In Switzerland they have an entire nationalco-ordinate system (http://en.wikipedia.org/wiki/Swiss_coordinate_system) © 2009 Data Management & Warehousing Page 14
  15. 15. White Paper - Process Neutral Data Modelling • Electronic_Address Any electronic address such as a telephone number, email address, web address, IP address etc. This is normally sub-typed by the categories used. • Asset (also known as Equipment) A physical object that can be uniquely identified (normally by a serial number or similar). This may be used or incorporated in a PRODUCT_SERVICE, or sold to a customer etc. In the example Cabinet, Rack and Widget were all examples of Asset, whilst Widget Type was an example of PRODUCT_SERVICE. • Component A physical object that cannot be uniquely identified by a serial number but has a part number and is used in the make-up of either an asset or of a product service. In the example company there was not a particular record of the serial numbers of the lamps, however they would all have had a part number that described the type of lamp to be used. • Channel A conceptual route to market (e.g. direct, indirect, web-based, call-centre, etc.). • Campaign A marketing exercise that is designed to promote the organisation, e.g. the running of a series of adverts on the television. • Campaign Activities The running of a specific advert as part of a larger campaign. • Contract Depending on the type of business the relationship between the organisation and its supplier or its customer may require the concept of a contract as well as that of an account. • Tariff (also known as Price_List) A set of charges and discounts that can be applied to product services as a point in time.This list is not comprehensive by if an organisation can effectively describe their majorentities and combine this information with the interactions between them (theoccurrences or transactions) then they have the basis of a very successful datawarehouse.Major Entities can have any meaningful name provided it is not a reserved word in thedatabase or (as will be seen below) a reserved word within the design pattern ofProcess Neutral Data Modelling.Some readers, who are familiar with the concepts of star schemas and data marts, willalso be aware that these are very close to the basic dimensions that most data martsuse. This should come as no surprise as these are the major data items of anybusiness regardless of their business processes or of their specific industry sector anda data mart is only a simplification of the data presented for the user. This effect iscalled “natural star schemas” and will be explored in more detail later.© 2009 Data Management & Warehousing Page 15
  16. 16. White Paper - Process Neutral Data Modelling Lifetime Value The next decision is which columns (attributes) should be included in the table. 24 Much like the processes involved in normalising a database the objective is to minimise duplication of data and there is also a requirement to minimise updates. To this end the attributes that are included should therefore have ‘lifetime value’, i.e. they should remain constant once they have been inserted into the database. This means that variable data needs to be handled elsewhere. Using some of the major entities above as examples: Calendar: Lifetime Value Attributes: Date, Public Holiday Flag Geography: Lifetime Value Attributes: Address Line 1, Address Line 2, City, 25 Postcode , County, Country Non-Lifetime Value Attributes: Population Party (Individuals): 26 Lifetime Value Attributes: Forename, Surname , Date of Birth, 27 Date of Death, Gender , State ID Number Non-Lifetime Value Attributes: Marital Status, Number of Children, Income Party (Organisations): Lifetime Value Attributes: Name, Start Date, End Date, State ID Number Non-Lifetime Value Attributes: Number of Employees, Turnover, Shares Issued Account: Lifetime Value Attributes: Account Number, Start Date, End Date. Non-Lifetime Value Attributes: Balance Other than this lifetime value requirement for columns every table must comply with the general rules for any table. For example every table will have a key column that uses 28 the table short name made up of six characters and the suffix _DWK , a TIMESTAMP column and an ORIGIN column.24 http://en.wikipedia.org/wiki/Database_normalization: Database normalization is a technique fordesigning relational database tables to minimize duplication of information and, in so doing, to safeguardthe database against certain types of logical or structural problems, namely data anomalies.25 This may occasionally be a special case as postal services do, from time to time, change postal codesthat are normally static.26 There is a specific special case that deals with the change of name for married women that will bedealt with in the section ‘The Party Special Case’ later.27 One insurance company had to deal with updatable genders due to the fact that underwriting rulesrequire assessment based on birth gender and not gender as a result of re-assignment surgery.Therefore for marketing it had to handle ‘current’ gender and for underwriting it had to deal with ‘birth’gender.28 See the data modelling rules appendix for how this name is created. © 2009 Data Management & Warehousing Page 16
  17. 17. White Paper - Process Neutral Data Modelling Type Tables There is often a need to categorise information into discrete sets of values. The valid set of categories will probably change over time and therefore each category record also needs to have lifetime value. Examples of the categorisation have already occurred with the some of the major entities: • Party: Individual, Organisation, Organisation Unit • Geography: Postal Address, Location • Electronic Address: Telephone, E-Mail To support this and to comply with the requirement for convention over configuration all _TYPES tables of this format have a standard data model as follows: • The table will have the same name as the major entity but with the suffix _TYPES (e.g. PARTY_TYPES, GEOGRAPHY_TYPES, etc.). • The table will always have a key column that uses the six character short code and the _DWK suffix. • The table will have a _TYPE column that is the type name. • The table will have a _DESC column that is a description of the type. • The table will have a _GROUP column that groups certain types together. • The table will have a _START_DATE column and a _END_DATE column. This is a type table in its entirety. If a table needs more information (i.e. columns) then this is not a _TYPES table and must not have the _TYPES extension, as it does not comply with the rules for a _TYPES table. Examples of data in _TYPES tables might include: PARTY_TYPES Column Example Rows PARTYP_DWK 1 2 3 4 PARTY_TYPE INDIVIDUAL LTD COMPANY PARTNERSHIP DIVISION PARTY_TYPE_DESC An Individual A company in This is a business A division of a which the liability owned by two or larger of the members in more people who organisation respect of the are personally company’s debts liable for all is limited business debts. PARTY_TYPE_GROUP INDIVIDUAL ORGANISATION ORGANISATION UNIT PARTY_TYPE_START_DATE 01-JAN-1900 01-JAN-1900 01-JAN-1900 01-JAN-1900 PARTY_TYPE_END_DATE Figure 5 - Example data for PARTY_TYPES The start date in this context has little initial value in this context, although it is a 29 mandatory field and therefore has to be completed with a date before the earliest party in this example. Legal types of organisation do change over time and so it is possible that the start and end dates of these will become significant. These types do not describe the type of role that the party is performing (i.e. Customer, Supplier, etc.) they describe the type of the party (e.g. Individual, etc.). Describing the role comes later. The type and group column are repeated for INDIVIDUAL, as there is no hierarchy of information for this value but the field is mandatory.29 Start Dates in _TYPES tables are mandatory as, with only a few exceptions, they are requiredinformation. In order to be consistent they therefore have to be mandatory for all _TYPES tables © 2009 Data Management & Warehousing Page 17
  18. 18. White Paper - Process Neutral Data Modelling GEOGRAPHY_TYPES Column Example Rows GEOTYP_DWK 1 2 GEOGRAPHY_TYPE POSTAL LOCATION GEOGRAPHY_TYPE_DESC An address as supported by A point on the surface of the earth the postal service defined by it’s longitude and latitude GEOGRAPHY _TYPE_GROUP POSTAL LOCATION GEOGRAPHY _TYPE_START_DATE 01-JAN-1900 01-JAN-1900 GEOGRAPHY _TYPE_END_DATE Figure 6 - Example Data for GEOGRAPHY_TYPES The start date in this context has little initial value, although it is a mandatory field and therefore has to be completed with a date. These types do not describe the type of role that the geography is performing (i.e. home address, work address, etc.) they describe the type of the geography (postal address, point location, etc.). The type and group column are repeated for both values, as there is no hierarchy of information for them. CALENDAR_TYPES The convention over configuration design aspect allows for this table, however it is rarely needed and can therefore be omitted. This is an example where a table can be described as designed (i.e. it is known exactly what it looks like) but not implemented. _TYPES tables will appear in other parts of the data model but they will always have the same function and format. 30 The consequence of this design re-use is that implementing an application to manage the source of _TYPE data is easy. The system than manages the type data needs to have a single table with the same columns as a standard _TYPES table and an additional column called, for example, DOMAIN. This DOMAIN column has the target system table name (e.g. PARTY_TYPES) in it. The ETL then simply maps the data from the source system to the target system where the DOMAIN equals the target table name. This is an example of re-use generating a significant saving in the implementation.30 This is a good use of a Warehouse Support Application as defined in “An Overview Architecture forEnterprise Data Warehouses” © 2009 Data Management & Warehousing Page 18
  19. 19. White Paper - Process Neutral Data Modelling Band Tables Whilst _TYPES tables classify information into discrete values it is sometimes necessary to classify information into ranges or bands i.e. between one value and another. The classic example of this is for telephone calls which are classified as ‘Off- Peak Rate’ if they are between 00:00 and 07:59 or between 18:00 and 23:59. Calls between 08:00 and 17:59 are classified as ‘Peak Rate’ and charged at a premium. _BANDS is a special case of the _TYPES table and would store the data as follows: Column Example Rows TIMBAN_DWK 1 2 3 TIME_BAND Early Off Peak Peak Late Off Peak 31 TIME_BAND_START_VALUE 0 480 1080 TIME_BAND_END_VALUE 479 1079 1439 TIME_BAND_DESC Early Off Peak Peak Late Off Peak TIME_BAND_GROUP Off Peak Peak Off Peak TIME_BAND_START_DATE 01-JAN-1900 01-JAN-1900 01-JAN-1900 TIME_BAND_END_DATE Figure 7 - Example data for TIME_BANDS Once again the _BANDS table has a standard format as follows • The table will have the same name as the major entity but with the suffix _BANDS (e.g. TIME_BANDS, etc.). • The table will always have a key column that uses the six character short code and the _DWK suffix. • The table will have a _BAND column that is the type name. • The table will have a _START_VALUE and a _END_VALUE that represent the starting and finishing values of the band. • The table will have a _DESC column that is a description of the band. • The table will have a _GROUP column that groups certain band together. • The table will have a _START_DATE column and a _END_DATE column. The table has to comply with this convention in order to be given the _BANDS suffix.31 Note that values are stored as a number of minutes since midnight. © 2009 Data Management & Warehousing Page 19
  20. 20. White Paper - Process Neutral Data Modelling Property Tables In the discussion of major entities and lifetime value the data that failed to meet the lifetime value principle was omitted from the major entity tables, however it still needs to be stored. This is handled via a property table. Property tables also help to support the extensibility aspects of the data model. If we use PARTY as an example then as already identified the marital status does not possess lifetime value and therefore is not included in the major entity. Everyone starts as single, some marry, some divorce and some are widowed, these ‘status changes’ occur through the lifetime of the individual. To deal with this problem the property table can be modelled as follows: Figure 8 - Party Properties Example As can be seen from example above in order to handle the properties two new tables are created. The first is the PARTY_PROPERTIES table itself and the second a supporting PARTY_PROPERTY_TYPES table. In order to store the marital status of an individual a set of data needs to be entered in the PARTY_PROPERTY_TYPES table: TYPE GROUP Single Marital Status Married Marital Status Divorced Marital Status Co-Habiting Marital Status Figure 9 - Example Party Property Data The description, start and end date would be filled in appropriately. Note that the start and end date here represent the start and end date of the type and not that of the 32 individuals’ use of that type. It is now possible to insert a row in the PARTY_PROPERTIES table that references the individual in the PARTY table and the appropriate PARTY_PROPERTY_TYPES (e.g. ‘Married’). The PARTY_PROPERTIES table can also hold the start date and end date of this status and optionally where appropriate a text or numeric value that relates to that property.32 The need for start and end dates on such items is often questioned however experience shows thatlegislation changes supposed static values in most countries over the lifetime of the data warehouse.For example in December 2005 the UK permitted a new type of relationship called a civil partnership.http://en.wikipedia.org/wiki/Civil_partnerships_in_the_United_Kingdom. © 2009 Data Management & Warehousing Page 20
  21. 21. White Paper - Process Neutral Data Modelling This means that not only the current marital status can be stored but also historical information. 33 PARTY_DWK PARTY_PROPERTY_DWK START_DATE END_DATE John Smith Single 01-Jan-1970 02-Feb-1990 John Smith Married 03-Feb-1990 04-Mar-2000 John Smith Divorced 05-Mar-2000 06-Apr-2005 John Smith Co-Habiting 07-Apr-2005 Figure 10 - Example data for PARTY_PROPERTIES The data shown here describes the complete history of an individual with the last row showing the current state as the START_DATE is before ‘today’ and the END_DATE is null. There is also nothing to prevent future information from being held. If John Smith announces that he is going to get married on a specific date in the future then the current record can have it’s end date set appropriately and a new record added. If another property is required (e.g. Number of Children) then no change is required to the data model. New rows are entered into the PARTY_PROPERTY_TYPES table: TYPE GROUP Male Number of Children Female Number of Children Figure 11 - Example Data for PARTY_PROPERTY_TYPES This allows data to be added to the PARTY_PROPERTIES as follows: PARTY_DWK PARTY_PROPERTY_DWK START_DATE END_DATE VALUE John Smith Single 01-Jan-1970 02-Feb-1990 John Smith Married 03-Feb-1990 04-Mar-2000 John Smith Divorced 05-Mar-2000 06-Apr-2005 John Smith Co-Habiting 07-Apr-2005 John Smith Male 09-Jun-2001 1 John Smith Female 10-Jul-2002 1 Figure 12 - Example Data for PARTY_PROPERTIES In fact any number of new properties can be added to the tables as business processes and source systems change and new data requirements come about. The effect of this method when compared to other methods of modelling this information is to create very narrow (i.e. not many columns) long (i.e. many rows) tables instead of making very much wider, shorter tables. However the properties table 34 is very effective. Firstly, unlike the example, the two _DWK columns are integers , as are the start and end dates. Many of the _VALUE fields will be NULL, and those that are not will be predominately numeric rather than text values. The PARTY_PROPERTY_TYPE acts as a natural partitioning key in those databases that support table partitions. This method is very effective in terms of performance and storage of data in databases that use column or vector type storage.33 Text from the related table is used in the _DWK column rather than the numeric key for clarity in theseexamples.34 Integers are better than text strings for a number of reasons: they usually require less storage andthere is less temptation to mix the requirements of identification and description (a problem clearlyillustrated by car registration numbers in the UK).Keys are more reliable when implemented as integers because databases often have key generationmechanisms that deliver unique values. Integers do not suffer from upper/lower case ambiguities andcan never contain special characters or ambiguities caused by different padding conventions (trailingspaces or leading zeros). © 2009 Data Management & Warehousing Page 21
  22. 22. White Paper - Process Neutral Data ModellingThe real saving in the number of rows is normally less than expected when comparedto more conventional data model techniques that store duplicated rows for changeddata. The example above has seven rows of data. The alternate approach of repeatedsets of data requires six rows of data and considerably more storage because of theduplicated data: PARTY_DWK START_DATE END_DATE MARITAL_STATUS UNKNOWN FEMALE CHILD CHILD CHILD MALE John Smith 01-Jan-1970 02-Feb-1990 Single 0 0 0 John Smith 03-Feb-1990 08-Jun-2001 Married 0 0 0 John Smith 09-Jun-2001 09-Jul-2002 Married 0 1 0 John Smith 10-Jul-2002 04-Mar-2000 Married 0 1 1 John Smith 05-Mar-2000 06-Apr-2005 Divorced 0 1 1 John Smith 07-Apr-2005 Co-Habiting 0 1 1Figure 13 - Example Data for PARTY_PROPERTIESThe other main objection to this technique is often described as the cost of matrixtransformation of the data. That is the changing of the data from rows into columns inthe ETL to load the data warehouse and then changing the columns back to rows in theETL to load the data mart(s). This objection is normally due to a lack of knowledge ofappropriate ETL techniques that can make this very efficient such as using SQL setoperations such as ‘UNION’, ‘MINUS’ and ‘INTERSECT’.Event TablesAn event table is almost identical to a property table except that instead of having_START_DATE and _END_DATE columns it has a single column _EVENT_DATE. Italso has the appropriate _EVENT_TYPES table. The table name has a suffix of_EVENTS. For example a wedding is an event (happens at a single point in time), but‘being married’ is a property (happens over a period of time). Events can be stored inproperty tables simply by storing the same value in both the start date and end datecolumns and this is a more common solution than creating a separate table. The use of_EVENTS tables is usually limited to places where events form a significant part of thedata and the cost of storing the extra field becomes significant.It should be noted that this is only required where the event may occur many times(e.g. a wedding date) rather than information that can only happen once (e.g. firstwedding date) which would be stored in the appropriate major entity as, once set, itwould have lifetime value.Figure 14 - Party Events Example_EVENTS tables are a special case of _PROPERTIES tables.© 2009 Data Management & Warehousing Page 22
  23. 23. White Paper - Process Neutral Data Modelling Link Tables Up to this point major entity attributes within a single record have been examined. It is also possible that records within the major entities will also relate to other records in the same major entity (e.g. John Smith is married to Jane Smith, both of whom are records within the PARTIES table). This is called a peer-to-peer relationship and is stored in a table with the suffix _LINKS and the appropriate _LINK_TYPES table. Figure 15 - Party Links Example The significant difference in a _LINK table is that there are two relationships from the major entity (in this case PARTIES). This also allows hierarchies to be stored so that: John Smith (Individual) works in Sales (Organisational Unit) Sales (Organisation Unit) is a division of ACME Enterprises (Organisation) where ‘works in’ and ‘is a division of’ are examples of the _LINK_TYPE. It should also be noted that there is a priority to the relationship because one of the linking fields is the main key (in this case PARTIE_DWK) and the other is the linked key (in this case LINKED_PARTIE_DWK). There are two options; one is to store the relationship in both directions (e.g. John Smith is married to Jane Smith and Jane 35 Smith is married to John Smith). This can be made complete with a reversing view but defeats both the ‘Convention over Configuration’ principle and the ‘DRY (Don’t Repeat Yourself)’ principle. The second method is to have a convention and only store the relationship in one direction (e.g. John Smith is married to Jane Smith, therefore the convention could be that that the male is being stored in the main key and the female is being stored in the linked key).35 A reversing view is one that has all the same columns as the underlying table except that the two keycolumns are swapped around. In this example PARTIE_DWK would be swapped withLINKED_PARTIE_DWK. © 2009 Data Management & Warehousing Page 23
  24. 24. White Paper - Process Neutral Data Modelling Segment Tables The final type of information that might be required about a major entity is the segment. This is a collection of records from the major entity that share something in common but more detail is not known. The most common business example of this would be the market segmentations done on customers. These segments are normally a result of detailed statistical analysis and then storing the results. In our example John Smith and Jane Smith could both be part of a segment of married people along with any number of other individuals for whom it is known that they are married but there is no information about when or to whom they are married. Where the _LINKS table provided the peer-to-peer relationship the segment provides the peer group relationship. Figure 16 - Party Segments Example© 2009 Data Management & Warehousing Page 24
  25. 25. White Paper - Process Neutral Data ModellingThe Sub-ModelThe major entities and the six supporting data structures (_TYPES, _BANDS,_PROPERTIES, _EVENTS, _LINKS, and _SEGMENTS) provide sufficient design patternstructure to hold a large part of the information in the data warehouse. This is known as aMajor Entity Sub-Model. Significantly the information that has been stored for a single majorentity sub-model is very close to the typical dimensions of a data mart. This design patternprovides complete temporal support and the ability to re-construct a dimension or dimensionsbased on a given set of business rules.The set of a major entity and the supporting structures is known as a sub-model. For examplethe designed PARTY sub-model consists of: • PARTIES • PARTY_TYPES • PARTY_BANDS • PARTY_PROPERTIES • PARTY_PROPERTY_TYPES • PARTY_EVENTS • PARTY_EVENT_TYPES • PARTY_LINKS • PARTY_LINK_TYPES • PARTY_SEGMENTS • PARTY_SEGMENT_TYPESThose tables in bold italics might represent the implemented PARTY sub-modelImportantly what has not been provided is the relationships between major entities and thebusiness transactions that occur as a result of the interaction between major entities. © 2009 Data Management & Warehousing Page 25
  26. 26. White Paper - Process Neutral Data ModellingHistory TablesExtending the example above it is noticeable that the party does not contain anyaddress information; this is held in the geography major entity. This is also anotherexample where current business processes and requirements may change. At theoutset the source system may provide a contract address and a billing address. Achange in process may require the capture of additional information e.g. contactaddresses and installation addresses.In practice the only difference between this type of relationship between major entitiesand the _LINKS relationship is that instead of two references to the same major entitythere is one relationship to each of two major entities.The data model is therefore relatively simple to construct:Figure 17 – Party Geography History ExampleThere is one minor semantic difference between links and histories. _LINKS tables joinback on to the major entity and therefore one half of the relationship has to be givenpriority. In a _HISTORY table there is no need for priority as each of the two attributesis associated with a different major entity.Finally note that in this example the major entity is shown without the rest of the sub-model that can be assumed.© 2009 Data Management & Warehousing Page 26
  27. 27. White Paper - Process Neutral Data ModellingOccurrences and TransactionsThe final part of the data model is to build up all the occurrence or transaction tables. Inthe data mart these are most akin to the fact tables although as this is a relationalmodel they may occur outside a pure star relationship. Like the major entities there isno standard suffix or prefix, just a meaningful name.To demonstrate what is required an example from a retail bank is described. Theexample is not nearly as complex as a real bank but necessarily longer and morecomplex than most examples to demonstrate a number of features. Furthermorebanking has been chosen as an example because the concepts will be familiar to mostreaders. The example only looks at some core banking function and not at the activitiessuch as marketing or specialist products such as insurance. The Example The bank has a number of regions and a central ‘premium’ account function that caters for some business customers. Each region has a number of branches. Branches have a manager and a number of staff. Each branch manager reports to a regional manager. If a customer has a personal account then the account manager is a branch personal account manager, however if the individual has a net worth in excess of USD1M the branch manager acts as the account manager. Personal accounts have contact and statement addresses and a range of telephone numbers, e- mails, addresses, etc. If the account belongs to a business with less than USD1M turnover then the account manager is a business account manager at the branch who reports to the branch manager. If the account belongs to a business with a turnover of between USD1M and USD10M then the account manager is an individual at the regional office who reports to the regional manager. If the account belongs to a business with a turnover more than USD10M then the account managers at the central office are responsible for the account. Businesses have contact and statement addresses as well as a number of approved individuals who can use the company account and contact details for them. Branch and account managers periodically review the banding of accounts by income for individuals and turnover for companies and if they are likely to move band in the coming year then they are added to the appropriate (future) category. Note that this is only partially fact based, the rest being based on subjective input from account managers. The bank offers a range of services including current, loan and deposit accounts, credit and debit cards, EPOS (for business accounts only), foreign exchange, etc. The bank has a number of channels including branches, a call centre service, a web service and the ability to use ATMs for certain transactions. The bank offers a range of transaction types including cash, cheque, standing order, direct debit, interest, service charges, etc.© 2009 Data Management & Warehousing Page 27
  28. 28. White Paper - Process Neutral Data Modelling After the close of business on the last working day of each month the starting and ending balances, the average daily balance and any interest is calculated for each account. On a daily basis the exposure (i.e. sum of all account balances) is calculated for each customer along with a risk factor that is a number between 0 and 100 that is influenced by a number of factors that are reviewed from time to time by the risk management department. Risk factors might include sudden large deposits or withdrawals, closure of a number of accounts, long-term non-use of an account, etc. that might influence account managers’ decisions. Every transaction that is made is recorded every day and has three associated dates, the date of the transaction, the date it appeared on the system and the cleared date. De-constructing the example The bank has a number of regions and a central ‘premium’ account function that caters for some business customers. Each region has a number of branches. Branches have a manager. Each branch manager reports to a regional manager. • The bank itself must be held as an organisation. • The regions and central ‘premium’ account function are held as 36 Organisation Units. • The bank and the regions have links. • The branches are held as organisational units. • The regions and the branches have links. • The branches have addresses via a history table. • The branches have electronic addresses via a history table. • There are a number of roles stored as organisation units. • There roles and the individuals have links. • The roles may have addresses via a history table. • The roles may have electronic addresses via a history table. • The individuals may have addresses via a history table. • The individuals have electronic addresses via a history table. At this point only existing major entities and history tables have been used. Also this information would be re-usable in many places just like the conformed dimensions concept of star schemas but with more flexibility. If a customer has a personal account then the account manager is a branch personal account manager, however if the individual has a net worth in excess of USD1M the branch manager acts as the account manager. Personal accounts have contact and statement addresses and a range of telephone numbers, e- mails, etc. • Customers are held as Parties, either Individuals or Organisations. • Customers have addresses via a history table. • Customers have electronic addresses via a history table. • Accounts are held in the Accounts major entity. • Customers are related to accounts via a history table. • Branches are related to accounts via a history table. • Accounts are associated with a role via a history table. • An individual’s net worth is generated elsewhere and stored as a property of the party.36 See Appendix 2 – Understanding Hierarchies for an explanation as to why the regions areorganisational units and not geography. © 2009 Data Management & Warehousing Page 28
  29. 29. White Paper - Process Neutral Data Modelling • A high net worth individual is a member of a similarly named segment. • The accounts may have addresses via a history table. • The accounts may have electronic addresses via a history table. If the account belongs to a business with less than USD1M turnover then the account manager is a business account manager at the branch who reports to the branch manager. If the account belongs to a business with a turnover of between USD1M and USD10M then the account manager is an individual at the regional office who reports to the regional manager. If the account belongs to a business with a turnover over USD10M then the account managers at the central office are responsible for the account. Businesses have contact and statement addresses as well as a number of approved individuals who can use the company account, and contact details for them. • Businesses are held as parties. • The business turnover is held as a party property. • The category membership based on turnover is held as a segment. • The businesses may have addresses via a history table. • The businesses may have electronic addresses via a history table. Branch and account managers periodically review the banding of accounts by turnover for both individuals and companies and if they are likely to move band in the coming year then they are added to the appropriate (future) category. Note that this is only partially fact based, the rest being based on subjective input from account managers. • There is a need to allow manual input via a warehouse support application for the party segments. At this point only the PARTY, ADDRESS, ELECTRONIC ADDRESS sub-models and associated _HISTORY tables have been used. The bank offers a range of services including current, loan and deposit accounts, credit and debit cards, epos (for business accounts only), foreign exchange, etc. • The product services are held in the product service major entity. • The product services are associated with an account via a history. The bank has a number of channels including branches, a call centre service, a web service and the ability to use ATMs for certain transactions. • The channels are held in the channels major entity. • The ability to use a channel for a specific product service is held in the history that relates the two major entities. This adds the PRODUCT_SERVICE and CHANNEL major entities into the model. The bank offers a range of transaction types including cash, cheque, standing order, direct debit, interest, service charges, etc. • This requires a TRANSACTION_TYPE table that will be added to the transaction table, which has not yet been defined. After the close of business on the last working day of each month the starting and ending balances, the average daily balance and any interest is calculated for each account. • This is stored as an account property (it may be an event).© 2009 Data Management & Warehousing Page 29
  30. 30. White Paper - Process Neutral Data Modelling On a daily basis the exposure (i.e. sum of all account balances) is calculated for each customer along with a risk factor that is a number between 0 and 100 that is influenced by a number of factors that are reviewed from time to time by the risk management department. Risk factors might include sudden large deposits or withdrawals, closure of a number of accounts, long-term non-use of an account, etc. that might influence account managers’ decisions. • The exposure is stored as a party property (or event). • The party risk factor is stored as a party property. Everything that is required to describe the transaction table is now available. Every transaction that is made is recorded every day and has three associated dates, the date of the transaction, the date it appeared on the system and the cleared date. • The Transaction Table will have the following columns o Transaction Date o Transaction System Date o Transaction Cleared Date o From Account o To Account o Transaction Type o Amount This would complete the model for the example. There are some interesting features to examine. The first is that all amounts would be positive. This is because for a credit to an account the ‘from account’ would be the sending party and the ‘to account’ would be the customer’s account. For a debit the ‘to account’ would be the recipient and the ‘from account’ would be the customer’s account. This has a number of effects. Firstly it complies with the DRY (Don’t Repeat Yourself) principle and means that extra data is not stored for the transaction. It also means that a collection of account information not related to any current party (e.g. a customer at another bank) is built up. This information is useful in the analysis of fraud, churn, market share, competitive analysis, etc. For a customer analysis data mart the data can be extracted and converted into the positive credit/negative debt arrangement required by the users. The payment of bank changes and interest would also have accounts and this information in a different data mart could be used to look at profitability, exposure, etc. The process has used seven major entities’ sub-models, an additional type table and an occurrence or transaction table. Storing this information should accommodate and absorb almost any change in business process or source system without the need to change the data warehouse model and will allow multiple data marts to be built from a single data warehouse quickly and easily. In effect the type tables act as metadata for how to use and extend the data model rather than defining the business process explicitly in the data model, hence the name process neutral data modelling. It also demonstrates the ability of the data model to support the requirements process. By knowing the major entities and using a storyboard approach similar to the example above, and familiar as an approach to agile developers, it is possible to quickly and easily identify business, data and query requirements.© 2009 Data Management & Warehousing Page 30
  31. 31. White Paper - Process Neutral Data Modelling Party Sub Model including: • Individuals History • Organisations History • Organisation Units • Roles Addresses Sub Model Electronic Addresses Sub Model including: including: • Postal Address • Telephone Numbers • Point Location • E-Mail Addresses • Telex History Accounts Sub Model History History History History Channel Sub Model Product Service Sub Model Retail Banking Transactions Transaction Calendar Types Sub ModelFigure 18 - The Example Bank Data Model © 2009 Data Management & Warehousing Page 31
  32. 32. White Paper - Process Neutral Data ModellingThe model above has been almost fully described in detail by this document since the self-similar modelling for all the sub-model components has been described along with the historytables, most of the retail banking transactions and some of the lifetime attributes of the majorentities. To complete the model just needs these additional attributes to be added.Two other effects that will influence the creation of data marts from this model can also beseen. Firstly the creation of dimensions will revolve around the de-normalisation of theattributes that are required from each of the major entities into one of the two dimensionsassociate with account as these have the hierarchies for the customer, account manager, etcassociated with them.The second effect is that of the natural star schema. It is clear from this diagram that the facttables will be based around the ‘Retail Banking Transactions’ table. As has already beenstated there are several data marts that can be built from this fact table, probably at differentlevels of aggregation and with different dimensions.The occurrence or transaction table above is one of perhaps twenty that a large enterprisewould require along with approximately thirty _HISTORY tables. This would be combined witharound twenty major entity sub models to create an enterprise data warehouse data model.For those readers who have also read and are familiar with the Data Management & 37Warehousing white paper ‘How Data Works’ that describes natural star schemas in moredetail and also a technique called left to right entity diagrams will see a correlation as follows:Level Description1 _TYPE and _BAND tables, simple small volume reference data.2 Major Entities, complex low volume data.3 Some major entities that are dependent on others along with _PROPERTIES and _SEGMENTS tables, less complex but with greater volume.4 _HISTORY tables and some occurrence or transaction tables.5 Occurrence or transaction tables. Significant volume but low complexity data.Figure 19 - Volume & Complexity Correlations37 Available for download from http://www.datamgmt.com/whitepapers © 2009 Data Management & Warehousing Page 32

×