IOUG93 - Technical Architecture for the Data Warehouse - Paper


Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

IOUG93 - Technical Architecture for the Data Warehouse - Paper

  1. 1. David Walker Page 1 A TECHNICAL ARCHITECTURE FOR THE DATA WAREHOUSE David Walker Consultant Data Management & Warehousing, Wokingham, United KingdomSummaryThe paper aims to set out a brief description of the process of implementing a DataWarehouse and then to look in more detail at one particular aspect of the process,namely the Technical Architecture. Whilst it can not be exhaustive it does set out amodel that can be used as a guide in the implementation of a system.IntroductionWhen an organisation decides to build a Data Warehouse the are four key elementsrequired of the process. These elements are: • The Business Analysis A process of gathering together users of information (this is not the IT department but other departments such as marketing, sales, finance, customer service, etc.) and asking them what their goals are and how they measure them. The measurements are the Key Performance Indicators (KPI) and will eventually become the answers required from the Data Warehouse. The constraints that they place on these KPI such as time, location, etc. are the dimensions of the Data Warehouse. This basic design allows users to formulate natural language questions against a relational database [for example What is the total sales (KPI) and average sales (KPI) by region (Dimension) over the past two years (Dimension) ?] Obtaining a common understanding between departments of the KPI and dimensions will in itself prove a challenge and is normally approached as a series of workshops. Users should understand that the Data Warehouse is an iterative design and should therefore not expect answers to all their questions on the first pass. It is also likely that the required information from the system will change over time as decisions, based on information that has come available, will change the course of business and require different KPI to help further decision support.
  2. 2. David Walker Page 2 • The Database Schema Design The next stage of the process is to take the requirements identified by the users and design a schema that will support the queries. By taking the questions identified by users and looking at the data available on the existing systems a mapping can be produced. This will inevitably show a mismatch between the requirement and the information available. Decisions then have to be made to either supplement this from external sources, modify internal sources to provide such information, or omit the information until a later iteration. The inevitable consequence of this approach is that one, or a number of ‘star schemas’ will be derived. These are schemas that have a large central (or Fact) table containing the KPI and foreign keys to a number of smaller dimension tables. Each dimension table will have a relatively low number of rows but has a large number of (highly indexed) columns that are to some extent de-normalised. A user may see a table called TIME that has columns ‘Day of Year’, ‘Day of Week’, ‘Day of Month’, ‘Month’, ‘Week Number’, ‘Year’ and ‘Julian Date’. For five years of information this will only contain 1825 (5*365) rows. The primary key will be ‘Julian Date’. This dimension will be one of perhaps eight that relate to the large central table. If each of these dimensions contains 2000 rows, then the central table could have up to 2.56^26 rows. In practice, there is a high level of sparsity (no sales on Saturday or Sunday would reduce the TIME dimension by 30% alone) and therefore it is practical to build the systems to support the design. Each ‘star schema’ can be considered a Data Mart for one or more departments. These star schemas may also contain pre-calculated summaries, averages, etc. The schema design must also include Meta Data, or data about data. This will allow users to interrogate the database to discover the where a particular field is and if appropriate how it is derived (for example how was the gross profit calculated, which system provided the stock levels, etc.). The Meta Data can also be used to enforce security on the system by determining what a user can or can not see.
  3. 3. David Walker Page 3 • The Technical Architecture This is the substantive part of the paper and is covered in detail below. At this stage it is sufficient to say that the technical architecture is the design of a practical method to build, maintain and secure the Data Warehouse. As the system is rolled out it will grow in its importance and will become mission critical. Whilst not being required on a seven by twenty-four basis, it will need to be available to run very large, and possibly long running queries. Other issues will include backup and recovery, response, accuracy and presentation. • Project Management As has already been stated the building of a Data Warehouse will be both inter- departmental and an iterative approach. It is necessary for the process to have the complete buy-in from management and to have strong project management to maintain the momentum over the long term. Not only does the project manager have to deal with the initial implementation, but build the necessary infrastructure to keep the project rolling over the longer term. A failure to achieve this is more often the cause of the collapse of a Data Warehouse project than the technical hurdles encountered along the way.The Technical Architecture
  4. 4. David Walker Page 4
  5. 5. David Walker Page 5Data Acquisition As can be seen at the bottom of the diagram there are a number of different types of source system. The first problem is not only identify the sources, but to map and extract the data that you require from each of the systems. This may require complex mappings to obtain all the information. Data acquisition is made up of a number of components. Each of which is required to ensure the successful loading of data. In some cases (such as data loading via SQL*Net a number of these components are performed together) • Extraction This is the physical process of extracting the data from the source system, either to flat files, or tapes. • Transformation This is the action required to change coding into common formats, a common problem is that the support desk customer id may be different from that of the finance or marketing database. In order for the data to be accessible by all it has to be converted into a common format. • Collation Once a file has been extracted and transformed it can be collated with other files. It may be advantageous at this stage to pre-sort some of the data. • Migration This is the moving of the data from the source system to the Data Warehouse. This can take place either via a network or tape loads. Networking issues arise if the volumes to be moved can not be supported by the available bandwidth. If tapes are used then there is the compatibility and reliability of media, and the ability of operators to consistently put the tapes into the system in the correct order. • Loading Once obtained the data must be loaded into the system. This may
  6. 6. David Walker Page 6 require the use of Oracle SQL*Loader. Consideration needs to be given to whether the parallel and/or direct load options can be supported, with the requirement to re-build the indexes afterwards. Another consideration is whether users will be on the system whilst the load takes place, and where aggregates have been used how they are re-calculated after loading • Exception Processing Despite the best efforts of the IT department there will after loading be exceptions. Examples of this range from the simple (a new customer key was not inserted and therefore all transactions associated with that customer have not yet been loaded) to the complex (a particular branch office was unable to generate their load file, but all other branches have. How are the gaps accounted for?). This process must handle both the expected and the unexpected. • Quality Assurance Once the data has been loaded it is necessary to perform some form of Quality Assurance on the data. The load and exception processing may have all succeeded but how is good is the data. Some key indicators need to be checked after each load to ensure that they are within the expected norms (i.e. sales in Central London did not rise by 700% overnight). This can normally be achieved by setting high and low bounds on KPI, the exceptions are then manually checked to see if they are valid and if so the bounds changed. If not they changes either need to be fixed or rolled back. • Publication After the completion of the Quality Assurance end users need to be notified of the state of the database. This can be achieved in a number of ways from a mail message, message of the day, or an automatic initial query that returns the status of the system. These messages should range from the fact that the Data Warehouse is either available or unavailable to the date of the last load and the percentage of feeds used (e.g. if one branch of twenty failed to load and a statistical process was used to populate the data then the data might be described as available with 95% source information included). This information is important to the users as it may affect the way in which the use the outputs from the system.
  7. 7. David Walker Page 7Data acquisition is performed in two stages. The first is the initial data load that can beeffectively managed by a series of bespoke programs that either directly load data intothe database (for example PL/SQL queries from the database to other Oracle databasesor databases that are accessible via SQL*Net and Oracle Gateways). It may also bepossible to use one of the growing number of tools on the market designed for thispurpose. These include Platinum InfoPump, Prism Warehouse Manager and HPIntelligent Warehouse. This exercise can be considered a one off effort and whilst it isdesirable to re-use the technology for the second stage the differences in volumes andchange control may not make this possible.The second stage is the day to day updating of the system. It is unlikely that a singletool will be able to perform this task entirely. However one should try to use a smallnumber of tools rather than build a complex set of bespoke code to perform the dataacquisition. There are two reasons for this.: The first is the changing dynamics of the systems supplying the information. It is unlikely that the source systems remain static, indeed they are likely to have changed in some form before the end of the first phase of the Data Warehouse project. Therefore a tool that allows the Warehouse Manager to update the relationships between the source and target entities without having to re-craft large code segments is desirable. The second reason is the speed of development that is required. It is unacceptable to say that the Data Warehouse will not be updated for six weeks whilst the new COBOL extraction code is written and tested. Thus traditional software development techniques to not meet the need of the Data Warehouse.The Transaction Repository The Transaction Repository is the core of the system. This is where the data that has been loaded is held. The data here is the lowest level that will be accessible by the users. The volumes as have already been described can be very large. A 250 Million row table with 100 bytes per row will be in excess of 30Gb, this is without any indexes, rollback space or other associated database requirements. Further drains on disk resources will include the use of mirroring (doubling the disk space requirement), or RAID technologies (Which can add up to 20% to the disk space requirement). With a large Data Warehouse this may startto cause a problem as the upper bound of most large UNIX systems (for example
  8. 8. David Walker Page 8Sequent, Pyramid, HP and IBM) is around 3.5Tb of raw disk. This may also not all beuse-able due to system design considerations such as channel bandwidth to the diskarray, or slots available in the system cabinets. The total amount of disk space may beexpanded by the use of external mass storage devices supplied by companies such asEMC, DG (Clariion), etc.An alternative strategy is to distribute the Data Warehouse over a number of systems.One might build a system that has a Fact Table and its associated Dimensions on asingle system, the next fact table would reside on another system and if they shared anyfact tables then one would use Oracle Replication server to ensure consistency betweenthe copies.A third solution may be to build a completely distributed system and then use a toolsuch as MicroStrategy DSS Server or HP Intelligent Warehouse to co-ordinate thequery. This provides users with a single entry point, but allows the back-end power of adistributed system.It is important to notice that in the rush to build a disk system large enough to hold allthe data requirements one does not preclude processing power. Often SMP architecturescan have either processor or disk controller cards up to a combined limit, whilst MMParchitectures may need to dedicate nodes to manage the virtual disk array presented tothe other nodes. Finally it is important to realise that where clustering technologies andMMP systems are utilised that there is an overhead involved in returning informationfrom the nodes to the processor that spawned that parallel query. This means that simplyadding additional nodes does not mean linear growth in system capacity.The performance of the data warehouse can now also be dramatically improved by theuse of bit-mapped and hash indexes, pre-joined tables and parallel queries options.These technologies, whilst very powerful are not panaceas and can only be used in acontrolled fashion within the Data Warehouse.Data AggregationIt is possible to improve the performance of the system by the use of Data Aggregation.This allows often used bands of data (for example quarterly figures) to be pre-calculatedand stored is the database.The level of aggregation also may be more meaningful to the level of user. For examplea CEO will want to have a high level view (traditionally associated with EIS systems)and drill down into a lower level of data (traditional Decision Support Systems).
  9. 9. David Walker Page 9However the CEO’s DSS system may be the entry point for the managers who workunder him and they drill down from the DSS level into the transaction repository totrack individual items or trends on a lower scale.The use of aggregates, whilst improving overall speed may in itself occupy largeamounts of disk space. For example a schema with three dimensions (Region 50 rows,Customer 10,000 rows and Time 365 rows) will have a fact table of total sales with182.5 Million (50*10,000*365) rows. If data is now stored for not only each day, butalso 52 weeks, 12 months and 4 quarters the total size rises to 216.5(50*10,000*{365+52+12+4}) Million rows, a growth of 20% in the total database size.This does not take into account any aggregations of the other two dimensions that willalso cause large amounts of growth in the database size.It is therefore possible to produce a system that contains Data Marts, suitable for use byfront end tools that query against a smaller database made up of the required aggregatesand submits jobs in batch to the transaction repository that run over a much longer timeon a separate system. This allows uses to get good local response for a large proportionof the queries and gives them an understanding that certain types of highly detailedqueries will result in batch jobs.If queries are monitored then it is possible to change the aggregation strategy over aperiod of time. This is because once implemented the users will start to manipulate thedata in ways that where not envisaged in the original design.
  10. 10. David Walker Page 10Meta Data & SecurityThe value of the Data Warehouse is defined by the ability of end users to obtainbusiness benefit from the information they glean from the system. This requires anunderstanding of what data is there and more importantly how it was derived. Differentdepartments, even different individuals within a department may understand theinformation in different ways. When the system returns an answer that is unexpected theuse must be able to not only view the results but pursue the unexpected element untilthey are satisfied with its derivation.The Meta data alsoallows a way for newand interestingperspectives to beobtained on data that isin the system. Forexample an insuranceunderwriter may be ableto combine factors suchas location, weather andcar type as a basis fornew underwritingdecisions, whereashistorically theinformation whilst heldwithin the organisationmay not have beenaccessible from a singlepoint and required toomuch effort to collate.The value of this type ofcross functionalfertilisation raises theproblems of security.The Data Warehouse is at its best when queries range across the system bringing back anew perspective. However there are internal problems (for example sales quotasbetween different departments being confidential) and external problems (for examplekey queries being run by an employee who then leaves to join a competitor).There is no easy answer to this, but defining access rights against the Meta Data doesenable some level of security to be introduced. It will be a matter of balance betweenthe needs of the company as a whole and the risks of allowing such broad access tostaff. The outcome of this challenge will depend on the culture of the organisation andin particular the co-operation of end user managers.
  11. 11. David Walker Page 11Middle-ware and PresentationThe wide variety of tools on the market make it almost impossible to make any sensiblecomments in such a small space. What is clear is that there a number of different typesof query tool. These can be classified into Summary Query, Trawling (or Mining),Simple Data Access and Data Manipulation. Some of these tools will exist in theorganisation today (often spreadsheets and in larger organisations SAS may well exist).Added to this are the number of user requirements about how they want the data to bepresented (for example graphically, text with dimensions, etc.)The only way in which to choose a tool is to build up a shortlist from current marketliterature (the Internet and magazines provide countless references) and draw up ashortlist based on the users stated requirements, and some more traditional constraintssuch as cost and support. Most vendors are more than happy to provide evaluation andtrial copies which end users can then evaluate for themselves. It is certain thatindividual preference and differing requirements will mean that no one tool will suffice,and given the cost of PC tools in general one can afford to allow a number to be triedand accepted or rejected.The choice of middle-ware is somewhat dependant on the final choice of tool, howeverthere are two clear alternatives within the Oracle arena: SQL*Net and ODBC. Toolssuch a the HP Intelligent Warehouse will also provide their own connectivity that routesthe query through the query manager or co-ordinator. The choice of middle-ware shouldbe limited to one or two, forcing the rejection of front end tools that do not meet thisrequirement.ConclusionThe Data Warehouse is an end user tool, which the IT department must build andsupport. It is a dynamic, iterative long-term project that requires different techniquesfrom those traditionally employed by the information systems professionals.Staff are required to work in cross-functional teams that will discover business criticalfacts that may be the key to the success of your employer (and consequently you) in thelonger term.The scale of the infrastructure required to support such a venture can and should not beunderestimated. It is certain that the Data Warehouse will involve larger volumes ofdata and new querying techniques that will not be envisaged at the initial design phase.It is therefore critical that the system is built with a flexible and scaleable technicalarchitecture designed to meet the unexpected.