Data Warehousing and Data Mining


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Data Warehousing- a repository of information, or archive information, gathered from multiple sources stored under a unified schema. OLAP – on-line analytical processing - a method of analysis of data based on multi-dimensional databases. Data Mining- analyses data, discovers rules and patterns from the data. Both OLAP and Data Mining are end-user tools for data analysis
  • DBMS in industry are pervasive throughout industry. Designed to handle high transaction throughput, where transactions typically make small changes to the operational data, for day-to-day running of the organisation. Can range in size: small databases being mbs where large databases can require terabytes or even petabytes. Decision makers require access to all the organisation’s data to provide comprehensive analysis of the organisation, its business, its requirements and its trends. This needs access to both current and past data. A data warehouse is different to an OLTP system in that it fits the definition given on this slide. Subject oriented: organised around the major subjects of the enterprise (e.g. customers, products, sales) rather than the major application areas (e.g. customer invoicing, stock control, product sales) Integrated: coming together of the source data from different enterprise-wide applications systems. Often inconsistent, e.g. different formats Time-variant: data in the warehouse is only accurate at some point in time or over some time interval. Data represents a series of snapshots Non-volatile: data is not updated in real time, refreshed from operational systems on a regular basis. New data is always added, rather than replaced.
  • These are the benefits. Organisations normally must commit a huge amount of investment and resources in developing the data warehouse, but the potential returns on that investment due to increasing productivity and the competitive advantage that gives can be very large, the IDC quote is 90% of companies had >40% return over three years( normal ROI would be 8-15%). The competitive advantage is gained by the access to data that was previously unavailable, and increases productivity as the data is integrated from previously incompatible systems. Competitive advantage is gained by giving access to this data by management decision makers for forecasting trends etc.
  • This slide shows the differences between how normal business databases and data warehouses work. OLTP, e.g.inventory control, invoicing, etc., are designed to maximise the transaction processing capacity, data warehouses are designed to support ad hoc query processing, therefore they are organised according to the requirements of potential questions and supports long term strategic decision making It is often the case that OLTP systems provide the data for data warehouses. However the data held in OLTP systems can be inconsistent, fragmented, contain duplicate or missing entries. This must be cleaned up before it can be used in a data warehouse.
  • This shows the typical architecture of a data warehouse. It shows: Data sources - can vary from mainframes to departmental databases to external data. There are lots of different sources and different data types. Load manager (or frontend) performs the extraction and loading of data into the warehouse. Warehouse manager performs all the operations associated with the management of the data in the warehouse. Operations include - ensuring consistency of data, indexes and views, denormalising, aggregating data, backing up and archiving. Query manager (backend) manages the user queries. Complexity depends on flexibility of end-user access tools. Can include directing queries to tables, scheduling execution of queries, generating query profiles to assist warehouse manager in managing indexes and views. Detailed data: this is all the detailed data in the schema. Normally stored offline and aggregated into next level of data. Lightly/highly summarised data: this is the aggregated data generated by the warehouse manager. This is subject to change on an on-going basis depending on the types of queries. Purpose is to speed up queries. Meta-data: description of data in warehouse. Changes according to structure of data in warehouse.
  • There are numerous types of data in a data warehouse. Detailed data is the actual data which has been pulled in from the various sources. Summarised data tends to create various views of the detailed data, to answer specific queries. It needs to be summarised because there is such a large amount of data. Because these views can change, there also needs to be meta-data. Archive/backup – the data warehouse will always grow, so some of the older data can be archived, in a way that it can still be included in queries if required.
  • This diagram shows the main flows of data in the warehouse. The next slide explains each of the five flows.
  • This explains the processes associated with each of the information flows. Inflow – cleans dirty data, restructures data to suit new requirements, ensure source is consistent with data already in the warehouse Upflow – summarise data into more convenient views, pack data into more useful formats, distributes data to increase availability/accessibility Downflow – transfer data of limited value restore following crash Outflow – 2 activities: 1. accessing – satisfy individual end user requests for data from tools; 2. delivering – delivery of information to end users Metaflow - - process which moves metadata responding to changing needs, i.e. updating metadata accordingly
  • These are the typical problems with a data warehouse, most arise from the problems of integrating the source data(see pages 1050-1052, Connolly & Begg) 80% of development time is spent on data loading Problems with source systems, e.g. nulls allow incomplete data, needs to be fixed OLTP systems may not store data needed – so may need to alter the OLTP systems Users become aware of capabilities – need better tools Homogenization can lessen value of data – similarties v. differences in data Disk space, large no. of indexes Data accessible to all users Reorganisation of business processes – change to DW Can take 3 years to build – data marts support only one department so may be quicker Need to integrate all tools to ensure benefits the organisation
  • The types of queries we need to be able to perform are different to those in an OLTP system as they are more factual, analytical and temporal. An example is given - try doing this in a relational system. So normal modelling techniques (E-R model) are not suitable as the relationships between the data can sometimes be too complex, therefore we use dimensionality modelling: a logical design technique that aims to present the data in a standard, intuitive form that allows for high-performance access.
  • Our dimensional model is based on the E-R model but with some restrictions, to support the types of queries required. A model with these restrictions is called a Star Schema. The use of surrogate keys aids performance in joins.
  • A star schema contains two types of tables as defined. An example is given on the next couple of slides. Fact tables contain factual data. The FK can be classed as either unintelligent (an unique identifier represented by the actual data), or intelligent (a surrogate FK) Fact tables are unlikely to change – as facts normally occur in the past Dimensional tables contain reference data – i.e. the data which supports the fact. Star schemas given increased performance as dimension tables are denormalised, thus minimising the number of joins.
  • This is an E-R model taken from Connolly and Begg for the Dream Home database. Notice that it contains complex relationships between the various objects, which would make it different to answer the types of queries required. So we redesign using a star schema as given on the next slide...
  • This is the Star Schema version. Now we have one table in the centre which contains all the links to the dimension tables, which contains the data. The fact table is just like a M:N relationship in a relational database. Note that the can be more than one fact table in a star schema.
  • Snowflake: Star schema is denormalised Snowflake schemas contain no denormalised data, the data is normalised Starflake A combination of normalised and denormalised data. This is the most appropriate schema to use.
  • There are 2 tools commonly used for data analysis. These are OLAP and data mining. OLAP tools are query centric – database schemas are array oriented and multi-dimensional in nature, e.g. market analysis. So OLAP tools work on the concept of multi-dimensional (i.e. >2 dimensions) data – to support complex analytical applications. So there is a new type of database – multi-dimensional DB – as there is a need to retrieve large numbers of records from very large data sets and summarise this data on they fly. For example, the dimensions of a database could be property type, city and time. Typical operations which can be performed are: Consolidation – aggregate data, roll up, e.g branch offices -> city, city -> country Drill down – the reverse of consolidation, displays detailed data Slicing and dicing, aka pivoting – look at data from different viewpoints, normally along a time axis.
  • Codd defined a number of rules for OLAP systems Must be intuitively analytical and easy to use Transparent to users – users are familiar with particular front end tools All data sources must be accessible (network, hierarchical, relational, etc.) As number of dimensions increased, the performance must remain consistent Must operate efficiently in a client-server architecture There must be no bias towards any one dimension There may be instances where a large number of nulls are stored – this must not have any averse impact on accuracy and speed of access Must support concurrent users Must be able to support the typical operations on the last slide, e.g. performing roll-up within/across dimensions Slicing and dicing, drill down and consolidation must be intuitive, e.g. via a point and click interface, or drag and drop operations on a data cube Must be able to retrieve any view of the dta No. of dimensions should be unlimited
  • Types of tools are categorised according to the architecture of the underlying database. There are 3 main categories. MOLAP – specialised data structures and MDDBMSs to organise, navigate and analyse data Data typically aggregated to enhance performance Use array technology and sparse data management ROLAP – this is the fastest growing technology. Works by providing multi-dimensional views of 2D data. SQL is enhanced to increase performance and support complex operations on multi dimensions MQE – this is the newest technology. Data can be delivered either directly from the RDB or from a MOLAP/ROLAP server in the form of a datacube. The datacube is stored and analysed locally – therefore they are simple to install, and each user can build a custom data cube.
  • There are various end-user tools which can be used with data warehouses, and data mining is one of those sets of tools which is used for analysing data within a database to find hidden/unexpected information within a database.
  • These are some examples of typical applications
  • Although there are four operations which can be used independently, many applications work well when several or a combination of operations are used. There are specific techniques used with each operation.
  • Predictive modelling uses observations to form a model of the important characteristics of some phenomenon. Can be used to analyse an existing database to determine some essential characteristics about the data set. Two main techniques: Classification: used to establish a specific predetermined class for each record in a database from a finite set of possible class values, e.g. if a customer has rented for > 2 years and > 25 years old then they are most likely to buy property. Can use neural/tree induction. Value prediction: used to estimate a continuous numeric value that is associated with a database record, uses statistical techniques, e.g. linear/non-linear regression. An example of classification is given on the next slide.
  • The second technique is database segmentation. Aims to cluster records so that they share a number of properties, i.e. homogenous. Uses unsupervised learning to discover sub-populations in the database. Two types – demographic and neural clustering. An example is given on the next slide of a scatterplot.
  • Link analysis establishes association between records. An example is given. Various techniques which look for associations/patterns/similar time sequences: Association – items which imply the presence of other item in same event Sequential – presence of 1 set of item implies presence of another in a period of time (e.g. long term customer buying behaviour) Similar time sequence – discovery of link between 2 sets of data that are time dependent, e.g. buying property -> buy household goods within 2 months.
  • Deviation detections identifies records where a value is out of the ordinary. Can be done either statistically (e.g. linear regression) or by visualisation (e.g. graphically), as in the example on the next slide. Good for fraud detection.
  • So to finish off on warehousing, if we look at the requirements for a data mining tool and then compare this to what we get from a data warehouse, then we can see that the ideal data source for data mining is a data warehouse. DW data is clean and consistent which is a prerequisite for Data mining Multiple sources allow to discover as many inter-relationships as possoble Query capabilities allow for selection of relevant subsets of records and fields Go back to data source – provides a way for data mining results to allow further investigation of uncovered patterns.
  • Data Warehousing and Data Mining

    1. 1. Data Warehousing and Data Mining May 2006
    2. 2. Contents <ul><li>Data Warehousing </li></ul><ul><li>OLAP </li></ul><ul><li>Data Mining </li></ul><ul><li>Further Reading </li></ul>
    3. 3. Data Warehousing <ul><li>OLTP (online transaction processing) systems </li></ul><ul><ul><li>range in size from megabytes to terabytes </li></ul></ul><ul><ul><li>high transaction throughput </li></ul></ul><ul><li>Decision makers require access to all data </li></ul><ul><ul><li>Historical and current </li></ul></ul><ul><ul><li>'A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision-making process' (Inmon 1993) </li></ul></ul>
    4. 4. Benefits <ul><li>Potential high returns on investment </li></ul><ul><ul><li>90% of companies in 1996 reported return of investment (over 3 years) of > 40% </li></ul></ul><ul><li>Competitive advantage </li></ul><ul><ul><li>Data can reveal previously unknown, unavailable and untapped information </li></ul></ul><ul><li>Increased productivity of corporate decision-makers </li></ul><ul><ul><li>Integration allows more substantive, accurate and consistent analysis </li></ul></ul>
    5. 5. Comparison Source: Connolly and Begg p1153
    6. 6. Typical Architecture Warehouse mgr Load mgr Warehouse mgr Query manager DBMS Meta-data Highly summarized data Lightly summarized data Detailed data Mainframe operational n/w,h/w data Departmental RDBMS data Private data External data Archive/backup Reporting query, app development,EIS tools OLAP tools Data-mining tools Source: Connolly and Begg p1157
    7. 7. Data Warehouses <ul><li>Types of Data </li></ul><ul><ul><li>Detailed </li></ul></ul><ul><ul><li>Summarised </li></ul></ul><ul><ul><li>Meta-data </li></ul></ul><ul><ul><li>Archive/Back-up </li></ul></ul>
    8. 8. Information Flows Warehouse Mgr Load mgr Warehouse mgr Query manager DBMS Meta- data Highly summ. data Lightly summ. Detailed data Operational data source 1 Operational data source n Archive/backup Reporting query, app development,EIS tools OLAP tools Data-mining tools Meta-flow Inflow Downflow Upflow Outflow Source Connolly and Begg p1162
    9. 9. Information Flow Processes <ul><li>Five primary information flows </li></ul><ul><ul><li>Inflow - extraction, cleansing and loading of data from source systems into warehouse </li></ul></ul><ul><ul><li>Upflow - adding value to data in warehouse through summarizing, packaging and distributing data </li></ul></ul><ul><ul><li>Downflow - archiving and backing up data in warehouse </li></ul></ul><ul><ul><li>Outflow - making data available to end users </li></ul></ul><ul><ul><li>Metaflow - managing the metadata </li></ul></ul>
    10. 10. Problems of Data Warehousing <ul><li>Underestimation of resources for data loading </li></ul><ul><li>Hidden problems with source systems </li></ul><ul><li>Required data not captured </li></ul><ul><li>Increased end-user demands </li></ul><ul><li>Data homogenization </li></ul><ul><li>High demand for resources </li></ul><ul><li>Data ownership </li></ul><ul><li>High maintenance </li></ul><ul><li>Long duration projects </li></ul><ul><li>Complexity of integration </li></ul>
    11. 11. Data Warehouse Design <ul><li>Data must be designed to allow ad-hoc queries to be answered with acceptable performance constraints </li></ul><ul><li>Queries usually require access to factual data generated by business transactions </li></ul><ul><ul><li>e.g. find the average number of properties rented out with a monthly rent greater than £700 at each branch office over the last six months </li></ul></ul><ul><li>Uses Dimensionality Modelling </li></ul>
    12. 12. Dimensionality Modelling <ul><li>Similar to E-R modelling but with constraints </li></ul><ul><ul><li>composed of one fact table with a composite primary key </li></ul></ul><ul><ul><li>dimension tables have a simple primary key which corresponds exactly to one foreign key in the fact table </li></ul></ul><ul><ul><li>uses surrogate keys based on integer values </li></ul></ul><ul><ul><li>Can efficiently and easily support ad-hoc end-user queries </li></ul></ul>
    13. 13. Star Schemas <ul><li>The most common dimensional model </li></ul><ul><li>A fact table surrounded by dimension tables </li></ul><ul><li>Fact tables </li></ul><ul><ul><li>contains FK for each dimension table </li></ul></ul><ul><ul><li>large relative to dimension tables </li></ul></ul><ul><ul><li>read-only </li></ul></ul><ul><li>Dimension tables </li></ul><ul><ul><li>reference data </li></ul></ul><ul><ul><li>query performance speeded up by denormalising into a single dimension table </li></ul></ul>
    14. 14. E-R Model Example Source: Connolly and Begg
    15. 15. Star Schema Example Source: Connolly and Begg
    16. 16. Other Schemas <ul><li>Snowflake schemas </li></ul><ul><ul><li>variant of star schema </li></ul></ul><ul><ul><li>each dimension can have its own dimensions </li></ul></ul><ul><li>Starflake schemas </li></ul><ul><ul><li>hybrid structure </li></ul></ul><ul><ul><li>contains mixture of (denormalised) star and (normalised) snowflake schemas </li></ul></ul>
    17. 17. OLAP <ul><li>Online Analytical Processing </li></ul><ul><ul><li>dynamic synthesis, analysis and consolidation of large volumes of multi-dimensional data </li></ul></ul><ul><ul><li>normally implemented using specialized multi-dimensional DBMS </li></ul></ul><ul><ul><ul><li>a method of visualising and manipulating data with many inter-relationships </li></ul></ul></ul><ul><ul><li>Support common analytical operations such as </li></ul></ul><ul><ul><ul><li>consolidation </li></ul></ul></ul><ul><ul><ul><li>drill-down </li></ul></ul></ul><ul><ul><ul><li>slicing and dicing </li></ul></ul></ul>
    18. 18. Codd’s OLAP Rules <ul><li>1. Multi-dimensional conceptual view </li></ul><ul><li>2. Transparency </li></ul><ul><li>3. Accessibility </li></ul><ul><li>4. Consistent reporting performance </li></ul><ul><li>5. Client-server architecture </li></ul><ul><li>6. Generic dimensionality </li></ul><ul><li>7. Dynamic sparse matrix handling </li></ul><ul><li>8. Multi-user support </li></ul><ul><li>9. Unrestricted cross-dimensional operations </li></ul><ul><li>10. Intuitive data manipulation </li></ul><ul><li>11. Flexible reporting </li></ul><ul><li>12. Unlimited dimensions and aggregation levels </li></ul>
    19. 19. OLAP Tools <ul><li>Categorised according to architecture of underlying database </li></ul><ul><ul><li>Multi-dimensional OLAP </li></ul></ul><ul><ul><ul><li>data typically aggregated and stored according to predicted usage </li></ul></ul></ul><ul><ul><ul><li>use array technology </li></ul></ul></ul><ul><ul><li>Relational OLAP </li></ul></ul><ul><ul><ul><li>use of relational meta-data layer with enhanced SQL </li></ul></ul></ul><ul><ul><li>Managed Query Environment </li></ul></ul><ul><ul><ul><li>deliver data direct from DBMS or MOLAP server to desktop in form of a datacube </li></ul></ul></ul>
    20. 20. MOLAP RDB Server Load MOLAP server Request Result Presentation Layer Database/Application Logic Layer
    21. 21. ROLAP RDB Server ROLAP server Request Result Presentation Layer Application Logic Layer SQL Result Database Layer
    22. 22. MQE RDB Server Load MOLAP server Request Result SQL Result End-user tools
    23. 23. Data Mining <ul><li>‘ The process of extracting valid, previously unknown, comprehensible and actionable information from large databases and using it to make crucial business decisions’ (Simoudis, 1996) </li></ul><ul><ul><li>focus is to reveal information which is hidden or unexpected </li></ul></ul><ul><ul><li>patterns and relationships are identified by examining the underlying rules and features of the data </li></ul></ul><ul><ul><li>work from data up </li></ul></ul><ul><ul><li>require large volumes of data </li></ul></ul>
    24. 24. Example Data Mining Applications <ul><li>Retail/Marketing </li></ul><ul><ul><li>Identifying buying patterns of customers </li></ul></ul><ul><ul><li>Finding associations among customer demographic characteristics </li></ul></ul><ul><ul><li>Predicting response to mailing campaigns </li></ul></ul><ul><ul><li>Market basket analysis </li></ul></ul>
    25. 25. Example Data Mining Applications <ul><li>Banking </li></ul><ul><ul><li>Detecting patterns of fraudulent credit card use </li></ul></ul><ul><ul><li>Identifying loyal customers </li></ul></ul><ul><ul><li>Predicting customers likely to change their credit card affiliation </li></ul></ul><ul><ul><li>Determining credit card spending by customer groups </li></ul></ul>
    26. 26. Data Mining Techniques <ul><li>Four main techniques </li></ul><ul><ul><li>Predictive Modelling </li></ul></ul><ul><ul><li>Database Segmentation </li></ul></ul><ul><ul><li>Link Analysis </li></ul></ul><ul><ul><li>Deviation Direction </li></ul></ul>
    27. 27. Data Mining Techniques <ul><li>Predictive Modelling </li></ul><ul><ul><li>using observations to form a model of the important characteristics of some phenomenon </li></ul></ul><ul><li>Techniques: </li></ul><ul><ul><li>Classification </li></ul></ul><ul><ul><li>Value Prediction </li></ul></ul>
    28. 28. Classification Example- Tree Induction Customer renting property > 2 years Rent property Rent property Buy property Customer age > 25 years? No Yes No Yes Source: Connolly and Begg
    29. 29. Data Mining Techniques <ul><li>Database Segmentation: </li></ul><ul><ul><li>to partition a database into an unknown number of segments (or clusters) of records which share a number of properties </li></ul></ul><ul><li>Techniques: </li></ul><ul><ul><li>Demographic clustering </li></ul></ul><ul><ul><li>Neural clustering </li></ul></ul>
    30. 30. Segmentation: Scatterplot Example Source: Connolly and Begg
    31. 31. Data Mining Techniques <ul><li>Link Analysis </li></ul><ul><ul><li>establish associations between individual records (or sets of records) in a database </li></ul></ul><ul><ul><ul><li>e.g. ‘when a customer rents property for more than two years and is more than 25 years old, then in 40% of cases, the customer will buy the property’ </li></ul></ul></ul><ul><ul><li>Techniques </li></ul></ul><ul><ul><ul><li>Association discovery </li></ul></ul></ul><ul><ul><ul><li>Sequential pattern discovery </li></ul></ul></ul><ul><ul><ul><li>Similar time sequence discovery </li></ul></ul></ul>
    32. 32. Data Mining Techniques <ul><li>Deviation Detection </li></ul><ul><ul><li>identify ‘outliers’, something which deviates from some known expectation or norm </li></ul></ul><ul><ul><li>Statistics </li></ul></ul><ul><ul><li>Visualisation </li></ul></ul>
    33. 33. Deviation Detection: Visualisation Example Source: Connolly and Begg
    34. 34. Mining and Warehousing <ul><li>Data mining needs single, separate, clean, integrated, self-consistent data source </li></ul><ul><li>Data warehouse well equipped: </li></ul><ul><ul><li>populated with clean, consistent data </li></ul></ul><ul><ul><li>contains multiple sources </li></ul></ul><ul><ul><li>utilises query capabilities </li></ul></ul><ul><ul><li>capability to go back to data source </li></ul></ul>
    35. 35. Further Reading <ul><li>Connolly and Begg, chapters 31 to 34. </li></ul><ul><li>W H Inmon, Building the Data Warehouse , New York, Wiley and Sons, 1993. </li></ul><ul><li>Benyon-Davies P, Database Systems (2 nd ed), Macmillan Press, 2000, ch 34, 35 & 36. </li></ul>