OLAP and Data Mining

2,614
-1

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,614
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
87
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

OLAP and Data Mining

  1. 1. Chapter 32 OLAP and Data Mining Transparencies
  2. 2. Chapter 32 - Objectives <ul><li>Purpose of online analytical processing (OLAP) and how OLAP differs from data warehousing. </li></ul><ul><li>Key features of OLAP applications. </li></ul><ul><li>Potential benefits associated with successful OLAP applications. </li></ul><ul><li>Rules for OLAP tools and main types of tools including: multi-dimensional OLAP (MOLAP), relational OLAP (ROLAP), and managed query environment (MQE). </li></ul>
  3. 3. Chapter 32 - Objectives <ul><li>OLAP extensions to SQL. </li></ul><ul><li>Concepts associated with data mining. </li></ul><ul><li>Main data mining operations including predictive modeling, database segmentation, link analysis, and deviation detection. </li></ul><ul><li>Relationship between data mining and data warehousing. </li></ul>
  4. 4. Data Warehousing and End-User Access Tools <ul><li>Accompanying growth in data warehouses is increasing demands for more powerful access tools providing advanced analytical capabilities. </li></ul><ul><li>Key developments include: </li></ul><ul><ul><li>Online analytical processing (OLAP). </li></ul></ul><ul><ul><li>SQL extensions for complex data analysis. </li></ul></ul><ul><ul><li>Data mining tools. </li></ul></ul>
  5. 5. Introducing OLAP <ul><li>The dynamic synthesis, analysis, and consolidation of large volumes of multi-dimensional data, Codd (1993). </li></ul><ul><li>Describes a technology that uses a multi-dimensional view of aggregate data to provide quick access to strategic information for purposes of advanced analysis. </li></ul>
  6. 6. Introducing OLAP <ul><li>Enables users to gain a deeper understanding and knowledge about various aspects of their corporate data through fast, consistent, interactive access to a wide variety of possible views of the data. </li></ul><ul><li>Allows users to view corporate data in such a way that it is a better model of the true dimensionality of the enterprise. </li></ul>
  7. 7. Introducing OLAP <ul><li>Can easily answer ‘who?’ and ‘what?’ questions, however, ability to answer ‘what if?’ and ‘why?’ type questions distinguishes OLAP from general-purpose query tools. </li></ul><ul><li>Types of analysis ranges from basic navigation and browsing (slicing and dicing) to calculations, to more complex analyses such as time series and complex modeling. </li></ul>
  8. 8. OLAP Benchmarks <ul><li>OLAP Council published an analytical processing benchmark referred to as the APB-1 (OLAP Council, 1998). </li></ul><ul><li>Aim is to measure a server’s overall OLAP performance rather than the performance of individual tasks. </li></ul>
  9. 9. OLAP Benchmarks <ul><li>APB-1 assesses most common business operations including: </li></ul><ul><ul><li>bulk loading of data from internal/external data sources; </li></ul></ul><ul><ul><li>incremental loading of data from operational systems; </li></ul></ul><ul><ul><li>aggregation of input/level data along hierarchies; </li></ul></ul><ul><ul><li>calculation of new data based on business models; </li></ul></ul><ul><ul><li>time series analysis; </li></ul></ul><ul><ul><li>queries with a high degree of complexity; </li></ul></ul><ul><ul><li>drill-down through hierarchies; </li></ul></ul><ul><ul><li>ad hoc queries; </li></ul></ul><ul><ul><li>multiple online sessions. </li></ul></ul>
  10. 10. OLAP Benchmarks <ul><li>OLAP applications are judged on their ability to provide just-in-time (JIT) information, a core requirement of supporting effective decision-making. </li></ul><ul><li>Assessing a server’s ability to satisfy this requirement is more than measuring processing performance but includes its abilities to model complex business relationships and to respond to changing business requirements. </li></ul>
  11. 11. OLAP Benchmarks <ul><li>APB-1 uses a standard benchmark metric called AQM (Analytical Queries per Minute). </li></ul><ul><li>AQM represents number of analytical queries processed per minute including data loading and computation time. Thus, AQM incorporates data loading performance, calculation performance, and query performance into a singe metric. </li></ul>
  12. 12. OLAP Benchmarks <ul><li>Publication of APB-1 benchmark results must include both the database schema and all code required for executing the benchmark . </li></ul><ul><li>An essential requirement of all OLAP applications is ability to provide users with JIT information, to make effective decisions about an organization’s strategic directions. </li></ul>
  13. 13. OLAP Applications <ul><li>JIT information is computed data that usually reflects complex relationships and is often calculated on the fly. </li></ul><ul><li>Also, as data relationships may not be known in advance, the data model must be flexible. </li></ul>
  14. 14. Examples of OLAP Applications in Various Functional Areas
  15. 15. OLAP Applications <ul><li>Although OLAP applications are found in widely divergent functional areas, all have following key features: </li></ul><ul><ul><li>multi-dimensional views of data; </li></ul></ul><ul><ul><li>support for complex calculations; </li></ul></ul><ul><ul><li>time intelligence. </li></ul></ul>
  16. 16. OLAP Applications - Multi-Dimensional Views of Data <ul><li>Core requirement of building a ‘realistic’ business model . </li></ul><ul><li>Provides basis for analytical processing through flexible access to corporate data. </li></ul><ul><li>The underlying database design that provides the multi-dimensional view of data should treat all dimensions equally. </li></ul>
  17. 17. OLAP Applications - Support for Complex Calculations <ul><li>Must provide a range of powerful computational methods such as that required by sales forecasting, which uses trend algorithms such as moving averages and percentage growth. </li></ul><ul><li>Mechanisms for implementing computational methods should be clear and non-procedural . </li></ul>
  18. 18. OLAP Applications – Time Intelligence <ul><li>Key feature of almost any analytical application as performance is almost always judged over time. </li></ul><ul><li>Time hierarchy is not always used in same manner as other hierarchies. </li></ul><ul><li>Concepts such as year-to-date and period-over-period comparisons should be easily defined. </li></ul>
  19. 19. OLAP Benefits <ul><li>Increased productivity of end-users . </li></ul><ul><li>Reduced backlog of applications development for IT staff. </li></ul><ul><li>Retention of organizational control over the integrity of corporate data. </li></ul><ul><li>Reduced query drag and network traffic on OLTP systems or on the data warehouse. </li></ul><ul><li>Improved potential revenue and profitability. </li></ul>
  20. 20. Representing Multi-Dimensional Data <ul><li>Example of two-dimensional query. </li></ul><ul><ul><li>What is the total revenue generated by property sales in each city, in each quarter of 1997?’ </li></ul></ul><ul><li>Choice of representation is based on types of queries end-user may ask. </li></ul><ul><li>Compare representation - three-field relational table versus two-dimensional matrix. </li></ul>
  21. 21. Multi-Dimensional Data as Three-Field Table versus Two-Dimensional Matrix
  22. 22. Representing Multi-Dimensional Data <ul><li>Example of three-dimensional query. </li></ul><ul><ul><li>‘ What is the total revenue generated by property sales for each type of property (Flat or House) in each city, in each quarter of 1997?’ </li></ul></ul><ul><li>Compare representation - four-field relational table versus three-dimensional cube. </li></ul>
  23. 23. Multi-Dimensional Data as Four-Field Table versus Three-Dimensional Cube
  24. 24. Representing Multi-Dimensional Data <ul><li>Cube represents data as cells in an array. </li></ul><ul><li>Relational table only represents multi-dimensional data in two dimensions. </li></ul>
  25. 25. Multi-Dimensional OLAP Servers <ul><li>Use multi-dimensional structures to store data and relationships between data. </li></ul><ul><li>Multi-dimensional structures are best visualized as cubes of data, and cubes within cubes of data. Each side of cube is a dimension. </li></ul><ul><li>A cube can be expanded to include other dimensions. </li></ul>
  26. 26. Multi-Dimensional OLAP Servers <ul><li>A cube supports matrix arithmetic. </li></ul><ul><li>Multi-dimensional query response time depends on how many cells have to be added ‘on the fly’. </li></ul><ul><li>As number of dimensions increases, number of the cube’s cells increases exponentially. </li></ul>
  27. 27. Multi-Dimensional OLAP Servers <ul><li>However, majority of multi-dimensional queries use summarized, high-level data. </li></ul><ul><li>Solution is to pre-aggregate (consolidate) all logical subtotals and totals along all dimensions. </li></ul><ul><li>Pre-aggregation is valuable, as typical dimensions are hierarchical in nature. </li></ul><ul><ul><li>(e.g. Time dimension hierarchy - years, quarters, months, weeks, and days) </li></ul></ul>
  28. 28. Multi-Dimensional OLAP Servers <ul><li>Predefined hierarchy allows logical pre-aggregation and, conversely, allows for a logical ‘drill-down’. </li></ul><ul><li>Supports common analytical operations </li></ul><ul><ul><li>Consolidation. </li></ul></ul><ul><ul><li>Drill-down. </li></ul></ul><ul><ul><li>Slicing and dicing. </li></ul></ul>
  29. 29. Multi-Dimensional OLAP Servers <ul><li>Consolidation - aggregation of data such as simple ‘roll-ups’ or complex expressions involving inter-related data. </li></ul><ul><li>Drill-Down - is reverse of consolidation and involves displaying the detailed data that comprises the consolidated data. </li></ul><ul><li>Slicing and Dicing - (also called pivoting) refers to the ability to look at the data from different viewpoints. </li></ul>
  30. 30. Multi-Dimensional OLAP servers <ul><li>Can store data in a compressed form by dynamically selecting physical storage organizations and compression techniques that maximize space utilization. </li></ul><ul><li>Dense data (i.e., data that exists for high percentage of cells) can be stored separately from sparse data (i.e., significant percentage of cells are empty). </li></ul>
  31. 31. Multi-Dimensional OLAP Servers <ul><li>Ability to omit empty or repetitive cells can greatly reduce the size of the cube and the amount of processing. </li></ul><ul><li>Allows analysis of exceptionally large amounts of data. </li></ul>
  32. 32. Multi-Dimensional OLAP Servers <ul><li>In summary, pre-aggregation, dimensional hierarchy, and sparse data management can significantly reduce the size of the cube and the need to calculate values ‘on-the-fly’. </li></ul><ul><li>Removes need for multi-table joins and provides quick and direct access to arrays of data, thus significantly speeding up execution of multi-dimensional queries. </li></ul>
  33. 33. Codd’s Rules for OLAP Systems <ul><li>In 1993, E.F. Codd formulated twelve rules as the basis for selecting OLAP tools. </li></ul><ul><ul><li>Multi-dimensional conceptual view </li></ul></ul><ul><ul><li>Transparency </li></ul></ul><ul><ul><li>Accessibility </li></ul></ul><ul><ul><li>Consistent reporting performance </li></ul></ul><ul><ul><li>Client-server architecture </li></ul></ul><ul><ul><li>Generic dimensionality </li></ul></ul>
  34. 34. Codd’s Rules for OLAP <ul><ul><li>Dynamic sparse matrix handling </li></ul></ul><ul><ul><li>Multi-user support </li></ul></ul><ul><ul><li>Unrestricted cross-dimensional operations </li></ul></ul><ul><ul><li>Intuitive data manipulation </li></ul></ul><ul><ul><li>Flexible reporting </li></ul></ul><ul><ul><li>Unlimited dimensions and aggregation levels. </li></ul></ul>
  35. 35. Codd’s Rules for OLAP Systems <ul><li>There are proposals to re-define or extend the rules. For example, to also include: </li></ul><ul><ul><li>Comprehensive database management tools. </li></ul></ul><ul><ul><li>Ability to drill down to detail (source record) level. </li></ul></ul><ul><ul><li>Incremental database refresh. </li></ul></ul><ul><ul><li>SQL interface to the existing enterprise environment. </li></ul></ul>
  36. 36. Categories of OLAP Tools <ul><li>OLAP tools are categorized according to the architecture of the underlying database. </li></ul><ul><li>Three main categories of OLAP tools include </li></ul><ul><ul><li>Multi-dimensional OLAP (MOLAP or MD-OLAP) </li></ul></ul><ul><ul><li>Relational OLAP (ROLAP), also called multi-relational OLAP </li></ul></ul><ul><ul><li>Managed query environment (MQE) </li></ul></ul>
  37. 37. Multi-Dimensional OLAP (MOLAP) <ul><li>Uses specialized data structures and multi-dimensional Database Management Systems (MDDBMSs) to organize, navigate, and analyze data. </li></ul><ul><li>Data is typically aggregated and stored according to predicted usage to enhance query performance. </li></ul>
  38. 38. Multi-Dimensional OLAP (MOLAP) <ul><li>Use array technology and efficient storage techniques that minimize the disk space requirements through sparse data management. </li></ul><ul><li>Provides excellent performance when data is used as designed, and the focus is on data for a specific decision-support application. </li></ul>
  39. 39. Multi-Dimensional OLAP (MOLAP) <ul><li>Traditionally, require a tight coupling with the application layer and presentation layer. </li></ul><ul><li>Recent trends segregate the OLAP from the data structures through the use of published application programming interfaces (APIs). </li></ul>
  40. 40. Typical Architecture for MOLAP Tools
  41. 41. MOLAP Tools - Development Issues <ul><li>Underlying data structures are limited in their ability to support multiple subject areas and to provide access to detailed data. </li></ul><ul><li>Navigation and analysis of data is limited because the data is designed according to previously determined requirements. </li></ul>
  42. 42. MOLAP Tools - Development Issues <ul><li>MOLAP products require a different set of skills and tools to build and maintain the database, thus increasing the cost and complexity of support. </li></ul>
  43. 43. Relational OLAP (ROLAP) <ul><li>Fastest growing style of OLAP technology. </li></ul><ul><li>Supports RDBMS products using a metadata layer - avoids need to create a static multi-dimensional data structure - facilitates the creation of multiple multi-dimensional views of the two-dimensional relation. </li></ul>
  44. 44. Relational OLAP (ROLAP) <ul><li>To improve performance, some products use SQL engines to support complexity of multi-dimensional analysis, while others recommend, or require, the use of highly denormalized database designs such as the star schema. </li></ul>
  45. 45. Typical Architecture for ROLAP Tools
  46. 46. ROLAP Tools - Development Issues <ul><li>Middleware to facilitate the development of multi-dimensional applications. (Software that converts the two-dimensional relation into a multi-dimensional structure). </li></ul><ul><li>Development of an option to create persistent, multi-dimensional structures with facilities to assist in the administration of these structures. </li></ul>
  47. 47. Hybrid OLAP (HOLAP) <ul><li>Can use data from either a RDBMS directly or a multi-dimension server. </li></ul>
  48. 48. Managed Query Environment (MQE) <ul><li>Relatively new development. </li></ul><ul><li>Provide limited analysis capability, either directly against RDBMS products, or by using an intermediate MOLAP server. </li></ul>
  49. 49. Managed Query Environment (MQE) <ul><li>Deliver selected data directly from DBMS or via a MOLAP server to desktop (or local server) in form of a datacube, where it is stored, analyzed, and maintained locally. </li></ul><ul><li>Promoted as being relatively simple to install and administer with reduced cost and maintenance. </li></ul>
  50. 50. Typical Architecture for MQE Tools
  51. 51. MQE Tools - Development Issues <ul><li>Architecture results in significant data redundancy and may cause problems for networks that support many users. </li></ul><ul><li>Ability of each user to build a custom datacube may cause a lack of data consistency among users. </li></ul><ul><li>Only a limited amount of data can be efficiently maintained. </li></ul>
  52. 52. OLAP Extensions to SQL <ul><li>SQL promoted as easy to learn, non-procedural, free-format, DBMS-independent, and international standard. </li></ul><ul><li>However, major disadvantage has been inability to represent many of the questions most commonly asked by business analysts. </li></ul><ul><li>IBM and Oracle jointly proposed OLAP extensions to SQL early in 1999, adopted as an amendment to SQL. </li></ul>
  53. 53. OLAP Extensions to SQL <ul><li>Many database vendors including IBM, Oracle, Informix, and Red Brick Systems have already implemented portions of specifications in their DBMSs. </li></ul><ul><li>Red Brick Systems was first to implement many essential OLAP functions (as Red Brick Intelligent SQL (RISQL)) , albeit in advance of the standard. </li></ul>
  54. 54. OLAP Extensions to SQL - RISQL <ul><li>Designed for business analysts. </li></ul><ul><li>Set of extensions that augments SQL with a variety of powerful operations appropriate to data analysis and decision-support applications such as ranking, moving averages, comparisons, market share, this year versus last year. </li></ul>
  55. 55. Use of the RISQL CUME Function <ul><li>Show the quarterly sales for branch office B003, along with the monthly year-to-date figures. </li></ul><ul><ul><li>SELECT quarter, quarterlySales, CUME(quarterlySales) AS Year-to-Date </li></ul></ul><ul><ul><li>FROM BranchSales </li></ul></ul><ul><ul><li>WHERE branchNo = ‘B003’; </li></ul></ul>
  56. 56. Use of the RISQL MOVINGAVG / MOVINGSUM Function <ul><li>Show the first six monthly sales for branch office B003 without the effect of seasonality. </li></ul><ul><li>SELECT month, monthlySales, </li></ul><ul><ul><li>MOVINGAVG(monthlySales) AS 3-MonthMovingAvg, </li></ul></ul><ul><ul><li>MOVINGSUM(monthlySales) AS 3-MonthMovingSum </li></ul></ul><ul><ul><li>FROM BranchSales </li></ul></ul><ul><ul><li>WHERE branchNo = ‘B003’; </li></ul></ul>
  57. 57. Data Mining <ul><li>The process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions (Simoudis, 1996). </li></ul><ul><li>Involves analysis of data and use of software techniques for finding hidden and unexpected patterns and relationships in sets of data. </li></ul>
  58. 58. Data Mining <ul><li>Reveals information that is hidden and unexpected, as little value in finding patterns and relationships that are already intuitive. </li></ul><ul><li>Patterns and relationships are identified by examining the underlying rules and features in the data. </li></ul><ul><li>Tends to work from the data up and most accurate results normally require large volumes of data to deliver reliable conclusions. </li></ul>
  59. 59. Data Mining <ul><li>Starts by developing an optimal representation of structure of sample data, during which time knowledge is acquired and extended to larger sets of data. </li></ul><ul><li>Data mining can provide huge paybacks for companies who have made a significant investment in data warehousing. </li></ul><ul><li>Relatively new technology, however already used in a number of industries. </li></ul>
  60. 60. Examples of Applications of Data Mining <ul><li>Retail / Marketing </li></ul><ul><ul><li>Identifying buying patterns of customers. </li></ul></ul><ul><ul><li>Finding associations among customer demographic characteristics. </li></ul></ul><ul><ul><li>Predicting response to mailing campaigns. </li></ul></ul><ul><ul><li>Market basket analysis. </li></ul></ul>
  61. 61. Examples of Applications of Data Mining <ul><li>Banking </li></ul><ul><ul><li>Detecting patterns of fraudulent credit card use. </li></ul></ul><ul><ul><li>Identifying loyal customers. </li></ul></ul><ul><ul><li>Predicting customers likely to change their credit card affiliation. </li></ul></ul><ul><ul><li>Determining credit card spending by customer groups. </li></ul></ul>
  62. 62. Examples of Applications of Data Mining <ul><li>Insurance </li></ul><ul><ul><li>Claims analysis. </li></ul></ul><ul><ul><li>Predicting which customers will buy new policies. </li></ul></ul><ul><li>Medicine </li></ul><ul><ul><li>Characterizing patient behavior to predict surgery visits. </li></ul></ul><ul><ul><li>Identifying successful medical therapies for different illnesses. </li></ul></ul>
  63. 63. Data Mining Operations <ul><li>Four main operations include: </li></ul><ul><ul><li>Predictive modeling. </li></ul></ul><ul><ul><li>Database segmentation. </li></ul></ul><ul><ul><li>Link analysis. </li></ul></ul><ul><ul><li>Deviation detection. </li></ul></ul><ul><li>There are recognized associations between the applications and the corresponding operations. </li></ul><ul><ul><li>e.g. Direct marketing strategies use database segmentation. </li></ul></ul>
  64. 64. Data Mining Techniques <ul><li>Techniques are specific implementations of the data mining operations. </li></ul><ul><li>Each operation has its own strengths and weaknesses. </li></ul><ul><li>Data mining tools sometimes offer a choice of operations to implement a technique. </li></ul>
  65. 65. Data Mining Techniques <ul><li>Criteria for selection of tool includes </li></ul><ul><ul><li>Suitability for certain input data types. </li></ul></ul><ul><ul><li>Transparency of the mining output. </li></ul></ul><ul><ul><li>Tolerance of missing variable values. </li></ul></ul><ul><ul><li>Level of accuracy possible. </li></ul></ul><ul><ul><li>Ability to handle large volumes of data. </li></ul></ul>
  66. 66. Data Mining Operations and Associated Techniques
  67. 67. Predictive Modeling <ul><li>Similar to the human learning experience </li></ul><ul><ul><li>uses observations to form a model of the important characteristics of some phenomenon. </li></ul></ul><ul><li>Uses generalizations of ‘real world’ and ability to fit new data into a general framework. </li></ul><ul><li>Can analyze a database to determine essential characteristics (model) about the data set. </li></ul>
  68. 68. Predictive Modeling <ul><li>Model is developed using a supervised learning approach, which has two phases: training and testing. </li></ul><ul><ul><li>Training builds a model using a large sample of historical data called a training set. </li></ul></ul><ul><ul><li>Testing involves trying out the model on new, previously unseen data to determine its accuracy and physical performance characteristics. </li></ul></ul>
  69. 69. Predictive Modeling <ul><li>Applications of predictive modeling include customer retention management, credit approval, cross selling, and direct marketing. </li></ul><ul><li>Two techniques associated with predictive modeling: classification and value prediction, distinguished by nature of the variable being predicted. </li></ul>
  70. 70. Predictive Modeling - Classification <ul><li>Used to establish a specific predetermined class for each record in a database from a finite set of possible class values. </li></ul><ul><li>Two specializations of classification: tree induction and neural induction. </li></ul>
  71. 71. Example of Classification using Tree Induction
  72. 72. Example of Classification using Neural Induction
  73. 73. Predictive Modeling - Value Prediction <ul><li>Used to estimate a continuous numeric value that is associated with a database record. </li></ul><ul><li>Uses the traditional statistical techniques of linear regression and nonlinear regression. </li></ul><ul><li>Relatively easy to use and understand. </li></ul>
  74. 74. Predictive Modeling - Value Prediction <ul><li>Linear regression attempts to fit a straight line through a plot of the data, such that the line is the best representation of the average of all observations at that point in the plot. </li></ul><ul><li>Problem is that the technique only works well with linear data and is sensitive to the presence of outliers (i.e., data values, which do not conform to the expected norm). </li></ul>
  75. 75. Predictive Modeling - Value Prediction <ul><li>Although nonlinear regression avoids the main problems of linear regression, still not flexible enough to handle all possible shapes of the data plot. </li></ul><ul><li>Statistical measurements are fine for building linear models that describe predictable data points, however, most data is not linear in nature. </li></ul>
  76. 76. Predictive Modeling - Value Prediction <ul><li>Data mining requires statistical methods that can accommodate non-linearity, outliers, and non-numeric data. </li></ul><ul><li>Applications of value prediction include credit card fraud detection or target mailing list identification. </li></ul>
  77. 77. Database Segmentation <ul><li>Aim is to partition a database into an unknown number of segments, or clusters, of similar records. </li></ul><ul><li>Uses unsupervised learning to discover homogeneous sub-populations in a database to improve the accuracy of the profiles. </li></ul>
  78. 78. Database Segmentation <ul><li>Less precise than other operations thus less sensitive to redundant and irrelevant features. </li></ul><ul><li>Sensitivity can be reduced by ignoring a subset of the attributes that describe each instance or by assigning a weighting factor to each variable. </li></ul><ul><li>Applications of database segmentation include customer profiling, direct marketing, and cross selling. </li></ul>
  79. 79. Example of Database Segmentation using a Scatterplot
  80. 80. Database Segmentation <ul><li>Associated with demographic or neural clustering techniques, distinguished by: </li></ul><ul><ul><li>Allowable data inputs. </li></ul></ul><ul><ul><li>Methods used to calculate the distance between records. </li></ul></ul><ul><ul><li>Presentation of the resulting segments for analysis. </li></ul></ul>
  81. 81. Link Analysis <ul><li>Aims to establish links (associations) between records, or sets of records, in a database. </li></ul><ul><li>There are three specializations </li></ul><ul><ul><li>Associations discovery. </li></ul></ul><ul><ul><li>Sequential pattern discovery. </li></ul></ul><ul><ul><li>Similar time sequence discovery. </li></ul></ul><ul><li>Applications include product affinity analysis, direct marketing, and stock price movement. </li></ul>
  82. 82. Link Analysis - Associations Discovery <ul><li>Finds items that imply the presence of other items in the same event. </li></ul><ul><li>Affinities between items are represented by association rules. </li></ul><ul><ul><li>e.g. ‘When customer rents property for more than 2 years and is more than 25 years old, in 40% of cases, customer will buy a property. Association happens in 35% of all customers who rent properties’. </li></ul></ul>
  83. 83. Link Analysis - Sequential Pattern Discovery <ul><li>Finds patterns between events such that the presence of one set of items is followed by another set of items in a database of events over a period of time. </li></ul><ul><ul><li>e.g. Used to understand long-term customer buying behavior. </li></ul></ul>
  84. 84. Link Analysis - Similar Time Sequence Discovery <ul><li>Finds links between two sets of data that are time-dependent, and is based on the degree of similarity between the patterns that both time series demonstrate. </li></ul><ul><ul><li>e.g. Within three months of buying property, new home owners will purchase goods such as cookers, freezers, and washing machines. </li></ul></ul>
  85. 85. Deviation Detection <ul><li>Relatively new operation in terms of commercially available data mining tools. </li></ul><ul><li>Often a source of true discovery because it identifies outliers, which express deviation from some previously known expectation and norm. </li></ul>
  86. 86. Deviation Detection <ul><li>Can be performed using statistics and visualization techniques or as a by-product of data mining. </li></ul><ul><li>Applications include fraud detection in the use of credit cards and insurance claims, quality control, and defects tracing. </li></ul>
  87. 87. Example of Database Segmentation using a Visualization
  88. 88. Data Mining Tools <ul><li>There are a growing number of commercial data mining tools on the marketplace. </li></ul><ul><li>Important characteristics of data mining tools include: </li></ul><ul><ul><li>Data preparation facilities. </li></ul></ul><ul><ul><li>Selection of data mining operations. </li></ul></ul><ul><ul><li>Product scalability and performance. </li></ul></ul><ul><ul><li>Facilities for visualization of results. </li></ul></ul>
  89. 89. Data Mining and Data Warehousing <ul><li>Major challenge to exploit data mining is identifying suitable data to mine. </li></ul><ul><li>Data mining requires single, separate, clean, integrated, and self-consistent source of data. </li></ul>
  90. 90. Data Mining and Data Warehousing <ul><li>A data warehouse is well equipped for providing data for mining. </li></ul><ul><li>Data quality and consistency is a prerequisite for mining to ensure the accuracy of the predictive models. Data warehouses are populated with clean, consistent data. </li></ul>
  91. 91. Data Mining and Data Warehousing <ul><li>Advantageous to mine data from multiple sources to discover as many interrelationships as possible. Data warehouses contain data from a number of sources. </li></ul><ul><li>Selecting relevant subsets of records and fields for data mining requires query capabilities of the data warehouse. </li></ul>
  92. 92. Data Mining and Data Warehousing <ul><li>Results of a data mining study are useful if there is some way to further investigate the uncovered patterns. Data warehouses provide capability to go back to the data source. </li></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×