Data mining


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data mining

  1. 1. Lecture Thirteen- Data Warehousing, OLAP, and Data Mining Readings: • Required: Connolly and Begg, sections 25.1, 26.1.1, 26.2.1, and 26.2.8. • Elective: Connolly and Begg, the remainder of chapter 26. Supporting OLAP and Data Mining Other than supporting day-to-day decisions of a large number of operational users, a DBMS also need to support analysis and strategic decisions for managerial users. For example, analyzing the performance of sales before and after a market campaign. Or a manager analyze the performance of sales in order to forecast product sales and plan accordingly for product ordering and storage capacity. (Prepared by BR-Cheung) 13-1/15
  2. 2. Such an on-line analytical processing (OLAP) can be enhanced with the use of data mining tools. Data mining  It is a tools which discover new patterns or rules that cannot necessarily be found with mere querying processing.  They utilize AI machine learning techniques that automatically classify the data into different groups based on different criteria. For example, it is possible from data on product sales to derive a rule that if a customer shops on Sunday before 11 a.m. and buys milk, the customer buys a newspaper and chocolate cookies as well. When a store manager wishes to promote a particular brand of chocolate cookies, she can use the above rule and stack the chocolate cookies next to the Sunday newspaper stand. OLAP and data mining involve no modifications, and require ad hoc access to all the organization's data, current as well as historical. This points to the need for (Prepared by BR-Cheung) 13-2/15
  3. 3. new data models for organization and storage of historical data—models that optimize query processing rather than transaction processing. What is a Data Warehouse? Data warehouses integrates data from multiple data sources and organize them for efficient querying processing and presentation. The best definition for data warehouses was given by Inmon (1992) when he introduced the data warehouse term: a data warehouse is a subject-oriented, integrated, non-volatile, time-variant collection of data in support of management decisions. • Subject-oriented collection means that data are organized around subjects such as customers, products, sales. In databases, by contrast, the stored data are organized around activities. For example, we use databases to store detailed data about purchasing orders and product acquisition. We use a data warehouse to store the summarization of the detailed data based on a subject. A summarization can be produced, for example, by applying aggregate (Prepared by BR-Cheung) 13-3/15
  4. 4. functions on groups of rows that are specified using the GROUP BY clause. For instance, a summarization around a product can be product sales, • SELECT Product, SUM(Total) AS ProductSale • FROM PurchaseOrders • GROUP BY Product; and a summarization around a sale can be daily sales, SELECT WorkDay, SUM(Total) AS DailySale FROM PurchaseOrders GROUP BY WorkDay; An aggregation of daily sales yields the example of time series that we discussed in 5.1.2. (Prepared by BR-Cheung) 13-4/15
  5. 5. • Integrated collection means that a data warehouse integrates and stores data from multiple sources, not all of which are necessarily databases. A data source can be a certain application file. Note that a data warehouse is not an example of a multi-database system that just provides access to data stored in heterogeneous databases. A data warehouse actually stores the integrated data after they are cleaned, removing any inconsistencies such as different formats and erroneous values. In this way, users are presented with a consistent unified view of the data in the data warehouse. • Non-volatile collection means that the data warehouse is not updated in real time and in coordination with the updates on the data sources. Updates on a data source are grouped and applied together on the data warehouse by a maintenance transaction. Maintenance transactions execute periodically or on demand. • Time-variant collection means that data in a data warehouse is historical data whose values have temporal validity. This clearly shows that data warehouses must support time series. (Prepared by BR-Cheung) 13-5/15
  6. 6. Data Warehouse Architecture (Prepared by BR-Cheung) 13-6/15
  7. 7. Multidimensional Modeling Multidimensional models populate data in multidimensional matrices. Three- dimensional (3-d) matrices are called data cubes, and matrices with more than three dimensions are called hypercubes. As an example of a cube, consider the dimensions: fiscal period, product, region. As we have mentioned earlier, we can use a 2-d matrix such as a spreadsheet, to represent regional sales for a fixed period. | R1 R2 R3 ... -----|-------------------> Region P1 | P2 | P3 | . | . | V Product A spreadsheet can be converted into a cube by adding a time dimension to the spreadsheet, such as, for example, month intervals. (Prepared by BR-Cheung) 13-7/15
  8. 8. Data-Warehouse-Enhanced Query Operations Data warehouses provide a multidimensional conceptual view with unlimited dimensions and aggregation levels. They offer several operators to facilitate both querying and visualizing data in a multidimensional view. • Pivot or Rotation: Cubes can be visualized and reoriented along different axes. In the example above, product and region are presented in the front. Using rotation we can bring time and product in the front, pushing region to the back. • Roll-Up Display: It can be used to derive a coarse-grain view—that is, a summarization and grouping along a dimension. For example, months can be grouped into years along the time axes. Products can be grouped into categories, toys, computer games, etc. • Drill-Down Display: It can be used to derive a finer-grain view, that is to say, a disaggregation along a dimension. For example, regions can be (Prepared by BR-Cheung) 13-8/15
  9. 9. disaggregated to cities within each region, and months can be disaggregated into weeks or days. • Slice and Dice: It can be used to specify projection on dimensions, creating smaller cubes. For example, retrieve all the toy products in cities in Pennsylvania during the winter months. • Selection: This is similar to the standard select in SQL that can be used to retrieve data by value and range. • Sorting: It can be used to specify the ordering of data along a dimension. • Derived Attributes: It allows the specification of attributes that are computed from stored and other derived values. Multidimensional Storage Model Data warehouses support the summarization provided by the drill-down and roll- up operations, in one of two ways, both of which trade time for space. (Prepared by BR-Cheung) 13-9/15
  10. 10. • They maintain smaller summary tables that are retrieved to display a summarization. • They encode all different levels along a dimension (e.g., weekly, quarterly, yearly) into existing tables. Using the appropriate encoding, a summarization is computed from the detailed data when needed. Tables in a data warehouse are logically organized in a so-called star schema. A star schema consists of a central fact table containing the factual data that can be analyzed in a variety of ways, and also a dimension table for each dimension, containing reference data. Detailed data are stored in the dimension tables and are referenced by foreign keys in the fact table. For example, a star schema that can support the above example would consist of a fact table, surrounded by three dimension tables—one for Products, one for regional sales, and one for month intervals. Fact table: (Prepared by BR-Cheung) 13-10/ 15
  11. 11. SALE SUMMARY (Product, Month. Region, Sales); Product -> PRODUCT(PID) MONTH -> MONTH_INTERVAL(Month) Region -> Regional_Sales(RegionNo) Dimension tables: PRODUCT (PID, Pname, PCategory, PDescription); REGIONAL_SALES (Region, County, City); MONTH_INTERVAL (MonthNo, Month, Year); In the star schema, dimension tables might not be normalized, containing redundant data (recall our discussion on normalization). The motivation for this redundancy is to speed up querying processing by eliminating the need for joining tables. On the other hand, an unnormalized table can grow very large and the overhead of scanning the table may offset any gain in query processing. In such a case, dimension tables can be normalized by decomposing them into smaller dimension tables, and referencing them within the original dimension table. This (Prepared by BR-Cheung) 13-11/ 15
  12. 12. decomposition leads to a hierarchical "star" schema called the Snowflake schema. As in databases, data warehouses utilize different forms of indexing to speed up access to data. Further, they implement efficient handling of dynamic sparse matrices. Issues and Categories of Data Warehouses Compared to databases, data warehouses are very costly to build both in terms of time and money. Furthermore, they are very costly to maintain. • Data warehouses have huge sizes and grow at enormous rates. They are at least one order of magnitude larger than their data sources. They range between hundreds of gigabytes to terabytes and petabyte sizes. • Resolving semantic heterogeneity between data sources, and converting different bodies of data from their data source models, to the data warehouse model is a complex and time-consuming process. This process is not (Prepared by BR-Cheung) 13-12/ 15
  13. 13. executed only once, but is repeated almost every time that the data warehouse is synchronized with its data sources. An example of a heterogeneity that must be resolved each time is when two data sources are using different currencies and follow different fiscal calendars. • Cleaning data to ensure quality and validity is another complex process. In fact, it has been identified as the most labor-intensive process of the data warehouse construction. The reason is that recognizing incomplete or erroneous data is difficult to automate, at least in the beginning. After being recognized, if some erroneous or missing entries follow a specific pattern, then these entries can be automatically corrected. For example, if a data source uses -99 to represent the NULL value in a specific column, then all -99 in that column can be converted into NULL. Another example: using the zip code, one can correct the misspelling of a city. For instance, Pittsburgh, PA ends with "h" whereas Pittsburg, CA does not. • The decision of what data to summarize and how to organize it is another very critical process. It affects both the utility of the data warehouse and its performance. (Prepared by BR-Cheung) 13-13/ 15
  14. 14. • The sheer volume of data makes data loading and synchronization a significant task. The data warehouse must be recovered from incomplete or incorrect updating. It is clear that administration and management tools are essential in such a complex environment. It seems, nowadays, that data administration is shifting away from database systems to data warehouses. In order to reduce the severity of the above issues, two other alternatives to an enterprise-wide data warehouse have been proposed: • Data Marts: These are small, highly focussed data warehouses at the level of a department. An enterprise-wide data warehouse can be constructed by forming a federation of data marts. • Virtual Data Warehouses: These are persistent collections of views of operational databases, that are materialized for efficient access and complex query processing. (Prepared by BR-Cheung) 13-14/ 15
  15. 15. Examination Revision:  Transaction Scheduling and Concurrency.  Performance Improvement  In exam3, you might be asked to extend your E-Commerce project implementation. Instead of extending your implementation, you could extend the sample library project. Make sure you have the skills to build a web-based database application. The best way to master the skills and be prepared for exam3 is to complete your project. (Prepared by BR-Cheung) 13-15/ 15