Upcoming SlideShare
×

# Data Warehouse

927 views
837 views

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
927
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
26
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Data Warehouse

1. 1. Data Warehouse and Data Cube Lecture Notes for Chapter 3 Introduction to Data Mining By Tan, Steinbach, Kumar And Data Mining, by Han and Kamber, 2 nd Edition Revised by QY
2. 2. OLAP <ul><li>On-Line Analytical Processing (OLAP) was proposed by E. F. Codd, the father of the relational database. </li></ul><ul><li>Relational databases put data into tables, while OLAP uses a multidimensional array representation. </li></ul><ul><ul><li>Such representations of data previously existed in statistics and other fields </li></ul></ul><ul><li>There are a number of data analysis and data exploration operations that are easier with such a data representation. </li></ul>
3. 3. Creating a Multidimensional Array <ul><li>Two key steps in converting tabular data into a multidimensional array. </li></ul><ul><ul><li>First, identify which attributes are to be the dimensions and which attribute is to be the target attribute whose values appear as entries in the multidimensional array. </li></ul></ul><ul><ul><ul><li>The attributes used as dimensions must have discrete values </li></ul></ul></ul><ul><ul><ul><li>The target value is typically a count or continuous value, e.g., the cost of an item </li></ul></ul></ul><ul><ul><ul><li>Can have no target variable at all except the count of objects that have the same set of attribute values </li></ul></ul></ul><ul><ul><li>Second, find the value of each entry in the multidimensional array by summing the values (of the target attribute) or count of all objects that have the attribute values corresponding to that entry. </li></ul></ul>
4. 4. Example: Iris data <ul><li>We show how the attributes, petal length, petal width, and species type can be converted to a multidimensional array: from iris data http://archive.ics.uci.edu/ml/datasets/Iris </li></ul><ul><ul><li>First, we discretized the petal width and length to have categorical values: low , medium , and high </li></ul></ul>
5. 5. Example: Iris data (continued) <ul><li>Each unique tuple of petal width, petal length, and species type identifies one element of the array. </li></ul><ul><li>This element is assigned the corresponding count value. </li></ul><ul><li>The figure illustrates the result. </li></ul><ul><li>All non-specified tuples are 0. </li></ul>Length
6. 6. OLAP Operations: Data Cube <ul><li>The key operation of a OLAP is the formation of a data cube </li></ul><ul><ul><li>A data cube is a multidimensional representation of data, together with all possible aggregates. </li></ul></ul><ul><ul><li>Aggregates: similar to class attribute </li></ul></ul><ul><ul><ul><li>result by selecting a proper subset of the dimensions and summing over all remaining dimensions. </li></ul></ul></ul><ul><ul><ul><li>Cached to improve speed and support online computation </li></ul></ul></ul><ul><ul><li>For example, </li></ul></ul><ul><ul><ul><li>if we choose the species type dimension of the Iris data and </li></ul></ul></ul><ul><ul><ul><ul><li>sum over all other dimensions, </li></ul></ul></ul></ul><ul><ul><ul><ul><li>the result will be a one-dimensional entry with three entries, </li></ul></ul></ul></ul><ul><ul><li>each of which gives the number of flowers of each type. </li></ul></ul>
7. 7. From Tables and Spreadsheets to Data Cubes <ul><li>A data warehouse is based on a multidimensional data model which views data in the form of a data cube </li></ul><ul><li>A data cube, such as sales , allows data to be modeled and viewed in multiple dimensions </li></ul><ul><ul><li>Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) </li></ul></ul><ul><ul><li>Fact table contains measures (such as dollars_sold ) and keys to each of the related dimension tables </li></ul></ul>May 10, 2010 Data Mining: Concepts and Techniques
8. 8. Cube: A Lattice of Cuboids May 10, 2010 Data Mining: Concepts and Techniques time,item time,item,location time, item, location, supplier all time item location supplier time,location time,supplier item,location item,supplier location,supplier time,item,supplier time,location,supplier item,location,supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid
9. 9. A Concept Hierarchy: Dimension (location) May 10, 2010 Data Mining: Concepts and Techniques all Europe North_America Mexico Canada Spain Germany Vancouver M. Wind L. Chan ... ... ... ... ... ... all region office country Toronto Frankfurt city
10. 10. A Sample Data Cube May 10, 2010 Data Mining: Concepts and Techniques Total annual sales of TV in U.S.A. Date Product Country All, All, All sum sum TV VCR PC 1Qtr 2Qtr 3Qtr 4Qtr U.S.A Canada Mexico sum
11. 11. Cuboids Corresponding to the Cube May 10, 2010 Data Mining: Concepts and Techniques all product date country product,date product,country date, country product, date, country 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D(base) cuboid
12. 12. <ul><li>The following figure table shows one of the two dimensional aggregates, along with two of the one-dimensional aggregates, and the overall total </li></ul>Data Cube Example (continued)
13. 13. OLAP Operations: Slicing and Dicing <ul><li>Slicing is selecting a group of cells from the entire multidimensional array by specifying a specific value for one or more dimensions. </li></ul><ul><li>Dicing involves selecting a subset of cells by specifying a range of attribute values. </li></ul><ul><ul><li>This is equivalent to defining a subarray from the complete array. </li></ul></ul><ul><li>In practice, both operations can also be accompanied by aggregation over some dimensions. </li></ul>
14. 14. OLAP Operations: Roll-up and Drill-down <ul><li>This hierarchical structure gives rise to the roll-up and drill-down operations. </li></ul><ul><ul><li>For sales data, we can aggregate (roll up) the sales across all the dates in a month. </li></ul></ul><ul><ul><li>Conversely, given a view of the data where the time dimension is broken into months, we could split the monthly sales totals (drill down) into daily sales totals. </li></ul></ul><ul><ul><li>Likewise, we can drill down or roll up on the location or product ID attributes. </li></ul></ul>