Dbm630_Lecture02-03

305 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
305
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Dbm630_Lecture02-03

  1. 1. DBM630: Data Mining and Data Warehousing MS.IT. Rangsit University Semester 2/2011 Lecture 2&3 Data Warehouse and OLAP Technology by Kritsada Sriphaew (sriphaew.k AT gmail.com)1
  2. 2. OutlinePart I: Basic Knowledge about Data Warehousing Dimensional Modeling Data Cube ArchitecturePart II: OLAP and Cube Computations OLAP, Data Cube and Data Analysis Cube Computations Demo PivotTable (bring laptop with MS Excel, if you have) 2 Data Warehousing and Data Mining by Kritsada Sriphaew
  3. 3. What is Data Warehouse? Defined in many different ways, but not rigorously.  A decision support database that is maintained separately from the organization’s operational DB  Support information processing by providing a solid platform of consolidated, historical data for analysis. “A data warehouse is a subject-oriented, integrated, time- variant, and nonvolatile collection of data in support of management’s decision-making process.” (definition by W. H. Inmon) Data warehousing:  Process of constructing and using data warehouses 3 Data Warehousing and Data Mining by Kritsada Sriphaew
  4. 4. Four Properties of Data Warehouses subject-oriented จัดเก็บเป็นเรื่องๆ integrated รวบรวมอยู่ในรูปแบบเดียวกัน time-variant มีข้อมูลตามมิติเวลา non-volatile มีเสถียรภาพ ข้อมูลไม่สูญหาย4 Data Warehousing and Data Mining by Kritsada Sriphaew
  5. 5. Subject-Oriented Subject-Oriented Property  Organized around major subjects, such as customer, product, sales.  Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing.  Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. OPERATIONAL DB DATA WAREHOUSE • Loans • Customer • Savings • Vendor • Bank card • Product • Trust • Activity An application orientation A subject orientation5 Data Warehousing and Data Mining by Kritsada Sriphaew
  6. 6. Integrated Integrated Property  Constructed by integrating multiple, heterogeneous data sources  Relational databases, flat files, on-line transaction records  Data cleaning and data integration techniques are applied.  Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources  e.g., Hotel price: currency, tax, breakfast covered, etc.  When data is moved to the warehouse, it is converted.6 Data Warehousing and Data Mining by Kritsada Sriphaew
  7. 7. Time Variant/Non-Volatile Time Variant Property  The time horizon for the data warehouse is significantly longer than that of operational systems.  Operational database: current value data.  Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)  Every key structure in the data warehouse  Contains an element of time, explicitly or implicitly  But the key of operational data may or may not contain “time element”. Non-Volatile Property  A physically separated store of data transformed from the operational environment.  Operational update of data does not occur in the data warehouse environment.  Does not require transaction processing, recovery, and concurrency control mechanisms  Requires only two operations in data accessing: initial loading of data and access of data.7 Data Warehousing and Data Mining by Kritsada Sriphaew
  8. 8. Operational DBMS vs. Data Warehouse(OLTP vs. OLAP) OLTP (on-line transaction processing)  Major task of traditional relational DBMS  Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. OLAP (on-line analytical processing)  Major task of data warehouse system  Data analysis and decision making Distinct features (OLTP vs. OLAP):  User and system orientation: customer vs. market  Data contents: current, detailed vs. historical, consolidated  Design: ER + application vs. star + subject  View: current, local vs. evolutionary, integrated  Access patterns: create/select/insert/delete/update vs. read-only but complex queries 8 Data Warehousing and Data Mining by Kritsada Sriphaew
  9. 9. OLTP vs. OLAP OLTP OLAPusers clerk, IT professional knowledge workerfunction day to day operations decision supportDB design application-oriented subject-orienteddata current, up-to-date historical, detailed, flat relational summarized, isolated multidimensional integrated, consolidatedusage repetitive ad-hocaccess read/write lots of scans index/hash on primary keyunit of work short, simple transaction complex query# records accessed tens millions#users thousands tensDB size 100MB-GB 100GB-TBmetric transaction throughput query throughput, response9 Data Warehousing and Data Mining by Kritsada Sriphaew
  10. 10. Heterogeneous DBMS vs. Data Warehouse Traditional heterogeneous DB integration  Build wrappers/mediators on top of heterogeneous DBs  Query-driven approach  When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set  Complex information filtering, compete for resources Data warehouse (DW)  Another database for high-performance analysis  Update-driven approach  Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query/analysis 10 Data Warehousing and Data Mining by Kritsada Sriphaew
  11. 11. Heterogeneous DBMS vs. Data Warehouse OLTP Query-driven Approach Sales Wrapper/ Purchasing Mediators Production Heterogeneous Operational DatabaseOLTP Update-driven Approach Sales Data Purchasing Warehouse Query Production OLAPHeterogeneous Operational Database 11 Data Warehousing and Data Mining by Kritsada Sriphaew
  12. 12. Why Separate Data Warehouse? High performance for both systems  DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery  Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation. Different functions and different data:  Missing data: Decision support (OLAP) requires historical data which operational DBs (OLTP) do not typically maintain  Data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources  Data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled12 Data Warehousing and Data Mining by Kritsada Sriphaew
  13. 13. * A multidimensional data model From Tables and Spreadsheets to Data Cubes  A data warehouse is based on a multidimensional data model which views data in the form of a data cube  A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions  Fact table contains measures (such as units_sold, dollars_sold) and keys to each of the related dimension tables  Dimension tables, such as location (branch, country, continent), or time (day, week, month, quarter, year) Date Branch Item Buyer Units sold Dollars sold Branch Country Continent1/1/2008 London VCD First Company 20 5000 London UK Europe1/1/2008 Bangkok TV First Company 30 9000 Glasgow UK Europe10/1/2008 London Ham First Company 20 1000 Berlin Germany Europe4/2/2008 London Milk First Company 80 1600 Bangkok Thailand Asia15/2/2008 Bangkok VCD Best Company 30 7500 Phuket Thailand Asia2/5/2008 Bangkok Orange Best Company 20 500 Tokyo Japan Asia 13 Fact Table Dimension Table Data Warehousing and Data Mining by Kritsada Sriphaew
  14. 14. Three Concepts in Data Cubes Three concepts in data cubes are (1) Multidimension (2) Hierarchy Location Dimension Table (3 levels) (3) Measure Fact Table Branch Country Continent 4 dimensions 2 measures London UK Europe Glasgow UK Europe Date Branch Item Buyer Units sold Dollars sold Berlin Germany Europe1/1/2008 London VCD First Company 20 5000 Bangkok Thailand Asia1/1/2008 Bangkok TV First Company 30 9000 Phuket Thailand Asia10/1/2008 London Ham First Company 20 1000 Tokyo Japan Asia4/2/2008 London Milk First Company 80 1600 Item Subcategory Category15/2/2008 Bangkok VCD Best Company 30 7500 VCD Electric Non-Food2/5/2008 Bangkok Orange Best Company 20 500 TV Electric Non-Food Buyer Buyer Group Shirt Clothes Non-Food First Company Group 1 Customer Dimension Table Ham Process food FoodSecond Company Group 1 (2 levels) Milk Fresh food Food Third Company Group 1 Orange Fresh food Food Best Company Group 2 Good Company Group 2 Product Dimension Table (3 levels) 14 Data Warehousing and Data Mining by Kritsada Sriphaew
  15. 15. Cuboids in Data Cubes Cuboid concept is formed by the number of dimensions. The top most 0-D cuboid, which holds the highest- level of summarization, is called an apex cuboid. In data warehousing literature, an n-D base cube where n is the total number of dimensions is called a base cuboid. The lattice of cuboids forms a data cube. 15 Data Warehousing and Data Mining by Kritsada Sriphaew
  16. 16. An Example of Cuboids (Dimension) (add vs. delete dimensions) Time Sales Sales 0-D (apex) cuboid Sales(all) Q1 5000 24000Add dim. Delete dim. Q2 4000 Sales(Q1), Sales(Q2), Q3 80001-D cuboid Sales(Q3), Sales(Q4) Q4 7000Add dim. Delete dim. Time Thailand Japan Sales(Q1, Thailand), Sales(Q1, Japan) Sales(Q2, Thailand), Sales(Q2, Japan) Q1 2000 3000 2-D cuboid Sales(Q3, Thailand), Sales(Q3, Japan) Q2 1500 2500 Sales(Q4, Thailand), Sales(Q4, Japan)Add dim. Q3 2000 6000 Delete dim. Q4 3000 4000 3-D cuboid Thailand JapanSales(Q1, Thailand, Food), Sales(Q1, Thailand, NonFood) TimeSales(Q2, Thailand, Food), Sales(Q2, Thailand, NonFood) Food NonFood Food NonFoodSales(Q3, Thailand, Food), Sales(Q3, Thailand, NonFood) Q1 1500 500 2000 1000Sales(Q4, Thailand, Food), Sales(Q4, Thailand, NonFood)Sales(Q1, Japan, Food), Sales(Q1, Japan, NonFood) Q2 900 600 1500 1000Sales(Q2, Japan, Food), Sales(Q2, Japan, NonFood)Sales(Q3, Japan, Food), Sales(Q3, Japan, NonFood) Q3 1200 800 4000 2000Sales(Q4, Japan, Food), Sales(Q4, Japan, NonFood) Q4 2000 1000 2500 1500 16 Data Warehousing and Data Mining by Kritsada Sriphaew
  17. 17. Data Cube: A Lattice of Cuboids (Dimension) all 0-D (apex) cuboid time product location customer 1-D cuboidstime, item time, location item, location location, customer 2-D cuboids time, customer item, customer time, item, location time, location, customer 3-D cuboids time, item, customer item, location, customer 4-D(base) cuboid time, item, location, customer 17 Data Warehousing and Data Mining by Kritsada Sriphaew
  18. 18. Designing Data Warehouses Conceptual Modeling (Star schema, Snowflake, Fact constellations), Concept Hierarchy Physical Modeling (Data sources, Data storage, OLAP engine)18 Data Warehousing and Data Mining by Kritsada Sriphaew
  19. 19. Conceptual Modeling of Data Warehouses  Modeling data warehouses: dimensions & measures  Star schema: A fact table in the middle connected to a set of dimension tables  Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake  Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation19 Data Warehousing and Data Mining by Kritsada Sriphaew
  20. 20. Example of Star Schema time time_key item day item_key day_of_the_week Sales Fact Table item_name month brand quarter time_key type year supplier_type item_key branch_key branch location location_key branch_key location_key branch_name units_sold street branch_type city dollars_sold province_or_street country avg_sales Measures20 Data Warehousing and Data Mining by Kritsada Sriphaew
  21. 21. Example of Snowflake Schematimetime_key itemday item_key supplierday_of_the_week Sales Fact Table item_name supplier_keymonth brand supplier_typequarter time_key typeyear item_key supplier_key branch_key branch location location_key location_key branch_key units_sold street branch_name city_key city branch_type dollars_sold city_key avg_sales city province_or_street Measures country21 Data Warehousing and Data Mining by Kritsada Sriphaew
  22. 22. Example of Fact Constellation time Shipping Fact Table item time_key time_key day Sales Fact Table item_key day_of_the_week item_name item_key month time_key brand quarter type shipper_key year item_key supplier_type from_location branch_key to_location branch location_key location branch_key dollars_cost units_sold location_key branch_name street units_shipped branch_type dollars_sold city province_or_street shipper avg_sales country Measures shipper_key shipper_name location_key shipper_type22 Data Warehousing and Data Mining by Kritsada Sriphaew
  23. 23. Measures: Three Categories Distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning.  e.g., count(), sum(), min(), max(). Algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function.  e.g., avg(), standard_deviation(). Holistic: if there is no constant bound on the storage size needed to describe a sub-aggregate.  e.g., median(), mode(), rank(). 23 Data Warehousing and Data Mining by Kritsada Sriphaew
  24. 24. A Concept Hierarchy: Dimension (location)all allregion Europe ... North_Americacountry Germany ... Spain Canada ... Mexicocity Frankfurt ... Vancouver ... Torontooffice L. Chan ... M. Wind 24 Data Warehousing and Data Mining by Kritsada Sriphaew
  25. 25. A Concept Hierarchy: Dimension (Distributive - count(), sum(), min(), max())all all sum = 2100 sum = 1400 sum = 700region Europe North_America sum = 900 sum = 500 sum = 600 sum = 100country Germany Spain Canada Mexicocity Frankfurt Berlin Aechen Segovia Madrid Vancouver Toronto 400 200 300 400 100 200 400 Mexico 100 25 Data Warehousing and Data Mining by Kritsada Sriphaew
  26. 26. A Concept Hierarchy: Dimension (Algebraic - avg(), standard_deviation())all all avg = 262.5 count=8 avg = 280 count=5 avg = 233.33region Europe North_America count=3 avg = 300 avg = 250 avg = 300 avg = 100country Germany Spain Canada Mexico count=3 count=2 count=1 count=2city Frankfurt Berlin Aechen Segovia Madrid Vancouver Toronto 400 200 300 400 100 200 400 Mexico 100 26 Data Warehousing and Data Mining by Kritsada Sriphaew
  27. 27. A Concept Hierarchy: Dimension (Holistic - median(), mode(), rank())all all median = 250 median = 300 median = 200region Europe North_America median = 300 median = 250 median = 300country Germany Spain Canada Mexico median = 100city Frankfurt Berlin Aechen Segovia Madrid Vancouver Toronto 400 200 300 400 100 200 400 Mexico 100 27 Data Warehousing and Data Mining by Kritsada Sriphaew
  28. 28. Multidimensional Data  Sales volume as a function of product, month, and region Dimensions: Product, Location, Time Hierarchical summarization paths Industry Region YearProduct Category Country Quarter Product City Month Week Office Day Month 28 Data Warehousing and Data Mining by Kritsada Sriphaew
  29. 29. A Sample Data Cube Total annual sales Time of TV in U.S.A. 1Qtr 2Qtr 3Qtr 4Qtr sum TV PC U.S.A VCR Country sum Canada Mexico sum29 Data Warehousing and Data Mining by Kritsada Sriphaew
  30. 30. Browsing a Data Cube Visualization OLAP capabilities Interactive manipulation 30 Data Warehousing and Data Mining by Kritsada Sriphaew
  31. 31. Typical OLAP Operations Drill up (roll up): summarize data  by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of drill up  from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice:  project and select  Slice (select on 1 dim.), dice (select on 2 or more dim.) Pivot (rotate):  reorient the cube, visualization, 3D to series of 2D planes. Other operations  drill across: involving (across) more than one fact table  drill through: through the bottom level of the cube to its back-end relational tables (using SQL) 31 Data Warehousing and Data Mining by Kritsada Sriphaew
  32. 32. A Star-Net Query Model Each circle is called Shipping Method Customer Orders a footprint All All Shipping type Customer Sex group Contracts Shipping type All Order Male/FemaleTime Annually Product line All All Quarterly Daily Item Product group Product City Salesperson Country District Region Promotion type Division All All All Location Promotion Organization32 Data Warehousing and Data Mining by Kritsada Sriphaew
  33. 33. An Example Three concepts in data cubes are * A multi-dimensional data model (1) Multidimension (2) Hierarchy Location Dimension Table (3 levels) (3) Measure Fact Table Branch Country Continent 4 dimensions 2 measures London UK Europe Glasgow UK Europe Date Branch Item Buyer Units sold Dollars sold Berlin Germany Europe1/1/2008 London VCD First Company 20 5000 Dimension Table Bangkok Thailand Asia1/1/2008 Bangkok TV First Company 30 9000 Phuket Thailand Asia10/1/2008 London Ham First Company 20 1000 Tokyo Japan Asia4/2/2008 London Milk First Company 80 1600 Item Subcategory Category15/2/2008 Bangkok VCD Best Company 30 7500 VCD Electric Non-Food2/5/2008 Bangkok Orange Best Company 20 500 TV Electric Non-Food Buyer Buyer Group Shirt Clothes Non-Food First Company Group 1 Customer Dimension Table Ham Process food FoodSecond Company Group 1 (2 levels) Milk Fresh food Food Third Company Group 1 Orange Fresh food Food Best Company Group 2 Good Company Group 2 Product Dimension Table (3 levels) 33 Data Warehousing and Data Mining by Kritsada Sriphaew
  34. 34. OLAP Operations (Drill down vs. Drill up) All customers All customers Thailand Japan Total Thailand Japan TotalTime Time Food NonFood Food NonFood Drill down Food NonFood Food NonFood2006 2400 2200 11000 5000 20600 Roll down Q1 1500 500 2000 1000 50002007 4500 3200 12000 6000 25700 Q2 900 600 1500 1000 40002008 5600 2900 10000 5500 24000 Q3 1200 800 4000 2000 8000Total 12500 8300 33000 16500 70300 Drill up Q4 2000 1000 2500 1500 7000 Location Roll up Total 2008 5600 2900 10000 5500 24000 all Drill down Drill up Roll down Roll up continent All customers country Thailand Japan TotalTime branch Product Time all year month subcategory all Food NonFood Food NonFood quarter day item category January 700 200 800 500 2200 buyer February 500 100 700 200 1500 buyer group March 300 200 500 300 1300 all Total Q1 1500 500 2000 1000 5000 Customer 34 Data Warehousing and Data Mining by Kritsada Sriphaew
  35. 35. OLAP Operations (Slice and Dice) All customers Buyer Group 1 Thailand Japan Total Thailand Japan TotalTime Time Food NonFood Food NonFood Food NonFood Food NonFood2006 2400 2200 11000 5000 20600 Slice 2006 1100 1500 7000 2000 116002007 4500 3200 12000 6000 25700 2007 1200 1200 8000 3000 134002008 5600 2900 10000 5500 24000 2008 3000 1200 5000 1800 11000Total 12500 8300 33000 16500 70300 Total 5300 3900 20000 6800 36000 Location all January Buyer Group 1 continent Dice Thailand Japan Total country Time Food NonFood Food NonFoodTime branch 2006 200 200 800 300 1500 all year month subcategory all Product 2007 100 100 900 400 1500 quarter day item category 2008 300 200 400 100 1000 buyer buyer group Total 600 500 2100 800 4000 all Customer 35 Data Warehousing and Data Mining by Kritsada Sriphaew
  36. 36. OLAP Operations (Pivot) All customers Thailand Japan TotalTime Food NonFood Food NonFood2006 2400 2200 11000 5000 206002007 4500 3200 12000 6000 25700 Pivot2008 5600 2900 10000 5500 24000Total 12500 8300 33000 16500 70300 All customers Location Thailand Japan Total Time all 2006 2007 2008 2006 2007 2008 continent Food 2400 4500 5600 11000 12000 10000 45500 country NonFood 2200 3200 2900 5000 6000 5500 24800 Total 4600 7700 8500 16000 18000 15500 70300Time branch all year month subcategory all quarter day item category Product buyer buyer group all Customer 36 Data Warehousing and Data Mining by Kritsada Sriphaew
  37. 37. OLAP Operations (Drill Across) All customers All customers Thailand Japan Total Thailand Japan TotalTime Time Food NonFood Food NonFood Food NonFood Food NonFood2006 2400 2200 11000 5000 20600 2006 1500 1500 6000 3000 120002007 4500 3200 12000 6000 25700 Drill Across 2007 3500 2500 8000 4000 180002008 5600 2900 10000 5500 24000 2008 4000 1500 5000 3000 13500Total 12500 8300 33000 16500 70300 Total 9000 5500 19000 10000 43500 Sales Purchase Product Product Drill Across Location Location 37 Data Warehousing and Data Mining by Kritsada Sriphaew
  38. 38. OLAP Operations (Drill Through) All customers Thailand Japan TotalTime Food NonFood Food NonFood2006 2400 2200 11000 5000 206002007 4500 3200 12000 6000 257002008 5600 2900 10000 5500 24000 In order to see the detail at the relationalTotal 12500 8300 33000 16500 70300 table, we can perform ‘drill through’. Drill Through Sales Date Branch Item Buyer Units sold Dollars sold 1/1/2008 Phuket VCD First Company 10 250 1/1/2008 Bangkok TV First Company 30 900 Product 10/1/2008 Phuket TV First Company 10 300 4/2/2008 Phuket Stereo First Company 40 200 15/2/2008 Bangkok VCD Best Company 30 750 2/5/2008 Bangkok Computer Best Company 20 600 Location 38 Data Warehousing and Data Mining by Kritsada Sriphaew
  39. 39. An Example of OLAP tools39 Data Warehousing and Data Mining by Kritsada Sriphaew
  40. 40. An Example of OLAP tools Organization (continent level) Measures Time (year level) Dimension 2 Dimensions: (1) Time (2) Organization40 Data Warehousing and Data Mining by Kritsada Sriphaew
  41. 41. An Example of OLAP tools41 Data Warehousing and Data Mining by Kritsada Sriphaew
  42. 42. An Example of OLAP tools42 Data Warehousing and Data Mining by Kritsada Sriphaew
  43. 43. Designing Data Warehouses Conceptual Modeling (Star schema, Snowflake, Fact constellations), Concept Hierarchy Physical Modeling (Data sources, Data storage, OLAP engine)43 Data Warehousing and Data Mining by Kritsada Sriphaew
  44. 44. Multi-Tiered Architecture Bottom Tier Middle Tier Top Tier Monitor & OLAP Server other Metadata sources Integrator AnalysisOperational Extract QueryDBs Transform Data Serve Reports Load Warehouse Data mining Data MartsData Sources Data Storage OLAP Engine Front-End Tools44 Data Warehousing and Data Mining by Kritsada Sriphaew
  45. 45. Three Physical Data Warehouse Models Enterprise Warehouse  collects all of the information about subjects spanning the entire organization Data Mart  a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart  Independent vs. dependent (directly from warehouse) data mart Virtual Warehouse  A set of views over operational databases  Only some of the possible summary views may be materialized  Easy to build but require excess capacity on operational DB server. 45 Data Warehousing and Data Mining by Kritsada Sriphaew
  46. 46. OLAP Server Architectures Relational OLAP (ROLAP)  Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces  Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools/services  greater scalability, we may keep summary in relational DB Multidimensional OLAP (MOLAP)  Array-based multidimensional storage engine(sparse matrix techniques)  fast indexing to pre-computed summarized data Hybrid OLAP (HOLAP)  User flexibility, e.g., low level: relational, high-level: array Specialized SQL servers  specialized support for SQL queries over star/snowflake schemas in read- only environment. 46 Data Warehousing and Data Mining by Kritsada Sriphaew
  47. 47. Efficient Data Cube Computation Data cube can be viewed as a lattice of cuboids  The bottom-most cuboid is the base cuboid  The top-most cuboid (apex) contains only one cell  How many cuboids in an n-dimensional cube with L levels? n T   ( Li 1) i 1 Materialization of data cube  Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization)  Selection of which cuboids to materialize  Based on size, sharing, access frequency, etc.47 Data Warehousing and Data Mining by Kritsada Sriphaew
  48. 48. Cube: A Lattice of Cuboids (Dimension) all 0-D (apex) cuboid time product location customer 1-D cuboidstime, item time, location item, location location, customer 2-D cuboids time, customer item, customer time, item, location time, location, customer 3-D cuboids time, item, customer item, location, customer 4-D(base) cuboid time, item, location, customer 48 Data Warehousing and Data Mining by Kritsada Sriphaew
  49. 49. Multi-way Array Aggregation for CubeComputation Partition arrays into chunks (a small subcube which fits in memory). Compressed sparse array addressing: (chunk_id, offset) Compute aggregates in “multiway” by visiting cube cells in the order which minimizes the # of times to visit each cell, and reduces memory access and storage cost. C c3 61 c2 45 62 63 64 46 47 48 c1 29 30 31 32 c0 b3 B13 14 15 16 60 What is the best traversing order 44 28 56 to do multi-way aggregation? b2 9B 40 24 52 b1 5 36 The size of the dimensions A, B, 20 and C is 6000, 1000, and 100000. b0 1 2 3 4 a0 a1 a2 a3 A 49 Data Warehousing and Data Mining by Kritsada Sriphaew
  50. 50. Multi-way Array Aggregation for CubeComputation C c3 61 c2 45 62 63 64 46 47 48 c1 29 30 31 32 c0 B13 14 15 16 60 b3 44 B 28 56 b2 9 40 24 52 b1 5 36 20 b0 1 2 3 4 a0 a1 a2 a3 A50 Data Warehousing and Data Mining by Kritsada Sriphaew
  51. 51. Multi-way Array Aggregation for CubeComputation C c3 61 c2 45 62 63 64 46 47 48 c1 29 30 31 32 c0 B13 14 15 16 60 b3 44 B 28 56 b2 9 40 24 52 b1 5 36 20 b0 1 2 3 4 a0 a1 a2 a3 A51 Data Warehousing and Data Mining by Kritsada Sriphaew
  52. 52. Multi-Way Array Aggregation for CubeComputation (Cont.) Method: the planes should be sorted and computed according to their size in ascending order.  Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane Limitation of the method: computing well only for a small number of dimensions  If there are a large number of dimensions, “bottom-up computation” and iceberg cube computation methods can be explored. Iceberg cubes store only cube partitions where the aggregate value is above some min. support. 52 Data Warehousing and Data Mining by Kritsada Sriphaew
  53. 53. Multi-Way Array Aggregation(An Example) Suppose that a base cuboid has three dimensions A, B, C with the following numbers of cells: |A| = 6,000, |B| = 1,000, |C| = 50,000. Suppose that the dimensions A, B and C are partitioned into 6, 5 and 1000 portions for chunking, respectively.  If each cube cell stores one measure with 4 bytes, what is the total size of the computed cube if the cube is very dense? That is, calculate the size of a base cuboid.  The number of cells in the computed cube is  6 x 103 x 103 x 5 x 104 = 3x1011 cells.  The total size of the computed cube is  4 x 3 x 1011 = 1.2x1012 bytes.  State the order for computing the chunks in the cube that requires the least amount of space, and compute the total amount of main memory space required for computing the 2-D planes. 53 Data Warehousing and Data Mining by Kritsada Sriphaew
  54. 54. Multi-Way Array Aggregation for CubeComputation (Cont.) COne chunk includesA: 6000/6 = 1000 AB: 1000/5 = 200C: 50000/1000 = 50The space we need for Bcomputing the cube is (3) One chunk of plane AC:(1) All elements of plane AB: = 1000x50 cells = 6000x1000 cells = 50000 cells = 6000000 cells = 200000 bytes = 24000000 bytes (4) One cube of ABC:(2) One row of plane BC: = 1000x200x50 cells = 1000x50 cells = 10000000 cells = 50000 cells = 40000000 bytes = 200000 bytes Total memory for keeping the result = 64400000 bytes = 6.44x107 bytes.54 Data Warehousing and Data Mining by Kritsada Sriphaew
  55. 55. Efficient Processing OLAP Queries Determine which operations should be performed on the available cuboids:  transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g, dice = selection + projection Determine to which materialized cuboid(s) the relevant operations should be applied. Exploring indexing structures and compressed vs. dense array structures in MOLAP 55 Data Warehousing and Data Mining by Kritsada Sriphaew
  56. 56. Metadata Repository Meta data is the data defining warehouse objects. It has the following kinds  The algorithms used for summarization  The mapping from operational environment to the data warehouse  Data related to system performance  warehouse schema, view and derived data definitions  Business data  business terms and definitions, ownership of data, charging policies56 Data Warehousing and Data Mining by Kritsada Sriphaew
  57. 57. Data Warehouse Back-End Tools and Utilities Extraction:  get data from multiple, heterogeneous, and external sources Transformation:  convert data from legacy or host format to warehouse format Load:  sort, summarize, consolidate, compute views, check integrity, and build indices and partitions Cleaning:  detect errors in the data and rectify them when possible Refresh:  propagate the updates from the data sources to the warehouse 57 Data Warehousing and Data Mining by Kritsada Sriphaew
  58. 58. From OLAP to On Line Analytical Mining (OLAM) Why online analytical mining?  High quality of data in data warehouses DW contains integrated, consistent, cleaned data  Available information processing structure surrounding data warehouses ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools  OLAP-based exploratory data analysis mining with drilling, dicing, pivoting, etc.  On-line selection of data mining functions integration and swapping of multiple mining functions, algorithms, and tasks.58 Data Warehousing and Data Mining by Kritsada Sriphaew
  59. 59. An OLAM Architecture (Online Analytical Mining) Mining query Mining result Layer4 User Interface User GUI API Layer3 OLAM OLAP Engine Engine OLAP/OLAM Data Cube API Layer2 MDDB MDDB Meta Data Filtering&Integration Database API Filtering Layer1 Data cleaning Data Data Repository Databases Warehouse Data integration59 Data Warehousing and Data Mining by Kritsada Sriphaew
  60. 60. Major Applications of Data WarehousesApplications Online Analytical Processing (OLAP) Data Mining Customer Relationship Management (CRM)60 Data Warehousing and Data Mining by Kritsada Sriphaew
  61. 61. Some commercial tools forData Warehouse Microsoft Excel’s PivotTable Crystal Reports’ Business Objects IBM Cognos Microsoft SQL Server Analysis Services Oracle Express Microsoft Proclarity Etc. 61 Data Warehousing and Data Mining by Kritsada Sriphaew
  62. 62. Exercise 1 Given the following fact table Date Branch Item Buyer Units Dollars sold sold 1/01/2008 London Chair First Company 20 5000 1/01/2008 Bangkok TV First Company 30 9000 10/01/2008 London Ham Second Company 20 1000 4/02/2008 London Milk First Company 80 1600 15/02/2008 Bangkok VCD Best Company 30 7500 2/05/2008 Berlin Orange Best Company 20 500 12/06/2008 London VCD Good Company 20 5000 14/06/2008 Bangkok TV First Company 10 12000 16/07/2008 Phuket VCD Good Company 20 1000 24/07/2008 London Orange First Company 30 6000 2/08/2008 Bangkok Table Best Company 30 7500 12/08/2008 Bangkok Ham Best Company 20 500 12/10/2008 London Table Good Company 20 5000 14/11/2008 Bangkok TV First Company 8 2000 16/11/2008 Phuket Ham Good Company 5 2000 24/12/2008 London Milk First Company 30 600062 Knowledge Management and Discovery © Kritsada Sriphaew
  63. 63. Exercise 1 (cont.) Given the following dimensions of the data Location all continent country Time all year month branch subcategory Product all quarter day item category buyer buyer group all Customer63 Knowledge Management and Discovery © Kritsada Sriphaew
  64. 64. Exercise 1 (cont.) Question 1: Write concept hierarchies of time, location, product and customer.  Buyer groups are  Group1: Quality Group; good company, best company  Group2: Number Group; first company, second company  Product’s categories are  Cat1: Food  Subcategories are drink and non-drink  Cat2: NonFood  Subcategories are electronic and non-electronic64 Knowledge Management and Discovery © Kritsada Sriphaew
  65. 65. Exercise 1 (cont.) Question 2: Write a result table of this star-net query model with sum dollars sold. Location all continent country Time all year month branch subcategory Product all quarter day item category buyer buyer group all Customer65 Knowledge Management and Discovery © Kritsada Sriphaew
  66. 66. Exercise 1 (cont.) Question 3: Write a result table of this star-net query model with average units sold. Location all continent country Time all year month branch subcategory Product all quarter day item category buyer buyer group all Customer66 Knowledge Management and Discovery © Kritsada Sriphaew
  67. 67. Exercise 1 (cont.) Question 4: Write a result table of this star-net query model with sum dollars sold. Slice only Bangkok location. Location all continent country Time all year month branch Product subcategory all quarter day item category buyer buyer group all Customer67 Knowledge Management and Discovery © Kritsada Sriphaew
  68. 68. Exercise 2 (Data Cube Computation) Suppose that a base cuboid has three dimensions A, B, C with the following numbers of cells: |A| = 20,000, |B| = 8,000, |C| = 1,000. Also suppose that the dimensions A, B and C are partitioned into 10, 8 and 4 portions for chunking, respectively.  Question 1: If each cube cell stores one measure with 4 bytes, what is the total size of the computed cube if the cube is very dense? That is, calculate the size of a base cuboid.  The number of cells in the computed cube ?  The total size (bytes) of the computed cube ?  Question 2: State the order for computing the chunks in the cube that requires the least amount of space, and compute the total amount of main memory space required for computing the 2-D planes. 68 Knowledge Management and Discovery © Kritsada Sriphaew

×