Data Warehousing and OLAP


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data Warehousing and OLAP

  1. 1. Data Warehouse A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of decision- making process Modeling and analysis of data for decision makers, not for Data Warehousing and OLAP transaction processing (OLAP vs. OLTP) Constructed by integrating multiple heterogeneous data sources Dr. Weining Zhang Keep data with a historical perspective Permanently store data imported from the operational data sources W. Zhang Data Warehousing & OLAP 2 DW vs. DBMS OLTP vs. OLAP Wrapper/mediator vs. DW for information integration OLTP OLAP Mediator is query driven. Data are stored in heterogeneous users clerk, IT professional knowledge worker resources. Mediator distributes queries and integrates function day to day operations decision support answers DB design application-oriented subject-oriented DW is update-driven. Data are loaded to DW for query data current, up-to-date historical, detailed, flat relational summarized, multidimensional processing isolated integrated, consolidated usage repetitive ad-hoc DW vs. Operational DBMS access read/write lots of scans DBMS is for OLTP (on-line transaction processing) and index/hash on prim. key unit of work short, simple transaction complex query daily operation # records accessed tens millions DW is for OLAP (on-line analytical processing) and #users thousands hundreds decision-making DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response W. Zhang Data Warehousing & OLAP 3 W. Zhang Data Warehousing & OLAP 4 Why Separate DW From DBMS? Data Cube Model High performance for both systems Fact table and dimension tables DBMS is tuned for OLTP Sales Location Warehouse is tuned for OLAP Time Location Product Amount cid city state country Special requirements for decision making d1 c1 p1 12 c1 Dallas TX USA Query historical data not found in typical databases d1 c3 p1 50 c2 Toronto ON Canada d1 c1 p2 11 Consolidate (aggregate, summarize) data from heterogeneous c2 Chicago IL USA d1 c2 p2 8 sources … … … … d2 c1 p1 44 Must reconcile inconsistency in formats representations, and d2 c2 p1 4 codes for data from heterogeneous data sources … … … … dimensions Facts/measure W. Zhang Data Warehousing & OLAP 5 W. Zhang Data Warehousing & OLAP 6 Data Mining: Concepts and Techniques 1
  2. 2. Data Cube Model Define DW Schema View data as cubes Types of schemas Sales Star schema. A fact table with a set of simple dimension Time Location Product Amount tables d1 c1 p1 12 Snowflake schema. Refine a star schema by allowing some d1 c3 p1 50 all dimension to be modeled by a set of tables 56 4 50 110 d1 c1 p2 11 d2 44 12 4 50 48 50 Fact constellation. Multiple fact tables share dimension d1 c2 p2 8 d1 12 11 8 50 50 p1 12 11 8 50 62 tables. Viewed as a collection of stars, thus called galaxy d2 c1 p1 44 11 8 50 schema or fact constellation d2 c2 p1 4 p2 11 8 19 … … … … all 23 8 50 81 all c1 p1 56 c1 c2 c3 all … … … … all all all 129 W. Zhang Data Warehousing & OLAP 7 W. Zhang Data Warehousing & OLAP 8 Example of Star Schema Defining a Star Schema time define cube sales_star [time, item, branch, location]: time_key item day dimensions dollars_sold = sum(sales_in_dollars), avg_sales = item_key day_of_the_week Sales Fact Table item_name avg(sales_in_dollars), units_sold = count(*) month brand define dimension time as (time_key, day, day_of_week, month, quarter time_key type quarter, year) year supplier_type item_key define dimension item as (item_key, item_name, brand, type, branch_key supplier_type) branch location define dimension branch as (branch_key, branch_name, location_key branch_type) branch_key location_key branch_name units_sold street define dimension location as (location_key, street, city, branch_type city province_or_state, country) dollars_sold province_or_street country avg_sales Facts W. Zhang Data Warehousing & OLAP 9 W. Zhang Data Warehousing & OLAP 10 Example of Snowflake Schema Example of Fact Constellation time time time_key item Shipping Fact Table time_key item day item_key day_of_the_week Sales Fact Table item_name time_key day item_key supplier month Sales Fact Table brand day_of_the_week item_name supplier_key quarter time_key type item_key month brand supplier_type year supplier_type time_key shipper_key quarter type item_key year item_key supplier_key from_location branch_key branch_key location_key to_location location branch location branch location_key branch_key dollars_cost location_key units_sold location_key branch_key branch_name street units_sold street units_shipped branch_name branch_type dollars_sold city city_key city province_or_street branch_type dollars_sold avg_sales country shipper city_key avg_sales city Measures shipper_key province_or_street shipper_name Measures country location_key shipper_type W. Zhang Data Warehousing & OLAP 11 W. Zhang Data Warehousing & OLAP 12 Data Mining: Concepts and Techniques 2
  3. 3. Types of Aggregate of Measures A Sample Concept Hierarchy Distributive. Obtain identical result no matter applied to aggregated values or to fact values e.g., count(), all all sum(), min(), max(). Algebraic. Can be computed by an algebraic function region Europe ... North_America with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive country Germany ... Spain Canada ... Mexico aggregate function, e.g., avg(), min_N(), standard_deviation(). Holistic. Has no constant bound on the storage size city Frankfurt ... Vancouver ... Toronto needed to describe a subaggregate, e.g.,median(), mode(), Rank(). office L. Chan ... M. Wind W. Zhang Data Warehousing & OLAP 13 W. Zhang Data Warehousing & OLAP 14 Lattice of Cuboids Cube-based OLAP Operations Each cube is a cuboid Roll-up, drill-down, dice, slice, pivot, etc all 56 4 50 110 Roll up Cuboids form a lattice based on dimensions in the cubes d2 44 4 d1 12 50 48 50 Drill down 12 11 8 50 50 p1 12 11 8 50 62 11 8 50 all all 0-D(apex) cuboid p2 11 8 19 all 67 12 50 all 23 8 50 81 product location c1 c2 c3 time 1-D cuboids c1 c2 c3 all Drill down Roll up on time on product all product,time product,location time, location all 129 2-D cuboids all all all p1 56 4 50 p1 110 11 8 Roll up on 3-D(base) cuboid p2 11 8 location p2 19 Roll up on product, time, location produce c1 c2 c3 all W. Zhang Data Warehousing & OLAP 15 W. Zhang Data Warehousing & OLAP 16 Cube-based OLAP Operations Design of a Data Warehouse all 56 4 50 110 A difficult and expensive process d2 d1 44 12 4 50 48 50 d1 12 11 8 50 50 p1 12 50 Involve business and technical considerations. Viewed 8 50 11 8 p1 12 11 62 11 8 from different perspectives: 11 8 50 Slice on p2 p2 11 8 19 d1 c1 c2 c3 What information are needed all 23 8 50 81 Models of data sources c1 c2 c3 all pivot Models of warehouse data Dice on d1, d2; How to use the data p1, p2; and c1, d1 c2 Need many experiments and prototypes d2 44 4 c3 50 d1 12 8 c2 p1 12 11 11 8 c1 11 12 p2 11 8 p2 p1 c1 c2 W. Zhang Data Warehousing & OLAP 17 W. Zhang Data Warehousing & OLAP 18 Data Mining: Concepts and Techniques 3
  4. 4. Data Warehouse Design Process Data Warehouse Architecture Typically Monitor Choose a business process to model, e.g., orders, invoices, Metadata & OLAP other etc. Integrator Server sources Choose the grain (atomic level of data) of the business Analysis process Operational Extract Query Data Serve Reports Choose the dimensions that will apply to each fact table DBs Transform record Load Warehouse Data mining Choose the measure that will populate each fact table record Refresh Data Marts Data Sources Data Storage OLAP Engine Front-End Tools W. Zhang Data Warehousing & OLAP 19 W. Zhang Data Warehousing & OLAP 20 Three Data Warehouse Models Data Warehouse Development Enterprise warehouse Multi-Tier Data collects all of the information about subjects spanning the Warehouse entire organization Distributed Data Marts Data Mart a subset of corporate-wide data that is of value to a specific groups of users. Enterprise Data Data • Independent vs. dependent (directly from warehouse) data Data Mart Mart mart Warehouse Virtual warehouse Model refinement Model refinement A set of views over operational databases Only some of the possible summary views may Define a high-level corporate data model W. Zhang Data Warehousing & OLAP 21 W. Zhang Data Warehousing & OLAP 22 OLAP Server Architectures Compute Cube in ROLAP Relational OLAP (ROLAP) Extend SQL to have a cube operator Use RDBMS or ORDBMS to store data and OLAP define cube sales[item, city, year]: middleware to support missing pieces sum(sales_in_dollars) Greater scalability compute cube sales Multidimensional OLAP (MOLAP) Compute each cuboid using a group by Store cubes in multidimensional arrays Time consuming Fast indexing to summarized data Pre-compute cuboids Hybrid OLAP (HOLAP) For n dimensions, there are at least 2n cuboids, mach more if E.g., low level: relational, high-level: array dimensions have hierarchies Specialized SQL servers • Takes too much space, is costly to maintain Support SQL queries over star/snowflake schemas W. Zhang Data Warehousing & OLAP 23 W. Zhang Data Warehousing & OLAP 24 Data Mining: Concepts and Techniques 4
  5. 5. Partial Materialization Compute Cube in MOLAP Pre-compute some cuboids and compute other cuboids Partition a multidimensional array into chunks that fits when they are needed in memory Select cuboid for materialization Compress sparse array to conserve space • Tradeoff among space limits, maintenance cost, & How to compute the entire cube by reading each chunk usefulness to query processing only once? Exploit materialized cuboids to answer queries Read chunks in some ordering • Choose cuboids to use, using indexes, translate cube Different ordering needs different buffer space (chunk operations on chosen cuboids memory) View (cuboid) maintenance A multi-way array aggregation tries to computer all K-D • Update the materialized cuboids when the source data is cuboids in parallel changed W. Zhang Data Warehousing & OLAP 25 W. Zhang Data Warehousing & OLAP 26 Multi-way Array Aggregation Multi-way Array Aggregation A all Assume reading chunks 1 to 64 in order For BC plane, compute 1 chunk at a time. Need to keep c3 61 62 63 64 partial result for each cell in the chunk C c2 45 46 47 48 B c1 29 130 2 31 2 32 2 For AC plane, compute 4 chunks in parallel. Must keep c0 1 1 1 1 60 1 1 1 1 2 B partial results for each cell in the 4 chunks b3 13 14 15 16 1 44 2 56 1 2840 For AB plane, compute all 16 chunks in parallel. Must keep b2 9 10 11 12 1 B 1 2436 2 52 partial results for all cells in that plane b1 5 6 7 8 220 Size=AB plane + 1 row of AC plane + 1 chunk of BC plane 2 b0 1 2 3 4 C a0 a1 a2 a3A all A C all W. Zhang Data Warehousing & OLAP 27 W. Zhang Data Warehousing & OLAP 28 Indexing Cube Using Bitmaps Indexing Cube Using Join Index Each value in the column has a bit vector Join index: JI(R-id, S-id) where R (R-id, …) joins with Each tuple has a bit in the bit vector S (S-id, …) on a join condition The i-th bit is set if the i-th tuple of the base table has the Relates the values of the dimensions of a star schema to value for the indexed column rows in the fact table. Use bit operations to search for tuples E.g. fact table Sales and two dimensions location and product • A join index on location maintains for each distinct city a Base table Index on Region Index on Type list of sales tuples recording the Sales in the city Cust Region Type RecIDAsia Europe America RecID Retail Dealer Join indices can span multiple dimensions C1 Asia Retail 1 1 0 0 1 1 0 C2 Europe Dealer 2 0 1 0 2 0 1 C3 Asia Dealer 3 1 0 0 3 0 1 C4 America Retail 4 0 0 1 4 1 0 C5 Europe Dealer 5 0 1 0 5 0 1 W. Zhang Data Warehousing & OLAP 29 W. Zhang Data Warehousing & OLAP 30 Data Mining: Concepts and Techniques 5
  6. 6. Processing OLAP Queries Data Warehouse Tools & Utilities Determine cube operations to perform on available Data extraction: cuboids: get data from heterogeneous external sources transform drill, roll, etc. into corresponding SQL and/or Data cleaning: OLAP operations, e.g, dice = selection + projection detect & rectify errors in the data Determine to which materialized cuboid(s) the relevant Data transformation: operations should be applied. convert data from host format to DW format Exploring indexing structures and compressed vs. Load: dense array structures in MOLAP sort, summarize, consolidate, compute views, check integrity, & build indicies and partitions Refresh: propagate source updates to the warehouse W. Zhang Data Warehousing & OLAP 31 W. Zhang Data Warehousing & OLAP 32 DW and Data Mining Summary DW prepares data for mining Data warehouse Integrated, consistent, cleaned A subject-oriented, integrated, time-variant, and nonvolatile DW provides infrastructure for data collection and collection of data in support of management’s decision- making process analysis A multi-dimensional model of a data warehouse OLAP is a simple form of mining & data exploration Star schema, snowflake schema, fact constellations Mining should be a part of DW operations A data cube consists of dimensions & measures Provide more powerful tools to analyze data and extract knowledge W. Zhang Data Warehousing & OLAP 33 W. Zhang Data Warehousing & OLAP 34 Summary OLAP operations: drilling, rolling, slicing, dicing and pivoting OLAP servers: ROLAP, MOLAP, HOLAP Efficient computation of data cubes Partial vs. full vs. no materialization Multi-way array aggregation Bitmap index and join index implementations Further development of data cube technology Discovery-drive and multi-feature cubes From OLAP to OLAM (on-line analytical mining) W. Zhang Data Warehousing & OLAP 35 Data Mining: Concepts and Techniques 6