Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Data Mining Techniques:                                                                   Overview
    Data Warehousing an...
Data Warehouse vs. Heterogeneous
          Data Warehouse—Nonvolatile                                                     ...
From Tables and Spreadsheets to Data
Example of Fact Constellation                                                                Concept Hierarchy: Dimension ...

                                                                                        What is a data warehouse...
Efficient Data Cube Computation                                          Efficient Cube Computation

How many cuboids in a...
Upcoming SlideShare
Loading in …5

Data Mining Techniques: Data Warehousing and OLAP


Published on

  • Be the first to comment

Data Mining Techniques: Data Warehousing and OLAP

  1. 1. Data Mining Techniques: Overview Data Warehousing and OLAP What is a data warehouse? A multi-dimensional data model Data warehouse architecture Mirek Riedewald Data warehouse implementation From data warehousing to data mining Slides based on presentation by Han and Kamber Data Mining Techniques 1 Data Mining Techniques 2 What is Data Warehouse? Data Warehouse—Subject-Oriented Many definitions Organized around major subjects, such as A decision support database that is maintained separately from customer, product, sales the organization’s operational database Support information processing by providing a solid platform of Focus: modeling and analysis of data for decision consolidated, historical data for analysis. makers, not on daily operations or transaction “A data warehouse is a subject-oriented, integrated, time-variant, processing and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon Provide a simple and concise view around Data warehousing: particular subject issues by excluding data that The process of constructing and using data warehouses are not useful in the decision support process Data Mining Techniques 3 Data Mining Techniques 4 Data Warehouse—Integrated Data Warehouse—Time-Variant Integrates multiple heterogeneous data sources Longer time horizon than operational systems Relational databases, flat files, on-line transaction records Operational database: current value of data Requires data cleaning and integration Current address of customer techniques Data warehouse: information from a historical Ensure consistency in naming conventions, perspective (e.g., past 5-10 years) encoding structures, attribute measures, etc. among different data sources E.g., Hotel price: currency, tax, breakfast Every key structure in the data warehouse covered, etc. contains an element of time, explicitly or implicitly Data is converted when loaded into warehouse Data Mining Techniques 5 Data Mining Techniques 6 1
  2. 2. Data Warehouse vs. Heterogeneous Data Warehouse—Nonvolatile DBMS A physically separate store of data transformed Heterogeneous DB integration: query-driven approach from the operational environment Wrappers/mediators for heterogeneous databases No operational updates Query processing: meta-dictionary is used to translate user query into queries for individual heterogeneous New value added with new timestamp, instead sites involved; results are integrated into a global of overwriting old value answer set Does not require transaction processing, Complex information filtering recovery, and concurrency control mechanisms Data warehouse: update-driven, high performance Information from heterogeneous sources is integrated Only two operations: loading of new data and in advance and stored in warehouses for direct query reading data and analysis Data Mining Techniques 7 Data Mining Techniques 8 Data Warehouse vs. Operational DBMS OLTP vs. OLAP OLTP OLAP OLTP (on-line transaction processing) users clerk, IT professional knowledge worker Major task of traditional relational DBMS function day to day operations decision support DB design application-oriented subject-oriented Day-to-day operations: purchasing, inventory, data current, up-to-date, historical, banking, manufacturing, payroll, registration, detailed, flat relational, isolated summarized, multidimensional, integrated, consolidated accounting usage repetitive ad-hoc access read/write, lots of scans index/hash on prim. key unit of work short, simple transaction complex query OLAP (on-line analytical processing) # records accessed tens millions Major task of data warehouse system #users thousands hundreds DB size 100MB-GB 100GB-TB Data analysis and decision making metric transaction throughput query throughput, response time Data Mining Techniques 9 Data Mining Techniques 10 Why A Separate Data Warehouse? Overview High performance for different workloads DBMS—tuned for OLTP: fewer simpler indices, concurrency What is a data warehouse? control, recovery Warehouse—tuned for OLAP: sophisticated read-optimized A multi-dimensional data model indices, multidimensional view, consolidation Different functions and different data: Data warehouse architecture Historic data: needed for decision support (DS) Data consolidation: DS requires consolidation (aggregation, Data warehouse implementation summarization) of data from heterogeneous sources Data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled From data warehousing to data mining Note: more and more systems perform OLAP analysis directly on relational databases Data Mining Techniques 11 Data Mining Techniques 12 2
  3. 3. From Tables and Spreadsheets to Data Data Cube: A Lattice of Cuboids Cubes Multidimensional data model: attributes are all 0-D(apex) cuboid dimensions or measures time item location supplier 1-D cuboids Dimension tables, e.g., item(item_name, brand, type) or time(day, week, month, time,location item,location location,supplier 2-D cuboids quarter, year) time,item time,supplier item,supplier time,location,supplier Fact table contains measures (e.g., 3-D cuboids time,item,location dollar_amount) and keys to each of the related time,item,supplier item,location,supplier 4-D(base) cuboid dimension tables time, item, location, supplier Data Mining Techniques 13 Data Mining Techniques 14 A Sample Data Cube Conceptual Modeling of Data Warehouses Date Total annual sales Star schema: fact table in the middle, connected of TV in U.S.A. 1Qtr 2Qtr sum to a set of dimension tables ct 3Qtr 4Qtr TV o du PC U.S.A Pr VCR Snowflake schema: refinement of star schema Country sum Canada where some dimensional hierarchy is normalized Mexico into a set of smaller dimension tables sum Fact constellation: multiple fact tables share dimension tables, viewed as a collection of stars Also called galaxy schema Data Mining Techniques 15 Data Mining Techniques 16 Example of Star Schema Example of Snowflake Schema time time time_key item time_key item day item_key day item_key supplier day_of_the_week day_of_the_week Sales Fact Table item_name Sales Fact Table item_name supplier_key month brand month brand supplier_type quarter time_key quarter time_key type type year supplier_type year item_key supplier_key item_key branch_key branch_key branch location branch location location_key location_key location_key location_key branch_key branch_key street units_sold street branch_name units_sold branch_name city city_key branch_type branch_type dollars_sold city dollars_sold state_or_province city_key country avg_sales avg_sales city state_or_province Measures Measures country Data Mining Techniques 17 Data Mining Techniques 18 3
  4. 4. Example of Fact Constellation Concept Hierarchy: Dimension (location) time time_key item Shipping Fact Table all all day item_key day_of_the_week Sales Fact Table item_name time_key month brand quarter time_key type item_key region Europe ... North_America year supplier_type shipper_key item_key branch_key from_location country Germany ... Spain Canada ... Mexico branch location_key location to_location branch_key location_key dollars_cost branch_name units_sold branch_type street units_shipped city Frankfurt ... Vancouver ... Toronto dollars_sold city province_or_state avg_sales country shipper Measures shipper_key office L. Chan ... M. Wind shipper_name location_key Data Mining Techniques shipper_type 19 Data Mining Techniques 20 Multidimensional Data Browsing a Data Cube Sales volume as a function of product, month, and region Dimensions: Product, Location, Time Hierarchical summarization paths on gi Re Industry Region Year Category Country Quarter Product Product City Month Week Visualization Office Day OLAP capabilities Interactive manipulation Month Data Mining Techniques 21 Data Mining Techniques 22 Typical OLAP Operations Roll-up (drill-up): summarize data Move up hierarchy for one or more dimensions Drill-down (roll-down): reverse of roll-up Move down hierarchy for one or more dimensions Slice and dice: project and select Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes Other operations Drill across: involving more than one fact table Drill through: through the bottom level of the cube to its back-end relational tables (using SQL) Data Mining Techniques 23 Data Mining Techniques 24 4
  5. 5. Overview What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation From data warehousing to data mining Data Mining Techniques 25 Data Mining Techniques 26 Data Warehouse Architecture Three Data Warehouse Models Monitor Enterprise warehouse OLAP Server Other Metadata & Information about subjects spanning the entire Integrator sources organization Analysis Operational Extract Query Data Mart DBs Transform Load Data Serve Reports Subset of corporate-wide data that is of value Refresh Warehouse Data mining to a specific group of users, e.g., marketing data mart Virtual warehouse Data Marts A set of views over operational databases Only some of the summary views materialized Data Sources Data Storage OLAP Engine Front-End Tools Data Mining Techniques 27 Data Mining Techniques 28 OLAP Server Architectures Overview Relational OLAP (ROLAP) Use relational or extended-relational DBMS to store What is a data warehouse? and manage warehouse data and OLAP middleware Include optimization of DBMS backend, implementation A multi-dimensional data model of aggregation navigation logic Great scalability, works well for sparse data Multidimensional OLAP (MOLAP) Data warehouse architecture Sparse array-based multidimensional storage engine Fast indexing to pre-computed summarized data Data warehouse implementation Hybrid OLAP (HOLAP) Flexibility, e.g., low level: relational, high-level: array From data warehousing to data mining Specialized SQL servers Specialized support for SQL queries over star or snowflake schemas Data Mining Techniques 29 Data Mining Techniques 30 5
  6. 6. Efficient Data Cube Computation Efficient Cube Computation How many cuboids in an n-dimensional cube with Can compute every cuboid from base cuboid n Cheaper: compute from appropriate aggregate cuboid Li levels in dimension i? ∏ ( L + 1) i all i=1 Materialization of data cube time item location supplier Every cuboid (full materialization) time,location item,location location,supplier No cuboid (no materialization) time,supplier Some cuboids (partial materialization) time,item item,supplier Which cuboids to materialize? time,location,supplier Benefit (access frequency, sharing) vs. cost time,item,location time,item,supplier item,location,supplier (storage, updates) Data Mining Techniques 31 time, item, location, supplier Data Mining Techniques 32 Categories of Aggregate Functions Data set D=D1∪…∪Dk, aggregate function F() Distributive: F(D) = G(F(D1),…,F(Dk)) for some function G() E.g., count(), sum(), min(), max() Algebraic: F(D) = H(G(D1),…,G(Dk)) for some function H() and some M-tuple valued function G() M = constant that is independent of |D| E.g., avg(), standard_deviation() Holistic: no constant bound on the storage size needed to describe a sub-aggregate F(Di) E.g., median(), mode(), rank() Data Mining Techniques 33 6