• Like
Data Mining:
Upcoming SlideShare
Loading in...5
×
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,386
On Slideshare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
191
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Chapter 3: Data Warehousing and Data Mining: OLAP Technology: An Overview Concepts and Techniques What is a data warehouse? — Slides slightly adapted from— — Chapter 3 — A multi-dimensional data model Data warehouse architecture Original slides by: Jiawei Han Data warehouse implementation Department of Computer Science University of Illinois at Urbana-Champaign From data warehousing to data mining www.cs.uiuc.edu/~hanj ©2006 Jiawei Han and Micheline Kamber, All rights reserved September 22, 2009 Data Mining: Concepts and Techniques 1 September 22, 2009 Data Mining: Concepts and Techniques 2 What is Data Warehouse? Data Warehouse—Subject-Oriented Defined in many different ways, but not rigorously. Organized around major subjects, such as A decision support database that is maintained separately from customer, product, sales; or the organization’s operational database Support information processing by providing a solid platform of patient, disease, gene, protein-class, etc. consolidated, historical data for analysis. Focusing on the modeling and analysis of data for “A data warehouse is a subject-oriented, integrated, time-variant, decision makers, not on daily operations or transaction and nonvolatile collection of data in support of management’s processing decision-making process.”—W. H. Inmon Provide a simple and concise view around particular Data warehousing: subject issues by excluding data that are not useful in The process of constructing and using data warehouses the decision support process September 22, 2009 Data Mining: Concepts and Techniques 3 September 22, 2009 Data Mining: Concepts and Techniques 4
  • 2. Data Warehouse—Integrated Data Warehouse—Time Variant Constructed by integrating multiple, heterogeneous data The time horizon for the data warehouse is significantly sources longer than that of operational systems relational databases, flat files, on-line transaction records Operational database: current value data Data cleaning and data integration techniques are Data warehouse data: provide information from a applied. historical perspective (e.g., past 5-10 years) Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different Every key structure in the data warehouse data sources Contains an element of time, explicitly or implicitly E.g., Hotel prices at different international locations: currency, tax, breakfast covered, etc. But the key of operational data may or may not When data is moved to the warehouse, it is contain “time element” => time derived converted. September 22, 2009 Data Mining: Concepts and Techniques 5 September 22, 2009 Data Mining: Concepts and Techniques 6 Data Warehouse—Nonvolatile Data Warehouse vs. Heterogeneous DBMS A physically separate store of data transformed from the Traditional heterogeneous DB integration: A query driven approach operational environment Build wrappers/mediators on top of heterogeneous databases When a query is posed to a client site, a meta-dictionary is used to Operational update of data does not occur in the data translate the query into queries appropriate for individual heterogeneous warehouse environment but in the operational data sites involved, and the results are integrated into a global answer set sources themselves => Complex information filtering, compete for resources Does not require transaction processing, recovery, Data warehouse: update-driven, high performance and concurrency control mechanisms Information from heterogeneous sources is integrated in advance and Requires only two operations in data accessing: stored in warehouses for direct query and analysis initial loading of data and access of data September 22, 2009 Data Mining: Concepts and Techniques 7 September 22, 2009 Data Mining: Concepts and Techniques 8
  • 3. Data Warehouse vs. Operational DBMS OLTP vs. OLAP OLTP (on-line transaction processing) OLTP OLAP Major task of traditional relational DBMS users clerk, IT professional knowledge worker Day-to-day operations: purchasing, inventory, banking, function day to day operations decision support manufacturing, payroll, registration, accounting, etc. DB design application-oriented subject-oriented data current, up-to-date historical, OLAP (on-line analytical processing) detailed, flat relational summarized, multidimensional Major task of data warehouse system isolated integrated, consolidated usage repetitive ad-hoc Data analysis and decision making access read/write lots of scans Distinct features (OLTP vs. OLAP): index/hash on prim. key User and system orientation: customer vs. market unit of work short, simple transaction complex query # records accessed tens millions Data contents: current, detailed vs. historical, consolidated #users thousands hundreds Database design: ER + application vs. star + subject 100MB-GB 100GB-TB DB size View: current, local vs. evolutionary, integrated metric transaction throughput query throughput, response Access patterns: updates vs. read-only but complex queries September 22, 2009 Data Mining: Concepts and Techniques 9 September 22, 2009 Data Mining: Concepts and Techniques 10 Chapter 3: Data Warehousing and Why Separate Data Warehouse? OLAP Technology: An Overview High performance for both systems DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery What is a data warehouse? Warehouse— tuned for OLAP: complex OLAP queries, multidimensional view, consolidation A multi-dimensional data model Different functions and different data: missing data: Decision support (DS) requires historical data which Data warehouse architecture operational DBs do not typically maintain data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources Data warehouse implementation data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled From data warehousing to data mining Note: There are more and more systems which perform OLAP analysis directly on relational databases (… one size fits all?) September 22, 2009 Data Mining: Concepts and Techniques 11 September 22, 2009 Data Mining: Concepts and Techniques 12
  • 4. From Tables and Spreadsheets to Data Cubes Cube: A Lattice of Cuboids A data warehouse is based on a multidimensional data model Sales Data Cube: all 0-D(apex) cuboid which views data in the form of a data cube A data cube, such as Sales, allows data to be modeled and viewed time item location supplier in multiple dimensions 1-D cuboids Dimension tables, such as item (item_name, brand, type), or time,location item,location location,supplier time (day, week, month, quarter, year), location (…), etc. 2-D cuboids time,item time,supplier item,supplier Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables time,location,supplier 3-D cuboids In data warehousing literature, an n-dimensional base cube is called time,item,location time,item,supplier item,location,supplier a base cuboid. The top most 0-dimensional cuboid, which holds the 4-D(base) cuboid highest-level of summarization, is called the apex cuboid. The lattice time, item, location, supplier of cuboids forms a data cube. September 22, 2009 Data Mining: Concepts and Techniques 13 September 22, 2009 Data Mining: Concepts and Techniques 14 Conceptual Modeling of Data Warehouses Example of Star Schema Modeling data warehouses: dimensions & measures time time_key item Star schema: A fact table (e.g sales) in the middle connected to a day item_key set of dimension tables (e.g. time, item, location, etc.) day_of_the_week Sales Fact Table item_name Snowflake schema: A refinement of a star schema where some month brand quarter time_key type dimensional hierarchy is normalized into a set of smaller year supplier_type item_key dimension tables, forming a shape similar to a snowflake branch_key Fact constellations: Multiple fact tables share dimension tables, branch location location_key viewed as a collection of stars, therefore called galaxy schema or branch_key location_key branch_name units_sold street fact constellation city branch_type dollars_sold state_or_province country avg_sales Measures September 22, 2009 Data Mining: Concepts and Techniques 15 September 22, 2009 Data Mining: Concepts and Techniques 16
  • 5. Cube Definition Syntax (BNF) in DMQL Example of Star Schema time Cube Definition (Fact Table) time_key item day define cube <cube_name> [<dimension_list>]: day_of_the_week item_key Sales Fact Table item_name <measure_list> month brand quarter time_key Dimension Definition (Dimension Table) type year supplier_type item_key define dimension <dimension_name> as (<attribute_or_subdimension_list>) branch_key branch location Special Case (Shared Dimension Tables) location_key location_key branch_key First time as “cube definition” branch_name units_sold street branch_type city define dimension <dimension_name> as dollars_sold state_or_province <dimension_name_first_time> in cube avg_sales country <cube_name_first_time> Measures September 22, 2009 Data Mining: Concepts and Techniques 17 September 22, 2009 Data Mining: Concepts and Techniques 18 Example 1: Defining Star Schema in DMQL Example of Snowflake Schema time item define cube sales_star [time, item, branch, location]: time_key day item_key supplier dollars_sold = sum(sales_in_dollars), avg_sales = day_of_the_week Sales Fact Table item_name supplier_key avg(sales_in_dollars), units_sold = count(*) month time_key brand supplier_type quarter type define dimension time as (time_key, day, day_of_week, year supplier_key item_key month, quarter, year) branch_key define dimension item as (item_key, item_name, brand, location branch location_key type, supplier_type) location_key branch_key define dimension branch as (branch_key, branch_name, units_sold street branch_name city_key branch_type) branch_type dollars_sold city city_key define dimension location as (location_key, street, city, avg_sales city province_or_state, country) Measures state_or_province country September 22, 2009 Data Mining: Concepts and Techniques 19 September 22, 2009 Data Mining: Concepts and Techniques 20
  • 6. Example 2: Defining Snowflake Schema in DMQL Example of Fact Constellation time Star schema 2 time_key item Shipping Fact Table define cube sales_snowflake [time, item, branch, location]: day Star schema 1 item_key dollars_sold = sum(sales_in_dollars), avg_sales = day_of_the_week Sales Fact Table item_name time_key month avg(sales_in_dollars), units_sold = count(*) brand item_key quarter time_key type define dimension time as (time_key, day, day_of_week, month, quarter, year supplier_type shipper_key year) item_key branch_key from_location define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type)) branch location_key location to_location define dimension branch as (branch_key, branch_name, branch_type) branch_key location_key dollars_cost branch_name units_sold define dimension location as (location_key, street, city(city_key, branch_type street units_shipped dollars_sold city province_or_state, country)) province_or_state avg_sales country shipper Measures shipper_key shipper_name Shared location_key September 22, 2009 Data Mining: Concepts and Techniques 21 September 22, 2009 Data Mining: Concepts and Techniques shipper_type 22 Example 3: Defining Fact Constellation in DMQL Measures of Data Cube: Three Categories define cube sales [time, item, branch, location]: Distributive: if the result derived by applying the function to n dollars_sold = sum(sales_in_dollars), avg_sales = aggregate values is the same as that derived by applying the function avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) on all the data without partitioning define dimension item as (item_key, item_name, brand, type, supplier_type) E.g., count(), sum(), min(), max() define dimension branch as (branch_key, branch_name, branch_type) sum(all) = sum(europe) + sum(america) + sum(asia) + … define dimension location as (location_key, street, city, province_or_state, Algebraic: if it can be computed by an algebraic function with M country) define cube shipping [time, item, shipper, from_location, to_location]: arguments (where M is a bounded integer), each of which is obtained dollar_cost = sum(cost_in_dollars), unit_shipped = count(*) by applying a distributive aggregate function define dimension time as time in cube sales E.g., avg(), min_N(), standard_deviation() define dimension item as item in cube sales Avg(all) = sum(all) / #items (arguments: sum(all), and #items) define dimension shipper as (shipper_key, shipper_name, location as location Holistic: if there is no constant bound on the storage size needed to in cube sales, shipper_type) define dimension from_location as location in cube sales describe a subaggregate. define dimension to_location as location in cube sales E.g., median(), mode(), rank() Median(all) = … no constant sized subaggregates for computing September 22, 2009 Data Mining: Concepts and Techniques 23 median September 22, 2009 Data Mining: Concepts and Techniques 24
  • 7. A Concept Hierarchy: Dimension (location) View of Warehouses and Hierarchies all all region Europe ... North_America Specification of hierarchies Schema hierarchy country Germany ... Spain Canada ... Mexico day < {month < quarter; week} < year city Frankfurt ... Vancouver ... Toronto Set_grouping hierarchy {1..10} < inexpensive office L. Chan ... M. Wind September 22, 2009 Data Mining: Concepts and Techniques 25 September 22, 2009 Data Mining: Concepts and Techniques 26 Multidimensional Data A Sample Data Cube Total sales of TV’s in 1st quarter in USA Total annual sales Sales volume as a function of product, month, Date of TV in U.S.A. 1Qtr 2Qtr 3Qtr 4Qtr sum t and region uc TV od Dimensions: Product, Location, Time PC U.S.A Pr Hierarchical summarization paths VCR 3-D Cuboid on Country sum gi Canada Industry Region Year Re Category Country Quarter Total sales in Mexico USA in 1Qtr 2-D Cuboid Product Product City Month Week sum Office Day 1-D Cuboid 0-D Cuboid summarize Month September 22, 2009 Data Mining: Concepts and Techniques 27 September 22, 2009 Data Mining: Concepts and Techniques 28
  • 8. Cuboids Corresponding to the Cube Browsing a Data Cube all 0-D(apex) cuboid product date country 1-D cuboids product,date product,country date, country 2-D cuboids 3-D(base) cuboid product, date, country Visualization OLAP capabilities Interactive manipulation September 22, 2009 Data Mining: Concepts and Techniques 29 September 22, 2009 Data Mining: Concepts and Techniques 30 t ie s) es) (ci tr i Toronto 395 un USA on Vancouver (coCanada 2000 at i ion loc (quarters) at Typical OLAP Operations Q1 605 loc Q1 1000 time time (quarters) Q2 Q2 Dice computer Q3 home entertainment Q4 item (types) computer security home phone dice for entertainment Roll up (drill-up): summarize data (location = “Toronto” or “Vancouver”) item (types) and (time = “Q1” or “Q2”) and (item = “home entertainment” or “computer”) roll-up by climbing up hierarchy or by dimension reduction Roll-up on location (from cities to countries) s) Drill down (roll down): reverse of roll-up on (ci t ie Chicago 440 New York Typical OLAP Operations 1560 at i Toronto 395 loc Vancouver from higher level summary to lower level summary or Q1 605 825 14 400 time (quarters) Q2 detailed data, or introducing new dimensions Q3 Q4 slice Slice and dice: project and select computer security for time = “Q1” home phone entertainment drill-down item (types) on time Pivot (rotate): (from quarters Drill-down Chicago to months) Slice location (cities) New York reorient the cube, visualization, 3D to series of 2D planes Toronto Vancouver 605 825 14 400 es) ti Chicago (ci New York Other operations computer security on Toronto home phone at i entertainment loc Vancouver item (types) January 150 drill across: involving (across) more than one fact table February 100 March 150 pivot April time (months) May drill through: through the bottom level of the cube to Pivot June July home August its back-end relational tables (using SQL) entertainment 605 item (types) September computer 825 October phone 14 November December security 400 computer security New York Vancouver home phone September 22, 2009 Data Mining: Concepts and Techniques 31 September 22, 2009 Data Mining: Concepts and Techniques 32 Chicago Toronto entertainment location (cities) item (types)
  • 9. s) t ie (ci New YorkChicago 440 on 1560 at i Toronto 395 loc Vancouver Q1 605 825 14 400 time (quarters) Q2 t ie s) es) tr i Slice Drill-down ci un n( Toronto 395 USA Q3 at i o Vancouver (co 2000 loc on Canada at i (quarters) Q4 Q1 605 loc Q1 1000 time time (quarters) slice Q2 computer security Q2 for time = “Q1” Dice home phone home computer Roll-up Q3 entertainment item (types) drill-down on time entertainment Q4 (from quarters Chicago to months) location (cities) item (types) computer security New York home phone dice for entertainment Toronto (location = “Toronto” or “Vancouver”) item (types) and (time = “Q1” or “Q2”) and Vancouver 605 825 14 400 (item = “home entertainment” or “computer”) s) t ie Chicago roll-up computer security (ci New York on location on Toronto home phone at i (from cities to countries) entertainment loc Vancouver item (types) January 150 s) February 100 t ie (ciChicago 440 March 150 on Pivot New York 1560 at i Toronto 395 pivot April loc time (months) Vancouver May Q1 605 825 14 400 June time (quarters) July Q2 home entertainment 605 August item (types) Q3 September computer 825 Q4 October phone 14 November slice computer security December for time = “Q1” security 400 home phone entertainment computer security drill-down New York Vancouver item (types) on time home phone (from quarters Chicago Toronto entertainment ) location (cities) item (types) September 22, 2009 Data Mining: Concepts and Techniques 33 September 22, 2009 Data Mining: Concepts and Techniques 34 Chapter 3: Data Warehousing and A Star-Net Query Model OLAP Technology: An Overview Customer Orders Shipping Method Customer CONTRACTS What is a data warehouse? AIR-EXPRESS TRUCK ORDER A multi-dimensional data model PRODUCT LINE Time Product ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP Data warehouse architecture CITY SALES PERSON Data warehouse implementation COUNTRY DISTRICT REGION Each circle is From data warehousing to data mining called a footprint DIVISION Location Promotion Organization September 22, 2009 Data Mining: Concepts and Techniques 35 September 22, 2009 Data Mining: Concepts and Techniques 36
  • 10. Design of Data Warehouse: A Business Analysis Framework Data Warehouse Design Process Top-down, bottom-up approaches or a combination of both Four views regarding the design of a data warehouse Top-down: Starts with overall design and planning (mature) Top-down view Bottom-up: Starts with experiments and prototypes (rapid) allows selection of the relevant information necessary for the From software engineering point of view data warehouse Waterfall: structured and systematic analysis at each step before Data source view proceeding to the next exposes the information being captured, stored, and Spiral: rapid generation of increasingly functional systems, short managed by operational systems turn around time, quick turn around Data warehouse view Typical data warehouse design process consists of fact tables and dimension tables Choose a business process to model, e.g., orders, invoices, etc. Business query view Choose the grain (atomic level of data) of the business process sees the perspectives of data in the warehouse from the view Choose the dimensions that will apply to each fact table record of end-user Choose the measure that will populate each fact table record September 22, 2009 Data Mining: Concepts and Techniques 37 September 22, 2009 Data Mining: Concepts and Techniques 38 Data Warehouse: A Multi-Tiered Architecture Multi- Three Data Warehouse Models Enterprise warehouse Monitor Metadata & OLAP Server collects all of the information about subjects spanning Other sources Integrator the entire organization Data Mart Analysis Operational Extract a subset of corporate-wide data that is of value to a DBs Transform Data Serve Query specific groups of users. Its scope is confined to Load Warehouse Reports Refresh Data mining specific, selected groups, such as a marketing data mart Independent vs. dependent (directly from warehouse) data mart Virtual warehouse A set of views over operational databases Data Marts Only some of the possible summary views may be Data Sources Data Storage OLAP Engine Front-End Tools materialized September 22, 2009 Data Mining: Concepts and Techniques 39 September 22, 2009 Data Mining: Concepts and Techniques 40
  • 11. Data Warehouse Development: Data Warehouse Back-End Tools and Utilities A Recommended Approach Data extraction Multi-Tier Data get data from multiple, heterogeneous, and external Warehouse sources Distributed Data cleaning Data Marts detect errors in the data and rectify them when possible Data transformation Enterprise convert data from legacy or host format to warehouse Data Data format Mart Mart Data Warehouse Load sort, summarize, consolidate, compute views, check integrity, and build indices and partitions Model refinement Model refinement Refresh propagate the updates from the data sources to the Define a high-level corporate data model warehouse September 22, 2009 Data Mining: Concepts and Techniques 41 September 22, 2009 Data Mining: Concepts and Techniques 42 Metadata Repository OLAP Server Architectures Meta data is the data defining warehouse objects. It stores: Relational OLAP (ROLAP) Description of the structure of the data warehouse Use relational or extended-relational DBMS to store and manage schema, view, dimensions, hierarchies, derived data definition, data warehouse data and OLAP middle ware mart locations and contents Include optimization of DBMS backend, implementation of Operational meta-data aggregation navigation logic, and additional tools and services data lineage (history of migrated data and transformation path), Greater scalability currency of data (active, archived, or purged), monitoring Multidimensional OLAP (MOLAP) information (warehouse usage statistics, error reports, audit trails) Sparse array-based multidimensional storage engine The algorithms used for summarization Fast indexing to pre-computed summarized data Mapping from operational environment to the data warehouse Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer) Data related to system performance Flexibility, e.g., low level: relational, high-level: array warehouse schema, view and derived data definitions Specialized SQL servers (e.g., Redbricks) Business data Specialized support for SQL queries over star/snowflake schemas business terms and definitions, ownership of data, charging policies September 22, 2009 Data Mining: Concepts and Techniques 43 September 22, 2009 Data Mining: Concepts and Techniques 44
  • 12. Chapter 3: Data Warehousing and Efficient Data Cube Computation OLAP Technology: An Overview Data cube can be viewed as a lattice of cuboids The bottom-most cuboid is the base cuboid What is a data warehouse? The top-most cuboid (apex) contains only one cell How many cuboids in an n-dimensional cube with L levels? A multi-dimensional data model n T = ∏ ( L + 1), i =1 i Data warehouse architecture where L is the number of conceptual levels i associated with dimension i. Data warehouse implementation Materialization of data cube Materialize every cuboid (full materialization), none (no From data warehousing to data mining materialization), or some (partial materialization) Selection of which cuboids to materialize Based on size, sharing, access frequency, etc. September 22, 2009 Data Mining: Concepts and Techniques 45 September 22, 2009 Data Mining: Concepts and Techniques 46 Cube Operation Iceberg Cube Cube definition and computation in DMQL Computing only the cuboid cells whose define cube sales[item, city, year]: sum(sales_in_dollars) count or other aggregates satisfying the compute cube sales condition like Transform it into a SQL-like language (with a new operator HAVING COUNT(*) >= minsup cube by, introduced by Gray et al.’96) () Motivation SELECT item, city, year, SUM (amount) (city) (item) (year) Only a small portion of cube cells may be “above the FROM SALES water’’ in a sparse cube CUBE BY item, city, year Only calculate “interesting” cells—data above certain Need compute the following Group-Bys (city, item) (city, year) (item, year) threshold (date, product, customer), (date,product),(date, customer), (product, customer), Avoid explosive growth of the cube (date), (product), (customer) (city, item, year) Suppose 100 dimensions, only 1 base cell. How many aggregate cells if count >= 1? What about count >= 2? () September 22, 2009 Data Mining: Concepts and Techniques 47 September 22, 2009 Data Mining: Concepts and Techniques 48
  • 13. Indexing OLAP Data: Bitmap Index Indexing OLAP Data: Join Indices Index on a particular column Join index: JI(R-id, S-id) where R (R-id, …) >< S Each value in the column has a bit vector: bit-op is fast (S-id, …) The length of the bit vector: # of attributes in the domain Traditional indices map the values to a list of The i-th bit is set if the i-th row of the base table has the value for record ids the indexed column It materializes relational join in a JI file and not suitable for high cardinality domains speeds up the relational join In data warehouses, join index relates the values of the dimensions (e.g. location, item) of a start Base table Index on Region Index on Type schema to rows (e.g. sales) in the fact table. Cust Region Type RecID Asia Europe Am erica RecID Retail Dealer E.g. fact table: Sales and two dimensions city C1 Asia Retail and product 1 1 0 0 1 1 0 C2 Europe Dealer 2 0 1 0 2 0 1 A join index on city maintains for each C3 Asia Dealer 3 1 0 0 3 0 1 distinct city a list of R-IDs of the tuples C4 America Retail 4 0 0 1 4 1 0 recording the Sales in the city C5 Europe Dealer 5 0 1 0 5 0 1 Join indices can span multiple dimensions September 22, 2009 Data Mining: Concepts and Techniques 49 September 22, 2009 Data Mining: Concepts and Techniques 50 Efficient Processing OLAP Queries Chapter 3: Data Warehousing and OLAP Technology: An Overview Determine which operations should be performed on the available cuboids Transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g., dice = selection + projection What is a data warehouse? Determine which materialized cuboid(s) should be selected for OLAP op. A multi-dimensional data model Let the query to be processed be on {brand, province_or_state} with the condition “year = 2004”, and there are 4 materialized cuboids available: Data warehouse architecture 1) {year, item_name, city} 2) {year, brand, country} Data warehouse implementation 3) {year, brand, province_or_state} 4) {item_name, province_or_state} where year = 2004 From data warehousing to data mining Which should be selected to process the query? Explore indexing structures and compressed vs. dense array structs in MOLAP September 22, 2009 Data Mining: Concepts and Techniques 51 September 22, 2009 Data Mining: Concepts and Techniques 52
  • 14. From On-Line Analytical Processing (OLAP) Data Warehouse Usage to On Line Analytical Mining (OLAM) Three kinds of data warehouse applications Why online analytical mining? Information processing High quality of data in data warehouses supports querying, basic statistical analysis, and reporting DW contains integrated, consistent, cleaned data using crosstabs, tables, charts and graphs Available information processing structure surrounding Analytical processing data warehouses multidimensional analysis of data warehouse data ODBC, OLEDB, Web accessing, service facilities, supports basic OLAP operations, slice-dice, drilling, pivoting reporting and OLAP tools Data mining OLAP-based exploratory data analysis knowledge discovery from hidden patterns Mining with drilling, dicing, pivoting, etc. supports associations, constructing analytical models, On-line selection of data mining functions performing classification and prediction, and presenting the Integration and swapping of multiple mining mining results using visualization tools functions, algorithms, and tasks September 22, 2009 Data Mining: Concepts and Techniques 53 September 22, 2009 Data Mining: Concepts and Techniques 54 An OLAM System Architecture Chapter 3: Data Warehousing and Mining query Mining result Layer4 OLAP Technology: An Overview User Interface User GUI API What is a data warehouse? Layer3 OLAM OLAP Engine Engine OLAP/OLAM A multi-dimensional data model Data Cube API Data warehouse architecture Layer2 MDDB MDDB Data warehouse implementation Meta Data Database API From data warehousing to data mining Filtering&Integration Filtering Layer1 Data cleaning Data Data Summary Databases Data integration Warehouse Repository September 22, 2009 Data Mining: Concepts and Techniques 55 September 22, 2009 Data Mining: Concepts and Techniques 56
  • 15. Summary: Data Warehouse and OLAP Technology Data Mining Tools and Links Why data warehousing? See the website on knowledge discovery: A multi-dimensional model of a data warehouse http://www.kdnuggets.com Star schema, snowflake schema, fact constellations A data cube consists of dimensions & measures OLAP operations: drilling, rolling, slicing, dicing and pivoting Commercial and free data mining tools: Data warehouse architecture http://www.kdnuggets.com/software/suites.html OLAP servers: ROLAP, MOLAP, HOLAP Efficient computation of data cubes Partial vs. full vs. no materialization Indexing OALP data: Bitmap index and join index OLAP query processing From OLAP to OLAM (on-line analytical mining) September 22, 2009 Data Mining: Concepts and Techniques 57 September 22, 2009 Data Mining: Concepts and Techniques 58 References (I) References (II) S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, C. Imhoff, N. Galemmo, and J. G. Geiger. Mastering Data Warehouse Design: Relational and S. Sarawagi. On the computation of multidimensional aggregates. VLDB’96 and Dimensional Techniques. John Wiley, 2003 D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data W. H. Inmon. Building the Data Warehouse. John Wiley, 1996 warehouses. SIGMOD’97 R. Kimball and M. Ross. The Data Warehouse Toolkit: The Complete Guide to R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE’97 Dimensional Modeling. 2ed. John Wiley, 2002 S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. P. O'Neil and D. Quass. Improved query performance with variant indexes. SIGMOD'97 ACM SIGMOD Record, 26:65-74, 1997 Microsoft. OLEDB for OLAP programmer's reference version 1.0. In E. F. Codd, S. B. Codd, and C. T. Salley. Beyond decision support. Computer World, 27, http://www.microsoft.com/data/oledb/olap, 1998 July 1993. A. Shoshani. OLAP and statistical databases: Similarities and differences. PODS’00. J. Gray, et al. Data cube: A relational aggregation operator generalizing group-by, S. Sarawagi and M. Stonebraker. Efficient organization of large multidimensional arrays. cross-tab and sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997. ICDE'94 A. Gupta and I. S. Mumick. Materialized Views: Techniques, Implementations, and OLAP council. MDAPI specification version 2.0. In Applications. MIT Press, 1999. http://www.olapcouncil.org/research/apily.htm, 1998 J. Han. Towards on-line analytical mining in large databases. ACM SIGMOD Record, E. Thomsen. OLAP Solutions: Building Multidimensional Information Systems. John Wiley, 27:97-107, 1998. 1997 V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. P. Valduriez. Join indices. ACM Trans. Database Systems, 12:218-246, 1987. SIGMOD’96 J. Widom. Research problems in data warehousing. CIKM’95. September 22, 2009 Data Mining: Concepts and Techniques 59 September 22, 2009 Data Mining: Concepts and Techniques 60