Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Data Warehousing
                   and elements of Data Mining

                   prof. Maurizio Pighin
                ...
DW and

                    Evolution of Database Technology                     elements of DM
                          ...
Data Warehousing and                                          DW and
                                                     ...
DW and

                    What is Data Warehouse?                           elements of DM
                             ...
DW and

                    Data Warehouse - Integrated                                         elements of DM
           ...
DW and

                    Data Warehouse - Non-Volatile                      elements of DM
                            ...
DW and

                    Why do we need all that?                           elements of DM
                            ...
DW and

                    Examples of OLAP                                          elements of DM
                     ...
Data Warehouse vs.                                                     DW and
                                            ...
DW and

                    OLTP vs. OLAP                                                                 elements of DM
 ...
Data Warehousing and                                      DW and
                                                         ...
DW and

                    Multidimensional model - example                   elements of DM
                            ...
DW and

                    Fact aggregation                                     elements of DM
                          ...
DW and

                        Dimension hierarchy                                                elements of DM
        ...
View of Warehouses and                                      DW and
                                                       ...
Data Warehousing and                                     DW and
                                                          ...
DW and

                    OLAP Server Architectures                                 elements of DM
                     ...
DW and

                     Components of Star Schema                                                     elements of DM
...
DW and

                        Star Schema with sample data                               elements of DM
                ...
DW and

                        Example of Snowflake Schema                                                        element...
Main Data Warehouse                                              DW and
                                                  ...
Indipendent data mart         Data marts:
                                                                                ...
DW and

                     General Architecture                                          elements of DM
                ...
DW and

                     ETL function                                                   elements of DM
               ...
Design of a Data Warehouse:                                            DW and
                                            ...
DW and

                     Data Warehouse Design Process                                   elements of DM
              ...
DW and

                     Exploration of Data Cubes                                          elements of DM
           ...
DW and

                    Typical OLAP Operations                                                          elements of D...
DW and

                    OLAP Operations             elements of DM
                                                Mau...
DW and

                    OLAP Operations                     elements of DM
                                           ...
DW and

                    OLAP Operations                 elements of DM
                                               ...
DW and

                     OLAP Operations                                                        elements of DM
       ...
DW and

                    OLAP Operations                      elements of DM
                                          ...
DW and

                    OLAP Operations                              elements of DM
                                  ...
DW and
                                                                            elements of DM
                   OLAP ...
DW and

                    OLAP Operations                   elements of DM
                                             ...
DW and

                    OLAP Operations                        elements of DM
                                        ...
DW and

                    OLAP Operations                                      elements of DM
                          ...
Examples: Discovery-Driven Data                 DW and
                                                              eleme...
DW and

                    Data Warehouse Usage                                             elements of DM
              ...
Data Warehousing and                                                     DW and
                                          ...
DW and

                    What Is Data Mining?                                  elements of DM
                         ...
Mining Large Data Sets                                                 DW and
                                            ...
DW and

                    Market Analysis and Management                          elements of DM
                       ...
Corporate Analysis and Risk                                    DW and
                                                    ...
DW and

                     Data Mining Tasks                                    elements of DM
                         ...
DW and

                             Classification: Definition                                                           ...
DW and

                    Classification: Application                                       elements of DM
             ...
DW and

                    Illustrating Clustering                                          elements of DM
              ...
Association Rule Discovery:                                      DW and
                                                  ...
Association Rule Discovery:                                             DW and
                                           ...
DW and

                     Regression                                                         elements of DM
           ...
Data Warehousing and elements of Data Mining
Data Warehousing and elements of Data Mining
Data Warehousing and elements of Data Mining
Upcoming SlideShare
Loading in …5
×

Data Warehousing and elements of Data Mining

2,990 views

Published on

  • Be the first to comment

  • Be the first to like this

Data Warehousing and elements of Data Mining

  1. 1. Data Warehousing and elements of Data Mining prof. Maurizio Pighin e-mail: maurizio.pighin@uniud.it Dipartimento di Matematica e Informatica Università di Udine - Italy Motivation: “Necessity is the DW and elements of DM Maurizio Pighin Mother of Invention” • Data explosion problem – Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases and other information repositories • Difficult to analyze data – Complex query, long time of analysis • We are drowning in data, but starving for knowledge! • Solution: Data warehousing and Data mining – Data warehousing and on-line analytical processing – Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases Slide 2 Copyright © 2008 by Maurizio Pighin Pagina 1
  2. 2. DW and Evolution of Database Technology elements of DM Maurizio Pighin • 1960s: Data collection, database creation, IMS and network DBMS • 1970s: Relational data model, relational DBMS implementation • 1980s: RDBMS, advanced data models (extended- relational, OO, deductive, etc.) and application- oriented DBMS (spatial, scientific, engineering, etc.) • 1990s—2000s: Data mining and data warehousing, multimedia databases, and Web databases Slide 3 DW and Evolution of data analysis elements of DM Maurizio Pighin • 1960s: batch reports – Difficult to find and analyze data – Expensive, every request needs a new report (today a lot of systems offers only this kind of analysis) • 1970s: First procedures to help decision process – Usually very poor and do not integrated with office automation tools • 1980s: Office automation tools – Query tools, spreadsheets, GUIs – Access to operational data (usually very complex) • 1990s: Data warehousing and data mining Slide 4 Copyright © 2008 by Maurizio Pighin Pagina 2
  3. 3. Data Warehousing and DW and elements of DM Maurizio Pighin Data Mining • What is a data warehouse? • A multi-dimensional data model • Data warehouse architecture • Data warehouse implementation • OLAP analysis • From data warehousing to data mining • Principles of data mining Slide 5 DW and What is Data Warehouse? elements of DM Maurizio Pighin • Defined in many different ways, but not rigorously. – A decision support database that is maintained separately from the organization’s operational database – Support information processing by providing a solid platform of consolidated, historical data for analysis. Slide 6 Copyright © 2008 by Maurizio Pighin Pagina 3
  4. 4. DW and What is Data Warehouse? elements of DM Maurizio Pighin • “A data warehouse is a subject-oriented, integrated, time-variant, and non volatile collection of data in support of management’s decision-making process.” - W. H. Inmon (1985) • “A single, complete and consistent data warehouse, obtained by different sources, available to final users to be immediately utilized” – IBM System Journal (1990) • Data warehousing: – The process of constructing and using data warehouses Slide 7 DW and Data Warehouse - Subject-Oriented elements of DM Maurizio Pighin • Organized around major subjects, such as customer, product, sales. • Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. • Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. Slide 8 Copyright © 2008 by Maurizio Pighin Pagina 4
  5. 5. DW and Data Warehouse - Integrated elements of DM Maurizio Pighin • Constructed by integrating multiple, heterogeneous data sources – relational databases, flat files, on-line transaction records • Data cleaning and data integration techniques are applied. – Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources • E.g., Hotel price: currency, tax, breakfast covered, etc. – When data is moved to the warehouse, it is converted. Slide 9 DW and Data Warehouse - Time Variant elements of DM Maurizio Pighin • The time horizon for the data warehouse is significantly longer than that of operational systems. – Operational database: current value data. – Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) • Every key structure in the data warehouse – Contains an element of time, explicitly or implicitly – But the key of operational data may or may not contain “time element”. Slide 10 Copyright © 2008 by Maurizio Pighin Pagina 5
  6. 6. DW and Data Warehouse - Non-Volatile elements of DM Maurizio Pighin • A physically separate store of data transformed from the operational environment. • Operational update of data does not occur in the data warehouse environment. – Does not require transaction processing, recovery, and concurrency control mechanisms – Requires only two operations in data accessing: • initial loading of data • access of data. Slide 11 DW and Data Warehouse elements of DM Maurizio Pighin • Data analysis system characteristics: FASMI – OLAP Report 1995 – Fast – Analytical – Shared – Multidimensional – Informational Slide 12 Copyright © 2008 by Maurizio Pighin Pagina 6
  7. 7. DW and Why do we need all that? elements of DM Maurizio Pighin • Operational databases are for On Line Transaction Processing (OLTP) – automate day-to-day operations (purchasing, banking etc) – transactions access (and modify!) a few records at a time – database design is application (process) oriented – metric: transactions/sec Slide 13 DW and Why do we need all that? elements of DM Maurizio Pighin • Data Warehouse is for On Line Analytical Processing (OLAP) – complex queries that access millions of records – need historical data for trend analysis – long scans would interfere with normal operations – synchronizing data-intensive queries among physically separated databases would be a nightmare! – metric: query response time Slide 14 Copyright © 2008 by Maurizio Pighin Pagina 7
  8. 8. DW and Examples of OLAP elements of DM Maurizio Pighin • Comparisons (this period v.s. last period) – Show me the sales per region for this year and compare it to that of the previous year to identify discrepancies • Multidimensional ratios (percent to total) – Show me the contribution to weekly profit made by all items sold in the northeast stores between may 1 and may 7 Slide 15 DW and Examples of OLAP elements of DM Maurizio Pighin • Ranking and statistical profiles (top N/bottom N) – Show me sales, profit and average call volume per day for my 10 most profitable salespeople • Custom consolidation (market segments, ad hoc groups) – Show me an abbreviated income statement by quarter for the last four quarters for my northeast region operations Slide 16 Copyright © 2008 by Maurizio Pighin Pagina 8
  9. 9. Data Warehouse vs. DW and elements of DM Maurizio Pighin Heterogeneous DBMS • Traditional heterogeneous DB integration: – Build wrappers/mediators on top of heterogeneous databases – Query driven approach • When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set • Complex information filtering, compete for resources • Data warehouse: update-driven, high performance – Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis Slide 17 Data Warehouse vs. DW and elements of DM Maurizio Pighin Operational DBMS • OLTP (on-line transaction processing) – Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. • OLAP (on-line analytical processing) – Data analysis and decision making • Distinct features (OLTP vs. OLAP): – System orientation: process vs. business subject – Data contents: current, detailed vs. historical, consolidated – Database design: ER + application vs. Multidimensional + subject – View: current, local vs. evolutionary, integrated – Access patterns: update vs. read-only but complex queries Slide 18 Copyright © 2008 by Maurizio Pighin Pagina 9
  10. 10. DW and OLTP vs. OLAP elements of DM Maurizio Pighin OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date historical, detailed, flat relational summarized, multidimensional isolated integrated, consolidated usage repetitive ad-hoc access read/write lots of scans index/hash on prim. key unit of work short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response Slide 19 DW and Why Separate Data Warehouse? elements of DM Maurizio Pighin • High performance for both systems – DBMS - tuned for OLTP: access methods, indexing, concurrency control, recovery – Warehouse - tuned for OLAP: complex OLAP queries, multidimensional view, consolidation. • Different functions and different data: – missing data: Decision Support requires historical data which operational DBs do not typically maintain – data consolidation: Decision Support requires consolidation (aggregation, summarization) of data from heterogeneous sources – data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled Slide 20 Copyright © 2008 by Maurizio Pighin Pagina 10
  11. 11. Data Warehousing and DW and elements of DM Maurizio Pighin Data Mining • What is a data warehouse? • A multi-dimensional data model • Data warehouse architecture • Data warehouse implementation • OLAP analysis • From data warehousing to data mining • Principles of data mining Slide 21 DW and Multidimensional model elements of DM Maurizio Pighin • A data warehouse is based on a multidimensional data model which views data in the form of a data cube (hypercube) • An hypercube is a multidimensional array which represents particular event • We define “fact” a point of this multidimensional array obtained crossing exiting co-ordinates – Dimension: fact co-ordinate – Measure: numerical value characterizing the event Slide 22 Copyright © 2008 by Maurizio Pighin Pagina 11
  12. 12. DW and Multidimensional model - example elements of DM Maurizio Pighin • A data cube, such as sales, allows numerical data (measures) to be modeled and viewed in multiple dimensions – Measures such as transaction value (dollars_sold), quantity (item_quantity) – Dimension, such as item (item_name, brand, type), or time (day, week, month, quarter, year), or customer (customer_name, city, region, state) Slide 23 DW and Measures elements of DM Maurizio Pighin • Every fact can contain more than one measure • A measure may be – Saved on the Data Warehouse (effective) – Run-time evaluated from effective measures – Implicit (presence or absence of a fact) Slide 24 Copyright © 2008 by Maurizio Pighin Pagina 12
  13. 13. DW and Fact aggregation elements of DM Maurizio Pighin • It is possible to aggregate elementary facts to obtain synthetic facts • The measures of the synthetic facts can be obtained with aggregation operators – Sum, mean, max, min,… • For each couple measure-dimension it is possible to define different aggregation-operators Slide 25 DW and Fact aggregation elements of DM Maurizio Pighin • The measures can be – Addictive: can be aggregate by sum on every dimension (for instance total income) – Semi-addictive: can be aggregate by sum on some dimension but not on other (for instance quantity can be summed on “item” but not on “store” (where are present different items)) – Not-addictive: they never can be summed, you must use other operators (mean, median, max, min) (for instance unitary price) Slide 26 Copyright © 2008 by Maurizio Pighin Pagina 13
  14. 14. DW and Dimension hierarchy elements of DM Maurizio Pighin • Hierarchy – Set of dimensional attributes hierarchically linked to one dimension – Dimensional attributes • Are used to aggregate elementary facts • Are univocally determined by a dimension • Represent a “classification” of the dimension Slide 27 DW and Example of dimension hierarchy elements of DM Maurizio Pighin all all region Europe ... North_America country Germany ... Spain Canada ... Mexico city Frankfurt ... Vancouver ... Toronto office L. Chan ... M. Wind Slide 28 Copyright © 2008 by Maurizio Pighin Pagina 14
  15. 15. View of Warehouses and DW and elements of DM Maurizio Pighin Hierarchies Slide 29 DW and Multidimensional Data elements of DM Maurizio Pighin • Sales volume as a function of Product, Location, and Time Dimensions: Product, Location, Time Hierarchical summarization paths n tio ca Industry Region Year Lo Category Country Quarter Product Item City Month Week Office Day Time Slide 30 Copyright © 2008 by Maurizio Pighin Pagina 15
  16. 16. Data Warehousing and DW and elements of DM Maurizio Pighin Data Mining • What is a data warehouse? • A multi-dimensional data model • Data warehouse architecture • Data warehouse implementation • OLAP analysis • From data warehousing to data mining • Principles of data mining Slide 31 DW and OLAP Server Architectures elements of DM Maurizio Pighin • Relational OLAP (ROLAP) – Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces – Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services – Greater scalability Slide 32 Copyright © 2008 by Maurizio Pighin Pagina 16
  17. 17. DW and OLAP Server Architectures elements of DM Maurizio Pighin • Multidimensional OLAP (MOLAP) – Array-based multidimensional storage engine (sparse matrix techniques) – fast indexing to pre-computed summarized data • Hybrid OLAP (HOLAP) – User flexibility, e.g., low level: relational, high-level: array Slide 33 Conceptual Modeling of Data DW and elements of DM Maurizio Pighin Warehouses • Modeling data warehouses: dimensions & measures on ROLAP Systems – Star schema: A fact table in the middle connected to a set of dimension tables – Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake – Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation Slide 34 Copyright © 2008 by Maurizio Pighin Pagina 17
  18. 18. DW and Components of Star Schema elements of DM Maurizio Pighin Fact tables contain factual or quantitative data 1:N relationship between Dimension tables are denormalized to dimension tables and fact tables maximize performance Dimension tables contain descriptions about the subjects of the business Excellent for ad-hoc queries, but bad for online transaction processing Slide 35 DW and Star Schema example elements of DM Maurizio Pighin Fact table provides statistics for sales broken down by product, period and store dimensions Slide 36 Copyright © 2008 by Maurizio Pighin Pagina 18
  19. 19. DW and Star Schema with sample data elements of DM Maurizio Pighin Slide 37 DW and Another example of Star Schema elements of DM Maurizio Pighin time time_key item day item_key day_of_the_week Sales Fact Table item_name month brand quarter time_key type year supplier_type item_key branch_key branch location location_key branch_key location_key branch_name units_sold street branch_type city dollars_sold province_or_street country avg_sales Measures Slide 38 Copyright © 2008 by Maurizio Pighin Pagina 19
  20. 20. DW and Example of Snowflake Schema elements of DM Maurizio Pighin time time_key item day item_key supplier day_of_the_week Sales Fact Table item_name supplier_key month brand supplier_type quarter time_key type year item_key supplier_key branch_key branch location location_key location_key branch_key city units_sold street branch_name city_key city_key branch_type dollars_sold city province_or_street avg_sales country Measures Slide 39 DW and Example of Fact Constellation elements of DM Maurizio Pighin time time_key item Shipping Fact Table day item_key day_of_the_week Sales Fact Table item_name time_key month brand quarter time_key type item_key year supplier_type shipper_key item_key branch_key from_location branch location_key location to_location branch_key location_key dollars_cost branch_name units_sold street branch_type dollars_sold city units_shipped province_or_street avg_sales country shipper Measures shipper_key shipper_name location_key 40 Slide shipper_type Copyright © 2008 by Maurizio Pighin Pagina 20
  21. 21. Main Data Warehouse DW and elements of DM Maurizio Pighin Architectures • Architectures – Generic Two-Level Architecture – Independent Data Mart – Dependent Data Mart and Operational Data Store - Three-Level Architecture • All involve some form of extraction, transformation and loading (ETL) Slide 41 Generic Two Level DW and elements of DM Maurizio Pighin Data Warehousing Architecture L One, company- wide T warehouse E Periodic extraction data is not completely current in warehouse Slide 42 Copyright © 2008 by Maurizio Pighin Pagina 21
  22. 22. Indipendent data mart Data marts: DW and elements of DM Maurizio Pighin Data Warehousing Architecture Mini-warehouses, limited in scope L T E Separate ETL for each Data access complexity independent data mart due to multiple data marts Slide 43 Dependent data mart with operational DW and elements of DM Maurizio Pighin datastore at three level architecture L T E Simpler data access Single ETL for enterprise data warehouse Dependent data marts (EDW) loaded from EDW Slide 44 Copyright © 2008 by Maurizio Pighin Pagina 22
  23. 23. DW and General Architecture elements of DM Maurizio Pighin Monitor & OLAP Server other Metadata sources Integrator Analysis Operational Extract Query Transform Server DBs Data Reports Load Refresh Warehouse Data mining Data Marts Data Sources Data Storage OLAP Engine Front-End Slide 45 DW and General Architecture elements of DM Maurizio Pighin • Enterprise warehouse – collects all of the information about subjects spanning the entire organization • Data Mart – a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart • Independent vs. dependent (directly from warehouse) data mart Slide 46 Copyright © 2008 by Maurizio Pighin Pagina 23
  24. 24. DW and ETL function elements of DM Maurizio Pighin • Data extraction: – get data from multiple, heterogeneous, and external sources • Data cleaning: – detect errors in the data and rectify them when possible • Data transformation: – convert data from legacy or host format to warehouse format • Load: – sort, summarize, consolidate, compute views, check integrity, and build indices and partitions • Refresh: – propagate the updates from the data sources to the warehouse Slide 47 Data Warehousing and DW and elements of DM Maurizio Pighin Data Mining • What is a data warehouse? • A multi-dimensional data model • Data warehouse architecture • Data warehouse implementation • OLAP analysis • From data warehousing to data mining • Principles of data mining Slide 48 Copyright © 2008 by Maurizio Pighin Pagina 24
  25. 25. Design of a Data Warehouse: DW and elements of DM Maurizio Pighin A Business Analysis Framework • Four views regarding the design of a data warehouse – Top-down view • allows selection of the relevant information necessary for the data warehouse – Data source view • exposes the information being captured, stored, and managed by operational systems – Data warehouse view • consists of fact tables and dimension tables – Business query view • sees the perspectives of data in the warehouse from the view of end-user Slide 49 DW and Data Warehouse Design Process elements of DM Maurizio Pighin • Top-down, bottom-up approaches or a combination of both – Top-down: Starts with overall design and planning (mature) – Bottom-up: Starts with experiments and prototypes (rapid) • From software engineering point of view – Waterfall: structured and systematic analysis at each step before proceeding to the next (top-down) – Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around (bottom-up) Slide 50 Copyright © 2008 by Maurizio Pighin Pagina 25
  26. 26. DW and Data Warehouse Design Process elements of DM Maurizio Pighin • Typical data warehouse design process with bottom up process – Choose a business process to model, e.g., orders, invoices, etc. – Choose the grain (atomic level of data) of the business process – Choose the dimensions that will apply to each fact table record – Choose the measure that will populate each fact table record – Design the architecture of the DW – Design the ETL – Install and test • Advantages – Results in short time – Not too expensive – Give to the management a clear perspective of the OLAP world Slide 51 Data Warehousing and DW and elements of DM Maurizio Pighin Data Mining • What is a data warehouse? • A multi-dimensional data model • Data warehouse architecture • Data warehouse implementation • OLAP analysis • From data warehousing to data mining • Principles of data mining Slide 52 Copyright © 2008 by Maurizio Pighin Pagina 26
  27. 27. DW and Exploration of Data Cubes elements of DM Maurizio Pighin • OLAP – Interactive navigation through data • Two models – Hypothesis-driven: exploration by user driven by hypothesis formulated by the user – Discovery-driven: pre-compute measures indicating exceptions, guide user in the data analysis, at all levels of aggregation. Then users utilize Hypothesis driven exploration Slide 53 DW and A Sample Data Cube elements of DM Maurizio Pighin Total annual sales Date of TV in U.S.A. 1Qtr 2Qtr sum t 3Qtr 4Qtr uc TV od PC U.S.A Pr VCR Country sum Canada Mexico sum Slide 54 Copyright © 2008 by Maurizio Pighin Pagina 27
  28. 28. DW and Typical OLAP Operations elements of DM Maurizio Pighin • Roll up (drill-up): summarize data – by climbing up hierarchy or by dimension reduction • Drill down (roll down): reverse of roll-up – from higher level summary to lower level summary or detailed data, or introducing new dimensions Slide 55 DW and Roll-up/Drill-down elements of DM Maurizio Pighin All Roll-up All All ll A Country Roll-up Drill-Down Date l Al Country Roll-up Drill-Down Date l Al Country Drill-Down ct o du Pr Slide 56 Copyright © 2008 by Maurizio Pighin Pagina 28
  29. 29. DW and OLAP Operations elements of DM Maurizio Pighin drill-down Slide 57 DW and OLAP Operations elements of DM Maurizio Pighin drill-down Slide 58 Copyright © 2008 by Maurizio Pighin Pagina 29
  30. 30. DW and OLAP Operations elements of DM Maurizio Pighin drill-down Slide 59 DW and OLAP Operations elements of DM Maurizio Pighin roll-up Slide 60 Copyright © 2008 by Maurizio Pighin Pagina 30
  31. 31. DW and OLAP Operations elements of DM Maurizio Pighin roll-up Slide 61 DW and OLAP Operations elements of DM Maurizio Pighin roll-up Slide 62 Copyright © 2008 by Maurizio Pighin Pagina 31
  32. 32. DW and OLAP Operations elements of DM Maurizio Pighin • Slice and Dice: select and project on one or more dimensions country t uc od pr date customer = “Smith” Slide 63 DW and Slice elements of DM Maurizio Pighin Date ( 2 quarters) Slice Country Date (4 quarters) ct du o Pr Country t uc od Pr Slide 64 Copyright © 2008 by Maurizio Pighin Pagina 32
  33. 33. DW and OLAP Operations elements of DM Maurizio Pighin slice-and-dice Slide 65 DW and OLAP Operations elements of DM Maurizio Pighin slice-and-dice Slide 66 Copyright © 2008 by Maurizio Pighin Pagina 33
  34. 34. DW and OLAP Operations elements of DM Maurizio Pighin slice-and-dice Slide 67 DW and OLAP Operations elements of DM Maurizio Pighin • Pivot (rotate): – reorient the cube visualization, 3D to series of 2D planes. Slide 68 Copyright © 2008 by Maurizio Pighin Pagina 34
  35. 35. DW and elements of DM OLAP Operations Maurizio Pighin Tim Store e Pivot Product Tim Product e Pivot Pivot Store Sto Product re Pivot Time Slide 69 DW and OLAP Operations elements of DM Maurizio Pighin pivoting Slide 70 Copyright © 2008 by Maurizio Pighin Pagina 35
  36. 36. DW and OLAP Operations elements of DM Maurizio Pighin pivoting Slide 71 DW and OLAP Operations elements of DM Maurizio Pighin pivoting Slide 72 Copyright © 2008 by Maurizio Pighin Pagina 36
  37. 37. DW and OLAP Operations elements of DM Maurizio Pighin • Drill across: involving (across) more than one fact table Slide 73 DW and OLAP Operations elements of DM Maurizio Pighin drill-across Slide 74 Copyright © 2008 by Maurizio Pighin Pagina 37
  38. 38. DW and OLAP Operations elements of DM Maurizio Pighin drill-across Slide 75 DW and Exploration of Data Cubes elements of DM Maurizio Pighin • Hypothesis-driven – exploration by user, huge search space • Discovery-driven – Pre-compute measures indicating exceptions, guide user in the data analysis, at all levels of aggregation – Exception: significantly different from the value anticipated, based on a statistical model – Visual cues such as background color are used to reflect the degree of exception of each cell – Computation of exception indicator can be overlapped with cube construction Slide 76 Copyright © 2008 by Maurizio Pighin Pagina 38
  39. 39. Examples: Discovery-Driven Data DW and elements of DM Maurizio Pighin Cubes Slide 77 Data Warehousing and DW and elements of DM Maurizio Pighin Data Mining • What is a data warehouse? • A multi-dimensional data model • Data warehouse architecture • Data warehouse implementation • OLAP analysis • From data warehousing to data mining • Principles of data mining Slide 78 Copyright © 2008 by Maurizio Pighin Pagina 39
  40. 40. DW and Data Warehouse Usage elements of DM Maurizio Pighin • Three kinds of data warehouse applications – Information processing • supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs – Analytical processing • multidimensional analysis of data warehouse data • supports basic OLAP operations, slice-dice, drilling, pivoting – Data mining • knowledge discovery from hidden patterns • supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools. • Differences among the three tasks Slide 79 From On-Line Analytical Processing to DW and elements of DM Maurizio Pighin On Line Analytical Mining (OLAM) • Why online analytical mining? – High quality of data in data warehouses • DW contains integrated, consistent, cleaned data – Available information processing structure surrounding data warehouses • ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools – OLAP-based exploratory data analysis • mining with drilling, dicing, pivoting, etc. – On-line selection of data mining functions Slide 80 Copyright © 2008 by Maurizio Pighin Pagina 40
  41. 41. Data Warehousing and DW and elements of DM Maurizio Pighin Data Mining • What is a data warehouse? • A multi-dimensional data model • Data warehouse architecture • Data warehouse implementation • OLAP analysis • From data warehousing to data mining • Principles of data mining Slide 81 DW and What Is Data Mining? elements of DM Maurizio Pighin • Data mining (knowledge discovery in databases): – Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases • Alternative names: – Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Slide 82 Copyright © 2008 by Maurizio Pighin Pagina 41
  42. 42. DW and What Is Data Mining? elements of DM Maurizio Pighin • Other Definitions – Non-trivial extraction of implicit, previously unknown and potentially useful information from data – Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns Slide 83 Why Mine Data? DW and elements of DM Maurizio Pighin Commercial Viewpoint • Lots of data is being collected and warehoused – Web data, e-commerce – Purchases at department stores – Bank/Credit Card transactions • Computers have become cheaper and more powerful • Competitive Pressure is Strong – Provide better, customized services (e.g. in Customer Relationship Management) Slide 84 Copyright © 2008 by Maurizio Pighin Pagina 42
  43. 43. Mining Large Data Sets DW and elements of DM Maurizio Pighin Motivation • There is often information “hidden” in the data that is not readily evident • Human analysts may take weeks to discover useful information • Much of the data is never analyzed at all Slide 85 Why Data Mining? DW and elements of DM Maurizio Pighin Potential Applications • Database analysis and decision support – Market analysis and management • target marketing, customer relation management, market basket analysis, cross selling, market segmentation – Risk analysis and management • Forecasting, customer retention, quality control, competitive analysis – Fraud detection and management • Other Applications – Text mining (news group, email, documents) and Web analysis. Slide 86 Copyright © 2008 by Maurizio Pighin Pagina 43
  44. 44. DW and Market Analysis and Management elements of DM Maurizio Pighin • Where are the data sources for analysis? – Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies • Target marketing – Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. Slide 87 DW and Market Analysis and Management elements of DM Maurizio Pighin • Determine customer purchasing patterns over time – Changing of customer habits with age • Cross-market analysis – Associations/co-relations between product sales – Prediction based on the association information • Customer profiling – Indentifying what types of customers buy what products (clustering or classification) • Identifying customer requirements – identifying the best products for different customers – using prediction to find what factors will attract new customers Slide 88 Copyright © 2008 by Maurizio Pighin Pagina 44
  45. 45. Corporate Analysis and Risk DW and elements of DM Maurizio Pighin Management • Finance planning and asset evaluation – cash flow analysis and prediction – cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) • Resource planning – summarize and compare the resources and spending • Competition – monitor competitors and market directions – group customers into classes and a class-based pricing procedure – set pricing strategy in a highly competitive market Slide 89 DW and Fraud Detection and Management elements of DM Maurizio Pighin • Applications – widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. – approach: use historical data to build models of fraudulent behavior and use data mining to help identify similar instances • Examples – auto insurance: detect a group of people who stage accidents to collect on insurance – money laundering: detect suspicious money transactions Slide 90 Copyright © 2008 by Maurizio Pighin Pagina 45
  46. 46. DW and Data Mining Tasks elements of DM Maurizio Pighin • Prediction Methods – Use some variables to predict unknown or future values of other variables. • Description Methods – Find human-interpretable patterns that describe the data. Slide 91 DW and Principal Data Mining Tasks. elements of DM Maurizio Pighin • Classification [Predictive] • Clustering [Descriptive] • Association Rule Discovery [Descriptive] • Regression [Predictive] • Deviation Detection [Predictive] Slide 92 Copyright © 2008 by Maurizio Pighin Pagina 46
  47. 47. DW and Classification: Definition elements of DM Maurizio Pighin • Given a collection of records (training set) • Each record contains a set of attributes, one of the attributes is the class. • Find a model for class attribute as a function of the values of other attributes. • Goal: previously unseen records should be assigned a class as accurately as possible. • Metodology: a test set is used to determine the accuracy of the model. Usually, the given a collection of known data set is randomly divided into training and test sets, with training set used to build the model and test set used to validate it. Slide 93 DW and Classification Example elements of DM Maurizio Pighin al al us ric ric uo go go ti n s te te n as ca ca co cl Tid Refund Marital Taxable Refund Marital Taxable Status Income Cheat Status Income Cheat 1 Yes Single 125K No No Single 75K ? 2 No Married 100K No Yes Married 50K ? 3 No Single 70K No No Married 150K ? 4 Yes Married 120K No Yes Divorced 90K ? 5 No Divorced 95K Yes No Single 40K ? 6 No Married 60K No No Married 80K ? Test 7 Yes Divorced 220K No 10 Set 8 No Single 85K Yes 9 No Married 75K No Learn 10 No Single 90K Yes Training Model 10 Set Classifier Slide 94 Copyright © 2008 by Maurizio Pighin Pagina 47
  48. 48. DW and Classification: Application elements of DM Maurizio Pighin • Direct Marketing – Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product. – Approach: • Use the data for a similar product introduced before. • We know which customers decided to buy and which decided otherwise. This {buy, don’t buy} decision forms the class attribute. • Collect various demographic, lifestyle, and company- interaction related information about all such customers. – Type of business, where they stay, how much they earn, etc. • Use this information as input attributes to learn a classifier model. Slide 95 DW and Clustering Definition elements of DM Maurizio Pighin • Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that – Data points in one cluster are more similar to one another. – Data points in separate clusters are less similar to one another. • Similarity Measures – Euclidean Distance if attributes are continuous. – Other Problem-specific Measures Slide 96 Copyright © 2008 by Maurizio Pighin Pagina 48
  49. 49. DW and Illustrating Clustering elements of DM Maurizio Pighin Euclidean Distance Based Clustering in 3-D space. Intracluster distances Intracluster distances Intercluster distances Intercluster distances are minimized are minimized are maximized are maximized Slide 97 DW and Clustering: Application elements of DM Maurizio Pighin • Market Segmentation: – Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. – Approach: • Collect different attributes of customers based on their geographical and lifestyle related information. • Find clusters of similar customers. • Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters. Slide 98 Copyright © 2008 by Maurizio Pighin Pagina 49
  50. 50. Association Rule Discovery: DW and elements of DM Maurizio Pighin Definition • Given a set of records each of which contain some number of items from a given collection; – Produce dependency rules which will predict occurrence of an item based on occurrences of other items. TID Items Rules Discovered: Rules Discovered: 1 Bread, Coke, Milk {Milk} --> {Coke} {Milk} --> {Coke} 2 Beer, Bread {Diaper, Milk} --> {Beer} {Diaper, Milk} --> {Beer} 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Slide 99 Association Rule Discovery: DW and elements of DM Maurizio Pighin Application 1 • Marketing and Sales Promotion: – Let the rule discovered be {Bagels, … } --> {Potato Chips} – Potato Chips as consequent => Can be used to determine what should be done to boost its sales. – Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagels. – Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips! Slide 100 Copyright © 2008 by Maurizio Pighin Pagina 50
  51. 51. Association Rule Discovery: DW and elements of DM Maurizio Pighin Application 2 • Supermarket shelf management. – Goal: To identify items that are bought together by sufficiently many customers. – Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items. – A classic rule -- • If a customer buys diaper and milk, then he is very likely to buy beer. • So, don’t be surprised if you find six-packs stacked next to diapers! Slide 101 DW and Regression elements of DM Maurizio Pighin • To identify unknown values in a continuous domain • Build tendency functions with interpolation of known points (regression) • Different models – Linear regression (two variables) • Y=q+mX – Multi-linear regression (more variables) • Y = q + m1 X1 + m2 X2+ m3 X3 – Non-linear regression (polynomial, exponential, logarithmic ...) • Y = q + m1X+ m2X2+ m3X3 Slide 102 Copyright © 2008 by Maurizio Pighin Pagina 51
  52. 52. DW and Regression elements of DM Maurizio Pighin • Example Slide 103 DW and Deviation Detection elements of DM Maurizio Pighin • The search of “Outlier” • Outlier: exception, element out of range • The search is based on the same principles of clustering • Concentrates the efforts in finding elements “far” from the other • Search method – Statistical • Can be used if a statistical distribution is evaluable – Distance based • Search for elements with maximize the distance from the other elements of the set – Deviation based • Search for elements with maximize the deviance from the other elements of the set. • Example: fraud detection Slide 104 Copyright © 2008 by Maurizio Pighin Pagina 52

×