Session7part1.ppt

367 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
367
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Start with a real-life scenario
  • CHECK ON THE PRODUCTS INTERESTING ALGORITHMS
  • Cognos and microstrategy next in line 1.4B in 1997, 40% growth from 1994-97, expected to be 3B in 2000 Source: http://www.olapreport.com/Market.htm
  • Each topic is a talk..
  • Absolute: 40 M$ 40M$, expected to grow 10 times by 2000 --Forrester research
  • Session7part1.ppt

    1. 1. Data warehousing and mining Session VII (Part 1) 15:45 - 16:10 Sunita Sarawagi School of IT, IIT Bombay
    2. 2. Introduction <ul><li>Organizations getting larger and amassing ever increasing amounts of data </li></ul><ul><li>Historic data encodes useful information about working of an organization. </li></ul><ul><li>However, data scattered across multiple sources, in multiple formats. </li></ul><ul><li>Data warehousing: process of consolidating data in a centralized location </li></ul><ul><li>Data mining: process of analyzing data to find useful patterns and relationships </li></ul>
    3. 3. Typical data analysis tasks <ul><li>Report the per-capita deposits broken down by region and profession. </li></ul><ul><li>Are deposits from rural coastal areas increasing over last five years? </li></ul><ul><li>What percent of small business loans were cleared? </li></ul><ul><li>Why is it less than last year’s? How did similar businesses that did not take loans perform? </li></ul><ul><li>What should be the new rules for loan eligibility? </li></ul>
    4. 4. Operational data Detailed transactional data Data warehouse Merge Clean Summarize Direct Query Reporting tools Mining tools Decision support tools Oracle SAS Relational DBMS+ e.g. Redbrick IMS Crystal reports Essbase Intelligent Miner Bombay branch Delhi branch Calcutta branch Census data OLAP GIS data
    5. 5. Data warehouse construction <ul><li>Heterogeneous data integration </li></ul><ul><ul><li>merge from various sources, fuzzy matches </li></ul></ul><ul><ul><li>remove inconsistencies </li></ul></ul><ul><li>Data cleaning: </li></ul><ul><ul><li>missing data, outliers, clean fields e.g. names/addresses </li></ul></ul><ul><ul><li>Data mining techniques </li></ul></ul><ul><li>Data loading: summarize, create indices </li></ul><ul><li>Products: Prism warehouse manager, Platinum info refiner, info pump, QDB, Vality </li></ul>
    6. 6. Warehouse maintenance <ul><li>Data refresh </li></ul><ul><ul><li>when to refresh, what form to send updates? </li></ul></ul><ul><li>Materialized view maintenance with batch updates. </li></ul><ul><li>Query evaluation using materialized views </li></ul><ul><li>Monitoring and reporting tools </li></ul><ul><ul><li>HP intelligent warehouse advisor </li></ul></ul>
    7. 7. Operational data Detailed transactional data Data warehouse Merge Clean Summarize Direct Query Reporting tools Mining tools Decision support tools Oracle SAS Relational DBMS+ e.g. Redbrick IMS Crystal reports Essbase Intelligent Miner Bombay branch Delhi branch Calcutta branch Census data OLAP GIS data
    8. 8. OLAP <ul><li>Fast, interactive answers to large aggregate queries . </li></ul><ul><li>Multidimensional model: dimensions with hierarchies </li></ul><ul><ul><li>Dim 1: Bank location: </li></ul></ul><ul><ul><ul><li>branch-->city-->state </li></ul></ul></ul><ul><ul><li>Dim 2: Customer: </li></ul></ul><ul><ul><ul><li>sub profession --> profession </li></ul></ul></ul><ul><ul><li>Dim 3: Time: </li></ul></ul><ul><ul><ul><li>month --> quarter --> year </li></ul></ul></ul><ul><li>Measures : loan amount, #transactions, balance </li></ul>
    9. 9. OLAP <ul><li>Navigational operators: Pivot, drill-down, roll-up, select. </li></ul><ul><li>Hypothesis driven search: E.g. factors affecting defaulters </li></ul><ul><ul><li>view defaulting rate on age aggregated over other dimensions </li></ul></ul><ul><ul><li>for particular age segment detail along profession </li></ul></ul><ul><li>Need interactive response to aggregate queries .. </li></ul>
    10. 10. OLAP products <ul><li>About 30 OLAP vendors </li></ul><ul><li>Dominant ones: </li></ul><ul><ul><li>Oracle Express: largest market share: 20% </li></ul></ul><ul><ul><li>Arbor Essbase: technology leader </li></ul></ul><ul><ul><li>Microsoft Plato: introduced late last year, rapidly taking over... </li></ul></ul>
    11. 11. Microsoft OLAP strategy <ul><li>Plato: OLAP server: powerful, integrating various operational sources </li></ul><ul><li>OLE-DB for OLAP: emerging industry standard based on MDX --> extension of SQL for OLAP </li></ul><ul><li>Pivot-table services: integrate with Office 2000 </li></ul><ul><ul><li>Every desktop will have OLAP capability. </li></ul></ul><ul><li>Client side caching and calculations </li></ul><ul><li>Partitioned and virtual cube </li></ul><ul><li>Hybrid relational and multidimensional storage </li></ul>
    12. 12. Data mining <ul><li>Process of semi-automatically analyzing large databases to find interesting and useful patterns </li></ul><ul><li>Overlaps with machine learning, statistics, artificial intelligence and databases but </li></ul><ul><ul><li>more scalable in number of features and instances </li></ul></ul><ul><ul><li>more automated to handle heterogeneous data </li></ul></ul>
    13. 13. Some basic operations <ul><li>Predictive: </li></ul><ul><ul><li>Regression </li></ul></ul><ul><ul><li>Classification </li></ul></ul><ul><li>Descriptive: </li></ul><ul><ul><li>Clustering / similarity matching </li></ul></ul><ul><ul><li>Association rules and variants </li></ul></ul><ul><ul><li>Deviation detection </li></ul></ul>
    14. 14. Classification <ul><li>Given old data about customers and payments, predict new applicant’s loan eligibility. </li></ul>Age Salary Profession Location Customer type Previous customers Classifier Decision rules Salary > 5 L Prof. = Exec New applicant’s data Good/ bad
    15. 15. Classification methods <ul><li>Nearest neighbor </li></ul><ul><li>Regression: (linear or any polynomial) </li></ul><ul><ul><li>a*salary + b*age + c = eligibility score. </li></ul></ul><ul><li>Decision tree classifier </li></ul><ul><li>Probabilistic/generative models </li></ul><ul><li>Neural networks </li></ul>
    16. 16. Clustering <ul><li>Unsupervised learning when old data with class labels not available e.g. when introducing a new product. </li></ul><ul><li>Group/cluster existing customers based on time series of payment history such that similar customers in same cluster. </li></ul><ul><li>Key requirement: Need a good measure of similarity between instances. </li></ul><ul><li>Identify micro-markets and develop policies for each </li></ul>
    17. 17. Association rules <ul><li>Given set T of groups of items </li></ul><ul><li>Example: set of item sets purchased </li></ul><ul><li>Goal: find all rules on itemsets of the form a-->b such that </li></ul><ul><ul><li>support of a and b > user threshold s </li></ul></ul><ul><ul><li>conditional probability (confidence) of b given a > user threshold c </li></ul></ul><ul><li>Example: Milk --> bread </li></ul><ul><li>Purchase of product A --> service B </li></ul>Milk, cereal Tea, milk Tea, rice, bread cereal T
    18. 18. Mining market <ul><li>Around 20 to 30 mining tool vendors </li></ul><ul><li>Major players: </li></ul><ul><ul><li>Clementine, </li></ul></ul><ul><ul><li>IBM’s Intelligent Miner, </li></ul></ul><ul><ul><li>SGI’s MineSet, </li></ul></ul><ul><ul><li>SAS’s Enterprise Miner. </li></ul></ul><ul><li>All pretty much the same set of tools </li></ul><ul><li>Many embedded products: fraud detection, electronic commerce applications </li></ul>
    19. 19. Conclusions <ul><li>The value of warehousing and mining in effective decision making based on concrete evidence from old data </li></ul><ul><li>Challenges of heterogeneity and scale in warehouse construction and maintenance </li></ul><ul><li>Grades of data analysis tools: straight querying, reporting tools, multidimensional analysis and mining. </li></ul>

    ×