Session7part1
Upcoming SlideShare
Loading in...5
×
 

Session7part1

on

  • 616 views

 

Statistics

Views

Total Views
616
Views on SlideShare
616
Embed Views
0

Actions

Likes
0
Downloads
2
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Data Warehousing & Mining Dr. Sunita Sarawagi
  • Data Warehousing & Mining Dr. Sunita Sarawagi Start with a real-life scenario
  • Data Warehousing & Mining Dr. Sunita Sarawagi CHECK ON THE PRODUCTS INTERESTING ALGORITHMS
  • Data Warehousing & Mining Dr. Sunita Sarawagi
  • Data Warehousing & Mining Dr. Sunita Sarawagi
  • Data Warehousing & Mining Dr. Sunita Sarawagi
  • Data Warehousing & Mining Dr. Sunita Sarawagi Cognos and microstrategy next in line 1.4B in 1997, 40% growth from 1994-97, expected to be 3B in 2000 Source: http://www.olapreport.com/Market.htm
  • Data Warehousing & Mining Dr. Sunita Sarawagi
  • Data Warehousing & Mining Dr. Sunita Sarawagi Each topic is a talk..
  • Data Warehousing & Mining Dr. Sunita Sarawagi Absolute: 40 M$ 40M$, expected to grow 10 times by 2000 --Forrester research

Session7part1 Session7part1 Presentation Transcript

  • Data warehousing and mining Session VII (Part 1) 15:45 - 16:10 Sunita Sarawagi School of IT, IIT Bombay
  • Introduction
    • Organizations getting larger and amassing ever increasing amounts of data
    • Historic data encodes useful information about working of an organization.
    • However, data scattered across multiple sources, in multiple formats.
    • Data warehousing: process of consolidating data in a centralized location
    • Data mining: process of analyzing data to find useful patterns and relationships
    Dr. Sunita Sarawagi Data Warehousing & Mining
  • Typical data analysis tasks
    • Report the per-capita deposits broken down by region and profession.
    • Are deposits from rural coastal areas increasing over last five years?
    • What percent of small business loans were cleared?
    • Why is it less than last year’s? How did similar businesses that did not take loans perform?
    • What should be the new rules for loan eligibility?
    Dr. Sunita Sarawagi Data Warehousing & Mining
  • Dr. Sunita Sarawagi Data Warehousing & Mining Operational data Detailed transactional data Data warehouse Merge Clean Summarize Direct Query Reporting tools Mining tools Decision support tools Oracle SAS Relational DBMS+ e.g. Redbrick IMS Crystal reports Essbase Intelligent Miner Bombay branch Delhi branch Calcutta branch Census data OLAP GIS data
  • Data warehouse construction
    • Heterogeneous data integration
      • merge from various sources, fuzzy matches
      • remove inconsistencies
    • Data cleaning:
      • missing data, outliers, clean fields e.g. names/addresses
      • Data mining techniques
    • Data loading: summarize, create indices
    • Products: Prism warehouse manager, Platinum info refiner, info pump, QDB, Vality
    Dr. Sunita Sarawagi Data Warehousing & Mining
  • Warehouse maintenance
    • Data refresh
      • when to refresh, what form to send updates?
    • Materialized view maintenance with batch updates.
    • Query evaluation using materialized views
    • Monitoring and reporting tools
      • HP intelligent warehouse advisor
    Dr. Sunita Sarawagi Data Warehousing & Mining
  • Dr. Sunita Sarawagi Data Warehousing & Mining Operational data Detailed transactional data Data warehouse Merge Clean Summarize Direct Query Reporting tools Mining tools Decision support tools Oracle SAS Relational DBMS+ e.g. Redbrick IMS Crystal reports Essbase Intelligent Miner Bombay branch Delhi branch Calcutta branch Census data OLAP GIS data
  • OLAP
    • Fast, interactive answers to large aggregate queries .
    • Multidimensional model: dimensions with hierarchies
      • Dim 1: Bank location:
        • branch-->city-->state
      • Dim 2: Customer:
        • sub profession --> profession
      • Dim 3: Time:
        • month --> quarter --> year
    • Measures : loan amount, #transactions, balance
    Dr. Sunita Sarawagi Data Warehousing & Mining
  • OLAP
    • Navigational operators: Pivot, drill-down, roll-up, select.
    • Hypothesis driven search: E.g. factors affecting defaulters
      • view defaulting rate on age aggregated over other dimensions
      • for particular age segment detail along profession
    • Need interactive response to aggregate queries ..
    Dr. Sunita Sarawagi Data Warehousing & Mining
  • OLAP products
    • About 30 OLAP vendors
    • Dominant ones:
      • Oracle Express: largest market share: 20%
      • Arbor Essbase: technology leader
      • Microsoft Plato: introduced late last year, rapidly taking over...
    Dr. Sunita Sarawagi Data Warehousing & Mining
  • Microsoft OLAP strategy
    • Plato: OLAP server: powerful, integrating various operational sources
    • OLE-DB for OLAP: emerging industry standard based on MDX --> extension of SQL for OLAP
    • Pivot-table services: integrate with Office 2000
      • Every desktop will have OLAP capability.
    • Client side caching and calculations
    • Partitioned and virtual cube
    • Hybrid relational and multidimensional storage
    Dr. Sunita Sarawagi Data Warehousing & Mining
  • Data mining
    • Process of semi-automatically analyzing large databases to find interesting and useful patterns
    • Overlaps with machine learning, statistics, artificial intelligence and databases but
      • more scalable in number of features and instances
      • more automated to handle heterogeneous data
    Dr. Sunita Sarawagi Data Warehousing & Mining
  • Some basic operations
    • Predictive:
      • Regression
      • Classification
    • Descriptive:
      • Clustering / similarity matching
      • Association rules and variants
      • Deviation detection
    Dr. Sunita Sarawagi Data Warehousing & Mining
  • Classification
    • Given old data about customers and payments, predict new applicant’s loan eligibility.
    Dr. Sunita Sarawagi Data Warehousing & Mining Age Salary Profession Location Customer type Previous customers Classifier Decision rules Salary > 5 L Prof. = Exec New applicant’s data Good/ bad
  • Classification methods
    • Nearest neighbor
    • Regression: (linear or any polynomial)
      • a*salary + b*age + c = eligibility score.
    • Decision tree classifier
    • Probabilistic/generative models
    • Neural networks
    Dr. Sunita Sarawagi Data Warehousing & Mining
  • Clustering
    • Unsupervised learning when old data with class labels not available e.g. when introducing a new product.
    • Group/cluster existing customers based on time series of payment history such that similar customers in same cluster.
    • Key requirement: Need a good measure of similarity between instances.
    • Identify micro-markets and develop policies for each
    Dr. Sunita Sarawagi Data Warehousing & Mining
  • Association rules
    • Given set T of groups of items
    • Example: set of item sets purchased
    • Goal: find all rules on itemsets of the form a-->b such that
      • support of a and b > user threshold s
      • conditional probability (confidence) of b given a > user threshold c
    • Example: Milk --> bread
    • Purchase of product A --> service B
    Dr. Sunita Sarawagi Data Warehousing & Mining Milk, cereal Tea, milk Tea, rice, bread cereal T
  • Mining market
    • Around 20 to 30 mining tool vendors
    • Major players:
      • Clementine,
      • IBM’s Intelligent Miner,
      • SGI’s MineSet,
      • SAS’s Enterprise Miner.
    • All pretty much the same set of tools
    • Many embedded products: fraud detection, electronic commerce applications
    Dr. Sunita Sarawagi Data Warehousing & Mining
  • Conclusions
    • The value of warehousing and mining in effective decision making based on concrete evidence from old data
    • Challenges of heterogeneity and scale in warehouse construction and maintenance
    • Grades of data analysis tools: straight querying, reporting tools, multidimensional analysis and mining.
    Dr. Sunita Sarawagi Data Warehousing & Mining