SlideShare a Scribd company logo
Data warehousing and mining

  Session VII (Part 1) 15:45 - 16:10

         Sunita Sarawagi
     School of IT, IIT Bombay
Introduction
• Organizations getting larger and amassing ever
  increasing amounts of data
• Historic data encodes useful information about
  working of an organization.
• However, data scattered across multiple sources,
  in multiple formats.
• Data warehousing: process of consolidating data
  in a centralized location
• Data mining: process of analyzing data to find
  useful patterns and relationships
Dr. Sunita Sarawagi    Data Warehousing & Mining     2
Typical data analysis tasks
• Report the per-capita deposits broken down by
  region and profession.
• Are deposits from rural coastal areas increasing
  over last five years?
• What percent of small business loans were cleared?
• Why is it less than last year’s? How did similar
  businesses that did not take loans perform?
• What should be the new rules for loan eligibility?

Dr. Sunita Sarawagi   Data Warehousing & Mining   3
Decision support tools
                                                                   Mining
  Direct                   Reporting             OLAP              tools
  Query                    tools
                                                Essbase           Intelligent Miner
                          Crystal reports


Merge                                                                     Relational
Clean                            Data warehouse                           DBMS+
Summarize                                                                 e.g. Redbrick

Detailed                                                        GIS
transactional                                                   data
data                      Operational data                                 Census
       Bombay branch Delhi branch                      Calcutta branch     data
             Oracle                                           IMS           SAS
    Dr. Sunita Sarawagi             Data Warehousing & Mining                     4
Data warehouse construction
• Heterogeneous data integration
     – merge from various sources, fuzzy matches
     – remove inconsistencies
• Data cleaning:
     – missing data, outliers, clean fields e.g. names/addresses
     – Data mining techniques
• Data loading: summarize, create indices
• Products: Prism warehouse manager, Platinum info
   refiner, info pump, QDB, Vality

Dr. Sunita Sarawagi     Data Warehousing & Mining              5
Warehouse maintenance
• Data refresh
     – when to refresh, what form to send updates?
• Materialized view maintenance with batch
  updates.
• Query evaluation using materialized views
• Monitoring and reporting tools
     – HP intelligent warehouse advisor

Dr. Sunita Sarawagi   Data Warehousing & Mining      6
Decision support tools
                                                                   Mining
  Direct                   Reporting             OLAP              tools
  Query                    tools
                                                Essbase           Intelligent Miner
                          Crystal reports


Merge                                                                     Relational
Clean                            Data warehouse                           DBMS+
Summarize                                                                 e.g. Redbrick

Detailed                                                        GIS
transactional                                                   data
data                      Operational data                                 Census
       Bombay branch Delhi branch                      Calcutta branch     data
             Oracle                                           IMS           SAS
    Dr. Sunita Sarawagi             Data Warehousing & Mining                     7
OLAP
 Fast, interactive answers to large aggregate queries.
 • Multidimensional model: dimensions with
   hierarchies
    – Dim 1: Bank location:
             • branch-->city-->state
       – Dim 2: Customer:
             • sub profession --> profession
       – Dim 3: Time:
             • month --> quarter --> year
 • Measures: loan amount, #transactions, balance
Dr. Sunita Sarawagi       Data Warehousing & Mining   8
OLAP
• Navigational operators: Pivot, drill-down,
  roll-up, select.
• Hypothesis driven search: E.g. factors
  affecting defaulters
     – view defaulting rate on age aggregated over other
       dimensions
     – for particular age segment detail along profession
• Need interactive response to aggregate queries..
Dr. Sunita Sarawagi    Data Warehousing & Mining            9
OLAP products
• About 30 OLAP vendors
• Dominant ones:
     – Oracle Express: largest market share: 20%
     – Arbor Essbase: technology leader
     – Microsoft Plato: introduced late last year,
       rapidly taking over...



Dr. Sunita Sarawagi     Data Warehousing & Mining    10
Microsoft OLAP strategy
• Plato: OLAP server: powerful, integrating various
  operational sources
• OLE-DB for OLAP: emerging industry standard
  based on MDX --> extension of SQL for OLAP
• Pivot-table services: integrate with Office 2000
     – Every desktop will have OLAP capability.
• Client side caching and calculations
• Partitioned and virtual cube
• Hybrid relational and multidimensional storage
Dr. Sunita Sarawagi   Data Warehousing & Mining    11
Data mining
• Process of semi-automatically analyzing large
  databases to find interesting and useful
  patterns
• Overlaps with machine learning, statistics,
  artificial intelligence and databases but
     – more scalable in number of features and instances
     – more automated to handle heterogeneous data

Dr. Sunita Sarawagi    Data Warehousing & Mining     12
Some basic operations
• Predictive:
      – Regression
      – Classification
• Descriptive:
      – Clustering / similarity matching
      – Association rules and variants
      – Deviation detection


Dr. Sunita Sarawagi   Data Warehousing & Mining   13
Classification
• Given old data about customers and payments,
  predict new applicant’s loan eligibility.
Previous customers             Classifier                Decision rules
  Age
                                                            Salary > 5 L
  Salary                                                                        Good/
  Profession                                                     Prof. = Exec    bad
  Location
  Customer
  type
                                                     New applicant’s data
 Dr. Sunita Sarawagi     Data Warehousing & Mining                              14
Classification methods
• Nearest neighbor
• Regression: (linear or any polynomial)
     – a*salary + b*age + c = eligibility score.
• Decision tree classifier
• Probabilistic/generative models
• Neural networks

Dr. Sunita Sarawagi   Data Warehousing & Mining    15
Clustering
• Unsupervised learning when old data with class
  labels not available e.g. when introducing a new
  product.
• Group/cluster existing customers based on time
  series of payment history such that similar customers
  in same cluster.
• Key requirement: Need a good measure of similarity
  between instances.
• Identify micro-markets and develop policies for each
Dr. Sunita Sarawagi   Data Warehousing & Mining    16
Association rules
                                                           T
                                                     Milk, cereal
• Given set T of groups of items                     Tea, milk
• Example: set of item sets purchased
                                                     Tea, rice, bread
• Goal: find all rules on itemsets of the
  form a-->b such that
     – support of a and b > user threshold s
     – conditional probability (confidence) of b
       given a > user threshold c
• Example: Milk --> bread
• Purchase of product A --> service B
Dr. Sunita Sarawagi      Data Warehousing & Mining   cereal     17
Mining market
• Around 20 to 30 mining tool vendors
• Major players:
     –   Clementine,
     –   IBM’s Intelligent Miner,
     –   SGI’s MineSet,
     –   SAS’s Enterprise Miner.
• All pretty much the same set of tools
• Many embedded products: fraud detection, electronic
   commerce applications
Dr. Sunita Sarawagi      Data Warehousing & Mining   18
Conclusions
• The value of warehousing and mining in
  effective decision making based on concrete
  evidence from old data
• Challenges of heterogeneity and scale in
  warehouse construction and maintenance
• Grades of data analysis tools: straight
  querying, reporting tools, multidimensional
  analysis and mining.
Dr. Sunita Sarawagi    Data Warehousing & Mining   19

More Related Content

What's hot

Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
Amazon Web Services
 
Yahoo! TAO Case Study Excerpt
Yahoo! TAO Case Study ExcerptYahoo! TAO Case Study Excerpt
Yahoo! TAO Case Study Excerpt
Denny Lee
 
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with HadoopChicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Cloudera, Inc.
 
Austin fraser sap hana presentation
Austin fraser sap hana presentationAustin fraser sap hana presentation
Austin fraser sap hana presentation
Shane Sale
 
Technical presentation
Technical presentationTechnical presentation
Technical presentation
זיו מורנו
 
Prashanth Updated C.V
Prashanth Updated C.VPrashanth Updated C.V
Prashanth Updated C.V
Prashanth Kumar
 

What's hot (6)

Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Yahoo! TAO Case Study Excerpt
Yahoo! TAO Case Study ExcerptYahoo! TAO Case Study Excerpt
Yahoo! TAO Case Study Excerpt
 
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with HadoopChicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
 
Austin fraser sap hana presentation
Austin fraser sap hana presentationAustin fraser sap hana presentation
Austin fraser sap hana presentation
 
Technical presentation
Technical presentationTechnical presentation
Technical presentation
 
Prashanth Updated C.V
Prashanth Updated C.VPrashanth Updated C.V
Prashanth Updated C.V
 

Similar to Session7part1

Session7part1
Session7part1Session7part1
Session7part1
Balan Boobalan
 
Data mining - GDi Techno Solutions
Data mining - GDi Techno SolutionsData mining - GDi Techno Solutions
Data mining - GDi Techno Solutions
GDi Techno Solutions
 
Dbm630_Lecture02-03
Dbm630_Lecture02-03Dbm630_Lecture02-03
Dbm630_Lecture02-03
Aj Kritsada Sriphaew
 
Dbm630_lecture02-03
Dbm630_lecture02-03Dbm630_lecture02-03
Dbm630_lecture02-03
Tokyo Institute of Technology
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
BigMine
 
Data warehousing
Data warehousingData warehousing
Data warehousing
Mandar Kulkarni
 
Dw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhanDw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhan
Dr Pradhan PL Pradhan
 
Lecture1
Lecture1Lecture1
Lecture1
Sunil Chavan
 
`Data mining
`Data mining`Data mining
`Data mining
Jebin R
 
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenariosThe Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
kcmallu
 
Data mining & column stores
Data mining & column storesData mining & column stores
Data mining & column stores
Aung Thu Rha Hein
 
MIS: Business Intelligence
MIS: Business IntelligenceMIS: Business Intelligence
MIS: Business Intelligence
Jonathan Coleman
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
Si Krishan
 
The Microsoft BigData Story
The Microsoft BigData StoryThe Microsoft BigData Story
The Microsoft BigData Story
Lynn Langit
 
Intro to Data warehousing lecture 16
Intro to Data warehousing   lecture 16Intro to Data warehousing   lecture 16
Intro to Data warehousing lecture 16
AnwarrChaudary
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
Er. Nawaraj Bhandari
 
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
Fabio Fumarola
 
IARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxIARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptx
AIMLSEMINARS
 
Assessing New Databases– Translytical Use Cases
Assessing New Databases– Translytical Use CasesAssessing New Databases– Translytical Use Cases
Assessing New Databases– Translytical Use Cases
DATAVERSITY
 
Software architecture & design patterns for MS CRM Developers
Software architecture & design patterns for MS CRM  Developers Software architecture & design patterns for MS CRM  Developers
Software architecture & design patterns for MS CRM Developers
sebedatalabs
 

Similar to Session7part1 (20)

Session7part1
Session7part1Session7part1
Session7part1
 
Data mining - GDi Techno Solutions
Data mining - GDi Techno SolutionsData mining - GDi Techno Solutions
Data mining - GDi Techno Solutions
 
Dbm630_Lecture02-03
Dbm630_Lecture02-03Dbm630_Lecture02-03
Dbm630_Lecture02-03
 
Dbm630_lecture02-03
Dbm630_lecture02-03Dbm630_lecture02-03
Dbm630_lecture02-03
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Dw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhanDw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhan
 
Lecture1
Lecture1Lecture1
Lecture1
 
`Data mining
`Data mining`Data mining
`Data mining
 
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenariosThe Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
 
Data mining & column stores
Data mining & column storesData mining & column stores
Data mining & column stores
 
MIS: Business Intelligence
MIS: Business IntelligenceMIS: Business Intelligence
MIS: Business Intelligence
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
The Microsoft BigData Story
The Microsoft BigData StoryThe Microsoft BigData Story
The Microsoft BigData Story
 
Intro to Data warehousing lecture 16
Intro to Data warehousing   lecture 16Intro to Data warehousing   lecture 16
Intro to Data warehousing lecture 16
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
 
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
 
IARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxIARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptx
 
Assessing New Databases– Translytical Use Cases
Assessing New Databases– Translytical Use CasesAssessing New Databases– Translytical Use Cases
Assessing New Databases– Translytical Use Cases
 
Software architecture & design patterns for MS CRM Developers
Software architecture & design patterns for MS CRM  Developers Software architecture & design patterns for MS CRM  Developers
Software architecture & design patterns for MS CRM Developers
 

Recently uploaded

Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
adhitya5119
 
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem studentsRHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
Himanshu Rai
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
mulvey2
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
Priyankaranawat4
 
How to Create a More Engaging and Human Online Learning Experience
How to Create a More Engaging and Human Online Learning Experience How to Create a More Engaging and Human Online Learning Experience
How to Create a More Engaging and Human Online Learning Experience
Wahiba Chair Training & Consulting
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
Colégio Santa Teresinha
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
adhitya5119
 
How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
Celine George
 
Constructing Your Course Container for Effective Communication
Constructing Your Course Container for Effective CommunicationConstructing Your Course Container for Effective Communication
Constructing Your Course Container for Effective Communication
Chevonnese Chevers Whyte, MBA, B.Sc.
 
BBR 2024 Summer Sessions Interview Training
BBR  2024 Summer Sessions Interview TrainingBBR  2024 Summer Sessions Interview Training
BBR 2024 Summer Sessions Interview Training
Katrina Pritchard
 
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
TechSoup
 
Solutons Maths Escape Room Spatial .pptx
Solutons Maths Escape Room Spatial .pptxSolutons Maths Escape Room Spatial .pptx
Solutons Maths Escape Room Spatial .pptx
spdendr
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
RAHUL
 
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptxPrésentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
siemaillard
 
Chapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptxChapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptx
Denish Jangid
 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
MJDuyan
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
Jean Carlos Nunes Paixão
 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
Krassimira Luka
 
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx
HajraNaeem15
 
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
สมใจ จันสุกสี
 

Recently uploaded (20)

Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
 
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem studentsRHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
 
How to Create a More Engaging and Human Online Learning Experience
How to Create a More Engaging and Human Online Learning Experience How to Create a More Engaging and Human Online Learning Experience
How to Create a More Engaging and Human Online Learning Experience
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
 
How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
 
Constructing Your Course Container for Effective Communication
Constructing Your Course Container for Effective CommunicationConstructing Your Course Container for Effective Communication
Constructing Your Course Container for Effective Communication
 
BBR 2024 Summer Sessions Interview Training
BBR  2024 Summer Sessions Interview TrainingBBR  2024 Summer Sessions Interview Training
BBR 2024 Summer Sessions Interview Training
 
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
 
Solutons Maths Escape Room Spatial .pptx
Solutons Maths Escape Room Spatial .pptxSolutons Maths Escape Room Spatial .pptx
Solutons Maths Escape Room Spatial .pptx
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
 
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptxPrésentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
 
Chapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptxChapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptx
 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
 
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx
 
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
 

Session7part1

  • 1. Data warehousing and mining Session VII (Part 1) 15:45 - 16:10 Sunita Sarawagi School of IT, IIT Bombay
  • 2. Introduction • Organizations getting larger and amassing ever increasing amounts of data • Historic data encodes useful information about working of an organization. • However, data scattered across multiple sources, in multiple formats. • Data warehousing: process of consolidating data in a centralized location • Data mining: process of analyzing data to find useful patterns and relationships Dr. Sunita Sarawagi Data Warehousing & Mining 2
  • 3. Typical data analysis tasks • Report the per-capita deposits broken down by region and profession. • Are deposits from rural coastal areas increasing over last five years? • What percent of small business loans were cleared? • Why is it less than last year’s? How did similar businesses that did not take loans perform? • What should be the new rules for loan eligibility? Dr. Sunita Sarawagi Data Warehousing & Mining 3
  • 4. Decision support tools Mining Direct Reporting OLAP tools Query tools Essbase Intelligent Miner Crystal reports Merge Relational Clean Data warehouse DBMS+ Summarize e.g. Redbrick Detailed GIS transactional data data Operational data Census Bombay branch Delhi branch Calcutta branch data Oracle IMS SAS Dr. Sunita Sarawagi Data Warehousing & Mining 4
  • 5. Data warehouse construction • Heterogeneous data integration – merge from various sources, fuzzy matches – remove inconsistencies • Data cleaning: – missing data, outliers, clean fields e.g. names/addresses – Data mining techniques • Data loading: summarize, create indices • Products: Prism warehouse manager, Platinum info refiner, info pump, QDB, Vality Dr. Sunita Sarawagi Data Warehousing & Mining 5
  • 6. Warehouse maintenance • Data refresh – when to refresh, what form to send updates? • Materialized view maintenance with batch updates. • Query evaluation using materialized views • Monitoring and reporting tools – HP intelligent warehouse advisor Dr. Sunita Sarawagi Data Warehousing & Mining 6
  • 7. Decision support tools Mining Direct Reporting OLAP tools Query tools Essbase Intelligent Miner Crystal reports Merge Relational Clean Data warehouse DBMS+ Summarize e.g. Redbrick Detailed GIS transactional data data Operational data Census Bombay branch Delhi branch Calcutta branch data Oracle IMS SAS Dr. Sunita Sarawagi Data Warehousing & Mining 7
  • 8. OLAP Fast, interactive answers to large aggregate queries. • Multidimensional model: dimensions with hierarchies – Dim 1: Bank location: • branch-->city-->state – Dim 2: Customer: • sub profession --> profession – Dim 3: Time: • month --> quarter --> year • Measures: loan amount, #transactions, balance Dr. Sunita Sarawagi Data Warehousing & Mining 8
  • 9. OLAP • Navigational operators: Pivot, drill-down, roll-up, select. • Hypothesis driven search: E.g. factors affecting defaulters – view defaulting rate on age aggregated over other dimensions – for particular age segment detail along profession • Need interactive response to aggregate queries.. Dr. Sunita Sarawagi Data Warehousing & Mining 9
  • 10. OLAP products • About 30 OLAP vendors • Dominant ones: – Oracle Express: largest market share: 20% – Arbor Essbase: technology leader – Microsoft Plato: introduced late last year, rapidly taking over... Dr. Sunita Sarawagi Data Warehousing & Mining 10
  • 11. Microsoft OLAP strategy • Plato: OLAP server: powerful, integrating various operational sources • OLE-DB for OLAP: emerging industry standard based on MDX --> extension of SQL for OLAP • Pivot-table services: integrate with Office 2000 – Every desktop will have OLAP capability. • Client side caching and calculations • Partitioned and virtual cube • Hybrid relational and multidimensional storage Dr. Sunita Sarawagi Data Warehousing & Mining 11
  • 12. Data mining • Process of semi-automatically analyzing large databases to find interesting and useful patterns • Overlaps with machine learning, statistics, artificial intelligence and databases but – more scalable in number of features and instances – more automated to handle heterogeneous data Dr. Sunita Sarawagi Data Warehousing & Mining 12
  • 13. Some basic operations • Predictive: – Regression – Classification • Descriptive: – Clustering / similarity matching – Association rules and variants – Deviation detection Dr. Sunita Sarawagi Data Warehousing & Mining 13
  • 14. Classification • Given old data about customers and payments, predict new applicant’s loan eligibility. Previous customers Classifier Decision rules Age Salary > 5 L Salary Good/ Profession Prof. = Exec bad Location Customer type New applicant’s data Dr. Sunita Sarawagi Data Warehousing & Mining 14
  • 15. Classification methods • Nearest neighbor • Regression: (linear or any polynomial) – a*salary + b*age + c = eligibility score. • Decision tree classifier • Probabilistic/generative models • Neural networks Dr. Sunita Sarawagi Data Warehousing & Mining 15
  • 16. Clustering • Unsupervised learning when old data with class labels not available e.g. when introducing a new product. • Group/cluster existing customers based on time series of payment history such that similar customers in same cluster. • Key requirement: Need a good measure of similarity between instances. • Identify micro-markets and develop policies for each Dr. Sunita Sarawagi Data Warehousing & Mining 16
  • 17. Association rules T Milk, cereal • Given set T of groups of items Tea, milk • Example: set of item sets purchased Tea, rice, bread • Goal: find all rules on itemsets of the form a-->b such that – support of a and b > user threshold s – conditional probability (confidence) of b given a > user threshold c • Example: Milk --> bread • Purchase of product A --> service B Dr. Sunita Sarawagi Data Warehousing & Mining cereal 17
  • 18. Mining market • Around 20 to 30 mining tool vendors • Major players: – Clementine, – IBM’s Intelligent Miner, – SGI’s MineSet, – SAS’s Enterprise Miner. • All pretty much the same set of tools • Many embedded products: fraud detection, electronic commerce applications Dr. Sunita Sarawagi Data Warehousing & Mining 18
  • 19. Conclusions • The value of warehousing and mining in effective decision making based on concrete evidence from old data • Challenges of heterogeneity and scale in warehouse construction and maintenance • Grades of data analysis tools: straight querying, reporting tools, multidimensional analysis and mining. Dr. Sunita Sarawagi Data Warehousing & Mining 19

Editor's Notes

  1. Start with a real-life scenario
  2. CHECK ON THE PRODUCTS INTERESTING ALGORITHMS
  3. Cognos and microstrategy next in line 1.4B in 1997, 40% growth from 1994-97, expected to be 3B in 2000 Source: http://www.olapreport.com/Market.htm
  4. Each topic is a talk..
  5. Absolute: 40 M$ 40M$, expected to grow 10 times by 2000 --Forrester research