SlideShare a Scribd company logo
1 of 35
Data Warehousing/Mining 1
Data Warehousing/Mining
Comp 150
Data Warehousing Introduction
(not in book)
Instructor: Dan Hebert
Data Warehousing/Mining 2
Outline of Lecture
 Data Warehousing and Information
Integration
 Brief History of Data Warehousing
 What is a Data Warehouse?
 Types of Data and Their Uses
 Data Warehouse Architectures
 Issues in Data Warehousing
Data Warehousing/Mining 3
Problem: Heterogeneous Information
Sources
“Heterogeneities are everywhere”
 Different interfaces
 Different data representations
 Duplicate and inconsistent information
Personal
Databases
Digital Libraries
Scientific Databases
World
Wide
Web
Data Warehousing/Mining 4
Problem: Data Management in
Large Enterprises
 Vertical fragmentation of informational systems
(vertical stove pipes)
 Result of application (user)-driven development of
operational systems
Sales Administration Finance Manufacturing ...
Sales Planning
Stock Mngmt
...
Suppliers
...
Debt Mngmt
Num. Control
...
Inventory
Data Warehousing/Mining 5
Goal: Unified Access to Data
Integration System
• Collects and combines information
• Provides integrated view, uniform user interface
• Supports sharing
World
Wide
Web
Digital Libraries Scientific Databases
Personal
Databases
Data Warehousing/Mining 6
The Traditional Research Approach
Source Source
Source
. . .
Integration System
. . .
Metadata
Clients
Wrapper Wrapper
Wrapper
 Query-driven (lazy, on-demand)
Data Warehousing/Mining 7
Disadvantages of Query-Driven
Approach
 Delay in query processing
– Slow or unavailable information sources
– Complex filtering and integration
 Inefficient and potentially expensive for
frequent queries
 Competes with local processing at sources
 Hasn’t caught on in industry
Data Warehousing/Mining 8
The Warehousing Approach
Data
Warehouse
Clients
Source Source
Source
. . .
Extractor/
Monitor
Integration System
. . .
Metadata
Extractor/
Monitor
Extractor/
Monitor
 Information
integrated in
advance
 Stored in wh for
direct querying
and analysis
Data Warehousing/Mining 9
Advantages of Warehousing Approach
 High query performance
– But not necessarily most current information
 Doesn’t interfere with local processing at sources
– Complex queries at warehouse
– OLTP at information sources
 Information copied at warehouse
– Can modify, annotate, summarize, restructure, etc.
– Can store historical information
– Security, no auditing
 Has caught on in industry
Data Warehousing/Mining 10
Not Either-Or Decision
 Query-driven approach still better for
– Rapidly changing information
– Rapidly changing information sources
– Truly vast amounts of data from large numbers of
sources
– Clients with unpredictable needs
Data Warehousing/Mining 11
Data Warehouse Evolution
TIME
2000
1995
1990
1985
1980
1960 1975
Information-
Based
Management
Data
Revolution
“Middle
Ages”
“Prehistoric
Times”
Relational
Databases
PC’s and
Spreadsheets
End-user
Interfaces
1st DW
Article
DW
Confs.
Vendor DW
Frameworks
Company
DWs
“Building the
DW”
Inmon (1992)
Data Replication
Tools
Data Warehousing/Mining 12
What is a Data Warehouse?
A Practitioners Viewpoint
“A data warehouse is simply a single, complete,
and consistent store of data obtained from a
variety of sources and made available to end
users in a way they can understand and use it
in a business context.”
-- Barry Devlin, IBM Consultant
Data Warehousing/Mining 13
A Data Warehouse is...
 Stored collection of diverse data
– A solution to data integration problem
– Single repository of information
 Subject-oriented
– Organized by subject, not by application
– Used for analysis, data mining, etc.
 Optimized differently from transaction-
oriented db
 User interface aimed at executive
Data Warehousing/Mining 14
A Data Warehouse is... (continued)
 Large volume of data (Gb, Tb)
 Non-volatile
– Historical
– Time attributes are important
 Updates infrequent
 May be append-only
 Examples
– All transactions ever at WalMart
– Complete client histories at insurance firm
– Stockbroker financial information and portfolios
Data Warehousing/Mining 15
Summary
Operational Systems
Enterprise
Modeling
Business
Information Guide
Data
Warehouse
Catalog
Data Warehouse
Population
Data
Warehouse
Business Information
Interface
Data Warehousing/Mining 16
Warehouse is a Specialized DB
Standard DB
 Mostly updates
 Many small transactions
 Mb - Gb of data
 Current snapshot
 Index/hash on p.k.
 Raw data
 Thousands of users (e.g.,
clerical users)
Warehouse
 Mostly reads
 Queries are long and complex
 Gb - Tb of data
 History
 Lots of scans
 Summarized, reconciled data
 Hundreds of users (e.g.,
decision-makers, analysts)
Data Warehousing/Mining 17
Warehousing and Industry
 Warehousing is big business
– $2 billion in 1995
– $3.5 billion in early 1997
– Predicted: $8 billion in 1998 [Metagroup]
 WalMart has largest warehouse
– 900-CPU, 2,700 disk, 23 TB Teradata system
– ~7TB in warehouse
– 40-50GB per day
Data Warehousing/Mining 18
Types of Data
 Business Data - represents meaning
– Real-time data (ultimate source of all business data)
– Reconciled data
– Derived data
 Metadata - describes meaning
– Build-time metadata
– Control metadata
– Usage metadata
 Data as a product* - intrinsic meaning
– Produced and stored for its own intrinsic value
– e.g., the contents of a text-book
Data Warehousing/Mining 19
Data Warehouse Architectures:
Conceptual View
 Single-layer
– Every data element is stored once only
– Virtual warehouse
 Two-layer
– Real-time + derived data
– Most commonly used approach in
industry today
“Real-time data”
Operational
systems
Informational
systems
Derived Data
Real-time data
Operational
systems
Informational
systems
Data Warehousing/Mining 20
Three-layer Architecture:
Conceptual View
 Transformation of real-time data to derived
data really requires two steps
Derived Data
Real-time data
Operational
systems
Informational
systems
Reconciled Data
Physical Implementation
of the Data Warehouse
View level
“Particular informational
needs”
Data Warehousing/Mining 21
Data Warehousing: Two Distinct
Issues
(1) How to get information into warehouse
“Data warehousing”
(2) What to do with data once it’s in warehouse
“Warehouse DBMS”
 Both rich research areas
 Industry has focused on (2)
Data Warehousing/Mining 22
Issues in Data Warehousing
 Warehouse Design
 Extraction
– Wrappers, monitors (change detectors)
 Integration
– Cleansing & merging
 Warehousing specification & Maintenance
 Optimizations
 Miscellaneous (e.g., evolution)
Data Warehousing/Mining 23
Data Extraction
 Source types
– Relational, flat file, WWW, etc.
 How to get data out?
– Replication tool
– Dump file
– Create report
– ODBC or third-party “wrappers”
Data Warehousing/Mining 24
Warehouse Architecture
Source Source Source
Extractor/
Monitor
Extractor/
Monitor
Extractor/
Monitor
Integrator
Warehouse
Query & Analysis
Client Client
...
Metadata
Data Warehousing/Mining 25
Issues (1)
 Warehouse uses relational data model or multi-
dimensional data model (e.g., data cube)
 On the other hand, source types
– Relational, OO, hierarchical, legacy
– Semistructured: flat file, WWW
 How do we get the data out?
Data Warehousing/Mining 26
Issues (2)
 Warehouse must be kept current in light of
changes to underlying sources
 How do we detect updates in sources?
Data Warehousing/Mining 27
Wrapper
Converts data and queries from one data model to
another
Extends query capabilities for sources with
limited capabilities
Data
Model
B
Data
Model
A
Queries
Data
Queries Source
Wrapper
Data Warehousing/Mining 28
Wrapper Generation
 Solution 1: Hard code for each source
 Solution 2: Automatic wrapper generation
Wrapper
Wrapper
Generator
Definition
Data Warehousing/Mining 29
Wrapper Approach
 Source-specific adapter (a.k.a. wrapper,
translator)
 “Thickness” of adapter depends on source
– Data model used (e.g. rel. schema vs.
unstructured)
– Interface (i.e., query language, API)
– Active capabilities (i.e., triggers)
– Degree of autonomy (e.g., same owner &
modifiable vs. controlled by external entity & no
changes possible)
– Cooperation (e.g., friendly vs. uncooperative)
Data Warehousing/Mining 30
Routine When...
 Many tools for dealing with “standard situations”
– Standard sources with full/many capabilities
 e.g., most commercial DBMSs, all ODBC-compliant sources
– Standard interactions
 e.g., pass-through queries, extraction from rel. tables, replication
– Cooperative sources or sources under our control
 Tools
– Replication tools, ODBC, report writers, third-party
“wrappers”
Data Warehousing/Mining 31
Not So Routine When...
 “Non-standard situations”
– Unstructured or semistructured sources with little
or no explicit schema
– Uncooperative sources
– Sources with limited capabilities (e.g., legacy
sources, WWW)
 Few commercial tools
 Mostly research
Data Warehousing/Mining 32
Data Transformations
 Convert data to uniform format
– Byte ordering, string termination
– Internal layout
 Remove, add & reorder attributes
– Add key
– Add data to get history
 Sort tuples
Data Warehousing/Mining 33
Monitors
 Goal: Detect changes of interest and
propagate to integrator
 How?
– Triggers
– Replication server
– Log sniffer
– Compare query results
– Compare snapshots/dumps
Data Warehousing/Mining 34
Data Integration
 Receive data (changes) from multiple
wrappers/monitors and integrate into warehouse
 Rule-based
 Actions
– Resolve inconsistencies
– Eliminate duplicates
– Integrate into warehouse (may not be empty)
– Summarize data
– Fetch more data from sources (wh updates)
– etc.
Data Warehousing/Mining 35
Data Cleansing
 Find (& remove) duplicate tuples
– e.g., Jane Doe vs. Jane Q. Doe
 Detect inconsistent, wrong data
– Attribute values that don’t match
 Patch missing, unreadable data
 Notify sources of errors found

More Related Content

Similar to DWIntro.ppt

Dbms and it infrastructure
Dbms and  it infrastructureDbms and  it infrastructure
Dbms and it infrastructure
projectandppt
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
bhagathk
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing concepts
pcherukumalla
 
Dataware house multidimensionalmodelling
Dataware house multidimensionalmodellingDataware house multidimensionalmodelling
Dataware house multidimensionalmodelling
meghu123
 
BI Chapter 03.pdf business business business business business business
BI Chapter 03.pdf business business business business business businessBI Chapter 03.pdf business business business business business business
BI Chapter 03.pdf business business business business business business
JawaherAlbaddawi
 
DMML1_overview.ppt
DMML1_overview.pptDMML1_overview.ppt
DMML1_overview.ppt
butest
 

Similar to DWIntro.ppt (20)

Dbms and it infrastructure
Dbms and  it infrastructureDbms and  it infrastructure
Dbms and it infrastructure
 
Data Mining @ BSU Malolos 2019
Data Mining @ BSU Malolos 2019Data Mining @ BSU Malolos 2019
Data Mining @ BSU Malolos 2019
 
Datawarehouse and OLAP
Datawarehouse and OLAPDatawarehouse and OLAP
Datawarehouse and OLAP
 
Dm unit i r16
Dm unit i   r16Dm unit i   r16
Dm unit i r16
 
Data warehousing and Data mining
Data warehousing and Data mining Data warehousing and Data mining
Data warehousing and Data mining
 
1.4 data warehouse
1.4 data warehouse1.4 data warehouse
1.4 data warehouse
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
DMDW 1st module.pdf
DMDW 1st module.pdfDMDW 1st module.pdf
DMDW 1st module.pdf
 
Chapter 2
Chapter 2Chapter 2
Chapter 2
 
Prototype Design of Open Access Institutional Repository
Prototype Design of Open Access Institutional RepositoryPrototype Design of Open Access Institutional Repository
Prototype Design of Open Access Institutional Repository
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
 
Dbm630_lecture02-03
Dbm630_lecture02-03Dbm630_lecture02-03
Dbm630_lecture02-03
 
Dbm630_Lecture02-03
Dbm630_Lecture02-03Dbm630_Lecture02-03
Dbm630_Lecture02-03
 
Chpt2.ppt
Chpt2.pptChpt2.ppt
Chpt2.ppt
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing concepts
 
Dataware house multidimensionalmodelling
Dataware house multidimensionalmodellingDataware house multidimensionalmodelling
Dataware house multidimensionalmodelling
 
BI Chapter 03.pdf business business business business business business
BI Chapter 03.pdf business business business business business businessBI Chapter 03.pdf business business business business business business
BI Chapter 03.pdf business business business business business business
 
DMML1_overview.ppt
DMML1_overview.pptDMML1_overview.ppt
DMML1_overview.ppt
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
Ch03
Ch03Ch03
Ch03
 

Recently uploaded

Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
daisycvs
 
Mckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for ViewingMckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for Viewing
Nauman Safdar
 

Recently uploaded (20)

How to Get Started in Social Media for Art League City
How to Get Started in Social Media for Art League CityHow to Get Started in Social Media for Art League City
How to Get Started in Social Media for Art League City
 
Falcon Invoice Discounting: Empowering Your Business Growth
Falcon Invoice Discounting: Empowering Your Business GrowthFalcon Invoice Discounting: Empowering Your Business Growth
Falcon Invoice Discounting: Empowering Your Business Growth
 
Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...
Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...
Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...
 
Uneak White's Personal Brand Exploration Presentation
Uneak White's Personal Brand Exploration PresentationUneak White's Personal Brand Exploration Presentation
Uneak White's Personal Brand Exploration Presentation
 
New 2024 Cannabis Edibles Investor Pitch Deck Template
New 2024 Cannabis Edibles Investor Pitch Deck TemplateNew 2024 Cannabis Edibles Investor Pitch Deck Template
New 2024 Cannabis Edibles Investor Pitch Deck Template
 
Getting Real with AI - Columbus DAW - May 2024 - Nick Woo from AlignAI
Getting Real with AI - Columbus DAW - May 2024 - Nick Woo from AlignAIGetting Real with AI - Columbus DAW - May 2024 - Nick Woo from AlignAI
Getting Real with AI - Columbus DAW - May 2024 - Nick Woo from AlignAI
 
Arti Languages Pre Seed Teaser Deck 2024.pdf
Arti Languages Pre Seed Teaser Deck 2024.pdfArti Languages Pre Seed Teaser Deck 2024.pdf
Arti Languages Pre Seed Teaser Deck 2024.pdf
 
Buy gmail accounts.pdf buy Old Gmail Accounts
Buy gmail accounts.pdf buy Old Gmail AccountsBuy gmail accounts.pdf buy Old Gmail Accounts
Buy gmail accounts.pdf buy Old Gmail Accounts
 
Call 7737669865 Vadodara Call Girls Service at your Door Step Available All Time
Call 7737669865 Vadodara Call Girls Service at your Door Step Available All TimeCall 7737669865 Vadodara Call Girls Service at your Door Step Available All Time
Call 7737669865 Vadodara Call Girls Service at your Door Step Available All Time
 
Putting the SPARK into Virtual Training.pptx
Putting the SPARK into Virtual Training.pptxPutting the SPARK into Virtual Training.pptx
Putting the SPARK into Virtual Training.pptx
 
Lundin Gold - Q1 2024 Conference Call Presentation (Revised)
Lundin Gold - Q1 2024 Conference Call Presentation (Revised)Lundin Gold - Q1 2024 Conference Call Presentation (Revised)
Lundin Gold - Q1 2024 Conference Call Presentation (Revised)
 
Organizational Transformation Lead with Culture
Organizational Transformation Lead with CultureOrganizational Transformation Lead with Culture
Organizational Transformation Lead with Culture
 
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
 
PHX May 2024 Corporate Presentation Final
PHX May 2024 Corporate Presentation FinalPHX May 2024 Corporate Presentation Final
PHX May 2024 Corporate Presentation Final
 
HomeRoots Pitch Deck | Investor Insights | April 2024
HomeRoots Pitch Deck | Investor Insights | April 2024HomeRoots Pitch Deck | Investor Insights | April 2024
HomeRoots Pitch Deck | Investor Insights | April 2024
 
Cracking the 'Career Pathing' Slideshare
Cracking the 'Career Pathing' SlideshareCracking the 'Career Pathing' Slideshare
Cracking the 'Career Pathing' Slideshare
 
Unveiling Falcon Invoice Discounting: Leading the Way as India's Premier Bill...
Unveiling Falcon Invoice Discounting: Leading the Way as India's Premier Bill...Unveiling Falcon Invoice Discounting: Leading the Way as India's Premier Bill...
Unveiling Falcon Invoice Discounting: Leading the Way as India's Premier Bill...
 
Phases of Negotiation .pptx
 Phases of Negotiation .pptx Phases of Negotiation .pptx
Phases of Negotiation .pptx
 
Mckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for ViewingMckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for Viewing
 
Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024
 

DWIntro.ppt

  • 1. Data Warehousing/Mining 1 Data Warehousing/Mining Comp 150 Data Warehousing Introduction (not in book) Instructor: Dan Hebert
  • 2. Data Warehousing/Mining 2 Outline of Lecture  Data Warehousing and Information Integration  Brief History of Data Warehousing  What is a Data Warehouse?  Types of Data and Their Uses  Data Warehouse Architectures  Issues in Data Warehousing
  • 3. Data Warehousing/Mining 3 Problem: Heterogeneous Information Sources “Heterogeneities are everywhere”  Different interfaces  Different data representations  Duplicate and inconsistent information Personal Databases Digital Libraries Scientific Databases World Wide Web
  • 4. Data Warehousing/Mining 4 Problem: Data Management in Large Enterprises  Vertical fragmentation of informational systems (vertical stove pipes)  Result of application (user)-driven development of operational systems Sales Administration Finance Manufacturing ... Sales Planning Stock Mngmt ... Suppliers ... Debt Mngmt Num. Control ... Inventory
  • 5. Data Warehousing/Mining 5 Goal: Unified Access to Data Integration System • Collects and combines information • Provides integrated view, uniform user interface • Supports sharing World Wide Web Digital Libraries Scientific Databases Personal Databases
  • 6. Data Warehousing/Mining 6 The Traditional Research Approach Source Source Source . . . Integration System . . . Metadata Clients Wrapper Wrapper Wrapper  Query-driven (lazy, on-demand)
  • 7. Data Warehousing/Mining 7 Disadvantages of Query-Driven Approach  Delay in query processing – Slow or unavailable information sources – Complex filtering and integration  Inefficient and potentially expensive for frequent queries  Competes with local processing at sources  Hasn’t caught on in industry
  • 8. Data Warehousing/Mining 8 The Warehousing Approach Data Warehouse Clients Source Source Source . . . Extractor/ Monitor Integration System . . . Metadata Extractor/ Monitor Extractor/ Monitor  Information integrated in advance  Stored in wh for direct querying and analysis
  • 9. Data Warehousing/Mining 9 Advantages of Warehousing Approach  High query performance – But not necessarily most current information  Doesn’t interfere with local processing at sources – Complex queries at warehouse – OLTP at information sources  Information copied at warehouse – Can modify, annotate, summarize, restructure, etc. – Can store historical information – Security, no auditing  Has caught on in industry
  • 10. Data Warehousing/Mining 10 Not Either-Or Decision  Query-driven approach still better for – Rapidly changing information – Rapidly changing information sources – Truly vast amounts of data from large numbers of sources – Clients with unpredictable needs
  • 11. Data Warehousing/Mining 11 Data Warehouse Evolution TIME 2000 1995 1990 1985 1980 1960 1975 Information- Based Management Data Revolution “Middle Ages” “Prehistoric Times” Relational Databases PC’s and Spreadsheets End-user Interfaces 1st DW Article DW Confs. Vendor DW Frameworks Company DWs “Building the DW” Inmon (1992) Data Replication Tools
  • 12. Data Warehousing/Mining 12 What is a Data Warehouse? A Practitioners Viewpoint “A data warehouse is simply a single, complete, and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use it in a business context.” -- Barry Devlin, IBM Consultant
  • 13. Data Warehousing/Mining 13 A Data Warehouse is...  Stored collection of diverse data – A solution to data integration problem – Single repository of information  Subject-oriented – Organized by subject, not by application – Used for analysis, data mining, etc.  Optimized differently from transaction- oriented db  User interface aimed at executive
  • 14. Data Warehousing/Mining 14 A Data Warehouse is... (continued)  Large volume of data (Gb, Tb)  Non-volatile – Historical – Time attributes are important  Updates infrequent  May be append-only  Examples – All transactions ever at WalMart – Complete client histories at insurance firm – Stockbroker financial information and portfolios
  • 15. Data Warehousing/Mining 15 Summary Operational Systems Enterprise Modeling Business Information Guide Data Warehouse Catalog Data Warehouse Population Data Warehouse Business Information Interface
  • 16. Data Warehousing/Mining 16 Warehouse is a Specialized DB Standard DB  Mostly updates  Many small transactions  Mb - Gb of data  Current snapshot  Index/hash on p.k.  Raw data  Thousands of users (e.g., clerical users) Warehouse  Mostly reads  Queries are long and complex  Gb - Tb of data  History  Lots of scans  Summarized, reconciled data  Hundreds of users (e.g., decision-makers, analysts)
  • 17. Data Warehousing/Mining 17 Warehousing and Industry  Warehousing is big business – $2 billion in 1995 – $3.5 billion in early 1997 – Predicted: $8 billion in 1998 [Metagroup]  WalMart has largest warehouse – 900-CPU, 2,700 disk, 23 TB Teradata system – ~7TB in warehouse – 40-50GB per day
  • 18. Data Warehousing/Mining 18 Types of Data  Business Data - represents meaning – Real-time data (ultimate source of all business data) – Reconciled data – Derived data  Metadata - describes meaning – Build-time metadata – Control metadata – Usage metadata  Data as a product* - intrinsic meaning – Produced and stored for its own intrinsic value – e.g., the contents of a text-book
  • 19. Data Warehousing/Mining 19 Data Warehouse Architectures: Conceptual View  Single-layer – Every data element is stored once only – Virtual warehouse  Two-layer – Real-time + derived data – Most commonly used approach in industry today “Real-time data” Operational systems Informational systems Derived Data Real-time data Operational systems Informational systems
  • 20. Data Warehousing/Mining 20 Three-layer Architecture: Conceptual View  Transformation of real-time data to derived data really requires two steps Derived Data Real-time data Operational systems Informational systems Reconciled Data Physical Implementation of the Data Warehouse View level “Particular informational needs”
  • 21. Data Warehousing/Mining 21 Data Warehousing: Two Distinct Issues (1) How to get information into warehouse “Data warehousing” (2) What to do with data once it’s in warehouse “Warehouse DBMS”  Both rich research areas  Industry has focused on (2)
  • 22. Data Warehousing/Mining 22 Issues in Data Warehousing  Warehouse Design  Extraction – Wrappers, monitors (change detectors)  Integration – Cleansing & merging  Warehousing specification & Maintenance  Optimizations  Miscellaneous (e.g., evolution)
  • 23. Data Warehousing/Mining 23 Data Extraction  Source types – Relational, flat file, WWW, etc.  How to get data out? – Replication tool – Dump file – Create report – ODBC or third-party “wrappers”
  • 24. Data Warehousing/Mining 24 Warehouse Architecture Source Source Source Extractor/ Monitor Extractor/ Monitor Extractor/ Monitor Integrator Warehouse Query & Analysis Client Client ... Metadata
  • 25. Data Warehousing/Mining 25 Issues (1)  Warehouse uses relational data model or multi- dimensional data model (e.g., data cube)  On the other hand, source types – Relational, OO, hierarchical, legacy – Semistructured: flat file, WWW  How do we get the data out?
  • 26. Data Warehousing/Mining 26 Issues (2)  Warehouse must be kept current in light of changes to underlying sources  How do we detect updates in sources?
  • 27. Data Warehousing/Mining 27 Wrapper Converts data and queries from one data model to another Extends query capabilities for sources with limited capabilities Data Model B Data Model A Queries Data Queries Source Wrapper
  • 28. Data Warehousing/Mining 28 Wrapper Generation  Solution 1: Hard code for each source  Solution 2: Automatic wrapper generation Wrapper Wrapper Generator Definition
  • 29. Data Warehousing/Mining 29 Wrapper Approach  Source-specific adapter (a.k.a. wrapper, translator)  “Thickness” of adapter depends on source – Data model used (e.g. rel. schema vs. unstructured) – Interface (i.e., query language, API) – Active capabilities (i.e., triggers) – Degree of autonomy (e.g., same owner & modifiable vs. controlled by external entity & no changes possible) – Cooperation (e.g., friendly vs. uncooperative)
  • 30. Data Warehousing/Mining 30 Routine When...  Many tools for dealing with “standard situations” – Standard sources with full/many capabilities  e.g., most commercial DBMSs, all ODBC-compliant sources – Standard interactions  e.g., pass-through queries, extraction from rel. tables, replication – Cooperative sources or sources under our control  Tools – Replication tools, ODBC, report writers, third-party “wrappers”
  • 31. Data Warehousing/Mining 31 Not So Routine When...  “Non-standard situations” – Unstructured or semistructured sources with little or no explicit schema – Uncooperative sources – Sources with limited capabilities (e.g., legacy sources, WWW)  Few commercial tools  Mostly research
  • 32. Data Warehousing/Mining 32 Data Transformations  Convert data to uniform format – Byte ordering, string termination – Internal layout  Remove, add & reorder attributes – Add key – Add data to get history  Sort tuples
  • 33. Data Warehousing/Mining 33 Monitors  Goal: Detect changes of interest and propagate to integrator  How? – Triggers – Replication server – Log sniffer – Compare query results – Compare snapshots/dumps
  • 34. Data Warehousing/Mining 34 Data Integration  Receive data (changes) from multiple wrappers/monitors and integrate into warehouse  Rule-based  Actions – Resolve inconsistencies – Eliminate duplicates – Integrate into warehouse (may not be empty) – Summarize data – Fetch more data from sources (wh updates) – etc.
  • 35. Data Warehousing/Mining 35 Data Cleansing  Find (& remove) duplicate tuples – e.g., Jane Doe vs. Jane Q. Doe  Detect inconsistent, wrong data – Attribute values that don’t match  Patch missing, unreadable data  Notify sources of errors found

Editor's Notes

  1. The slides for this text are organized into several modules. Each lecture contains about enough material for a 1.25 hour class period. (The time estimate is very approximate--it will vary with the instructor, and lectures also differ in length; so use this as a rough guideline.) This lecture is the first of two in Module (1). Module (1): Introduction (DBMS, Relational Model) Module (2): Storage and File Organizations (Disks, Buffering, Indexes) Module (3): Database Concepts (Relational Queries, DDL/ICs, Views and Security) Module (4): Relational Implementation (Query Evaluation, Optimization) Module (5): Database Design (ER Model, Normalization, Physical Design, Tuning) Module (6): Transaction Processing (Concurrency Control, Recovery) Module (7): Advanced Topics