SlideShare a Scribd company logo
ANHAI DOAN ALON HALEVY ZACHARY IVES
CHAPTER 10: DATA
WAREHOUSING & CACHING
PRINCIPLES OF
DATA INTEGRATION
Data Warehousing and
Materialization
 We have mostly focused on techniques for virtual
data integration (see Ch. 1)
 Queries are composed with mappings on the fly and data
is fetched on demand
 This represents one extreme point
 In this chapter, we consider cases where data is
transformed and materialized “in advance” of the
queries
 The main scenario: the data warehouse
What Is a Data Warehouse?
 In many organizations, we want a central “store” of
all of our entities, concepts, metadata, and historical
information
 For doing data validation, complex mining, analysis,
prediction, …
 This is the data warehouse
 To this point we’ve focused on scenarios where the
data “lives” in the sources – here we may have a
“master” version (and archival version) in a central
database
 For performance reasons, availability reasons, archival
reasons, …
In the Rest of this Chapter…
 The data warehouse as the master data instance
 Data warehouse architectures, design, loading
 Data exchange: declarative data warehousing
 Hybrid models: caching and partial materialization
 Querying externally archived data
Outline
 The data warehouse
 Motivation: Master data management
 Physical design
 Extract/transform/load
 Data exchange
 Caching & partial materialization
 Operating on external data
Master Data Management
 One of the “modern” uses of the data warehouse is
not only to support analytics but to serve as a
reference to all of the entities in the organization
 A cleaned, validated repository of what we know
… which can be linked to by data sources
… which may help with data cleaning
… and which may be the basis of data governance
(processes by which data is created and modified in a
systematic way, e.g., to comply with gov’t regulations)
 There is an emerging field called master data
management out the process of creating these
Data Warehouse Architecture
 At the top – a centralized
database
 Generally configured for
queries and appends –
not transactions
 Many indices,
materialized views, etc.
 Data is loaded and
periodically updated via
Extract/Transform/Load
(ETL) tools
Data Warehouse
ETL ETL ETL ETL
RDBMS1 RDBMS2
HTML1 XML1
ETL pipeline
outputs
ETL
ETL Tools
 ETL tools are the equivalent of schema mappings in
virtual integration, but are more powerful
 Arbitrary pieces of code to take data from a source,
convert it into data for the warehouse:
 import filters – read and convert from data sources
 data transformations – join, aggregate, filter, convert data
 de-duplication – finds multiple records referring to the
same entity, merges them
 profiling – builds tables, histograms, etc. to summarize
data
 quality management – test against master values, known
business rules, constraints, etc.
Example ETL Tool Chain
 This is an example for e-commerce loading
 Note multiple stages of filtering (using selection or
join-like operations), logging bad records, before we
group and load
Invoice
line items
Split
Date-
time
Filter
invalid
Join
Filter
invalid
Invalid
dates/times
Invalid
items
Item
records
Filter
non -
match
Invalid
customers
Group by
customer
Customer
balance
Customer
records
Basic Data Warehouse – Summary
 Two aspects:
 A central DBMS optimized for appends and querying
 The “master data” instance
 Or the instance for doing mining, analytics, and prediction
 A set of procedural ETL “pipelines” to fetch, transform,
filter, clean, and load data
 Often these tools are more expressive than standard conjunctive
queries (as in Chapters 2-3)
 … But not always!
This raises a question – can we do warehousing with declarative
mappings?
Outline
 The data warehouse
 Data exchange
 Caching & partial materialization
 Operating on external data
Data Exchange
 Intuitively, a declarative setup for data warehousing
 Declarative schema mappings as in Ch. 2-3
 Materialized database as in the previous section
 Also allow for unknown values when we map from
source to target (warehouse) instance
 If we know a professor teaches a student, then there must
exist a course C that the student took and the professor
taught – but we may not know which…
Data Exchange Formulation
A data exchange setting (S,T,M,CT) has:
 S, source schema representing all of the source tables
jointly
 T, target schema
 A set of mappings or tuple-generating dependencies
relating S and T
 A set of constraints (equality-generating dependencies)
("X)s1(X1), ..., sm (Xm ) ® ($Y) t1(Y1), ..., tk (Yk )
)
Y
(Y
)
Y
(
t
...,
,
)
Y
(
)t
Y
( j
i
l
l
1
1 


An Example
Source S has
Teaches(prof, student)
Adviser(adviser, student)
Target T has
Advise(adviser, student)
TeachesCourse(prof, course)
Takes(course, student)
)
,
(
),
,
(
.
,
)
,
(
:
)
,
(
)
,
(
:
)
,
(
),
,
(
.
)
,
(
:
)
,
(
.
)
,
(
:
4
3
2
1
stud
C
Takes
C
D
rse
TeachesCou
D
C
stud
prof
Adviser
r
stud
prof
Advise
stud
prof
Adviser
r
stud
C
Takes
C
prof
rse
TeachesCou
C
stud
prof
Teaches
r
stud
D
Advise
D
stud
prof
Teaches
r







existential variables represent unknown
The Data Exchange Solution
 The goal of data exchange is to compute an instance
of the target schema, given a data exchange setting
D = (S,T,M,CT) and an instance I(S)
 An instance J of Schema T is a data exchange
solution for D and I if
1. the pair (I,J) satisfies schema mapping M, and
2. J satisfies constraints CT
Instance I(S) has
Teaches
Adviser
Back to the Example, Now with Data
prof student
Ann Bob
Chloe David
Instance J(T) has
Advise
TeachesCourse
Takes
adviser student
Ellen Bob
Felicia David
adviser student
Ellen Bob
Felicia David
course student
C1 Bob
C2 David
prof course
Ann C1
Chloe C2
variables or labeled nulls
represent unknown values
Instance I(S) has
Teaches
Adviser
This Is also a Solution
prof student
Ann Bob
Chloe David
Instance J(T) has
Advise
TeachesCourse
Takes
adviser student
Ellen Bob
Felicia David
adviser student
Ellen Bob
Felicia David
course student
C1 Bob
C1 David
prof course
Ann C1
Chloe C1
this time the labeled
nulls are all the same!
Universal Solutions
 Intuitively, the first solution should be better than
the second
 The first solution uses the same variable for the course
taught by Ann and by Chloe – they are the same course
 But this was not specified in the original schema!
 We formalize that through the notion of the
universal solution, which must not lose any
information
Formalizing the Universal Solution
First we define instance homomorphism:
 Let J1, J2 be two instances of schema T
 A mapping h: J1  J2 is a homomorphism from J1 to J2 if
 h(c) = c for every c ∈ C,
 for every tuple R(a1,…,an) ∈ J1 the tuple R(h(a1),…,h(an)) ∈ J2
 J1, J2 are homomorphically equivalent if there are
homomorphisms h: J1  J2 and h’: J2  J1
Def: Universal solution for data exchange setting
D = (S,T,M,CT), where I is an instance of S.
A data exchange solution J for D and I is a universal
solution if, for every other data exchange solution J’ for D
and I, there exists a homomorphism h: J  J’
Computing Universal Solutions
 The standard process is to use a procedure called the
chase
 Informally:
 Consider every formula r of M in turn:
 If there is a variable substitution for the left-hand side (lhs) of r
where the right-hand side (rhs) is not in the solution – add it
 If we create a new tuple, for every existential variable in the rhs,
substitute a new fresh variable
 See Chapter 10 Algorithm 10 for full pseudocode
Core Universal Solutions
 Universal solutions may be of arbitrary size
 The core universal solution is the minimal universal
solution
Data Exchange and Querying
 As with the data warehouse, all queries are directly
posed over the target database – no reformulation
necessary
 However, we typically assume certain answers
semantics
 To get the certain answers (which are the same as in the
virtual integration setting with GLAV/TGD mappings) –
compute the query answers and then drop any tuples with
labeled nulls (variables)
Data Exchange vs. Warehousing
 From an external perspective, exchange and
warehousing are essentially equivalent
 But there are different trade-offs in procedural vs.
declarative mappings
 Procedural – more expressive
 Declarative – easier to reason about, compose, invert,
create matieralized views for, etc. (see Chapter 6)
Outline
 The data warehouse
 Data exchange
 Caching & partial materialization
 Operating on external data
The Spectrum of Materialization
 Many real EII systems compute and maintain
materialized views, or cache results
A “hybrid” point between the fully virtual and fully
materialized approaches
Virtual integration
(EII)
Data exchange /
data warehouse
sources materialized all mediated relations
materialized
caching or partial materialization –
some views materialized
Possible Techniques for Choosing
What to Materialize
Cache results of prior queries
 Take the results of each query, materialize them
 Use answering queries using views to reuse
 Expire using time-to-live… May not always be fresh!
Administrator-selected views
 Someone manually specifies views to compute and
maintain, as with a relational DBMS
 System automatically maintains
Automatic view selection
 Using query workload, update frequencies – a view
materialization wizard chooses what to materialize
Outline
 The data warehouse
 Data exchange
 Caching & partial materialization
 Operating on external data
Many “Integration-Like” Scenarios
over Historical Data
 Many Web scenarios where we have large logs of
data accesses, created by the server
 Goal: put these together and query them!
 Looks like a very simple data integration scenario –
external data, but single schema
 A common approach: use programming
environments like MapReduce (or SQL layers above)
to query the data on a cluster
 MapReduce reliably runs large jobs across 100s or 1000s
of “shared nothing” nodes in a cluster
MapReduce Basics
 MapReduce is essentially a template for writing
distributed programs – corresponding to a single SQL
SELECT..FROM..WHERE..GROUP BY..HAVING block
with user-defined functions
 The MapReduce runtime calls a set of functions:
 map is given a tuple, outputs 0 or more tuples in response
 roughly like the WHERE clause
 shuffle is a stage for doing sort-based grouping on a key
(specified by the map)
 reduce is an aggregate function called over the set of
tuples with the same grouping key
MapReduce Dataflow “Template”: Tuples 
Map “worker”  Shuffle  Reduce “worker”
30
Map
Worker
Map
Worker
Map
Worker
Map
Worker
Reduce
Worker
Reduce
Worker
Reduce
Worker
Reduce
Worker
Reduce
Worker
emit tuples
emit aggregate
results
MapReduce as ETL
 Some people use MapReduce to take data,
transform it, and load it into a warehouse
 … which is basically what ETL tools do!
 The dividing line between DBMSs, EII, MapReduce is
blurring as of the development of this book
 SQL  MapReduce
 MapReduce over SQL engines
 Shared-nothing DBMSs
 NoSQL
Warehousing & Materialization Wrap-up
 There are benefits to centralizing & materializing data
 Performance, especially for analytics / mining
 Archival
 Standardization / canonicalization
 Data warehouses typically use procedural ETL tools to
extract, transform, load (and clean) data
 Data exchange replaces ETL with declarative
mappings (where feasible)
 Hybrid schemes exist for partial materialization
 Increasingly we are integrating via MapReduce and its
cousins

More Related Content

Similar to Warehousing_Ch10.ppt

Dimensional data model
Dimensional data modelDimensional data model
Dimensional data model
Vnktp1
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R Notes
LakshmiSarvani6
 
Whats a datawarehouse
Whats a datawarehouseWhats a datawarehouse
Whats a datawarehouse
vijjudarling
 
Hibernate
HibernateHibernate
UNIT-5 DATA WAREHOUSING.docx
UNIT-5 DATA WAREHOUSING.docxUNIT-5 DATA WAREHOUSING.docx
UNIT-5 DATA WAREHOUSING.docx
DURGADEVIL
 
DBMS an Example
DBMS an ExampleDBMS an Example
DBMS an Example
Dr. C.V. Suresh Babu
 
Cdocumentsandsettingsuser1desktop2 dbmsexamples-091012013049-phpapp01
Cdocumentsandsettingsuser1desktop2 dbmsexamples-091012013049-phpapp01Cdocumentsandsettingsuser1desktop2 dbmsexamples-091012013049-phpapp01
Cdocumentsandsettingsuser1desktop2 dbmsexamples-091012013049-phpapp01
Raza Baloch
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
Revathy V R
 
The best ETL questions in a nut shell
The best ETL questions in a nut shellThe best ETL questions in a nut shell
The best ETL questions in a nut shell
Srinimf-Slides
 
12.Data processing and concepts.pdf
12.Data processing and concepts.pdf12.Data processing and concepts.pdf
12.Data processing and concepts.pdf
Ayele40
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset Preparation
Andrew Ferlitsch
 
Datamining
DataminingDatamining
Datamining
Debashis Pradhan
 
Summary introduction to data engineering
Summary introduction to data engineeringSummary introduction to data engineering
Summary introduction to data engineering
Novita Sari
 
extract, transform, load_Data Analyt.ppt
extract, transform, load_Data Analyt.pptextract, transform, load_Data Analyt.ppt
extract, transform, load_Data Analyt.ppt
Neerupa Chauhan
 
Data Structure Introduction for Beginners
Data Structure Introduction for BeginnersData Structure Introduction for Beginners
Data Structure Introduction for Beginners
ramanathan2006
 
Lec1
Lec1Lec1
Lec1
Lec1Lec1
Lec1
Saad Gabr
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
VijayasankariS
 
BI - Data warehousing in practice
BI - Data warehousing in practiceBI - Data warehousing in practice
BI - Data warehousing in practice
Sjors Otten
 
data analytics lecture 3.2.ppt
data analytics lecture 3.2.pptdata analytics lecture 3.2.ppt
data analytics lecture 3.2.ppt
RutujaPatil247341
 

Similar to Warehousing_Ch10.ppt (20)

Dimensional data model
Dimensional data modelDimensional data model
Dimensional data model
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R Notes
 
Whats a datawarehouse
Whats a datawarehouseWhats a datawarehouse
Whats a datawarehouse
 
Hibernate
HibernateHibernate
Hibernate
 
UNIT-5 DATA WAREHOUSING.docx
UNIT-5 DATA WAREHOUSING.docxUNIT-5 DATA WAREHOUSING.docx
UNIT-5 DATA WAREHOUSING.docx
 
DBMS an Example
DBMS an ExampleDBMS an Example
DBMS an Example
 
Cdocumentsandsettingsuser1desktop2 dbmsexamples-091012013049-phpapp01
Cdocumentsandsettingsuser1desktop2 dbmsexamples-091012013049-phpapp01Cdocumentsandsettingsuser1desktop2 dbmsexamples-091012013049-phpapp01
Cdocumentsandsettingsuser1desktop2 dbmsexamples-091012013049-phpapp01
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
The best ETL questions in a nut shell
The best ETL questions in a nut shellThe best ETL questions in a nut shell
The best ETL questions in a nut shell
 
12.Data processing and concepts.pdf
12.Data processing and concepts.pdf12.Data processing and concepts.pdf
12.Data processing and concepts.pdf
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset Preparation
 
Datamining
DataminingDatamining
Datamining
 
Summary introduction to data engineering
Summary introduction to data engineeringSummary introduction to data engineering
Summary introduction to data engineering
 
extract, transform, load_Data Analyt.ppt
extract, transform, load_Data Analyt.pptextract, transform, load_Data Analyt.ppt
extract, transform, load_Data Analyt.ppt
 
Data Structure Introduction for Beginners
Data Structure Introduction for BeginnersData Structure Introduction for Beginners
Data Structure Introduction for Beginners
 
Lec1
Lec1Lec1
Lec1
 
Lec1
Lec1Lec1
Lec1
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
BI - Data warehousing in practice
BI - Data warehousing in practiceBI - Data warehousing in practice
BI - Data warehousing in practice
 
data analytics lecture 3.2.ppt
data analytics lecture 3.2.pptdata analytics lecture 3.2.ppt
data analytics lecture 3.2.ppt
 

Recently uploaded

一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptxREUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
KiriakiENikolaidou
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
Alireza Kamrani
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
TeukuEriSyahputra
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
oaxefes
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
tzu5xla
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
aguty
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
ytypuem
 
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
lzdvtmy8
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
Bisnar Chase Personal Injury Attorneys
 
一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理
zsafxbf
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
slg6lamcq
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
dataschool1
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
eoxhsaa
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
vasanthatpuram
 

Recently uploaded (20)

一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptxREUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
 
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
 
一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
 

Warehousing_Ch10.ppt

  • 1. ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 10: DATA WAREHOUSING & CACHING PRINCIPLES OF DATA INTEGRATION
  • 2. Data Warehousing and Materialization  We have mostly focused on techniques for virtual data integration (see Ch. 1)  Queries are composed with mappings on the fly and data is fetched on demand  This represents one extreme point  In this chapter, we consider cases where data is transformed and materialized “in advance” of the queries  The main scenario: the data warehouse
  • 3. What Is a Data Warehouse?  In many organizations, we want a central “store” of all of our entities, concepts, metadata, and historical information  For doing data validation, complex mining, analysis, prediction, …  This is the data warehouse  To this point we’ve focused on scenarios where the data “lives” in the sources – here we may have a “master” version (and archival version) in a central database  For performance reasons, availability reasons, archival reasons, …
  • 4. In the Rest of this Chapter…  The data warehouse as the master data instance  Data warehouse architectures, design, loading  Data exchange: declarative data warehousing  Hybrid models: caching and partial materialization  Querying externally archived data
  • 5. Outline  The data warehouse  Motivation: Master data management  Physical design  Extract/transform/load  Data exchange  Caching & partial materialization  Operating on external data
  • 6. Master Data Management  One of the “modern” uses of the data warehouse is not only to support analytics but to serve as a reference to all of the entities in the organization  A cleaned, validated repository of what we know … which can be linked to by data sources … which may help with data cleaning … and which may be the basis of data governance (processes by which data is created and modified in a systematic way, e.g., to comply with gov’t regulations)  There is an emerging field called master data management out the process of creating these
  • 7. Data Warehouse Architecture  At the top – a centralized database  Generally configured for queries and appends – not transactions  Many indices, materialized views, etc.  Data is loaded and periodically updated via Extract/Transform/Load (ETL) tools Data Warehouse ETL ETL ETL ETL RDBMS1 RDBMS2 HTML1 XML1 ETL pipeline outputs ETL
  • 8. ETL Tools  ETL tools are the equivalent of schema mappings in virtual integration, but are more powerful  Arbitrary pieces of code to take data from a source, convert it into data for the warehouse:  import filters – read and convert from data sources  data transformations – join, aggregate, filter, convert data  de-duplication – finds multiple records referring to the same entity, merges them  profiling – builds tables, histograms, etc. to summarize data  quality management – test against master values, known business rules, constraints, etc.
  • 9. Example ETL Tool Chain  This is an example for e-commerce loading  Note multiple stages of filtering (using selection or join-like operations), logging bad records, before we group and load Invoice line items Split Date- time Filter invalid Join Filter invalid Invalid dates/times Invalid items Item records Filter non - match Invalid customers Group by customer Customer balance Customer records
  • 10. Basic Data Warehouse – Summary  Two aspects:  A central DBMS optimized for appends and querying  The “master data” instance  Or the instance for doing mining, analytics, and prediction  A set of procedural ETL “pipelines” to fetch, transform, filter, clean, and load data  Often these tools are more expressive than standard conjunctive queries (as in Chapters 2-3)  … But not always! This raises a question – can we do warehousing with declarative mappings?
  • 11. Outline  The data warehouse  Data exchange  Caching & partial materialization  Operating on external data
  • 12. Data Exchange  Intuitively, a declarative setup for data warehousing  Declarative schema mappings as in Ch. 2-3  Materialized database as in the previous section  Also allow for unknown values when we map from source to target (warehouse) instance  If we know a professor teaches a student, then there must exist a course C that the student took and the professor taught – but we may not know which…
  • 13. Data Exchange Formulation A data exchange setting (S,T,M,CT) has:  S, source schema representing all of the source tables jointly  T, target schema  A set of mappings or tuple-generating dependencies relating S and T  A set of constraints (equality-generating dependencies) ("X)s1(X1), ..., sm (Xm ) ® ($Y) t1(Y1), ..., tk (Yk ) ) Y (Y ) Y ( t ..., , ) Y ( )t Y ( j i l l 1 1   
  • 14. An Example Source S has Teaches(prof, student) Adviser(adviser, student) Target T has Advise(adviser, student) TeachesCourse(prof, course) Takes(course, student) ) , ( ), , ( . , ) , ( : ) , ( ) , ( : ) , ( ), , ( . ) , ( : ) , ( . ) , ( : 4 3 2 1 stud C Takes C D rse TeachesCou D C stud prof Adviser r stud prof Advise stud prof Adviser r stud C Takes C prof rse TeachesCou C stud prof Teaches r stud D Advise D stud prof Teaches r        existential variables represent unknown
  • 15. The Data Exchange Solution  The goal of data exchange is to compute an instance of the target schema, given a data exchange setting D = (S,T,M,CT) and an instance I(S)  An instance J of Schema T is a data exchange solution for D and I if 1. the pair (I,J) satisfies schema mapping M, and 2. J satisfies constraints CT
  • 16. Instance I(S) has Teaches Adviser Back to the Example, Now with Data prof student Ann Bob Chloe David Instance J(T) has Advise TeachesCourse Takes adviser student Ellen Bob Felicia David adviser student Ellen Bob Felicia David course student C1 Bob C2 David prof course Ann C1 Chloe C2 variables or labeled nulls represent unknown values
  • 17. Instance I(S) has Teaches Adviser This Is also a Solution prof student Ann Bob Chloe David Instance J(T) has Advise TeachesCourse Takes adviser student Ellen Bob Felicia David adviser student Ellen Bob Felicia David course student C1 Bob C1 David prof course Ann C1 Chloe C1 this time the labeled nulls are all the same!
  • 18. Universal Solutions  Intuitively, the first solution should be better than the second  The first solution uses the same variable for the course taught by Ann and by Chloe – they are the same course  But this was not specified in the original schema!  We formalize that through the notion of the universal solution, which must not lose any information
  • 19. Formalizing the Universal Solution First we define instance homomorphism:  Let J1, J2 be two instances of schema T  A mapping h: J1  J2 is a homomorphism from J1 to J2 if  h(c) = c for every c ∈ C,  for every tuple R(a1,…,an) ∈ J1 the tuple R(h(a1),…,h(an)) ∈ J2  J1, J2 are homomorphically equivalent if there are homomorphisms h: J1  J2 and h’: J2  J1 Def: Universal solution for data exchange setting D = (S,T,M,CT), where I is an instance of S. A data exchange solution J for D and I is a universal solution if, for every other data exchange solution J’ for D and I, there exists a homomorphism h: J  J’
  • 20. Computing Universal Solutions  The standard process is to use a procedure called the chase  Informally:  Consider every formula r of M in turn:  If there is a variable substitution for the left-hand side (lhs) of r where the right-hand side (rhs) is not in the solution – add it  If we create a new tuple, for every existential variable in the rhs, substitute a new fresh variable  See Chapter 10 Algorithm 10 for full pseudocode
  • 21. Core Universal Solutions  Universal solutions may be of arbitrary size  The core universal solution is the minimal universal solution
  • 22. Data Exchange and Querying  As with the data warehouse, all queries are directly posed over the target database – no reformulation necessary  However, we typically assume certain answers semantics  To get the certain answers (which are the same as in the virtual integration setting with GLAV/TGD mappings) – compute the query answers and then drop any tuples with labeled nulls (variables)
  • 23. Data Exchange vs. Warehousing  From an external perspective, exchange and warehousing are essentially equivalent  But there are different trade-offs in procedural vs. declarative mappings  Procedural – more expressive  Declarative – easier to reason about, compose, invert, create matieralized views for, etc. (see Chapter 6)
  • 24. Outline  The data warehouse  Data exchange  Caching & partial materialization  Operating on external data
  • 25. The Spectrum of Materialization  Many real EII systems compute and maintain materialized views, or cache results A “hybrid” point between the fully virtual and fully materialized approaches Virtual integration (EII) Data exchange / data warehouse sources materialized all mediated relations materialized caching or partial materialization – some views materialized
  • 26. Possible Techniques for Choosing What to Materialize Cache results of prior queries  Take the results of each query, materialize them  Use answering queries using views to reuse  Expire using time-to-live… May not always be fresh! Administrator-selected views  Someone manually specifies views to compute and maintain, as with a relational DBMS  System automatically maintains Automatic view selection  Using query workload, update frequencies – a view materialization wizard chooses what to materialize
  • 27. Outline  The data warehouse  Data exchange  Caching & partial materialization  Operating on external data
  • 28. Many “Integration-Like” Scenarios over Historical Data  Many Web scenarios where we have large logs of data accesses, created by the server  Goal: put these together and query them!  Looks like a very simple data integration scenario – external data, but single schema  A common approach: use programming environments like MapReduce (or SQL layers above) to query the data on a cluster  MapReduce reliably runs large jobs across 100s or 1000s of “shared nothing” nodes in a cluster
  • 29. MapReduce Basics  MapReduce is essentially a template for writing distributed programs – corresponding to a single SQL SELECT..FROM..WHERE..GROUP BY..HAVING block with user-defined functions  The MapReduce runtime calls a set of functions:  map is given a tuple, outputs 0 or more tuples in response  roughly like the WHERE clause  shuffle is a stage for doing sort-based grouping on a key (specified by the map)  reduce is an aggregate function called over the set of tuples with the same grouping key
  • 30. MapReduce Dataflow “Template”: Tuples  Map “worker”  Shuffle  Reduce “worker” 30 Map Worker Map Worker Map Worker Map Worker Reduce Worker Reduce Worker Reduce Worker Reduce Worker Reduce Worker emit tuples emit aggregate results
  • 31. MapReduce as ETL  Some people use MapReduce to take data, transform it, and load it into a warehouse  … which is basically what ETL tools do!  The dividing line between DBMSs, EII, MapReduce is blurring as of the development of this book  SQL  MapReduce  MapReduce over SQL engines  Shared-nothing DBMSs  NoSQL
  • 32. Warehousing & Materialization Wrap-up  There are benefits to centralizing & materializing data  Performance, especially for analytics / mining  Archival  Standardization / canonicalization  Data warehouses typically use procedural ETL tools to extract, transform, load (and clean) data  Data exchange replaces ETL with declarative mappings (where feasible)  Hybrid schemes exist for partial materialization  Increasingly we are integrating via MapReduce and its cousins