SlideShare a Scribd company logo
ANHAI DOAN ALON HALEVY ZACHARY IVES
CHAPTER 10: DATA
WAREHOUSING & CACHING
PRINCIPLES OF
DATA INTEGRATION
Data Warehousing and
Materialization
 We have mostly focused on techniques for virtual
data integration (see Ch. 1)
 Queries are composed with mappings on the fly and data
is fetched on demand
 This represents one extreme point
 In this chapter, we consider cases where data is
transformed and materialized “in advance” of the
queries
 The main scenario: the data warehouse
What Is a Data Warehouse?
 In many organizations, we want a central “store” of
all of our entities, concepts, metadata, and historical
information
 For doing data validation, complex mining, analysis,
prediction, …
 This is the data warehouse
 To this point we’ve focused on scenarios where the
data “lives” in the sources – here we may have a
“master” version (and archival version) in a central
database
 For performance reasons, availability reasons, archival
reasons, …
In the Rest of this Chapter…
 The data warehouse as the master data instance
 Data warehouse architectures, design, loading
 Data exchange: declarative data warehousing
 Hybrid models: caching and partial materialization
 Querying externally archived data
Outline
 The data warehouse
 Motivation: Master data management
 Physical design
 Extract/transform/load
 Data exchange
 Caching & partial materialization
 Operating on external data
Master Data Management
 One of the “modern” uses of the data warehouse is
not only to support analytics but to serve as a
reference to all of the entities in the organization
 A cleaned, validated repository of what we know
… which can be linked to by data sources
… which may help with data cleaning
… and which may be the basis of data governance
(processes by which data is created and modified in a
systematic way, e.g., to comply with gov’t regulations)
 There is an emerging field called master data
management out the process of creating these
Data Warehouse Architecture
 At the top – a
centralized database
 Generally configured for
queries and appends –
not transactions
 Many indices,
materialized views, etc.
 Data is loaded and
periodically updated via
Extract/Transform/Load
(ETL) tools
Data Warehouse
ETL ETL ETL ETL
RDBMS1 RDBMS2
HTML1 XML1
ETL pipeline
outputs
ETL
ETL Tools
 ETL tools are the equivalent of schema mappings in
virtual integration, but are more powerful
 Arbitrary pieces of code to take data from a source,
convert it into data for the warehouse:
 import filters – read and convert from data sources
 data transformations – join, aggregate, filter, convert
data
 de-duplication – finds multiple records referring to the
same entity, merges them
 profiling – builds tables, histograms, etc. to summarize
data
 quality management – test against master values, known
Example ETL Tool Chain
 This is an example for e-commerce loading
 Note multiple stages of filtering (using selection or
join-like operations), logging bad records, before we
group and load
Invoice
line items
Split
Date-
time
Filter
invalid
Join
Filter
invalid
Invalid
dates/times
Invalid
items
Item
records
Filter
non -
match
Invalid
customers
Group by
customer
Customer
balance
Customer
records
Basic Data Warehouse – Summary
 Two aspects:
 A central DBMS optimized for appends and querying
 The “master data” instance
 Or the instance for doing mining, analytics, and prediction
 A set of procedural ETL “pipelines” to fetch, transform,
filter, clean, and load data
 Often these tools are more expressive than standard conjunctive
queries (as in Chapters 2-3)
 … But not always!
This raises a question – can we do warehousing with declarative
mappings?
Outline
The data warehouse
 Data exchange
 Caching & partial materialization
 Operating on external data
Data Exchange
 Intuitively, a declarative setup for data warehousing
 Declarative schema mappings as in Ch. 2-3
 Materialized database as in the previous section
 Also allow for unknown values when we map from
source to target (warehouse) instance
 If we know a professor teaches a student, then there must
exist a course C that the student took and the professor
taught – but we may not know which…
Data Exchange Formulation
A data exchange setting (S,T,M,CT) has:
 S, source schema representing all of the source tables
jointly
 T, target schema
 A set of mappings or tuple-generating dependencies
relating S and T
 A set of constraints (equality-generating dependencies)
(∀X)s1(X1), ..., sm (Xm ) → (∃Y) t1(Y1), ..., tk (Yk )
)Y(Y)Y(t...,,)Y()tY( jill11 =→∃
An Example
Source S has
Teaches(prof, student)
Adviser(adviser, student)
Target T has
Advise(adviser, student)
TeachesCourse(prof, course)
Takes(course, student)
),(),,(.,),(:
),(),(:
),(),,(.),(:
),(.),(:
4
3
2
1
studCTakesCDrseTeachesCouDCstudprofAdviserr
studprofAdvisestudprofAdviserr
studCTakesCprofrseTeachesCouCstudprofTeachesr
studDAdviseDstudprofTeachesr
∃→
→
∃→
∃→
existential variables represent unknowns
The Data Exchange Solution
 The goal of data exchange is to compute an instance
of the target schema, given a data exchange setting
D = (S,T,M,CT) and an instance I(S)
 An instance J of Schema T is a data exchange
solution for D and I if
1. the pair (I,J) satisfies schema mapping M, and
2. J satisfies constraints CT
Instance I(S) has
Teaches
Adviser
Back to the Example, Now with Data
prof student
Ann Bob
Chloe David
Instance J(T) has
Advise
TeachesCourse
Takes
adviser student
Ellen Bob
Felicia David
adviser student
Ellen Bob
Felicia David
course student
C1 Bob
C2 David
prof course
Ann C1
Chloe C2
variables or labeled nulls
represent unknown values
Instance I(S) has
Teaches
Adviser
This Is also a Solution
prof student
Ann Bob
Chloe David
Instance J(T) has
Advise
TeachesCourse
Takes
adviser student
Ellen Bob
Felicia David
adviser student
Ellen Bob
Felicia David
course student
C1 Bob
C1 David
prof course
Ann C1
Chloe C1
this time the labeled
nulls are all the same!
Universal Solutions
 Intuitively, the first solution should be better than
the second
 The first solution uses the same variable for the course
taught by Ann and by Chloe – they are the same course
 But this was not specified in the original schema!
 We formalize that through the notion of the
universal solution, which must not lose any
information
Formalizing the Universal Solution
First we define instance homomorphism:
 Let J1, J2 be two instances of schema T
 A mapping h: J1  J2 is a homomorphism from J1 to J2 if
 h(c) = c for every c ∈ C,
 for every tuple R(a1,…,an) ∈ J1 the tuple R(h(a1),…,h(an)) ∈ J2
 J1, J2 are homomorphically equivalent if there are
homomorphisms h: J1  J2 and h’: J2  J1
Def: Universal solution for data exchange setting
D = (S,T,M,CT), where I is an instance of S.
A data exchange solution J for D and I is a universal
solution if, for every other data exchange solution J’ for D
and I, there exists a homomorphism h: J  J’
Computing Universal Solutions
 The standard process is to use a procedure called
the chase
 Informally:
 Consider every formula r of M in turn:
 If there is a variable substitution for the left-hand side (lhs) of r
where the right-hand side (rhs) is not in the solution – add it
 If we create a new tuple, for every existential variable in the rhs,
substitute a new fresh variable
 See Chapter 10 Algorithm 10 for full pseudocode
Core Universal Solutions
 Universal solutions may be of arbitrary size
 The core universal solution is the minimal universal
solution
Data Exchange and Querying
 As with the data warehouse, all queries are directly
posed over the target database – no reformulation
necessary
 However, we typically assume certain answers
semantics
 To get the certain answers (which are the same as in the
virtual integration setting with GLAV/TGD mappings) –
compute the query answers and then drop any tuples
with labeled nulls (variables)
Data Exchange vs. Warehousing
 From an external perspective, exchange and
warehousing are essentially equivalent
 But there are different trade-offs in procedural vs.
declarative mappings
 Procedural – more expressive
 Declarative – easier to reason about, compose, invert,
create matieralized views for, etc. (see Chapter 6)
Outline
The data warehouse
Data exchange
 Caching & partial materialization
 Operating on external data
The Spectrum of Materialization
 Many real EII systems compute and maintain
materialized views, or cache results
A “hybrid” point between the fully virtual and fully
materialized approaches
Virtual integration
(EII)
Data exchange /
data warehouse
sources materialized all mediated relations
materialized
caching or partial materialization –
some views materialized
Possible Techniques for Choosing
What to Materialize
Cache results of prior queries
 Take the results of each query, materialize them
 Use answering queries using views to reuse
 Expire using time-to-live… May not always be fresh!
Administrator-selected views
 Someone manually specifies views to compute and
maintain, as with a relational DBMS
 System automatically maintains
Automatic view selection
 Using query workload, update frequencies – a view
materialization wizard chooses what to materialize
Outline
The data warehouse
Data exchange
Caching & partial materialization
 Operating on external data
Many “Integration-Like” Scenarios
over Historical Data
 Many Web scenarios where we have large logs of
data accesses, created by the server
 Goal: put these together and query them!
 Looks like a very simple data integration scenario –
external data, but single schema
 A common approach: use programming
environments like MapReduce (or SQL layers above)
to query the data on a cluster
 MapReduce reliably runs large jobs across 100s or 1000s
of “shared nothing” nodes in a cluster
MapReduce Basics
 MapReduce is essentially a template for writing
distributed programs – corresponding to a single SQL
SELECT..FROM..WHERE..GROUP BY..HAVING block
with user-defined functions
 The MapReduce runtime calls a set of functions:
 map is given a tuple, outputs 0 or more tuples in response
 roughly like the WHERE clause
 shuffle is a stage for doing sort-based grouping on a key
(specified by the map)
 reduce is an aggregate function called over the set of
tuples with the same grouping key
MapReduce Dataflow “Template”: Tuples 
Map “worker”  Shuffle  Reduce “worker”
30
Map
Worker
Map
Worker
Map
Worker
Map
Worker
Reduce
Worker
Reduce
Worker
Reduce
Worker
Reduce
Worker
Reduce
Worker
emit tuples
emit aggregate
results
MapReduce as ETL
 Some people use MapReduce to take data,
transform it, and load it into a warehouse
 … which is basically what ETL tools do!
 The dividing line between DBMSs, EII, MapReduce is
blurring as of the development of this book
 SQL  MapReduce
 MapReduce over SQL engines
 Shared-nothing DBMSs
 NoSQL
Warehousing & Materialization Wrap-
up
 There are benefits to centralizing & materializing data
 Performance, especially for analytics / mining
 Archival
 Standardization / canonicalization
 Data warehouses typically use procedural ETL tools to
extract, transform, load (and clean) data
 Data exchange replaces ETL with declarative
mappings (where feasible)
 Hybrid schemes exist for partial materialization
 Increasingly we are integrating via MapReduce and its
cousins

More Related Content

What's hot

6. Linked list - Data Structures using C++ by Varsha Patil
6. Linked list - Data Structures using C++ by Varsha Patil6. Linked list - Data Structures using C++ by Varsha Patil
6. Linked list - Data Structures using C++ by Varsha Patil
widespreadpromotion
 
Lect07
Lect07Lect07
5. Queue - Data Structures using C++ by Varsha Patil
5. Queue - Data Structures using C++ by Varsha Patil5. Queue - Data Structures using C++ by Varsha Patil
5. Queue - Data Structures using C++ by Varsha Patil
widespreadpromotion
 
I- Extended Databases
I- Extended DatabasesI- Extended Databases
I- Extended Databases
Zakaria Zubi
 
14. Files - Data Structures using C++ by Varsha Patil
14. Files - Data Structures using C++ by Varsha Patil14. Files - Data Structures using C++ by Varsha Patil
14. Files - Data Structures using C++ by Varsha Patil
widespreadpromotion
 
Lecture 07 Data Structures - Basic Sorting
Lecture 07 Data Structures - Basic SortingLecture 07 Data Structures - Basic Sorting
Lecture 07 Data Structures - Basic Sorting
Haitham El-Ghareeb
 
Introduction to database-Formal Query language and Relational calculus
Introduction to database-Formal Query language and Relational calculusIntroduction to database-Formal Query language and Relational calculus
Introduction to database-Formal Query language and Relational calculus
Ajit Nayak
 
Database Programming using SQL
Database Programming using SQLDatabase Programming using SQL
Database Programming using SQL
Ajit Nayak
 
DATA STRUCTURE AND ALGORITHMS
DATA STRUCTURE AND ALGORITHMS DATA STRUCTURE AND ALGORITHMS
DATA STRUCTURE AND ALGORITHMS
Adams Sidibe
 
DBMS _Relational model
DBMS _Relational modelDBMS _Relational model
DBMS _Relational model
Azizul Mamun
 
10. Search Tree - Data Structures using C++ by Varsha Patil
10. Search Tree - Data Structures using C++ by Varsha Patil10. Search Tree - Data Structures using C++ by Varsha Patil
10. Search Tree - Data Structures using C++ by Varsha Patil
widespreadpromotion
 
Adt
AdtAdt
Adt
MrSaem
 
DBMS_Ch1
 DBMS_Ch1 DBMS_Ch1
DBMS_Ch1
Azizul Mamun
 
Bc0058 data warehousing
Bc0058   data warehousingBc0058   data warehousing
Bc0058 data warehousing
smumbahelp
 
Semi-automatic Discovery of Mappings Between Heterogeneous Data Warehouse Dim...
Semi-automatic Discovery of Mappings Between Heterogeneous Data Warehouse Dim...Semi-automatic Discovery of Mappings Between Heterogeneous Data Warehouse Dim...
Semi-automatic Discovery of Mappings Between Heterogeneous Data Warehouse Dim...
IDES Editor
 
DSA - Lecture 04
DSA - Lecture 04DSA - Lecture 04
DSA - Lecture 04
Haitham El-Ghareeb
 
3. Stack - Data Structures using C++ by Varsha Patil
3. Stack - Data Structures using C++ by Varsha Patil3. Stack - Data Structures using C++ by Varsha Patil
3. Stack - Data Structures using C++ by Varsha Patil
widespreadpromotion
 
9. Searching & Sorting - Data Structures using C++ by Varsha Patil
9. Searching & Sorting - Data Structures using C++ by Varsha Patil9. Searching & Sorting - Data Structures using C++ by Varsha Patil
9. Searching & Sorting - Data Structures using C++ by Varsha Patil
widespreadpromotion
 
DBMS_intermediate sql
DBMS_intermediate sqlDBMS_intermediate sql
DBMS_intermediate sql
Azizul Mamun
 
Bt0066 dbms
Bt0066 dbmsBt0066 dbms
Bt0066 dbms
smumbahelp
 

What's hot (20)

6. Linked list - Data Structures using C++ by Varsha Patil
6. Linked list - Data Structures using C++ by Varsha Patil6. Linked list - Data Structures using C++ by Varsha Patil
6. Linked list - Data Structures using C++ by Varsha Patil
 
Lect07
Lect07Lect07
Lect07
 
5. Queue - Data Structures using C++ by Varsha Patil
5. Queue - Data Structures using C++ by Varsha Patil5. Queue - Data Structures using C++ by Varsha Patil
5. Queue - Data Structures using C++ by Varsha Patil
 
I- Extended Databases
I- Extended DatabasesI- Extended Databases
I- Extended Databases
 
14. Files - Data Structures using C++ by Varsha Patil
14. Files - Data Structures using C++ by Varsha Patil14. Files - Data Structures using C++ by Varsha Patil
14. Files - Data Structures using C++ by Varsha Patil
 
Lecture 07 Data Structures - Basic Sorting
Lecture 07 Data Structures - Basic SortingLecture 07 Data Structures - Basic Sorting
Lecture 07 Data Structures - Basic Sorting
 
Introduction to database-Formal Query language and Relational calculus
Introduction to database-Formal Query language and Relational calculusIntroduction to database-Formal Query language and Relational calculus
Introduction to database-Formal Query language and Relational calculus
 
Database Programming using SQL
Database Programming using SQLDatabase Programming using SQL
Database Programming using SQL
 
DATA STRUCTURE AND ALGORITHMS
DATA STRUCTURE AND ALGORITHMS DATA STRUCTURE AND ALGORITHMS
DATA STRUCTURE AND ALGORITHMS
 
DBMS _Relational model
DBMS _Relational modelDBMS _Relational model
DBMS _Relational model
 
10. Search Tree - Data Structures using C++ by Varsha Patil
10. Search Tree - Data Structures using C++ by Varsha Patil10. Search Tree - Data Structures using C++ by Varsha Patil
10. Search Tree - Data Structures using C++ by Varsha Patil
 
Adt
AdtAdt
Adt
 
DBMS_Ch1
 DBMS_Ch1 DBMS_Ch1
DBMS_Ch1
 
Bc0058 data warehousing
Bc0058   data warehousingBc0058   data warehousing
Bc0058 data warehousing
 
Semi-automatic Discovery of Mappings Between Heterogeneous Data Warehouse Dim...
Semi-automatic Discovery of Mappings Between Heterogeneous Data Warehouse Dim...Semi-automatic Discovery of Mappings Between Heterogeneous Data Warehouse Dim...
Semi-automatic Discovery of Mappings Between Heterogeneous Data Warehouse Dim...
 
DSA - Lecture 04
DSA - Lecture 04DSA - Lecture 04
DSA - Lecture 04
 
3. Stack - Data Structures using C++ by Varsha Patil
3. Stack - Data Structures using C++ by Varsha Patil3. Stack - Data Structures using C++ by Varsha Patil
3. Stack - Data Structures using C++ by Varsha Patil
 
9. Searching & Sorting - Data Structures using C++ by Varsha Patil
9. Searching & Sorting - Data Structures using C++ by Varsha Patil9. Searching & Sorting - Data Structures using C++ by Varsha Patil
9. Searching & Sorting - Data Structures using C++ by Varsha Patil
 
DBMS_intermediate sql
DBMS_intermediate sqlDBMS_intermediate sql
DBMS_intermediate sql
 
Bt0066 dbms
Bt0066 dbmsBt0066 dbms
Bt0066 dbms
 

Similar to Warehousing

Warehousing_Ch10.ppt
Warehousing_Ch10.pptWarehousing_Ch10.ppt
Warehousing_Ch10.ppt
NoThanks63
 
Hibernate
HibernateHibernate
Whats a datawarehouse
Whats a datawarehouseWhats a datawarehouse
Whats a datawarehouse
vijjudarling
 
data analytics lecture 3.2.ppt
data analytics lecture 3.2.pptdata analytics lecture 3.2.ppt
data analytics lecture 3.2.ppt
RutujaPatil247341
 
Etl interview questions
Etl interview questionsEtl interview questions
Etl interview questions
ashokvirtual
 
Dwh faqs
Dwh faqsDwh faqs
Dwh faqs
infor123
 
Oracle tutorial
Oracle tutorialOracle tutorial
Oracle tutorial
Lalit Shaktawat
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modeling
vivekjv
 
Dawak f v.6camera-1
Dawak f v.6camera-1Dawak f v.6camera-1
Dawak f v.6camera-1
Mohammed El malki
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
Jason S
 
The best ETL questions in a nut shell
The best ETL questions in a nut shellThe best ETL questions in a nut shell
The best ETL questions in a nut shell
Srinimf-Slides
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
Dan Gunter
 
Datawarehousing & DSS
Datawarehousing & DSSDatawarehousing & DSS
Datawarehousing & DSS
Deepali Raut
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduce
J Singh
 
Database Management System, Lecture-1
Database Management System, Lecture-1Database Management System, Lecture-1
Database Management System, Lecture-1
Sonia Mim
 
127556030 bisp-informatica-question-collections
127556030 bisp-informatica-question-collections127556030 bisp-informatica-question-collections
127556030 bisp-informatica-question-collections
Amit Sharma
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R Notes
LakshmiSarvani6
 
Dimensional data model
Dimensional data modelDimensional data model
Dimensional data model
Vnktp1
 
ch02models.pptx
ch02models.pptxch02models.pptx
ch02models.pptx
dreamboy6060
 
ch02models.pptx
ch02models.pptxch02models.pptx
ch02models.pptx
dreamboy6060
 

Similar to Warehousing (20)

Warehousing_Ch10.ppt
Warehousing_Ch10.pptWarehousing_Ch10.ppt
Warehousing_Ch10.ppt
 
Hibernate
HibernateHibernate
Hibernate
 
Whats a datawarehouse
Whats a datawarehouseWhats a datawarehouse
Whats a datawarehouse
 
data analytics lecture 3.2.ppt
data analytics lecture 3.2.pptdata analytics lecture 3.2.ppt
data analytics lecture 3.2.ppt
 
Etl interview questions
Etl interview questionsEtl interview questions
Etl interview questions
 
Dwh faqs
Dwh faqsDwh faqs
Dwh faqs
 
Oracle tutorial
Oracle tutorialOracle tutorial
Oracle tutorial
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modeling
 
Dawak f v.6camera-1
Dawak f v.6camera-1Dawak f v.6camera-1
Dawak f v.6camera-1
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
The best ETL questions in a nut shell
The best ETL questions in a nut shellThe best ETL questions in a nut shell
The best ETL questions in a nut shell
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
Datawarehousing & DSS
Datawarehousing & DSSDatawarehousing & DSS
Datawarehousing & DSS
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduce
 
Database Management System, Lecture-1
Database Management System, Lecture-1Database Management System, Lecture-1
Database Management System, Lecture-1
 
127556030 bisp-informatica-question-collections
127556030 bisp-informatica-question-collections127556030 bisp-informatica-question-collections
127556030 bisp-informatica-question-collections
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R Notes
 
Dimensional data model
Dimensional data modelDimensional data model
Dimensional data model
 
ch02models.pptx
ch02models.pptxch02models.pptx
ch02models.pptx
 
ch02models.pptx
ch02models.pptxch02models.pptx
ch02models.pptx
 

Recently uploaded

MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
Colégio Santa Teresinha
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
heathfieldcps1
 
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx
HajraNaeem15
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
Nguyen Thanh Tu Collection
 
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptxPengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Fajar Baskoro
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
Nicholas Montgomery
 
Reimagining Your Library Space: How to Increase the Vibes in Your Library No ...
Reimagining Your Library Space: How to Increase the Vibes in Your Library No ...Reimagining Your Library Space: How to Increase the Vibes in Your Library No ...
Reimagining Your Library Space: How to Increase the Vibes in Your Library No ...
Diana Rendina
 
How to Create a More Engaging and Human Online Learning Experience
How to Create a More Engaging and Human Online Learning Experience How to Create a More Engaging and Human Online Learning Experience
How to Create a More Engaging and Human Online Learning Experience
Wahiba Chair Training & Consulting
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
Celine George
 
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem studentsRHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
Himanshu Rai
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
Priyankaranawat4
 
The Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collectionThe Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collection
Israel Genealogy Research Association
 
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdfবাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
eBook.com.bd (প্রয়োজনীয় বাংলা বই)
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Excellence Foundation for South Sudan
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
สมใจ จันสุกสี
 
writing about opinions about Australia the movie
writing about opinions about Australia the moviewriting about opinions about Australia the movie
writing about opinions about Australia the movie
Nicholas Montgomery
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
GeorgeMilliken2
 
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skillsspot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
haiqairshad
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
Nguyen Thanh Tu Collection
 

Recently uploaded (20)

MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
 
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
 
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptxPengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptx
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
 
Reimagining Your Library Space: How to Increase the Vibes in Your Library No ...
Reimagining Your Library Space: How to Increase the Vibes in Your Library No ...Reimagining Your Library Space: How to Increase the Vibes in Your Library No ...
Reimagining Your Library Space: How to Increase the Vibes in Your Library No ...
 
How to Create a More Engaging and Human Online Learning Experience
How to Create a More Engaging and Human Online Learning Experience How to Create a More Engaging and Human Online Learning Experience
How to Create a More Engaging and Human Online Learning Experience
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
 
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem studentsRHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
 
The Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collectionThe Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collection
 
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdfবাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
 
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
 
writing about opinions about Australia the movie
writing about opinions about Australia the moviewriting about opinions about Australia the movie
writing about opinions about Australia the movie
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
 
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skillsspot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
 

Warehousing

  • 1. ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 10: DATA WAREHOUSING & CACHING PRINCIPLES OF DATA INTEGRATION
  • 2. Data Warehousing and Materialization  We have mostly focused on techniques for virtual data integration (see Ch. 1)  Queries are composed with mappings on the fly and data is fetched on demand  This represents one extreme point  In this chapter, we consider cases where data is transformed and materialized “in advance” of the queries  The main scenario: the data warehouse
  • 3. What Is a Data Warehouse?  In many organizations, we want a central “store” of all of our entities, concepts, metadata, and historical information  For doing data validation, complex mining, analysis, prediction, …  This is the data warehouse  To this point we’ve focused on scenarios where the data “lives” in the sources – here we may have a “master” version (and archival version) in a central database  For performance reasons, availability reasons, archival reasons, …
  • 4. In the Rest of this Chapter…  The data warehouse as the master data instance  Data warehouse architectures, design, loading  Data exchange: declarative data warehousing  Hybrid models: caching and partial materialization  Querying externally archived data
  • 5. Outline  The data warehouse  Motivation: Master data management  Physical design  Extract/transform/load  Data exchange  Caching & partial materialization  Operating on external data
  • 6. Master Data Management  One of the “modern” uses of the data warehouse is not only to support analytics but to serve as a reference to all of the entities in the organization  A cleaned, validated repository of what we know … which can be linked to by data sources … which may help with data cleaning … and which may be the basis of data governance (processes by which data is created and modified in a systematic way, e.g., to comply with gov’t regulations)  There is an emerging field called master data management out the process of creating these
  • 7. Data Warehouse Architecture  At the top – a centralized database  Generally configured for queries and appends – not transactions  Many indices, materialized views, etc.  Data is loaded and periodically updated via Extract/Transform/Load (ETL) tools Data Warehouse ETL ETL ETL ETL RDBMS1 RDBMS2 HTML1 XML1 ETL pipeline outputs ETL
  • 8. ETL Tools  ETL tools are the equivalent of schema mappings in virtual integration, but are more powerful  Arbitrary pieces of code to take data from a source, convert it into data for the warehouse:  import filters – read and convert from data sources  data transformations – join, aggregate, filter, convert data  de-duplication – finds multiple records referring to the same entity, merges them  profiling – builds tables, histograms, etc. to summarize data  quality management – test against master values, known
  • 9. Example ETL Tool Chain  This is an example for e-commerce loading  Note multiple stages of filtering (using selection or join-like operations), logging bad records, before we group and load Invoice line items Split Date- time Filter invalid Join Filter invalid Invalid dates/times Invalid items Item records Filter non - match Invalid customers Group by customer Customer balance Customer records
  • 10. Basic Data Warehouse – Summary  Two aspects:  A central DBMS optimized for appends and querying  The “master data” instance  Or the instance for doing mining, analytics, and prediction  A set of procedural ETL “pipelines” to fetch, transform, filter, clean, and load data  Often these tools are more expressive than standard conjunctive queries (as in Chapters 2-3)  … But not always! This raises a question – can we do warehousing with declarative mappings?
  • 11. Outline The data warehouse  Data exchange  Caching & partial materialization  Operating on external data
  • 12. Data Exchange  Intuitively, a declarative setup for data warehousing  Declarative schema mappings as in Ch. 2-3  Materialized database as in the previous section  Also allow for unknown values when we map from source to target (warehouse) instance  If we know a professor teaches a student, then there must exist a course C that the student took and the professor taught – but we may not know which…
  • 13. Data Exchange Formulation A data exchange setting (S,T,M,CT) has:  S, source schema representing all of the source tables jointly  T, target schema  A set of mappings or tuple-generating dependencies relating S and T  A set of constraints (equality-generating dependencies) (∀X)s1(X1), ..., sm (Xm ) → (∃Y) t1(Y1), ..., tk (Yk ) )Y(Y)Y(t...,,)Y()tY( jill11 =→∃
  • 14. An Example Source S has Teaches(prof, student) Adviser(adviser, student) Target T has Advise(adviser, student) TeachesCourse(prof, course) Takes(course, student) ),(),,(.,),(: ),(),(: ),(),,(.),(: ),(.),(: 4 3 2 1 studCTakesCDrseTeachesCouDCstudprofAdviserr studprofAdvisestudprofAdviserr studCTakesCprofrseTeachesCouCstudprofTeachesr studDAdviseDstudprofTeachesr ∃→ → ∃→ ∃→ existential variables represent unknowns
  • 15. The Data Exchange Solution  The goal of data exchange is to compute an instance of the target schema, given a data exchange setting D = (S,T,M,CT) and an instance I(S)  An instance J of Schema T is a data exchange solution for D and I if 1. the pair (I,J) satisfies schema mapping M, and 2. J satisfies constraints CT
  • 16. Instance I(S) has Teaches Adviser Back to the Example, Now with Data prof student Ann Bob Chloe David Instance J(T) has Advise TeachesCourse Takes adviser student Ellen Bob Felicia David adviser student Ellen Bob Felicia David course student C1 Bob C2 David prof course Ann C1 Chloe C2 variables or labeled nulls represent unknown values
  • 17. Instance I(S) has Teaches Adviser This Is also a Solution prof student Ann Bob Chloe David Instance J(T) has Advise TeachesCourse Takes adviser student Ellen Bob Felicia David adviser student Ellen Bob Felicia David course student C1 Bob C1 David prof course Ann C1 Chloe C1 this time the labeled nulls are all the same!
  • 18. Universal Solutions  Intuitively, the first solution should be better than the second  The first solution uses the same variable for the course taught by Ann and by Chloe – they are the same course  But this was not specified in the original schema!  We formalize that through the notion of the universal solution, which must not lose any information
  • 19. Formalizing the Universal Solution First we define instance homomorphism:  Let J1, J2 be two instances of schema T  A mapping h: J1  J2 is a homomorphism from J1 to J2 if  h(c) = c for every c ∈ C,  for every tuple R(a1,…,an) ∈ J1 the tuple R(h(a1),…,h(an)) ∈ J2  J1, J2 are homomorphically equivalent if there are homomorphisms h: J1  J2 and h’: J2  J1 Def: Universal solution for data exchange setting D = (S,T,M,CT), where I is an instance of S. A data exchange solution J for D and I is a universal solution if, for every other data exchange solution J’ for D and I, there exists a homomorphism h: J  J’
  • 20. Computing Universal Solutions  The standard process is to use a procedure called the chase  Informally:  Consider every formula r of M in turn:  If there is a variable substitution for the left-hand side (lhs) of r where the right-hand side (rhs) is not in the solution – add it  If we create a new tuple, for every existential variable in the rhs, substitute a new fresh variable  See Chapter 10 Algorithm 10 for full pseudocode
  • 21. Core Universal Solutions  Universal solutions may be of arbitrary size  The core universal solution is the minimal universal solution
  • 22. Data Exchange and Querying  As with the data warehouse, all queries are directly posed over the target database – no reformulation necessary  However, we typically assume certain answers semantics  To get the certain answers (which are the same as in the virtual integration setting with GLAV/TGD mappings) – compute the query answers and then drop any tuples with labeled nulls (variables)
  • 23. Data Exchange vs. Warehousing  From an external perspective, exchange and warehousing are essentially equivalent  But there are different trade-offs in procedural vs. declarative mappings  Procedural – more expressive  Declarative – easier to reason about, compose, invert, create matieralized views for, etc. (see Chapter 6)
  • 24. Outline The data warehouse Data exchange  Caching & partial materialization  Operating on external data
  • 25. The Spectrum of Materialization  Many real EII systems compute and maintain materialized views, or cache results A “hybrid” point between the fully virtual and fully materialized approaches Virtual integration (EII) Data exchange / data warehouse sources materialized all mediated relations materialized caching or partial materialization – some views materialized
  • 26. Possible Techniques for Choosing What to Materialize Cache results of prior queries  Take the results of each query, materialize them  Use answering queries using views to reuse  Expire using time-to-live… May not always be fresh! Administrator-selected views  Someone manually specifies views to compute and maintain, as with a relational DBMS  System automatically maintains Automatic view selection  Using query workload, update frequencies – a view materialization wizard chooses what to materialize
  • 27. Outline The data warehouse Data exchange Caching & partial materialization  Operating on external data
  • 28. Many “Integration-Like” Scenarios over Historical Data  Many Web scenarios where we have large logs of data accesses, created by the server  Goal: put these together and query them!  Looks like a very simple data integration scenario – external data, but single schema  A common approach: use programming environments like MapReduce (or SQL layers above) to query the data on a cluster  MapReduce reliably runs large jobs across 100s or 1000s of “shared nothing” nodes in a cluster
  • 29. MapReduce Basics  MapReduce is essentially a template for writing distributed programs – corresponding to a single SQL SELECT..FROM..WHERE..GROUP BY..HAVING block with user-defined functions  The MapReduce runtime calls a set of functions:  map is given a tuple, outputs 0 or more tuples in response  roughly like the WHERE clause  shuffle is a stage for doing sort-based grouping on a key (specified by the map)  reduce is an aggregate function called over the set of tuples with the same grouping key
  • 30. MapReduce Dataflow “Template”: Tuples  Map “worker”  Shuffle  Reduce “worker” 30 Map Worker Map Worker Map Worker Map Worker Reduce Worker Reduce Worker Reduce Worker Reduce Worker Reduce Worker emit tuples emit aggregate results
  • 31. MapReduce as ETL  Some people use MapReduce to take data, transform it, and load it into a warehouse  … which is basically what ETL tools do!  The dividing line between DBMSs, EII, MapReduce is blurring as of the development of this book  SQL  MapReduce  MapReduce over SQL engines  Shared-nothing DBMSs  NoSQL
  • 32. Warehousing & Materialization Wrap- up  There are benefits to centralizing & materializing data  Performance, especially for analytics / mining  Archival  Standardization / canonicalization  Data warehouses typically use procedural ETL tools to extract, transform, load (and clean) data  Data exchange replaces ETL with declarative mappings (where feasible)  Hybrid schemes exist for partial materialization  Increasingly we are integrating via MapReduce and its cousins