SlideShare a Scribd company logo
+919892900103 |
info@vibranttechnologies.co.in
| www.
Vibranttechnologies.co.in
Contact us on :
∗ We have mostly focused on techniques for virtual
data integration (see Ch. 1)
∗ Queries are composed with mappings on the fly and
data is fetched on demand
∗ This represents one extreme point
∗ In this chapter, we consider cases where data is
transformed and materialized “in advance” of the
queries
∗ The main scenario: the data warehouse
Data Warehousing and Materialization
∗ In many organizations, we want a central “store” of all of
our entities, concepts, metadata, and historical
information
∗ For doing data validation, complex mining, analysis,
prediction, …
∗ This is the data warehouse
∗ To this point we’ve focused on scenarios where the data
“lives” in the sources – here we may have a “master”
version (and archival version) in a central database
∗ For performance reasons, availability reasons, archival
reasons, …
What Is a Data Warehouse?
∗ The data warehouse as the master data instance
∗ Data warehouse architectures, design, loading
∗ Data exchange: declarative data warehousing
∗ Hybrid models: caching and partial materialization
∗ Querying externally archived data
In the Rest of this Chapter…
The data warehouse
∗ Motivation: Master data management
∗ Physical design
∗ Extract/transform/load
∗ Data exchange
∗ Caching & partial materialization
∗ Operating on external data
Outline
∗ One of the “modern” uses of the data warehouse is not
only to support analytics but to serve as a reference to all
of the entities in the organization
∗ A cleaned, validated repository of what we know
… which can be linked to by data sources
… which may help with data cleaning
… and which may be the basis of data governance (processes
by which data is created and modified in a systematic way,
e.g., to comply with gov’t regulations)
∗ There is an emerging field called master data management
out the process of creating these
Master Data Management
Data Warehouse Architecture
∗ At the top – a centralized
database
∗ Generally configured for
queries and appends – not
transactions
∗ Many indices, materialized
views, etc.
∗ Data is loaded and
periodically updated via
Extract/Transform/Load
(ETL) tools
Data Warehouse
ETL ETL ETL ETL
RDBMS1 RDBMS2
HTML1 XML1
ETL pipeline
outputs
ETL
∗ ETL tools are the equivalent of schema mappings in virtual
integration, but are more powerful
∗ Arbitrary pieces of code to take data from a source,
convert it into data for the warehouse:
∗ import filters – read and convert from data sources
∗ data transformations – join, aggregate, filter, convert data
∗ de-duplication – finds multiple records referring to the same
entity, merges them
∗ profiling – builds tables, histograms, etc. to summarize data
∗ quality management – test against master values, known
business rules, constraints, etc.
ETL Tools
Example ETL Tool Chain
∗ This is an example for e-commerce loading
∗ Note multiple stages of filtering (using selection or join-like
operations), logging bad records, before we group and load
Invoice
line items
Split
Date-
time
Filter
invalid
Join
Filter
invalid
Invalid
dates/times
Invalid
items
Item
records
Filter
non -
match
Invalid
customers
Group by
customer
Customer
balance
Customer
records
∗ Two aspects:
∗ A central DBMS optimized for appends and querying
∗ The “master data” instance
∗ Or the instance for doing mining, analytics, and prediction
∗ A set of procedural ETL “pipelines” to fetch, transform, filter,
clean, and load data
∗ Often these tools are more expressive than standard
conjunctive queries (as in Chapters 2-3)
∗ … But not always!
This raises a question – can we do warehousing with declarative
mappings?
Basic Data Warehouse – Summary
The data warehouse
Data exchange
∗ Caching & partial materialization
∗ Operating on external data
Outline
∗ Intuitively, a declarative setup for data warehousing
∗ Declarative schema mappings as in Ch. 2-3
∗ Materialized database as in the previous section
∗ Also allow for unknown values when we map from
source to target (warehouse) instance
∗ If we know a professor teaches a student, then there
must exist a course C that the student took and the
professor taught – but we may not know which…
Data Exchange
A data exchange setting (S,T,M,CT) has:
∗ S, source schema representing all of the source tables
jointly
∗ T, target schema
∗ A set of mappings or tuple-generating dependencies
relating S and T
∗ A set of constraints (equality-generating dependencies)
Data Exchange Formulation
(∀X)s1(X1), ..., sm (Xm ) → (∃Y) t1(Y1), ..., tk (Yk )
)Y(Y)Y(t...,,)Y()tY( jill11 =→∃
An Example
Source S has
Teaches(prof, student)
Adviser(adviser, student)
Target T has
Advise(adviser, student)
TeachesCourse(prof, course)
Takes(course, student)
),(),,(.,),(:
),(),(:
),(),,(.),(:
),(.),(:
4
3
2
1
studCTakesCDrseTeachesCouDCstudprofAdviserr
studprofAdvisestudprofAdviserr
studCTakesCprofrseTeachesCouCstudprofTeachesr
studDAdviseDstudprofTeachesr
∃→
→
∃→
∃→
existential variables represent unknowns
∗ The goal of data exchange is to compute an instance
of the target schema, given a data exchange setting D
= (S,T,M,CT) and an instance I(S)
∗ An instance J of Schema T is a data exchange solution
for D and I if
1. the pair (I,J) satisfies schema mapping M, and
2. J satisfies constraints CT
The Data Exchange Solution
Instance I(S) has
Teaches
Adviser
prof student
Ann Bob
Chloe David
Back to the Example, Now with Data
adviser student
Ellen Bob
Felicia David
adviser student
Ellen Bob
Felicia David
course student
C1 Bob
C2 David
prof course
Ann C1
Chloe C2
Instance J(T) has
Advise
TeachesCourse
Takes
variables or labeled nulls
represent unknown values
Instance I(S) has
Teaches
Adviser
prof student
Ann Bob
Chloe David
This Is also a Solution
adviser student
Ellen Bob
Felicia David
adviser student
Ellen Bob
Felicia David
course student
C1 Bob
C1 David
prof course
Ann C1
Chloe C1
Instance J(T) has
Advise
TeachesCourse
Takes
this time the labeled
nulls are all the same!
∗ Intuitively, the first solution should be better than the
second
∗ The first solution uses the same variable for the course
taught by Ann and by Chloe – they are the same course
∗ But this was not specified in the original schema!
∗ We formalize that through the notion of the universal
solution, which must not lose any information
Universal Solutions
First we define instance homomorphism:
∗ Let J1, J2 be two instances of schema T
∗ A mapping h: J1  J2 is a homomorphism from J1 to J2 if
∗ h(c) = c for every c ∈ C,
∗ for every tuple R(a1,…,an) ∈ J1 the tuple R(h(a1),…,h(an)) ∈ J2
∗ J1, J2 are homomorphically equivalent if there are homomorphisms
h: J1  J2 and h’: J2  J1
Def: Universal solution for data exchange setting
D = (S,T,M,CT), where I is an instance of S.
A data exchange solution J for D and I is a universal solution if, for
every other data exchange solution J’ for D and I, there exists a
homomorphism h: J  J’
Formalizing the Universal Solution
∗ The standard process is to use a procedure called the chase
∗ Informally:
∗ Consider every formula r of M in turn:
∗ If there is a variable substitution for the left-hand side (lhs) of r
where the right-hand side (rhs) is not in the solution – add it
∗ If we create a new tuple, for every existential variable in the
rhs, substitute a new fresh variable
∗ See Chapter 10 Algorithm 10 for full pseudocode
Computing Universal Solutions
∗ Universal solutions may be of arbitrary size
∗ The core universal solution is the minimal universal
solution
Core Universal Solutions
∗ As with the data warehouse, all queries are directly posed
over the target database – no reformulation necessary
∗ However, we typically assume certain answers semantics
∗ To get the certain answers (which are the same as in the
virtual integration setting with GLAV/TGD mappings) –
compute the query answers and then drop any tuples with
labeled nulls (variables)
Data Exchange and Querying
∗ From an external perspective, exchange and
warehousing are essentially equivalent
∗ But there are different trade-offs in procedural vs.
declarative mappings
∗ Procedural – more expressive
∗ Declarative – easier to reason about, compose, invert,
create matieralized views for, etc. (see Chapter 6)
Data Exchange vs. Warehousing
The data warehouse
Data exchange
Caching & partial materialization
∗ Operating on external data
Outline
∗ Many real EII systems compute and maintain materialized views,
or cache results
A “hybrid” point between the fully virtual and fully materialized
approaches
The Spectrum of Materialization
Virtual integration
(EII)
Data exchange /
data warehouse
sources materialized all mediated relations
materialized
caching or partial materialization –
some views materialized
Cache results of prior queries
∗ Take the results of each query, materialize them
∗ Use answering queries using views to reuse
∗ Expire using time-to-live… May not always be fresh!
Administrator-selected views
∗ Someone manually specifies views to compute and maintain,
as with a relational DBMS
∗ System automatically maintains
Automatic view selection
∗ Using query workload, update frequencies – a view
materialization wizard chooses what to materialize
Possible Techniques for Choosing
What to Materialize
The data warehouse
Data exchange
Caching & partial materialization
Operating on external data
Outline
∗ Many Web scenarios where we have large logs of data
accesses, created by the server
∗ Goal: put these together and query them!
∗ Looks like a very simple data integration scenario –
external data, but single schema
∗ A common approach: use programming environments like
MapReduce (or SQL layers above) to query the data on a
cluster
∗ MapReduce reliably runs large jobs across 100s or 1000s of
“shared nothing” nodes in a cluster
Many “Integration-Like” Scenarios
over Historical Data
∗ MapReduce is essentially a template for writing distributed
programs – corresponding to a single SQL
SELECT..FROM..WHERE..GROUP BY..HAVING block with
user-defined functions
∗ The MapReduce runtime calls a set of functions:
∗ map is given a tuple, outputs 0 or more tuples in response
∗ roughly like the WHERE clause
∗ shuffle is a stage for doing sort-based grouping on a key
(specified by the map)
∗ reduce is an aggregate function called over the set of tuples
with the same grouping key
MapReduce Basics
31
MapReduce Dataflow “Template”: Tuples  Map
“worker”  Shuffle  Reduce “worker”
Map
Worker
Map
Worker
Map
Worker
Map
Worker
Reduce
Worker
Reduce
Worker
Reduce
Worker
Reduce
Worker
Reduce
Worker
emit tuples
emit aggregate
results
∗ Some people use MapReduce to take data, transform it, and
load it into a warehouse
∗ … which is basically what ETL tools do!
∗ The dividing line between DBMSs, EII, MapReduce is blurring as
of the development of this book
∗ SQL  MapReduce
∗ MapReduce over SQL engines
∗ Shared-nothing DBMSs
∗ NoSQL
MapReduce as ETL
∗ There are benefits to centralizing & materializing data
∗ Performance, especially for analytics / mining
∗ Archival
∗ Standardization / canonicalization
∗ Data warehouses typically use procedural ETL tools to
extract, transform, load (and clean) data
∗ Data exchange replaces ETL with declarative mappings
(where feasible)
∗ Hybrid schemes exist for partial materialization
∗ Increasingly we are integrating via MapReduce and its
cousins
Warehousing & Materialization Wrap-
up
Thank you

More Related Content

What's hot

1. Fundamental Concept - Data Structures using C++ by Varsha Patil
1. Fundamental Concept - Data Structures using C++ by Varsha Patil1. Fundamental Concept - Data Structures using C++ by Varsha Patil
1. Fundamental Concept - Data Structures using C++ by Varsha Patil
widespreadpromotion
 
2. Linear Data Structure Using Arrays - Data Structures using C++ by Varsha P...
2. Linear Data Structure Using Arrays - Data Structures using C++ by Varsha P...2. Linear Data Structure Using Arrays - Data Structures using C++ by Varsha P...
2. Linear Data Structure Using Arrays - Data Structures using C++ by Varsha P...
widespreadpromotion
 
Data structure
Data structureData structure
Data structure
Prof. Dr. K. Adisesha
 
Data structure-question-bank
Data structure-question-bankData structure-question-bank
Data structure-question-bank
Jagan Mohan Bishoyi
 
1 list datastructures
1 list datastructures1 list datastructures
1 list datastructures
Nguync91368
 
Introduction to data structure and algorithms
Introduction to data structure and algorithmsIntroduction to data structure and algorithms
Introduction to data structure and algorithms
Research Scholar in Manonmaniam Sundaranar University
 
Abstract data types
Abstract data typesAbstract data types
Abstract data types
Poojith Chowdhary
 
Data structure lecture 1
Data structure lecture 1Data structure lecture 1
Data structure lecture 1Kumar
 
Introduction to database-Formal Query language and Relational calculus
Introduction to database-Formal Query language and Relational calculusIntroduction to database-Formal Query language and Relational calculus
Introduction to database-Formal Query language and Relational calculus
Ajit Nayak
 
2 b queues
2 b queues2 b queues
2 b queues
Nguync91368
 
Database Programming using SQL
Database Programming using SQLDatabase Programming using SQL
Database Programming using SQL
Ajit Nayak
 
7. Tree - Data Structures using C++ by Varsha Patil
7. Tree - Data Structures using C++ by Varsha Patil7. Tree - Data Structures using C++ by Varsha Patil
7. Tree - Data Structures using C++ by Varsha Patil
widespreadpromotion
 
Computer Science-Data Structures :Abstract DataType (ADT)
Computer Science-Data Structures :Abstract DataType (ADT)Computer Science-Data Structures :Abstract DataType (ADT)
Computer Science-Data Structures :Abstract DataType (ADT)
St Mary's College,Thrissur,Kerala
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
 
2 a stacks
2 a stacks2 a stacks
2 a stacks
Nguync91368
 
Abstract Data Types
Abstract Data TypesAbstract Data Types
Abstract Data Types
karthikeyanC40
 
Bsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structureBsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structure
Rai University
 
Python for beginners
Python for beginnersPython for beginners
Python for beginners
Ali Huseyn Aliyev
 
Data structure,abstraction,abstract data type,static and dynamic,time and spa...
Data structure,abstraction,abstract data type,static and dynamic,time and spa...Data structure,abstraction,abstract data type,static and dynamic,time and spa...
Data structure,abstraction,abstract data type,static and dynamic,time and spa...
Hassan Ahmed
 

What's hot (20)

1. Fundamental Concept - Data Structures using C++ by Varsha Patil
1. Fundamental Concept - Data Structures using C++ by Varsha Patil1. Fundamental Concept - Data Structures using C++ by Varsha Patil
1. Fundamental Concept - Data Structures using C++ by Varsha Patil
 
2. Linear Data Structure Using Arrays - Data Structures using C++ by Varsha P...
2. Linear Data Structure Using Arrays - Data Structures using C++ by Varsha P...2. Linear Data Structure Using Arrays - Data Structures using C++ by Varsha P...
2. Linear Data Structure Using Arrays - Data Structures using C++ by Varsha P...
 
Data structure
Data structureData structure
Data structure
 
Data structure-question-bank
Data structure-question-bankData structure-question-bank
Data structure-question-bank
 
1 list datastructures
1 list datastructures1 list datastructures
1 list datastructures
 
Introduction to data structure and algorithms
Introduction to data structure and algorithmsIntroduction to data structure and algorithms
Introduction to data structure and algorithms
 
Abstract data types
Abstract data typesAbstract data types
Abstract data types
 
Data structure lecture 1
Data structure lecture 1Data structure lecture 1
Data structure lecture 1
 
Introduction to database-Formal Query language and Relational calculus
Introduction to database-Formal Query language and Relational calculusIntroduction to database-Formal Query language and Relational calculus
Introduction to database-Formal Query language and Relational calculus
 
2 b queues
2 b queues2 b queues
2 b queues
 
Database Programming using SQL
Database Programming using SQLDatabase Programming using SQL
Database Programming using SQL
 
7. Tree - Data Structures using C++ by Varsha Patil
7. Tree - Data Structures using C++ by Varsha Patil7. Tree - Data Structures using C++ by Varsha Patil
7. Tree - Data Structures using C++ by Varsha Patil
 
Computer Science-Data Structures :Abstract DataType (ADT)
Computer Science-Data Structures :Abstract DataType (ADT)Computer Science-Data Structures :Abstract DataType (ADT)
Computer Science-Data Structures :Abstract DataType (ADT)
 
List moderate
List   moderateList   moderate
List moderate
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
2 a stacks
2 a stacks2 a stacks
2 a stacks
 
Abstract Data Types
Abstract Data TypesAbstract Data Types
Abstract Data Types
 
Bsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structureBsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structure
 
Python for beginners
Python for beginnersPython for beginners
Python for beginners
 
Data structure,abstraction,abstract data type,static and dynamic,time and spa...
Data structure,abstraction,abstract data type,static and dynamic,time and spa...Data structure,abstraction,abstract data type,static and dynamic,time and spa...
Data structure,abstraction,abstract data type,static and dynamic,time and spa...
 

Similar to Professional dataware-housing-training-in-mumbai

Warehousing_Ch10.ppt
Warehousing_Ch10.pptWarehousing_Ch10.ppt
Warehousing_Ch10.ppt
NoThanks63
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset Preparation
Andrew Ferlitsch
 
DS-UNIT 1 FINAL (2).pptx
DS-UNIT 1 FINAL (2).pptxDS-UNIT 1 FINAL (2).pptx
DS-UNIT 1 FINAL (2).pptx
prakashvs7
 
Data Structures_Introduction
Data Structures_IntroductionData Structures_Introduction
Data Structures_Introduction
ThenmozhiK5
 
Data massage! databases scaled from one to one million nodes (ulf wendel)
Data massage! databases scaled from one to one million nodes (ulf wendel)Data massage! databases scaled from one to one million nodes (ulf wendel)
Data massage! databases scaled from one to one million nodes (ulf wendel)Zhang Bo
 
Data massage: How databases have been scaled from one to one million nodes
Data massage: How databases have been scaled from one to one million nodesData massage: How databases have been scaled from one to one million nodes
Data massage: How databases have been scaled from one to one million nodes
Ulf Wendel
 
Data Warehouse - What you know about etl process is wrong
Data Warehouse - What you know about etl process is wrongData Warehouse - What you know about etl process is wrong
Data Warehouse - What you know about etl process is wrong
Massimo Cenci
 
Etl interview questions
Etl interview questionsEtl interview questions
Etl interview questionsashokvirtual
 
ETL Process & Data Warehouse Fundamentals
ETL Process & Data Warehouse FundamentalsETL Process & Data Warehouse Fundamentals
ETL Process & Data Warehouse Fundamentals
SOMASUNDARAM T
 
The best ETL questions in a nut shell
The best ETL questions in a nut shellThe best ETL questions in a nut shell
The best ETL questions in a nut shell
Srinimf-Slides
 
algo 1.ppt
algo 1.pptalgo 1.ppt
algo 1.ppt
example43
 
Data warehouse physical design
Data warehouse physical designData warehouse physical design
Data warehouse physical design
Er. Nawaraj Bhandari
 
127556030 bisp-informatica-question-collections
127556030 bisp-informatica-question-collections127556030 bisp-informatica-question-collections
127556030 bisp-informatica-question-collectionsAmit Sharma
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modelingvivekjv
 
Phd coursestatalez2datamanagement
Phd coursestatalez2datamanagementPhd coursestatalez2datamanagement
Phd coursestatalez2datamanagement
Marco Delogu
 
5 Coding Hacks to Reduce GC Overhead
5 Coding Hacks to Reduce GC Overhead5 Coding Hacks to Reduce GC Overhead
5 Coding Hacks to Reduce GC Overhead
Takipi
 
Whats a datawarehouse
Whats a datawarehouseWhats a datawarehouse
Whats a datawarehousevijjudarling
 
Elag 2012 - Under the hood of 3TU.Datacentrum.
Elag 2012 - Under the hood of 3TU.Datacentrum.Elag 2012 - Under the hood of 3TU.Datacentrum.
Elag 2012 - Under the hood of 3TU.Datacentrum.
Egbert Gramsbergen
 

Similar to Professional dataware-housing-training-in-mumbai (20)

Warehousing_Ch10.ppt
Warehousing_Ch10.pptWarehousing_Ch10.ppt
Warehousing_Ch10.ppt
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset Preparation
 
DS-UNIT 1 FINAL (2).pptx
DS-UNIT 1 FINAL (2).pptxDS-UNIT 1 FINAL (2).pptx
DS-UNIT 1 FINAL (2).pptx
 
Data Structures_Introduction
Data Structures_IntroductionData Structures_Introduction
Data Structures_Introduction
 
Data massage! databases scaled from one to one million nodes (ulf wendel)
Data massage! databases scaled from one to one million nodes (ulf wendel)Data massage! databases scaled from one to one million nodes (ulf wendel)
Data massage! databases scaled from one to one million nodes (ulf wendel)
 
Dwh faqs
Dwh faqsDwh faqs
Dwh faqs
 
Data massage: How databases have been scaled from one to one million nodes
Data massage: How databases have been scaled from one to one million nodesData massage: How databases have been scaled from one to one million nodes
Data massage: How databases have been scaled from one to one million nodes
 
Data Warehouse - What you know about etl process is wrong
Data Warehouse - What you know about etl process is wrongData Warehouse - What you know about etl process is wrong
Data Warehouse - What you know about etl process is wrong
 
Etl interview questions
Etl interview questionsEtl interview questions
Etl interview questions
 
ETL Process & Data Warehouse Fundamentals
ETL Process & Data Warehouse FundamentalsETL Process & Data Warehouse Fundamentals
ETL Process & Data Warehouse Fundamentals
 
The best ETL questions in a nut shell
The best ETL questions in a nut shellThe best ETL questions in a nut shell
The best ETL questions in a nut shell
 
algo 1.ppt
algo 1.pptalgo 1.ppt
algo 1.ppt
 
Data Warehouse
Data WarehouseData Warehouse
Data Warehouse
 
Data warehouse physical design
Data warehouse physical designData warehouse physical design
Data warehouse physical design
 
127556030 bisp-informatica-question-collections
127556030 bisp-informatica-question-collections127556030 bisp-informatica-question-collections
127556030 bisp-informatica-question-collections
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modeling
 
Phd coursestatalez2datamanagement
Phd coursestatalez2datamanagementPhd coursestatalez2datamanagement
Phd coursestatalez2datamanagement
 
5 Coding Hacks to Reduce GC Overhead
5 Coding Hacks to Reduce GC Overhead5 Coding Hacks to Reduce GC Overhead
5 Coding Hacks to Reduce GC Overhead
 
Whats a datawarehouse
Whats a datawarehouseWhats a datawarehouse
Whats a datawarehouse
 
Elag 2012 - Under the hood of 3TU.Datacentrum.
Elag 2012 - Under the hood of 3TU.Datacentrum.Elag 2012 - Under the hood of 3TU.Datacentrum.
Elag 2012 - Under the hood of 3TU.Datacentrum.
 

More from Unmesh Baile

java-corporate-training-institute-in-mumbai
java-corporate-training-institute-in-mumbaijava-corporate-training-institute-in-mumbai
java-corporate-training-institute-in-mumbai
Unmesh Baile
 
Php mysql training-in-mumbai
Php mysql training-in-mumbaiPhp mysql training-in-mumbai
Php mysql training-in-mumbai
Unmesh Baile
 
Java course-in-mumbai
Java course-in-mumbaiJava course-in-mumbai
Java course-in-mumbai
Unmesh Baile
 
Robotics corporate-training-in-mumbai
Robotics corporate-training-in-mumbaiRobotics corporate-training-in-mumbai
Robotics corporate-training-in-mumbai
Unmesh Baile
 
Corporate-training-for-msbi-course-in-mumbai
Corporate-training-for-msbi-course-in-mumbaiCorporate-training-for-msbi-course-in-mumbai
Corporate-training-for-msbi-course-in-mumbai
Unmesh Baile
 
Linux corporate-training-in-mumbai
Linux corporate-training-in-mumbaiLinux corporate-training-in-mumbai
Linux corporate-training-in-mumbai
Unmesh Baile
 
Best-embedded-corporate-training-in-mumbai
Best-embedded-corporate-training-in-mumbaiBest-embedded-corporate-training-in-mumbai
Best-embedded-corporate-training-in-mumbai
Unmesh Baile
 
Selenium-corporate-training-in-mumbai
Selenium-corporate-training-in-mumbaiSelenium-corporate-training-in-mumbai
Selenium-corporate-training-in-mumbai
Unmesh Baile
 
Weblogic-clustering-failover-and-load-balancing-training
Weblogic-clustering-failover-and-load-balancing-trainingWeblogic-clustering-failover-and-load-balancing-training
Weblogic-clustering-failover-and-load-balancing-training
Unmesh Baile
 
Advance-excel-professional-trainer-in-mumbai
Advance-excel-professional-trainer-in-mumbaiAdvance-excel-professional-trainer-in-mumbai
Advance-excel-professional-trainer-in-mumbai
Unmesh Baile
 
Best corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbaiBest corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbai
Unmesh Baile
 
R-programming-training-in-mumbai
R-programming-training-in-mumbaiR-programming-training-in-mumbai
R-programming-training-in-mumbai
Unmesh Baile
 
Corporate-data-warehousing-training
Corporate-data-warehousing-trainingCorporate-data-warehousing-training
Corporate-data-warehousing-training
Unmesh Baile
 
Sas-training-in-mumbai
Sas-training-in-mumbaiSas-training-in-mumbai
Sas-training-in-mumbai
Unmesh Baile
 
Microsoft-business-intelligence-training-in-mumbai
Microsoft-business-intelligence-training-in-mumbaiMicrosoft-business-intelligence-training-in-mumbai
Microsoft-business-intelligence-training-in-mumbai
Unmesh Baile
 
Linux-training-for-beginners-in-mumbai
Linux-training-for-beginners-in-mumbaiLinux-training-for-beginners-in-mumbai
Linux-training-for-beginners-in-mumbai
Unmesh Baile
 
Corporate-informatica-training-in-mumbai
Corporate-informatica-training-in-mumbaiCorporate-informatica-training-in-mumbai
Corporate-informatica-training-in-mumbai
Unmesh Baile
 
Corporate-informatica-training-in-mumbai
Corporate-informatica-training-in-mumbaiCorporate-informatica-training-in-mumbai
Corporate-informatica-training-in-mumbai
Unmesh Baile
 
Best-robotics-training-in-mumbai
Best-robotics-training-in-mumbaiBest-robotics-training-in-mumbai
Best-robotics-training-in-mumbai
Unmesh Baile
 
Best-embedded-system-classes-in-mumbai
Best-embedded-system-classes-in-mumbaiBest-embedded-system-classes-in-mumbai
Best-embedded-system-classes-in-mumbai
Unmesh Baile
 

More from Unmesh Baile (20)

java-corporate-training-institute-in-mumbai
java-corporate-training-institute-in-mumbaijava-corporate-training-institute-in-mumbai
java-corporate-training-institute-in-mumbai
 
Php mysql training-in-mumbai
Php mysql training-in-mumbaiPhp mysql training-in-mumbai
Php mysql training-in-mumbai
 
Java course-in-mumbai
Java course-in-mumbaiJava course-in-mumbai
Java course-in-mumbai
 
Robotics corporate-training-in-mumbai
Robotics corporate-training-in-mumbaiRobotics corporate-training-in-mumbai
Robotics corporate-training-in-mumbai
 
Corporate-training-for-msbi-course-in-mumbai
Corporate-training-for-msbi-course-in-mumbaiCorporate-training-for-msbi-course-in-mumbai
Corporate-training-for-msbi-course-in-mumbai
 
Linux corporate-training-in-mumbai
Linux corporate-training-in-mumbaiLinux corporate-training-in-mumbai
Linux corporate-training-in-mumbai
 
Best-embedded-corporate-training-in-mumbai
Best-embedded-corporate-training-in-mumbaiBest-embedded-corporate-training-in-mumbai
Best-embedded-corporate-training-in-mumbai
 
Selenium-corporate-training-in-mumbai
Selenium-corporate-training-in-mumbaiSelenium-corporate-training-in-mumbai
Selenium-corporate-training-in-mumbai
 
Weblogic-clustering-failover-and-load-balancing-training
Weblogic-clustering-failover-and-load-balancing-trainingWeblogic-clustering-failover-and-load-balancing-training
Weblogic-clustering-failover-and-load-balancing-training
 
Advance-excel-professional-trainer-in-mumbai
Advance-excel-professional-trainer-in-mumbaiAdvance-excel-professional-trainer-in-mumbai
Advance-excel-professional-trainer-in-mumbai
 
Best corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbaiBest corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbai
 
R-programming-training-in-mumbai
R-programming-training-in-mumbaiR-programming-training-in-mumbai
R-programming-training-in-mumbai
 
Corporate-data-warehousing-training
Corporate-data-warehousing-trainingCorporate-data-warehousing-training
Corporate-data-warehousing-training
 
Sas-training-in-mumbai
Sas-training-in-mumbaiSas-training-in-mumbai
Sas-training-in-mumbai
 
Microsoft-business-intelligence-training-in-mumbai
Microsoft-business-intelligence-training-in-mumbaiMicrosoft-business-intelligence-training-in-mumbai
Microsoft-business-intelligence-training-in-mumbai
 
Linux-training-for-beginners-in-mumbai
Linux-training-for-beginners-in-mumbaiLinux-training-for-beginners-in-mumbai
Linux-training-for-beginners-in-mumbai
 
Corporate-informatica-training-in-mumbai
Corporate-informatica-training-in-mumbaiCorporate-informatica-training-in-mumbai
Corporate-informatica-training-in-mumbai
 
Corporate-informatica-training-in-mumbai
Corporate-informatica-training-in-mumbaiCorporate-informatica-training-in-mumbai
Corporate-informatica-training-in-mumbai
 
Best-robotics-training-in-mumbai
Best-robotics-training-in-mumbaiBest-robotics-training-in-mumbai
Best-robotics-training-in-mumbai
 
Best-embedded-system-classes-in-mumbai
Best-embedded-system-classes-in-mumbaiBest-embedded-system-classes-in-mumbai
Best-embedded-system-classes-in-mumbai
 

Recently uploaded

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 

Recently uploaded (20)

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 

Professional dataware-housing-training-in-mumbai

  • 1.
  • 3. ∗ We have mostly focused on techniques for virtual data integration (see Ch. 1) ∗ Queries are composed with mappings on the fly and data is fetched on demand ∗ This represents one extreme point ∗ In this chapter, we consider cases where data is transformed and materialized “in advance” of the queries ∗ The main scenario: the data warehouse Data Warehousing and Materialization
  • 4. ∗ In many organizations, we want a central “store” of all of our entities, concepts, metadata, and historical information ∗ For doing data validation, complex mining, analysis, prediction, … ∗ This is the data warehouse ∗ To this point we’ve focused on scenarios where the data “lives” in the sources – here we may have a “master” version (and archival version) in a central database ∗ For performance reasons, availability reasons, archival reasons, … What Is a Data Warehouse?
  • 5. ∗ The data warehouse as the master data instance ∗ Data warehouse architectures, design, loading ∗ Data exchange: declarative data warehousing ∗ Hybrid models: caching and partial materialization ∗ Querying externally archived data In the Rest of this Chapter…
  • 6. The data warehouse ∗ Motivation: Master data management ∗ Physical design ∗ Extract/transform/load ∗ Data exchange ∗ Caching & partial materialization ∗ Operating on external data Outline
  • 7. ∗ One of the “modern” uses of the data warehouse is not only to support analytics but to serve as a reference to all of the entities in the organization ∗ A cleaned, validated repository of what we know … which can be linked to by data sources … which may help with data cleaning … and which may be the basis of data governance (processes by which data is created and modified in a systematic way, e.g., to comply with gov’t regulations) ∗ There is an emerging field called master data management out the process of creating these Master Data Management
  • 8. Data Warehouse Architecture ∗ At the top – a centralized database ∗ Generally configured for queries and appends – not transactions ∗ Many indices, materialized views, etc. ∗ Data is loaded and periodically updated via Extract/Transform/Load (ETL) tools Data Warehouse ETL ETL ETL ETL RDBMS1 RDBMS2 HTML1 XML1 ETL pipeline outputs ETL
  • 9. ∗ ETL tools are the equivalent of schema mappings in virtual integration, but are more powerful ∗ Arbitrary pieces of code to take data from a source, convert it into data for the warehouse: ∗ import filters – read and convert from data sources ∗ data transformations – join, aggregate, filter, convert data ∗ de-duplication – finds multiple records referring to the same entity, merges them ∗ profiling – builds tables, histograms, etc. to summarize data ∗ quality management – test against master values, known business rules, constraints, etc. ETL Tools
  • 10. Example ETL Tool Chain ∗ This is an example for e-commerce loading ∗ Note multiple stages of filtering (using selection or join-like operations), logging bad records, before we group and load Invoice line items Split Date- time Filter invalid Join Filter invalid Invalid dates/times Invalid items Item records Filter non - match Invalid customers Group by customer Customer balance Customer records
  • 11. ∗ Two aspects: ∗ A central DBMS optimized for appends and querying ∗ The “master data” instance ∗ Or the instance for doing mining, analytics, and prediction ∗ A set of procedural ETL “pipelines” to fetch, transform, filter, clean, and load data ∗ Often these tools are more expressive than standard conjunctive queries (as in Chapters 2-3) ∗ … But not always! This raises a question – can we do warehousing with declarative mappings? Basic Data Warehouse – Summary
  • 12. The data warehouse Data exchange ∗ Caching & partial materialization ∗ Operating on external data Outline
  • 13. ∗ Intuitively, a declarative setup for data warehousing ∗ Declarative schema mappings as in Ch. 2-3 ∗ Materialized database as in the previous section ∗ Also allow for unknown values when we map from source to target (warehouse) instance ∗ If we know a professor teaches a student, then there must exist a course C that the student took and the professor taught – but we may not know which… Data Exchange
  • 14. A data exchange setting (S,T,M,CT) has: ∗ S, source schema representing all of the source tables jointly ∗ T, target schema ∗ A set of mappings or tuple-generating dependencies relating S and T ∗ A set of constraints (equality-generating dependencies) Data Exchange Formulation (∀X)s1(X1), ..., sm (Xm ) → (∃Y) t1(Y1), ..., tk (Yk ) )Y(Y)Y(t...,,)Y()tY( jill11 =→∃
  • 15. An Example Source S has Teaches(prof, student) Adviser(adviser, student) Target T has Advise(adviser, student) TeachesCourse(prof, course) Takes(course, student) ),(),,(.,),(: ),(),(: ),(),,(.),(: ),(.),(: 4 3 2 1 studCTakesCDrseTeachesCouDCstudprofAdviserr studprofAdvisestudprofAdviserr studCTakesCprofrseTeachesCouCstudprofTeachesr studDAdviseDstudprofTeachesr ∃→ → ∃→ ∃→ existential variables represent unknowns
  • 16. ∗ The goal of data exchange is to compute an instance of the target schema, given a data exchange setting D = (S,T,M,CT) and an instance I(S) ∗ An instance J of Schema T is a data exchange solution for D and I if 1. the pair (I,J) satisfies schema mapping M, and 2. J satisfies constraints CT The Data Exchange Solution
  • 17. Instance I(S) has Teaches Adviser prof student Ann Bob Chloe David Back to the Example, Now with Data adviser student Ellen Bob Felicia David adviser student Ellen Bob Felicia David course student C1 Bob C2 David prof course Ann C1 Chloe C2 Instance J(T) has Advise TeachesCourse Takes variables or labeled nulls represent unknown values
  • 18. Instance I(S) has Teaches Adviser prof student Ann Bob Chloe David This Is also a Solution adviser student Ellen Bob Felicia David adviser student Ellen Bob Felicia David course student C1 Bob C1 David prof course Ann C1 Chloe C1 Instance J(T) has Advise TeachesCourse Takes this time the labeled nulls are all the same!
  • 19. ∗ Intuitively, the first solution should be better than the second ∗ The first solution uses the same variable for the course taught by Ann and by Chloe – they are the same course ∗ But this was not specified in the original schema! ∗ We formalize that through the notion of the universal solution, which must not lose any information Universal Solutions
  • 20. First we define instance homomorphism: ∗ Let J1, J2 be two instances of schema T ∗ A mapping h: J1  J2 is a homomorphism from J1 to J2 if ∗ h(c) = c for every c ∈ C, ∗ for every tuple R(a1,…,an) ∈ J1 the tuple R(h(a1),…,h(an)) ∈ J2 ∗ J1, J2 are homomorphically equivalent if there are homomorphisms h: J1  J2 and h’: J2  J1 Def: Universal solution for data exchange setting D = (S,T,M,CT), where I is an instance of S. A data exchange solution J for D and I is a universal solution if, for every other data exchange solution J’ for D and I, there exists a homomorphism h: J  J’ Formalizing the Universal Solution
  • 21. ∗ The standard process is to use a procedure called the chase ∗ Informally: ∗ Consider every formula r of M in turn: ∗ If there is a variable substitution for the left-hand side (lhs) of r where the right-hand side (rhs) is not in the solution – add it ∗ If we create a new tuple, for every existential variable in the rhs, substitute a new fresh variable ∗ See Chapter 10 Algorithm 10 for full pseudocode Computing Universal Solutions
  • 22. ∗ Universal solutions may be of arbitrary size ∗ The core universal solution is the minimal universal solution Core Universal Solutions
  • 23. ∗ As with the data warehouse, all queries are directly posed over the target database – no reformulation necessary ∗ However, we typically assume certain answers semantics ∗ To get the certain answers (which are the same as in the virtual integration setting with GLAV/TGD mappings) – compute the query answers and then drop any tuples with labeled nulls (variables) Data Exchange and Querying
  • 24. ∗ From an external perspective, exchange and warehousing are essentially equivalent ∗ But there are different trade-offs in procedural vs. declarative mappings ∗ Procedural – more expressive ∗ Declarative – easier to reason about, compose, invert, create matieralized views for, etc. (see Chapter 6) Data Exchange vs. Warehousing
  • 25. The data warehouse Data exchange Caching & partial materialization ∗ Operating on external data Outline
  • 26. ∗ Many real EII systems compute and maintain materialized views, or cache results A “hybrid” point between the fully virtual and fully materialized approaches The Spectrum of Materialization Virtual integration (EII) Data exchange / data warehouse sources materialized all mediated relations materialized caching or partial materialization – some views materialized
  • 27. Cache results of prior queries ∗ Take the results of each query, materialize them ∗ Use answering queries using views to reuse ∗ Expire using time-to-live… May not always be fresh! Administrator-selected views ∗ Someone manually specifies views to compute and maintain, as with a relational DBMS ∗ System automatically maintains Automatic view selection ∗ Using query workload, update frequencies – a view materialization wizard chooses what to materialize Possible Techniques for Choosing What to Materialize
  • 28. The data warehouse Data exchange Caching & partial materialization Operating on external data Outline
  • 29. ∗ Many Web scenarios where we have large logs of data accesses, created by the server ∗ Goal: put these together and query them! ∗ Looks like a very simple data integration scenario – external data, but single schema ∗ A common approach: use programming environments like MapReduce (or SQL layers above) to query the data on a cluster ∗ MapReduce reliably runs large jobs across 100s or 1000s of “shared nothing” nodes in a cluster Many “Integration-Like” Scenarios over Historical Data
  • 30. ∗ MapReduce is essentially a template for writing distributed programs – corresponding to a single SQL SELECT..FROM..WHERE..GROUP BY..HAVING block with user-defined functions ∗ The MapReduce runtime calls a set of functions: ∗ map is given a tuple, outputs 0 or more tuples in response ∗ roughly like the WHERE clause ∗ shuffle is a stage for doing sort-based grouping on a key (specified by the map) ∗ reduce is an aggregate function called over the set of tuples with the same grouping key MapReduce Basics
  • 31. 31 MapReduce Dataflow “Template”: Tuples  Map “worker”  Shuffle  Reduce “worker” Map Worker Map Worker Map Worker Map Worker Reduce Worker Reduce Worker Reduce Worker Reduce Worker Reduce Worker emit tuples emit aggregate results
  • 32. ∗ Some people use MapReduce to take data, transform it, and load it into a warehouse ∗ … which is basically what ETL tools do! ∗ The dividing line between DBMSs, EII, MapReduce is blurring as of the development of this book ∗ SQL  MapReduce ∗ MapReduce over SQL engines ∗ Shared-nothing DBMSs ∗ NoSQL MapReduce as ETL
  • 33. ∗ There are benefits to centralizing & materializing data ∗ Performance, especially for analytics / mining ∗ Archival ∗ Standardization / canonicalization ∗ Data warehouses typically use procedural ETL tools to extract, transform, load (and clean) data ∗ Data exchange replaces ETL with declarative mappings (where feasible) ∗ Hybrid schemes exist for partial materialization ∗ Increasingly we are integrating via MapReduce and its cousins Warehousing & Materialization Wrap- up