The document discusses data warehousing and materialization approaches for data integration. It describes:
1) The data warehouse approach, where data is transformed, cleaned and stored in a central database optimized for queries. Extract-Transform-Load tools are used to periodically load and update the warehouse.
2) Data exchange, which takes a declarative approach using schema mappings to materialize a target warehouse instance from source data, handling unknown values.
3) Hybrid approaches involving caching query results or partially materializing views to trade off between virtual and fully materialized integration.
A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.
In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.
Professional dataware-housing-training-in-mumbaiUnmesh Baile
Vibrant Technologies is headquarted in Mumbai,India.We are the best Data Warehousing training provider in Navi Mumbai who provides Live Projects to students.We provide Corporate Training also.We are Best Data Warehousing classes in Mumbai according to our students and corporators
This document provides an overview of data warehousing concepts including dimensional modeling, online analytical processing (OLAP), and indexing techniques. It discusses the evolution of data warehousing, definitions of data warehouses, architectures, and common applications. Dimensional modeling concepts such as star schemas, snowflake schemas, and slowly changing dimensions are explained. The presentation concludes with references for further reading.
Database management concepts involve managing large structured data through efficient data retrieval and modification via query processing. DBMS provide data independence and sharing through conceptual, logical, and physical schema separation. Common DBMS types include relational, hierarchical, network and object-oriented structures, each with different data modeling and query capabilities. Transaction management ensures atomicity, consistency, isolation and durability of database operations through concurrency control and recovery techniques.
The document discusses various concepts related to database design and data warehousing. It describes how DBMS minimize problems like data redundancy, isolation, and inconsistency through techniques like normalization, indexing, and using data dictionaries. It then discusses data warehousing concepts like the need for data warehouses, their key characteristics of being subject-oriented, integrated, and time-variant. Common data warehouse architectures and components like the ETL process, OLAP, and decision support systems are also summarized.
The document discusses algorithms and data structures. It begins by introducing common data structures like arrays, stacks, queues, trees, and hash tables. It then explains that data structures allow for organizing data in a way that can be efficiently processed and accessed. The document concludes by stating that the choice of data structure depends on effectively representing real-world relationships while allowing simple processing of the data.
A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.
In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.
Professional dataware-housing-training-in-mumbaiUnmesh Baile
Vibrant Technologies is headquarted in Mumbai,India.We are the best Data Warehousing training provider in Navi Mumbai who provides Live Projects to students.We provide Corporate Training also.We are Best Data Warehousing classes in Mumbai according to our students and corporators
This document provides an overview of data warehousing concepts including dimensional modeling, online analytical processing (OLAP), and indexing techniques. It discusses the evolution of data warehousing, definitions of data warehouses, architectures, and common applications. Dimensional modeling concepts such as star schemas, snowflake schemas, and slowly changing dimensions are explained. The presentation concludes with references for further reading.
Database management concepts involve managing large structured data through efficient data retrieval and modification via query processing. DBMS provide data independence and sharing through conceptual, logical, and physical schema separation. Common DBMS types include relational, hierarchical, network and object-oriented structures, each with different data modeling and query capabilities. Transaction management ensures atomicity, consistency, isolation and durability of database operations through concurrency control and recovery techniques.
The document discusses various concepts related to database design and data warehousing. It describes how DBMS minimize problems like data redundancy, isolation, and inconsistency through techniques like normalization, indexing, and using data dictionaries. It then discusses data warehousing concepts like the need for data warehouses, their key characteristics of being subject-oriented, integrated, and time-variant. Common data warehouse architectures and components like the ETL process, OLAP, and decision support systems are also summarized.
The document discusses algorithms and data structures. It begins by introducing common data structures like arrays, stacks, queues, trees, and hash tables. It then explains that data structures allow for organizing data in a way that can be efficiently processed and accessed. The document concludes by stating that the choice of data structure depends on effectively representing real-world relationships while allowing simple processing of the data.
The document defines conceptual, logical, and physical data models and compares their key features. A conceptual model shows entities and relationships without attributes or keys. A logical model adds attributes, primary keys, and foreign keys. A physical model specifies tables, columns, data types, and other implementation details.
Data science involves extracting knowledge from data to solve business problems. The data science life cycle includes defining the problem, collecting and preparing data, exploring the data, building models, and communicating results. Data preparation is an essential step that can consume 60% of a project's time. It involves cleaning, transforming, handling outliers, integrating, and reducing data. Models are built using machine learning algorithms like regression for continuous variables and classification for discrete variables. Results are visualized and communicated effectively to clients.
The document discusses what a data warehouse is and why schools are setting them up. It provides key concepts like OLTP, OLAP, ETL, star schemas, and data marts. A data warehouse extracts data from transactional systems, transforms and loads it into a dimensional data store to support analysis. It is updated via periodic ETL jobs and presents data in simplified, denormalized schemas to support decision making. Implementing a data warehouse requires defining requirements and priorities through collaboration between decision makers and technologists.
This document provides an overview of Hibernate, an object-relational mapping framework for Java. It discusses what Hibernate is, why it is useful for developers, and some of its main alternatives. The document then covers object-relational mapping challenges like identity, granularity, associations, inheritance, and data types that Hibernate aims to address. It provides a simple example of using Hibernate and describes its basic architecture, configuration, and object lifecycle. Finally, it discusses advanced Hibernate features like association mapping.
Operational database systems are designed to support transaction processing while data warehouses are designed to support analytical processing and report generation. Operational systems focus on business processes, contain current data, and are optimized for fast updates. Data warehouses are subject-oriented, contain historical data that is rarely changed, and are optimized for fast data retrieval. The three main components of a data warehouse architecture are the database server, OLAP server, and client tools. Data is extracted from operational systems, transformed, cleansed, and loaded into fact and dimension tables in the data warehouse using the ETL process. Multidimensional schemas like star, snowflake, and constellation organize this data. Common OLAP operations performed on the data include roll-up,
The document provides an overview of example databases and database concepts. It discusses example databases from universities, banks, airlines, genetics research, and online bookstores. It also defines key database terminology like database, database management system, application programs, and client/server architecture. The basic data models and how to query, insert, update and retrieve data from databases is also summarized.
The document provides an overview of example databases and database concepts. It discusses example databases from universities, banks, airlines, genetics research, and online bookstores. It also defines key database terminology like database, database management system, application programs, and information system. It describes basic database concepts such as data models, schemas, queries, transactions, and the benefits of using a database management system.
Data Preprocessing can be defined as a process of converting raw data into a format that is understandable and usable for further analysis. It is an important step in the Data Preparation stage. It ensures that the outcome of the analysis is accurate, complete, and consistent.
Data preprocessing involves cleaning, transforming, and reducing raw data to prepare it for analysis. It addresses issues like missing values, inconsistencies, noise and redundancy. Key tasks include data cleaning to detect and correct errors, data integration to combine related data from multiple sources, and data reduction to reduce dimensionality or data size for more efficient analysis while retaining important information. Techniques like wavelet transforms, principal component analysis and dimensionality reduction are commonly used for data reduction. Preprocessing aims to improve data quality and prepare it for downstream analysis tasks.
Dataset Preparation
Abstract: This PDSG workshop introduces basic concepts on preparing a dataset for training a model. Concepts covered are data wrangling, replacing missing values, categorical variable conversion, and feature scaling.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
This document provides an overview of key concepts in data mining including data preprocessing, data warehouses, frequent patterns, association rule mining, classification, clustering, outlier analysis and more. It discusses different types of databases that can be mined such as relational, transactional, temporal and spatial databases. The document also covers data characterization, discrimination, interestingness measures and different types of data mining systems.
Summary introduction to data engineeringNovita Sari
Data engineering involves designing, building, and maintaining data warehouses to transform raw data into queryable forms that enable analytics. A core task of data engineers is Extract, Transform, and Load (ETL) processes - extracting data from sources, transforming it through processes like filtering and aggregation, and loading it into destinations. Data engineers help divide systems into transactional (OLTP) and analytical (OLAP) databases, with OLTP providing source data to data warehouses analyzed through OLAP systems. While similar, data engineers focus more on infrastructure and ETL processes, while data scientists focus more on analysis, modeling, and insights.
ETL—which stands for extract, transform, load—is a long-standing data integration process used to combine data from multiple sources into a single, consistent data set for loading into a data warehouse, data lake or other target system.
The document discusses data structures and their importance in organizing data efficiently for computer programs. It defines what a data structure is and how choosing the right one can improve a program's performance. Several examples are provided to illustrate how analyzing a problem's specific needs guides the selection of an optimal data structure.
The document discusses data structures and their importance in organizing data efficiently for computer programs. It defines what a data structure is and how choosing the right one can improve a program's performance. Several examples are provided to illustrate how analyzing a problem's specific needs guides the selection of an optimal data structure.
This document provides an overview of data preprocessing techniques for data mining. It discusses data quality issues like accuracy, completeness, and consistency that require data cleaning. The major tasks of data preprocessing are described as data cleaning, integration, reduction, and transformation. Specific techniques for handling missing data, noisy data, and reducing redundancy during data integration are also summarized.
this presentation covers the following:
* Data warehouse-design strategies
* Data warehouse-modeling techniques
* the points of attention when building ETL-procedures for one of these Data warehouse-modeling techniques
This document provides an overview of a data analytics session covering big data architecture, connecting and extracting data from storage, traditional processing with a bank use case, Hadoop-HDFS solutions, and HDFS working. The key topics covered include big data architecture layers, structured and unstructured data extraction, comparisons of storage media, traditional versus Hadoop approaches, HDFS basics including blocks and replication across nodes. The session aims to help learners understand efficient analytics systems for handling large and diverse data sources.
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
The document defines conceptual, logical, and physical data models and compares their key features. A conceptual model shows entities and relationships without attributes or keys. A logical model adds attributes, primary keys, and foreign keys. A physical model specifies tables, columns, data types, and other implementation details.
Data science involves extracting knowledge from data to solve business problems. The data science life cycle includes defining the problem, collecting and preparing data, exploring the data, building models, and communicating results. Data preparation is an essential step that can consume 60% of a project's time. It involves cleaning, transforming, handling outliers, integrating, and reducing data. Models are built using machine learning algorithms like regression for continuous variables and classification for discrete variables. Results are visualized and communicated effectively to clients.
The document discusses what a data warehouse is and why schools are setting them up. It provides key concepts like OLTP, OLAP, ETL, star schemas, and data marts. A data warehouse extracts data from transactional systems, transforms and loads it into a dimensional data store to support analysis. It is updated via periodic ETL jobs and presents data in simplified, denormalized schemas to support decision making. Implementing a data warehouse requires defining requirements and priorities through collaboration between decision makers and technologists.
This document provides an overview of Hibernate, an object-relational mapping framework for Java. It discusses what Hibernate is, why it is useful for developers, and some of its main alternatives. The document then covers object-relational mapping challenges like identity, granularity, associations, inheritance, and data types that Hibernate aims to address. It provides a simple example of using Hibernate and describes its basic architecture, configuration, and object lifecycle. Finally, it discusses advanced Hibernate features like association mapping.
Operational database systems are designed to support transaction processing while data warehouses are designed to support analytical processing and report generation. Operational systems focus on business processes, contain current data, and are optimized for fast updates. Data warehouses are subject-oriented, contain historical data that is rarely changed, and are optimized for fast data retrieval. The three main components of a data warehouse architecture are the database server, OLAP server, and client tools. Data is extracted from operational systems, transformed, cleansed, and loaded into fact and dimension tables in the data warehouse using the ETL process. Multidimensional schemas like star, snowflake, and constellation organize this data. Common OLAP operations performed on the data include roll-up,
The document provides an overview of example databases and database concepts. It discusses example databases from universities, banks, airlines, genetics research, and online bookstores. It also defines key database terminology like database, database management system, application programs, and client/server architecture. The basic data models and how to query, insert, update and retrieve data from databases is also summarized.
The document provides an overview of example databases and database concepts. It discusses example databases from universities, banks, airlines, genetics research, and online bookstores. It also defines key database terminology like database, database management system, application programs, and information system. It describes basic database concepts such as data models, schemas, queries, transactions, and the benefits of using a database management system.
Data Preprocessing can be defined as a process of converting raw data into a format that is understandable and usable for further analysis. It is an important step in the Data Preparation stage. It ensures that the outcome of the analysis is accurate, complete, and consistent.
Data preprocessing involves cleaning, transforming, and reducing raw data to prepare it for analysis. It addresses issues like missing values, inconsistencies, noise and redundancy. Key tasks include data cleaning to detect and correct errors, data integration to combine related data from multiple sources, and data reduction to reduce dimensionality or data size for more efficient analysis while retaining important information. Techniques like wavelet transforms, principal component analysis and dimensionality reduction are commonly used for data reduction. Preprocessing aims to improve data quality and prepare it for downstream analysis tasks.
Dataset Preparation
Abstract: This PDSG workshop introduces basic concepts on preparing a dataset for training a model. Concepts covered are data wrangling, replacing missing values, categorical variable conversion, and feature scaling.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
This document provides an overview of key concepts in data mining including data preprocessing, data warehouses, frequent patterns, association rule mining, classification, clustering, outlier analysis and more. It discusses different types of databases that can be mined such as relational, transactional, temporal and spatial databases. The document also covers data characterization, discrimination, interestingness measures and different types of data mining systems.
Summary introduction to data engineeringNovita Sari
Data engineering involves designing, building, and maintaining data warehouses to transform raw data into queryable forms that enable analytics. A core task of data engineers is Extract, Transform, and Load (ETL) processes - extracting data from sources, transforming it through processes like filtering and aggregation, and loading it into destinations. Data engineers help divide systems into transactional (OLTP) and analytical (OLAP) databases, with OLTP providing source data to data warehouses analyzed through OLAP systems. While similar, data engineers focus more on infrastructure and ETL processes, while data scientists focus more on analysis, modeling, and insights.
ETL—which stands for extract, transform, load—is a long-standing data integration process used to combine data from multiple sources into a single, consistent data set for loading into a data warehouse, data lake or other target system.
The document discusses data structures and their importance in organizing data efficiently for computer programs. It defines what a data structure is and how choosing the right one can improve a program's performance. Several examples are provided to illustrate how analyzing a problem's specific needs guides the selection of an optimal data structure.
The document discusses data structures and their importance in organizing data efficiently for computer programs. It defines what a data structure is and how choosing the right one can improve a program's performance. Several examples are provided to illustrate how analyzing a problem's specific needs guides the selection of an optimal data structure.
This document provides an overview of data preprocessing techniques for data mining. It discusses data quality issues like accuracy, completeness, and consistency that require data cleaning. The major tasks of data preprocessing are described as data cleaning, integration, reduction, and transformation. Specific techniques for handling missing data, noisy data, and reducing redundancy during data integration are also summarized.
this presentation covers the following:
* Data warehouse-design strategies
* Data warehouse-modeling techniques
* the points of attention when building ETL-procedures for one of these Data warehouse-modeling techniques
This document provides an overview of a data analytics session covering big data architecture, connecting and extracting data from storage, traditional processing with a bank use case, Hadoop-HDFS solutions, and HDFS working. The key topics covered include big data architecture layers, structured and unstructured data extraction, comparisons of storage media, traditional versus Hadoop approaches, HDFS basics including blocks and replication across nodes. The session aims to help learners understand efficient analytics systems for handling large and diverse data sources.
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
Enhanced data collection methods can help uncover the true extent of child abuse and neglect. This includes Integrated Data Systems from various sources (e.g., schools, healthcare providers, social services) to identify patterns and potential cases of abuse and neglect.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Did you know that drowning is a leading cause of unintentional death among young children? According to recent data, children aged 1-4 years are at the highest risk. Let's raise awareness and take steps to prevent these tragic incidents. Supervision, barriers around pools, and learning CPR can make a difference. Stay safe this summer!
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Warehousing_Ch10.ppt
1. ANHAI DOAN ALON HALEVY ZACHARY IVES
CHAPTER 10: DATA
WAREHOUSING & CACHING
PRINCIPLES OF
DATA INTEGRATION
2. Data Warehousing and
Materialization
We have mostly focused on techniques for virtual
data integration (see Ch. 1)
Queries are composed with mappings on the fly and data
is fetched on demand
This represents one extreme point
In this chapter, we consider cases where data is
transformed and materialized “in advance” of the
queries
The main scenario: the data warehouse
3. What Is a Data Warehouse?
In many organizations, we want a central “store” of
all of our entities, concepts, metadata, and historical
information
For doing data validation, complex mining, analysis,
prediction, …
This is the data warehouse
To this point we’ve focused on scenarios where the
data “lives” in the sources – here we may have a
“master” version (and archival version) in a central
database
For performance reasons, availability reasons, archival
reasons, …
4. In the Rest of this Chapter…
The data warehouse as the master data instance
Data warehouse architectures, design, loading
Data exchange: declarative data warehousing
Hybrid models: caching and partial materialization
Querying externally archived data
5. Outline
The data warehouse
Motivation: Master data management
Physical design
Extract/transform/load
Data exchange
Caching & partial materialization
Operating on external data
6. Master Data Management
One of the “modern” uses of the data warehouse is
not only to support analytics but to serve as a
reference to all of the entities in the organization
A cleaned, validated repository of what we know
… which can be linked to by data sources
… which may help with data cleaning
… and which may be the basis of data governance
(processes by which data is created and modified in a
systematic way, e.g., to comply with gov’t regulations)
There is an emerging field called master data
management out the process of creating these
7. Data Warehouse Architecture
At the top – a centralized
database
Generally configured for
queries and appends –
not transactions
Many indices,
materialized views, etc.
Data is loaded and
periodically updated via
Extract/Transform/Load
(ETL) tools
Data Warehouse
ETL ETL ETL ETL
RDBMS1 RDBMS2
HTML1 XML1
ETL pipeline
outputs
ETL
8. ETL Tools
ETL tools are the equivalent of schema mappings in
virtual integration, but are more powerful
Arbitrary pieces of code to take data from a source,
convert it into data for the warehouse:
import filters – read and convert from data sources
data transformations – join, aggregate, filter, convert data
de-duplication – finds multiple records referring to the
same entity, merges them
profiling – builds tables, histograms, etc. to summarize
data
quality management – test against master values, known
business rules, constraints, etc.
9. Example ETL Tool Chain
This is an example for e-commerce loading
Note multiple stages of filtering (using selection or
join-like operations), logging bad records, before we
group and load
Invoice
line items
Split
Date-
time
Filter
invalid
Join
Filter
invalid
Invalid
dates/times
Invalid
items
Item
records
Filter
non -
match
Invalid
customers
Group by
customer
Customer
balance
Customer
records
10. Basic Data Warehouse – Summary
Two aspects:
A central DBMS optimized for appends and querying
The “master data” instance
Or the instance for doing mining, analytics, and prediction
A set of procedural ETL “pipelines” to fetch, transform,
filter, clean, and load data
Often these tools are more expressive than standard conjunctive
queries (as in Chapters 2-3)
… But not always!
This raises a question – can we do warehousing with declarative
mappings?
11. Outline
The data warehouse
Data exchange
Caching & partial materialization
Operating on external data
12. Data Exchange
Intuitively, a declarative setup for data warehousing
Declarative schema mappings as in Ch. 2-3
Materialized database as in the previous section
Also allow for unknown values when we map from
source to target (warehouse) instance
If we know a professor teaches a student, then there must
exist a course C that the student took and the professor
taught – but we may not know which…
13. Data Exchange Formulation
A data exchange setting (S,T,M,CT) has:
S, source schema representing all of the source tables
jointly
T, target schema
A set of mappings or tuple-generating dependencies
relating S and T
A set of constraints (equality-generating dependencies)
("X)s1(X1), ..., sm (Xm ) ® ($Y) t1(Y1), ..., tk (Yk )
)
Y
(Y
)
Y
(
t
...,
,
)
Y
(
)t
Y
( j
i
l
l
1
1
14. An Example
Source S has
Teaches(prof, student)
Adviser(adviser, student)
Target T has
Advise(adviser, student)
TeachesCourse(prof, course)
Takes(course, student)
)
,
(
),
,
(
.
,
)
,
(
:
)
,
(
)
,
(
:
)
,
(
),
,
(
.
)
,
(
:
)
,
(
.
)
,
(
:
4
3
2
1
stud
C
Takes
C
D
rse
TeachesCou
D
C
stud
prof
Adviser
r
stud
prof
Advise
stud
prof
Adviser
r
stud
C
Takes
C
prof
rse
TeachesCou
C
stud
prof
Teaches
r
stud
D
Advise
D
stud
prof
Teaches
r
existential variables represent unknown
15. The Data Exchange Solution
The goal of data exchange is to compute an instance
of the target schema, given a data exchange setting
D = (S,T,M,CT) and an instance I(S)
An instance J of Schema T is a data exchange
solution for D and I if
1. the pair (I,J) satisfies schema mapping M, and
2. J satisfies constraints CT
16. Instance I(S) has
Teaches
Adviser
Back to the Example, Now with Data
prof student
Ann Bob
Chloe David
Instance J(T) has
Advise
TeachesCourse
Takes
adviser student
Ellen Bob
Felicia David
adviser student
Ellen Bob
Felicia David
course student
C1 Bob
C2 David
prof course
Ann C1
Chloe C2
variables or labeled nulls
represent unknown values
17. Instance I(S) has
Teaches
Adviser
This Is also a Solution
prof student
Ann Bob
Chloe David
Instance J(T) has
Advise
TeachesCourse
Takes
adviser student
Ellen Bob
Felicia David
adviser student
Ellen Bob
Felicia David
course student
C1 Bob
C1 David
prof course
Ann C1
Chloe C1
this time the labeled
nulls are all the same!
18. Universal Solutions
Intuitively, the first solution should be better than
the second
The first solution uses the same variable for the course
taught by Ann and by Chloe – they are the same course
But this was not specified in the original schema!
We formalize that through the notion of the
universal solution, which must not lose any
information
19. Formalizing the Universal Solution
First we define instance homomorphism:
Let J1, J2 be two instances of schema T
A mapping h: J1 J2 is a homomorphism from J1 to J2 if
h(c) = c for every c ∈ C,
for every tuple R(a1,…,an) ∈ J1 the tuple R(h(a1),…,h(an)) ∈ J2
J1, J2 are homomorphically equivalent if there are
homomorphisms h: J1 J2 and h’: J2 J1
Def: Universal solution for data exchange setting
D = (S,T,M,CT), where I is an instance of S.
A data exchange solution J for D and I is a universal
solution if, for every other data exchange solution J’ for D
and I, there exists a homomorphism h: J J’
20. Computing Universal Solutions
The standard process is to use a procedure called the
chase
Informally:
Consider every formula r of M in turn:
If there is a variable substitution for the left-hand side (lhs) of r
where the right-hand side (rhs) is not in the solution – add it
If we create a new tuple, for every existential variable in the rhs,
substitute a new fresh variable
See Chapter 10 Algorithm 10 for full pseudocode
21. Core Universal Solutions
Universal solutions may be of arbitrary size
The core universal solution is the minimal universal
solution
22. Data Exchange and Querying
As with the data warehouse, all queries are directly
posed over the target database – no reformulation
necessary
However, we typically assume certain answers
semantics
To get the certain answers (which are the same as in the
virtual integration setting with GLAV/TGD mappings) –
compute the query answers and then drop any tuples with
labeled nulls (variables)
23. Data Exchange vs. Warehousing
From an external perspective, exchange and
warehousing are essentially equivalent
But there are different trade-offs in procedural vs.
declarative mappings
Procedural – more expressive
Declarative – easier to reason about, compose, invert,
create matieralized views for, etc. (see Chapter 6)
24. Outline
The data warehouse
Data exchange
Caching & partial materialization
Operating on external data
25. The Spectrum of Materialization
Many real EII systems compute and maintain
materialized views, or cache results
A “hybrid” point between the fully virtual and fully
materialized approaches
Virtual integration
(EII)
Data exchange /
data warehouse
sources materialized all mediated relations
materialized
caching or partial materialization –
some views materialized
26. Possible Techniques for Choosing
What to Materialize
Cache results of prior queries
Take the results of each query, materialize them
Use answering queries using views to reuse
Expire using time-to-live… May not always be fresh!
Administrator-selected views
Someone manually specifies views to compute and
maintain, as with a relational DBMS
System automatically maintains
Automatic view selection
Using query workload, update frequencies – a view
materialization wizard chooses what to materialize
27. Outline
The data warehouse
Data exchange
Caching & partial materialization
Operating on external data
28. Many “Integration-Like” Scenarios
over Historical Data
Many Web scenarios where we have large logs of
data accesses, created by the server
Goal: put these together and query them!
Looks like a very simple data integration scenario –
external data, but single schema
A common approach: use programming
environments like MapReduce (or SQL layers above)
to query the data on a cluster
MapReduce reliably runs large jobs across 100s or 1000s
of “shared nothing” nodes in a cluster
29. MapReduce Basics
MapReduce is essentially a template for writing
distributed programs – corresponding to a single SQL
SELECT..FROM..WHERE..GROUP BY..HAVING block
with user-defined functions
The MapReduce runtime calls a set of functions:
map is given a tuple, outputs 0 or more tuples in response
roughly like the WHERE clause
shuffle is a stage for doing sort-based grouping on a key
(specified by the map)
reduce is an aggregate function called over the set of
tuples with the same grouping key
31. MapReduce as ETL
Some people use MapReduce to take data,
transform it, and load it into a warehouse
… which is basically what ETL tools do!
The dividing line between DBMSs, EII, MapReduce is
blurring as of the development of this book
SQL MapReduce
MapReduce over SQL engines
Shared-nothing DBMSs
NoSQL
32. Warehousing & Materialization Wrap-up
There are benefits to centralizing & materializing data
Performance, especially for analytics / mining
Archival
Standardization / canonicalization
Data warehouses typically use procedural ETL tools to
extract, transform, load (and clean) data
Data exchange replaces ETL with declarative
mappings (where feasible)
Hybrid schemes exist for partial materialization
Increasingly we are integrating via MapReduce and its
cousins