2. Data Integration
* Data Integration involves combining data from several
disparate source, which are stored using various technologies
and provide a unified view of the data.
* The later initiative is often called a data warehouse.
* It merges the data from multiple data stores (data
source).
* It includes multiple databases, data cubes or flat
files.
* Metadata, correlation analysis, data conflict detection
and resolution of semantic heterogeneity contribute towards
smooth data integration.
3. Data Integration Define :
It combines data from multiple sources into a coherent
data store, as in data warehousing. These sources may include
multiple databases, data cubes, or flat files.
The data integration systems are formally defined as triple<G,S,M>
Where G: The global schema
S:Heterogeneous source of schemas
M: Mapping between the queries of source and global
schema
4. Advantages :
1. Independence.
2. Faster query processing.
3. Complex query processing.
4. Advanced data summarization & storage possible.
5. High volume data processing.
Disadvantages :
1. Latency (since data needs to be loaded using ETL).
2. Costlier (data localization, infrastructure, security).
6. Data Integration Approach:
There are mainly 2 major approaches for data
integration – one is “tight coupling approach” and another is
“loose coupling approach”.
Tight Coupling:
• Here, a data warehouse is treated as an information retrieval component.
• In this coupling, data is combined from different sources into a single physical
location through the process of ETL – Extraction, Transformation and Loading.
Loose Coupling:
• Here, an interface is provided that takes the query from the user, transforms it in a
way the source database can understand and then sends the query directly to the
source databases to obtain the result.
• And the data only remains in the actual source databases.
7. There are a number of issues to consider during data integration.
1. Schema Integration.
2. Redundancy.
3. Detection and resolution of data value conflicts.
Schema integration :
The real-world entities from multiple source be matched
is referred to as the entity identification problem.
For example,
Data analyst or the computer be sure that customer_id in
one database and cust_number in another refer to the same
entity. Databases and data warehouses that is a data about the
data it’s a meta data.
8. Redundancy :
* It is another important issue.
* An attribute may be redundant if it can be “derived”
from another table, such as annual revenue.
* Some redundancies can be detected by correlation
analysis.
For example,
Two attributes, such analysis can measure how
strongly one attribute implies the other based on the
available data.
The correlation between attributes attribute A and Bby
9. Detection and resolution of data value conflicts :
* A third important issue in data integration is the
detection and resolution of data value conflicts.
* The same real-world entity, attribute values from
different sources. This may be due to differences in
representation, scaling, or encoding.
* An attribute in one system may be recorded at a
lower level of abstraction than the “same” attribute in another.
* For example, the total sales in one database may
refer to one branch of All Electronics, an attribute of the same
name in another database may refer to the total sales for All
Electronics stores in a given region.
10. Data Transformation
* Data transformation the data are transformed or
consolidated into forms in appropriate for mining.
* Data transformation can involve
1. Smoothing.
2. Aggregation.
3. Generalization.
4. Normalization.
5. Attribute construction.
Such
Smoothing :
Which works to remove the noise from data.
techniques include binning, clustering and regression.
11. Aggregation :
* Where summary or aggregation operations are applied to
the data.
* For example, the daily sales data may be aggregated so
as to compute monthly and annual total amounts.
Generalization :
* The data where low-level or “primitive” data are placed
by higher-level concepts through the use of concept through
the use of concept hierarchies.
* For example, the attributes like street can be generalized
to higher-level concept city or country when the numeric
attributes to higher-level concept young, middle- aged and
street.
12. Normalization :
Where the attribute data are scaled so as to fall within
a specified range, such as -1.0 to 1.0 or 0.0 to 1.0
Attribute construction :
Where new attribute are a constructed and added
from the given set of attributes to help the mining
process.
There are many method for data normalization.
* Min-Max normalization.
* Z-Score normalization.
* Normalization by decimal scaling.
13. Min – Max Normalization :
It performs a linear transformation on the original data.
Suppose that min A and max A are the minimum and
maximum values of attributes A. A Min – Max
normalization maps a value v ofAto v’in the range.
Z – Score Normalization :
The Z – Score normalization a value of an attribute A
are normalized based on the mean and standard deviation of
A.Avalue v ofAis normalized to v’
14. Normalization by Decimal Scaling :
Normalization by decimal scaling normalizes by moving
the decimal point of values of attribute A.
The number of decimal points moved depends on the
maximum absolute value of A. A value v of A is normalized
to v’ by computing
where j is the smallest integer such that Max(|V’|) < 1.