CS 402 DATAMINING AND WAREHOUSING -MODULE 2

MODULE 2
• Data Preprocessing: Data Preprocessing Concepts, Data Cleaning, Data
integration and transformation, Data Reduction, Discretization and concept
hierarchy
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 1

Data Quality: Why Preprocess the Data?
• Measures for data quality: A multidimensional view
– Accuracy: correct or wrong, accurate or not
– Completeness: not recorded, unavailable, …
– Consistency: some modified but some not, dangling, …
– Timeliness: timely update?
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be understood?
36/30/2020 NIMMY RAJU,AP,VKCET,TVM

 . There are many possible reasons for inaccurate data (i.e., having incorrect attribute
values).
o The data collection instruments used may be faulty.
o There may have been human or computer errors occurring at data entry.
o Users may purposely submit incorrect data values for mandatory fields when they
do not wish to submit personal information (e.g., by choosing the default value
“January 1” displayed for birthday). This is known as disguised missing data.
o Errors in data transmission can also occur.
o There may be technology limitations such as limited buffer size for coordinating
synchronized data transfer and consumption.
o Incorrect data may also result from inconsistencies in naming conventions or data
codes, or inconsistent formats for input fields (e.g., date).
o Duplicate tuples also require data cleaning.

Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation

Data Cleaning
• Real-world data tend to be incomplete, noisy, and inconsistent.
– incomplete: lacking attribute values….
• e.g., Occupation=“ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
– inconsistent: containing discrepancies in codes or names,
• e.g., Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• duplicate records

• Data cleaning (or data cleansing)
routines work to “clean” the data by
filling in missing values, smoothing
noisy data, identifying or removing
outliers, and resolving
inconsistencies

Incomplete (Missing) Data
• Data is not always available
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time
of entry

How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (when
doing classification)
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– the attribute mean for all samples belonging to the same
class: smarter
– the most probable value: inference-based such as Bayesian
formula or decision tree.

11
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which require data cleaning
– duplicate records
– incomplete data
– inconsistent data

How to Handle Noisy Data?
• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
– In smoothing by bin means, each value in a bin is replaced
by the mean value of the bin.
– In smoothing by bin medians, each bin value is replaced by
the bin median.
– In smoothing by bin boundaries, the minimum and
maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest
boundary value

How to Handle Noisy Data?
• Regression
– a technique that conforms data values to a function.
eg: Linear regression , Multiple linear regression
– Clustering
– Similar values are organized into groups, or “clusters.”
– Values that fall outside of the set of clusters may be
considered outliers.
• Combined computer and human inspection
– detect suspicious values and check by human

Data Integration
• Data integration:
– Combines data from multiple sources into a coherent store
• Careful integration can help reduce and avoid redundancies and
inconsistencies in the resulting data set.
• This can help improve the accuracy and speed of the subsequent data mining
process.
• Issues in data Integration
• 1.Entity identification problem:
– Schema integration and object matching -How can equivalent real-world
entities from multiple data sources be matched up?
– For example, how can the data analyst or the computer be sure that
customer id in one database and cust number in another refer to the
same attribute?
15
15

Data Integration
– 2. Detecting and resolving data value conflicts
• For the same real world entity, attribute values from different
sources are different
• Possible reasons: different representations, different scales, e.g.,
metric vs. British units
– 3. Redundancy
• An attribute may be redundant if it can be “derived” from another
attribute or set of attributes.
• Inconsistencies in attribute or dimension naming can also cause
redundancies in the resulting data set.
• Some redundancies can be detected by correlation analysis

Handling Redundancy in Data Integration
• Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
• Given two attributes, such analysis can measure how strongly
one attribute implies the other, based on the
available data.
• For nominal data, we use the Χ2 (chi-square) test.
• For numeric attributes, use the correlation coefficient and
covariance, both of which access how one attribute’s values
vary from those of another.
17
17

Correlation Analysis (Nominal Data)
• Χ2 (chi-square) test
• The larger the Χ2 value, the more likely the variables are related
• Suppose A has c distinct values, namely a1,a2, …………..ac
• B has r distinct values, namely b1,b2, …………..br .
• The data tuples described by A and B can be shown as a contingency table,
with the c values of A making up the columns and the r values of B making
up the rows.



Expected
ExpectedObserved 2
2 )(


Correlation Analysis (Nominal Data)
19
where n is the number of data tuples, count.A D ai/ is the number of tuples
having value ai for A, and count.B D bj/ is the number of tuples having value bj
for B.6/30/2020 NIMMY RAJU,AP,VKCET,TVM

Chi-Square Calculation: An Example

Correlation Analysis (Numeric Data)
• Correlation coefficient (also called Pearson’s product moment
coefficient)
where n is the number of tuples, and are the respective means of A
and B, σA and σB are the respective standard deviation of A and B, and
Σ(aibi) is the sum of the AB cross-product.
• If rA,B > 0, A and B are positively correlated (A’s values increase as
B’s). The higher, the stronger correlation.
• rA,B = 0: independent; rAB < 0: negatively correlated
BA
n
i ii
BA
n
i ii
BA
n
BAnba
n
BbAa
r
 )(
)(
)(
))(( 11
,
 




A
21
B

Covariance (Numeric Data)
• Covariance is similar to correlation
where n is the number of tuples, and are the respective mean or
expected values of A and B, σA and σB are the respective standard deviation
of A and B.
• Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their
expected values.
• Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is
likely to be smaller than its expected value.
• Independence: CovA,B = 0
22
A B
Correlation coefficient:

• It can be simplified in computation as

Co-Variance: An Example

Data Reduction
• Complex data analysis and mining on huge amounts
of data can take a long time, making such analysis
impractical or infeasible.
• Data reduction: Obtain a reduced representation of
the data set that is much smaller in volume but yet
produces the same (or almost the same) analytical
results
• Data reduction strategies include dimensionality
reduction, numerosity reduction, and data
compression

Overview of Data Reduction Strategies
• Dimensionality reduction : process of reducing the number of
random variables or attributes under consideration .
• Wavelet transforms
• Principal Components Analysis (PCA)
• Attribute subset selection
• Wavelet transforms and principal components analysis which
transform or project the original data onto a smaller space.
• Attribute subset selection is a method of dimensionality
reduction in which irrelevant, weakly relevant, or redundant
attributes or dimensions are detected and removed.

– Numerosity reduction:replace the original data volume by
alternative, smaller forms of data representation.
– These techniques may be parametric or nonparametric.
– For parametric methods, a model is used to estimate the data,
so that typically only the data parameters need to be stored,
instead of the actual data.
• Regression and log-linear models are examples.
– Nonparametric methods for storing reduced representations
of the data include histograms, clustering, sampling and data
cube aggregation .

– Data compression:
– In data compression, transformations are applied so as to
obtain a reduced or “compressed” representation of the
original data.
– If the original data can be reconstructed from the
compressed data without any information loss, the data
reduction is called lossless.
– If, instead, we can reconstruct only an approximation of
the original data, then the data reduction is called lossy.

29
1. Wavelet transforms
• The discrete wavelet transform (DWT) is a linear signal
processing technique that, when applied to a data vector X,
transforms it to a numerically different vector, X/, of wavelet
coefficients.
• The two vectors are of the same length.
• “How can this technique be useful for data reduction if the
wavelet transformed data are of the same length as the original
data?”
• A compressed approximation of the data can be retained by
storing only a small fraction of the strongest of the wavelet
coefficients.
• For example, all wavelet coefficients larger than some user-
specified threshold can be retained. All other coefficients are set
to 0.6/30/2020 NIMMY RAJU,AP,VKCET,TVM

•The technique removes noise without smoothing out the main features of
the data, making it effective for data cleaning as well.
•Given a set of coefficients, an approximation of the original data can be
constructed by applying the inverse of the DWT used.
•The DWT is closely related to the discrete Fourier transform (DFT), a
signal processing technique involving sines and cosines.
•
•In general, however, the DWT achieves better lossy compression.
•That is, if the same number of coefficients is retained for a DWT and a
DFT of a given data vector, the DWT version will provide a more accurate
approximation of the original data.
•DWT requires less space than the DFT.
•Unlike the DFT, wavelets are quite localized in space.
•There is only one DFT, yet there are several families of DWTs6/30/2020 NIMMY RAJU,AP,VKCET,TVM 30

The general procedure for applying a discrete wavelet transform
uses a hierarchical pyramid algorithm that halves the data at
each iteration, resulting in fast computational speed. The method
is as follows:
1. The length, L, of the input data vector must be an integer power
of 2. This condition can be met by padding the data vector with
zeros as necessary (L ≥ n).
2. Each transform involves applying two functions. The first
applies some data smoothing, such as a sum or weighted average.
The second performs a weighted difference, which acts to bring out
the detailed features of the data.
3. The two functions are applied to pairs of data points in X, that
is, to all pairs of measurements .This results in two data sets of
length L/2.6/30/2020 NIMMY RAJU,AP,VKCET,TVM 31

4. The two functions are recursively applied to the data sets
obtained in the previous loop, until the resulting data sets obtained
are of length 2.
5. Selected values from the data sets obtained in the previous
iterations are designated the wavelet coefficients of the transformed
data.
• Equivalently, a matrix multiplication can be applied to the input
data in order to obtain the wavelet coefficients, where the matrix used
depends on the given DWT

33
Wavelet Decomposition
• S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to S^ = [23/4, -11/4, 1/2,
0, 0, 1, -1, 0]

2.Principal Components Analysis (PCA)
• Principal components analysis searches for k n-
dimensional orthogonal vectors that can best be
used to represent the data, where k ≤ n.
• The original data are thus projected onto a much
smaller space, resulting in dimensionality
reduction.
• PCA “combines” the essence of attributes by
creating an alternative, smaller set of variables.
The initial data can then be projected onto this
smaller set.

Principal Component Analysis (Steps)
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
– Normalize input data: Each attribute falls within the same range
– Compute k orthonormal (unit) vectors, i.e., principal components
– Each input data (vector) is a linear combination of the k principal
component vectors
– The principal components are sorted in order of decreasing “significance”
or strength
– Since the components are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance (i.e., using
the strongest principal components, it is possible to reconstruct a good
approximation of the original data)
• Works for numeric data only.

3. Attribute subset selection
• Data sets for analysis may contain hundreds of attributes,
many of which may be irrelevant to the mining task or
redundant.
• Leaving out relevant attributes or keeping irrelevant attributes
may be detrimental, causing confusion for the mining
algorithm employed.
• This can result in discovered patterns of poor quality.
• In addition, the added volume of irrelevant or redundant
attributes can slow down the mining process.

• Attribute subset selection reduces the data set size by
removing such attributes or dimensions from it.
• The goal of attribute subset selection is to find a
minimum set of attributes such that the resulting
probability distribution of the data classes is as close as
possible to the original distribution obtained using all
attributes.
• Mining on a reduced set of attributes has an additional
benefit: It reduces the number of attributes appearing in
the discovered patterns, helping to make the patterns
easier to understand.

• “How can we find a ‘good’ subset of the original
attributes?”
• For n attributes, there are 2n possible subsets.
• Heuristic methods that explore a reduced search space
are commonly used for attribute subset selection.
• Their strategy is to make a locally optimal choice
in the hope that this will lead to a globally optimal
solution.

• Basic heuristic methods of attribute subset selection include the
techniques that follow, some of which are illustrated .
• 1. Stepwise forward selection: The procedure starts with an empty
set of attributes as the reduced set. The best of the original attributes
is determined and added to the reduced set. At each subsequent
iteration or step, the best of the remaining original attributes is
added to the set.
• 2. Stepwise backward elimination: The procedure starts with the
full set of attributes. At each step, it removes the worst attribute
remaining in the set.
• 3. Combination of forward selection and backward elimination:
The stepwise forward selection and backward elimination methods
can be combined so that, at each step, the procedure selects the best
attribute and removes the worst from among the remaining
attributes.

• 4. Decision tree induction:
• Each internal (nonleaf) node denotes a test on an
attribute, each branch corresponds to an outcome of
the test, and each external (leaf) node denotes a
class prediction.
• At each node, the algorithm chooses the “best”
attribute to partition the data into individual classes.
• When decision tree induction is used for attribute
subset selection, a tree is constructed from the given
data.
• All attributes that do not appear in the tree are
assumed to be irrelevant.
• The set of attributes appearing in the tree form the
reduced subset of attributes.

4. Regression and log-linear models
• Regression and log-linear models can be used to approximate the
given data.
• In linear regression, the data are modeled to fit a straight line.
• For example, a random variable, Y (called a response variable), can
be modeled as a linear function of another random variable, x
(called a predictor variable), with the equation
• y = wx + b,
where the variance of y is assumed to be constant. In the context of
data mining, x and y are numeric database attributes. The
coefficients, w and b (called regression coefficients), specify the
slope of the line and the y-intercept, respectively.

• Multiple linear regression is an extension of (simple) linear
regression, which allows a response variable, y, to be modeled
as a linear function of two or more predictor variables.
• Log-linear models approximate discrete multidimensional
probability distributions.
• Given a set of tuples in n dimensions (e.g., described by n
attributes), consider each tuple as a point in an n-dimensional
space.

5.Histogram
• Histograms use binning.
• A histogram for an attribute, A, partitions the data distribution
of A into disjoint subsets, referred to as buckets or bins.
• If each bucket represents only a single attribute–
value/frequency pair, the buckets are called singleton buckets

• Equal-width: In an equal-width histogram,
the width of each bucket range is uniform
(e.g., the width of $10 for the buckets)
• Equal-frequency (or equal-depth): In an
equal-frequency histogram, the buckets are
created so that, roughly, the frequency of each
bucket is constant (i.e., each bucket contains
roughly the same number of contiguous data
samples).

6.Clustering

• Clustering techniques consider data tuples as objects.
• They partition the objects into groups, or clusters, so that objects
within a cluster are “similar” to one another and “dissimilar” to
objects in other clusters.
• Similarity is defined in terms of how “close” the objects are in
space, based on a distance function.
• The “quality” of a cluster may be represented by its diameter,
the maximum distance between any two objects in the cluster.
• Centroid distance is an alternative measure of cluster quality and
is defined as the average distance of each cluster object from the
cluster centroid .

7.Sampling
• Sampling can be used as a data reduction technique because it
allows a large data set to be represented by a much smaller
random data sample (or subset). Suppose.
• Simple random sample without replacement (SRSWOR) of
size s: This is created by drawing s of the N tuples from D (s <
N), where the probability of drawing any tuple in D is 1/N,
that is, all tuples are equally likely to be sampled.
• Simple random sample with replacement (SRSWR) of size
s: Here each time a tuple is drawn from D, it is recorded and
then replaced. That is, after a tuple is drawn, it is placed back
in D so that it may be drawn again.that a large data set, D,
contains N tuples 496/30/2020 NIMMY RAJU,AP,VKCET,TVM

• Cluster sample: If the tuples in D are grouped into M
mutually disjoint “clusters,” then an SRS of s clusters can
be obtained, where s < M.
• Stratified sample: If D is divided into mutually disjoint
parts called strata, a stratified sample of D is generated by
obtaining an SRS at each stratum. This helps ensure a
representative sample, especially when the data are
skewed.

52
8. Data aggregation
Data can be aggregated so that the resulting data summarize the total
sales per year instead of per quarter

• Datacubes store multidimensional aggregated information.
• For example, Figure shows a data cube for multidimensional
analysis of sales data with respect to annual sales per item type .

• Each cell holds an aggregate data value, corresponding to the data point in
multidimensional space.
• Data cubes provide fast access to precomputed, summarized data, thereby
benefiting online analytical processing as well as data mining.
• The cube created at the lowest abstraction level is referred to as the base
cuboid.
• Lowest level should be usable, or useful for the analysis.
• A cube at the highest level of abstraction is the apex cuboid.
• The apex cuboid would give one total—the total sales for all three years, for all
item types, and for all branches.
• Each higher abstraction level further reduces the resulting data size.

55
Data Transformation
• A function that maps the entire set of values of a given attribute to a new
set of replacement values s.t. each old value can be identified with one of
the new values
• Methods
– Smoothing: Remove noise from data
– Attribute/feature construction
• New attributes constructed from the given ones
– Aggregation: Summarization, data cube construction
– Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
– Discretization: Concept hierarchy climbing

Normalization
• Min-max normalization: to [new_minA, new_maxA]
– Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
Then $73,000 is mapped to
• Z-score normalization (μ: mean, σ: standard deviation):
– Ex. Let μ = 54,000, σ = 16,000. Then
• Normalization by decimal scaling
716.00)00.1(
000,12000,98
000,12600,73



225.1
000,16
000,54600,73


56
AAA
AA
A
minnewminnewmaxnew
minmax
minv
v _)__(' 



A
Av
v


'
j
v
v
10
' Where j is the smallest integer such that Max(|ν’|) < 1

Discretization
• Three types of attributes
– Nominal—values from an unordered set, e.g., color, profession
– Ordinal—values from an ordered set, e.g., military or academic rank
– Numeric—real numbers, e.g., integer or real numbers
• Discretization: Divide the range of a continuous attribute into intervals
– Interval labels can then be used to replace actual data values
– Reduce data size by discretization
– Supervised vs. unsupervised
– Split (top-down) vs. merge (bottom-up)
– Discretization can be performed recursively on an attribute
– Prepare for further analysis, e.g., classification

58
Data Discretization Methods
• Typical methods: All the methods can be applied recursively
– Binning
• Top-down split, unsupervised
– Histogram analysis
• Top-down split, unsupervised
– Clustering analysis (unsupervised, top-down split or bottom-
up merge)
– Decision-tree analysis (supervised, top-down split)
– Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)

Simple Discretization: Binning
• Equal-width (distance) partitioning
– Divides the range into N intervals of equal size: uniform grid
– if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
– The most straightforward, but outliers may dominate presentation
– Skewed data is not handled well
• Equal-depth (frequency) partitioning
– Divides the range into N intervals, each containing approximately same
number of samples
– Good data scaling
– Managing categorical attributes can be tricky

Binning Methods for Data Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

61
Discretization Without Using Class Labels
(Binning vs. Clustering)
Data Equal interval width (binning)
Equal frequency (binning) K-means clustering leads to better results

62
Discretization by Classification & Correlation
Analysis
• Classification (e.g., decision tree analysis)
– Supervised: Given class labels, e.g., cancerous vs. benign
– Using entropy to determine split point (discretization point)
– Top-down, recursive split
– Details to be covered in Chapter 7
• Correlation analysis (e.g., Chi-merge: χ2-based discretization)
– Supervised: use class information
– Bottom-up merge: find the best neighboring intervals (those having
similar distributions of classes, i.e., low χ2 values) to merge
– Merge performed recursively, until a predefined stopping condition

Concept Hierarchy Generation
• Concept hierarchy organizes concepts (i.e., attribute values) hierarchically
and is usually associated with each dimension in a data warehouse
• Concept hierarchies facilitate drilling and rolling in data warehouses to view
data in multiple granularity
• Concept hierarchy formation: Recursively reduce the data by collecting and
replacing low level concepts (such as numeric values for age) by higher level
concepts (such as youth, adult, or senior)
• Concept hierarchies can be explicitly specified by domain experts and/or data
warehouse designers
• Concept hierarchy can be automatically formed for both numeric and
nominal data. For numeric data, use discretization methods shown.

Concept Hierarchy Generation
for Nominal Data
• Specification of a partial/total ordering of attributes explicitly at
the schema level by users or experts
– street < city < state < country
• Specification of a hierarchy for a set of values by explicit data
grouping
– {Urbana, Champaign, Chicago} < Illinois
• Specification of only a partial set of attributes
– E.g., only street < city, not others
• Automatic generation of hierarchies (or attribute levels) by the
analysis of the number of distinct values
– E.g., for a set of attributes: {street, city, state, country}

Automatic Concept Hierarchy Generation
• Some hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in
the data set
– The attribute with the most distinct values is placed at
the lowest level of the hierarchy
– Exceptions, e.g., weekday, month, quarter, year
65
country
province_or_ state
city
street
15 distinct values
365 distinct values
3567 distinct values
674,339 distinct values

CS 402 DATAMINING AND WAREHOUSING -MODULE 2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to CS 402 DATAMINING AND WAREHOUSING -MODULE 2

Similar to CS 402 DATAMINING AND WAREHOUSING -MODULE 2 (20)

Recently uploaded

Recently uploaded (20)

CS 402 DATAMINING AND WAREHOUSING -MODULE 2