SlideShare a Scribd company logo
1 of 65
MODULE 2
• Data Preprocessing: Data Preprocessing Concepts, Data Cleaning, Data
integration and transformation, Data Reduction, Discretization and concept
hierarchy
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 1
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 2
Data Quality: Why Preprocess the Data?
• Measures for data quality: A multidimensional view
– Accuracy: correct or wrong, accurate or not
– Completeness: not recorded, unavailable, …
– Consistency: some modified but some not, dangling, …
– Timeliness: timely update?
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be understood?
36/30/2020 NIMMY RAJU,AP,VKCET,TVM
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 4
 . There are many possible reasons for inaccurate data (i.e., having incorrect attribute
values).
o The data collection instruments used may be faulty.
o There may have been human or computer errors occurring at data entry.
o Users may purposely submit incorrect data values for mandatory fields when they
do not wish to submit personal information (e.g., by choosing the default value
“January 1” displayed for birthday). This is known as disguised missing data.
o Errors in data transmission can also occur.
o There may be technology limitations such as limited buffer size for coordinating
synchronized data transfer and consumption.
o Incorrect data may also result from inconsistencies in naming conventions or data
codes, or inconsistent formats for input fields (e.g., date).
o Duplicate tuples also require data cleaning.
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation
56/30/2020 NIMMY RAJU,AP,VKCET,TVM
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 6
Data Cleaning
• Real-world data tend to be incomplete, noisy, and inconsistent.
– incomplete: lacking attribute values….
• e.g., Occupation=“ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
– inconsistent: containing discrepancies in codes or names,
• e.g., Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• duplicate records
76/30/2020 NIMMY RAJU,AP,VKCET,TVM
• Data cleaning (or data cleansing)
routines work to “clean” the data by
filling in missing values, smoothing
noisy data, identifying or removing
outliers, and resolving
inconsistencies
86/30/2020 NIMMY RAJU,AP,VKCET,TVM
Incomplete (Missing) Data
• Data is not always available
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time
of entry
96/30/2020 NIMMY RAJU,AP,VKCET,TVM
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (when
doing classification)
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– the attribute mean for all samples belonging to the same
class: smarter
– the most probable value: inference-based such as Bayesian
formula or decision tree.
106/30/2020 NIMMY RAJU,AP,VKCET,TVM
11
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which require data cleaning
– duplicate records
– incomplete data
– inconsistent data
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 11
How to Handle Noisy Data?
• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
– In smoothing by bin means, each value in a bin is replaced
by the mean value of the bin.
– In smoothing by bin medians, each bin value is replaced by
the bin median.
– In smoothing by bin boundaries, the minimum and
maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest
boundary value
126/30/2020 NIMMY RAJU,AP,VKCET,TVM
136/30/2020 NIMMY RAJU,AP,VKCET,TVM
How to Handle Noisy Data?
• Regression
– a technique that conforms data values to a function.
eg: Linear regression , Multiple linear regression
– Clustering
– Similar values are organized into groups, or “clusters.”
– Values that fall outside of the set of clusters may be
considered outliers.
• Combined computer and human inspection
– detect suspicious values and check by human
146/30/2020 NIMMY RAJU,AP,VKCET,TVM
Data Integration
• Data integration:
– Combines data from multiple sources into a coherent store
• Careful integration can help reduce and avoid redundancies and
inconsistencies in the resulting data set.
• This can help improve the accuracy and speed of the subsequent data mining
process.
• Issues in data Integration
• 1.Entity identification problem:
– Schema integration and object matching -How can equivalent real-world
entities from multiple data sources be matched up?
– For example, how can the data analyst or the computer be sure that
customer id in one database and cust number in another refer to the
same attribute?
15
15
6/30/2020 NIMMY RAJU,AP,VKCET,TVM
Data Integration
– 2. Detecting and resolving data value conflicts
• For the same real world entity, attribute values from different
sources are different
• Possible reasons: different representations, different scales, e.g.,
metric vs. British units
– 3. Redundancy
• An attribute may be redundant if it can be “derived” from another
attribute or set of attributes.
• Inconsistencies in attribute or dimension naming can also cause
redundancies in the resulting data set.
• Some redundancies can be detected by correlation analysis
166/30/2020 NIMMY RAJU,AP,VKCET,TVM
Handling Redundancy in Data Integration
• Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
• Given two attributes, such analysis can measure how strongly
one attribute implies the other, based on the
available data.
• For nominal data, we use the Χ2 (chi-square) test.
• For numeric attributes, use the correlation coefficient and
covariance, both of which access how one attribute’s values
vary from those of another.
17
17
6/30/2020 NIMMY RAJU,AP,VKCET,TVM
Correlation Analysis (Nominal Data)
• Χ2 (chi-square) test
• The larger the Χ2 value, the more likely the variables are related
• Suppose A has c distinct values, namely a1,a2, …………..ac
• B has r distinct values, namely b1,b2, …………..br .
• The data tuples described by A and B can be shown as a contingency table,
with the c values of A making up the columns and the r values of B making
up the rows.



Expected
ExpectedObserved 2
2 )(

186/30/2020 NIMMY RAJU,AP,VKCET,TVM
Correlation Analysis (Nominal Data)
19
where n is the number of data tuples, count.A D ai/ is the number of tuples
having value ai for A, and count.B D bj/ is the number of tuples having value bj
for B.6/30/2020 NIMMY RAJU,AP,VKCET,TVM
Chi-Square Calculation: An Example
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 20
Correlation Analysis (Numeric Data)
• Correlation coefficient (also called Pearson’s product moment
coefficient)
where n is the number of tuples, and are the respective means of A
and B, σA and σB are the respective standard deviation of A and B, and
Σ(aibi) is the sum of the AB cross-product.
• If rA,B > 0, A and B are positively correlated (A’s values increase as
B’s). The higher, the stronger correlation.
• rA,B = 0: independent; rAB < 0: negatively correlated
BA
n
i ii
BA
n
i ii
BA
n
BAnba
n
BbAa
r
 )(
)(
)(
))(( 11
,
 




A
21
B
6/30/2020 NIMMY RAJU,AP,VKCET,TVM
Covariance (Numeric Data)
• Covariance is similar to correlation
where n is the number of tuples, and are the respective mean or
expected values of A and B, σA and σB are the respective standard deviation
of A and B.
• Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their
expected values.
• Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is
likely to be smaller than its expected value.
• Independence: CovA,B = 0
22
A B
Correlation coefficient:
6/30/2020 NIMMY RAJU,AP,VKCET,TVM
• It can be simplified in computation as
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 23
Co-Variance: An Example
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 24
Data Reduction
• Complex data analysis and mining on huge amounts
of data can take a long time, making such analysis
impractical or infeasible.
• Data reduction: Obtain a reduced representation of
the data set that is much smaller in volume but yet
produces the same (or almost the same) analytical
results
• Data reduction strategies include dimensionality
reduction, numerosity reduction, and data
compression
256/30/2020 NIMMY RAJU,AP,VKCET,TVM
Overview of Data Reduction Strategies
• Dimensionality reduction : process of reducing the number of
random variables or attributes under consideration .
• Wavelet transforms
• Principal Components Analysis (PCA)
• Attribute subset selection
• Wavelet transforms and principal components analysis which
transform or project the original data onto a smaller space.
• Attribute subset selection is a method of dimensionality
reduction in which irrelevant, weakly relevant, or redundant
attributes or dimensions are detected and removed.
266/30/2020 NIMMY RAJU,AP,VKCET,TVM
– Numerosity reduction:replace the original data volume by
alternative, smaller forms of data representation.
– These techniques may be parametric or nonparametric.
– For parametric methods, a model is used to estimate the data,
so that typically only the data parameters need to be stored,
instead of the actual data.
• Regression and log-linear models are examples.
– Nonparametric methods for storing reduced representations
of the data include histograms, clustering, sampling and data
cube aggregation .
276/30/2020 NIMMY RAJU,AP,VKCET,TVM
– Data compression:
– In data compression, transformations are applied so as to
obtain a reduced or “compressed” representation of the
original data.
– If the original data can be reconstructed from the
compressed data without any information loss, the data
reduction is called lossless.
– If, instead, we can reconstruct only an approximation of
the original data, then the data reduction is called lossy.
286/30/2020 NIMMY RAJU,AP,VKCET,TVM
29
1. Wavelet transforms
• The discrete wavelet transform (DWT) is a linear signal
processing technique that, when applied to a data vector X,
transforms it to a numerically different vector, X/, of wavelet
coefficients.
• The two vectors are of the same length.
• “How can this technique be useful for data reduction if the
wavelet transformed data are of the same length as the original
data?”
• A compressed approximation of the data can be retained by
storing only a small fraction of the strongest of the wavelet
coefficients.
• For example, all wavelet coefficients larger than some user-
specified threshold can be retained. All other coefficients are set
to 0.6/30/2020 NIMMY RAJU,AP,VKCET,TVM
•The technique removes noise without smoothing out the main features of
the data, making it effective for data cleaning as well.
•Given a set of coefficients, an approximation of the original data can be
constructed by applying the inverse of the DWT used.
•The DWT is closely related to the discrete Fourier transform (DFT), a
signal processing technique involving sines and cosines.
•
•In general, however, the DWT achieves better lossy compression.
•That is, if the same number of coefficients is retained for a DWT and a
DFT of a given data vector, the DWT version will provide a more accurate
approximation of the original data.
•DWT requires less space than the DFT.
•Unlike the DFT, wavelets are quite localized in space.
•There is only one DFT, yet there are several families of DWTs6/30/2020 NIMMY RAJU,AP,VKCET,TVM 30
The general procedure for applying a discrete wavelet transform
uses a hierarchical pyramid algorithm that halves the data at
each iteration, resulting in fast computational speed. The method
is as follows:
1. The length, L, of the input data vector must be an integer power
of 2. This condition can be met by padding the data vector with
zeros as necessary (L ≥ n).
2. Each transform involves applying two functions. The first
applies some data smoothing, such as a sum or weighted average.
The second performs a weighted difference, which acts to bring out
the detailed features of the data.
3. The two functions are applied to pairs of data points in X, that
is, to all pairs of measurements .This results in two data sets of
length L/2.6/30/2020 NIMMY RAJU,AP,VKCET,TVM 31
4. The two functions are recursively applied to the data sets
obtained in the previous loop, until the resulting data sets obtained
are of length 2.
5. Selected values from the data sets obtained in the previous
iterations are designated the wavelet coefficients of the transformed
data.
• Equivalently, a matrix multiplication can be applied to the input
data in order to obtain the wavelet coefficients, where the matrix used
depends on the given DWT
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 32
33
Wavelet Decomposition
• S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to S^ = [23/4, -11/4, 1/2,
0, 0, 1, -1, 0]
6/30/2020 NIMMY RAJU,AP,VKCET,TVM
2.Principal Components Analysis (PCA)
• Principal components analysis searches for k n-
dimensional orthogonal vectors that can best be
used to represent the data, where k ≤ n.
• The original data are thus projected onto a much
smaller space, resulting in dimensionality
reduction.
• PCA “combines” the essence of attributes by
creating an alternative, smaller set of variables.
The initial data can then be projected onto this
smaller set.
346/30/2020 NIMMY RAJU,AP,VKCET,TVM
Principal Component Analysis (Steps)
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
– Normalize input data: Each attribute falls within the same range
– Compute k orthonormal (unit) vectors, i.e., principal components
– Each input data (vector) is a linear combination of the k principal
component vectors
– The principal components are sorted in order of decreasing “significance”
or strength
– Since the components are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance (i.e., using
the strongest principal components, it is possible to reconstruct a good
approximation of the original data)
• Works for numeric data only.
356/30/2020 NIMMY RAJU,AP,VKCET,TVM
3. Attribute subset selection
• Data sets for analysis may contain hundreds of attributes,
many of which may be irrelevant to the mining task or
redundant.
• Leaving out relevant attributes or keeping irrelevant attributes
may be detrimental, causing confusion for the mining
algorithm employed.
• This can result in discovered patterns of poor quality.
• In addition, the added volume of irrelevant or redundant
attributes can slow down the mining process.
366/30/2020 NIMMY RAJU,AP,VKCET,TVM
• Attribute subset selection reduces the data set size by
removing such attributes or dimensions from it.
• The goal of attribute subset selection is to find a
minimum set of attributes such that the resulting
probability distribution of the data classes is as close as
possible to the original distribution obtained using all
attributes.
• Mining on a reduced set of attributes has an additional
benefit: It reduces the number of attributes appearing in
the discovered patterns, helping to make the patterns
easier to understand.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 37
• “How can we find a ‘good’ subset of the original
attributes?”
• For n attributes, there are 2n possible subsets.
• Heuristic methods that explore a reduced search space
are commonly used for attribute subset selection.
• Their strategy is to make a locally optimal choice
in the hope that this will lead to a globally optimal
solution.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 38
• Basic heuristic methods of attribute subset selection include the
techniques that follow, some of which are illustrated .
• 1. Stepwise forward selection: The procedure starts with an empty
set of attributes as the reduced set. The best of the original attributes
is determined and added to the reduced set. At each subsequent
iteration or step, the best of the remaining original attributes is
added to the set.
• 2. Stepwise backward elimination: The procedure starts with the
full set of attributes. At each step, it removes the worst attribute
remaining in the set.
• 3. Combination of forward selection and backward elimination:
The stepwise forward selection and backward elimination methods
can be combined so that, at each step, the procedure selects the best
attribute and removes the worst from among the remaining
attributes.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 39
• 4. Decision tree induction:
• Each internal (nonleaf) node denotes a test on an
attribute, each branch corresponds to an outcome of
the test, and each external (leaf) node denotes a
class prediction.
• At each node, the algorithm chooses the “best”
attribute to partition the data into individual classes.
• When decision tree induction is used for attribute
subset selection, a tree is constructed from the given
data.
• All attributes that do not appear in the tree are
assumed to be irrelevant.
• The set of attributes appearing in the tree form the
reduced subset of attributes.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 40
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 41
4. Regression and log-linear models
• Regression and log-linear models can be used to approximate the
given data.
• In linear regression, the data are modeled to fit a straight line.
• For example, a random variable, Y (called a response variable), can
be modeled as a linear function of another random variable, x
(called a predictor variable), with the equation
• y = wx + b,
where the variance of y is assumed to be constant. In the context of
data mining, x and y are numeric database attributes. The
coefficients, w and b (called regression coefficients), specify the
slope of the line and the y-intercept, respectively.
426/30/2020 NIMMY RAJU,AP,VKCET,TVM
• Multiple linear regression is an extension of (simple) linear
regression, which allows a response variable, y, to be modeled
as a linear function of two or more predictor variables.
• Log-linear models approximate discrete multidimensional
probability distributions.
• Given a set of tuples in n dimensions (e.g., described by n
attributes), consider each tuple as a point in an n-dimensional
space.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 43
5.Histogram
• Histograms use binning.
• A histogram for an attribute, A, partitions the data distribution
of A into disjoint subsets, referred to as buckets or bins.
• If each bucket represents only a single attribute–
value/frequency pair, the buckets are called singleton buckets
446/30/2020 NIMMY RAJU,AP,VKCET,TVM
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 45
• Equal-width: In an equal-width histogram,
the width of each bucket range is uniform
(e.g., the width of $10 for the buckets)
• Equal-frequency (or equal-depth): In an
equal-frequency histogram, the buckets are
created so that, roughly, the frequency of each
bucket is constant (i.e., each bucket contains
roughly the same number of contiguous data
samples).
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 46
6.Clustering
476/30/2020 NIMMY RAJU,AP,VKCET,TVM
• Clustering techniques consider data tuples as objects.
• They partition the objects into groups, or clusters, so that objects
within a cluster are “similar” to one another and “dissimilar” to
objects in other clusters.
• Similarity is defined in terms of how “close” the objects are in
space, based on a distance function.
• The “quality” of a cluster may be represented by its diameter,
the maximum distance between any two objects in the cluster.
• Centroid distance is an alternative measure of cluster quality and
is defined as the average distance of each cluster object from the
cluster centroid .
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 48
7.Sampling
• Sampling can be used as a data reduction technique because it
allows a large data set to be represented by a much smaller
random data sample (or subset). Suppose.
• Simple random sample without replacement (SRSWOR) of
size s: This is created by drawing s of the N tuples from D (s <
N), where the probability of drawing any tuple in D is 1/N,
that is, all tuples are equally likely to be sampled.
• Simple random sample with replacement (SRSWR) of size
s: Here each time a tuple is drawn from D, it is recorded and
then replaced. That is, after a tuple is drawn, it is placed back
in D so that it may be drawn again.that a large data set, D,
contains N tuples 496/30/2020 NIMMY RAJU,AP,VKCET,TVM
• Cluster sample: If the tuples in D are grouped into M
mutually disjoint “clusters,” then an SRS of s clusters can
be obtained, where s < M.
• Stratified sample: If D is divided into mutually disjoint
parts called strata, a stratified sample of D is generated by
obtaining an SRS at each stratum. This helps ensure a
representative sample, especially when the data are
skewed.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 50
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 51
52
8. Data aggregation
Data can be aggregated so that the resulting data summarize the total
sales per year instead of per quarter
6/30/2020 NIMMY RAJU,AP,VKCET,TVM
• Datacubes store multidimensional aggregated information.
• For example, Figure shows a data cube for multidimensional
analysis of sales data with respect to annual sales per item type .
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 53
• Each cell holds an aggregate data value, corresponding to the data point in
multidimensional space.
• Data cubes provide fast access to precomputed, summarized data, thereby
benefiting online analytical processing as well as data mining.
• The cube created at the lowest abstraction level is referred to as the base
cuboid.
• Lowest level should be usable, or useful for the analysis.
• A cube at the highest level of abstraction is the apex cuboid.
• The apex cuboid would give one total—the total sales for all three years, for all
item types, and for all branches.
• Each higher abstraction level further reduces the resulting data size.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 54
55
Data Transformation
• A function that maps the entire set of values of a given attribute to a new
set of replacement values s.t. each old value can be identified with one of
the new values
• Methods
– Smoothing: Remove noise from data
– Attribute/feature construction
• New attributes constructed from the given ones
– Aggregation: Summarization, data cube construction
– Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
– Discretization: Concept hierarchy climbing
6/30/2020 NIMMY RAJU,AP,VKCET,TVM
Normalization
• Min-max normalization: to [new_minA, new_maxA]
– Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
Then $73,000 is mapped to
• Z-score normalization (μ: mean, σ: standard deviation):
– Ex. Let μ = 54,000, σ = 16,000. Then
• Normalization by decimal scaling
716.00)00.1(
000,12000,98
000,12600,73



225.1
000,16
000,54600,73


56
AAA
AA
A
minnewminnewmaxnew
minmax
minv
v _)__(' 



A
Av
v


'
j
v
v
10
' Where j is the smallest integer such that Max(|ν’|) < 1
6/30/2020 NIMMY RAJU,AP,VKCET,TVM
Discretization
• Three types of attributes
– Nominal—values from an unordered set, e.g., color, profession
– Ordinal—values from an ordered set, e.g., military or academic rank
– Numeric—real numbers, e.g., integer or real numbers
• Discretization: Divide the range of a continuous attribute into intervals
– Interval labels can then be used to replace actual data values
– Reduce data size by discretization
– Supervised vs. unsupervised
– Split (top-down) vs. merge (bottom-up)
– Discretization can be performed recursively on an attribute
– Prepare for further analysis, e.g., classification
576/30/2020 NIMMY RAJU,AP,VKCET,TVM
58
Data Discretization Methods
• Typical methods: All the methods can be applied recursively
– Binning
• Top-down split, unsupervised
– Histogram analysis
• Top-down split, unsupervised
– Clustering analysis (unsupervised, top-down split or bottom-
up merge)
– Decision-tree analysis (supervised, top-down split)
– Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 58
Simple Discretization: Binning
• Equal-width (distance) partitioning
– Divides the range into N intervals of equal size: uniform grid
– if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
– The most straightforward, but outliers may dominate presentation
– Skewed data is not handled well
• Equal-depth (frequency) partitioning
– Divides the range into N intervals, each containing approximately same
number of samples
– Good data scaling
– Managing categorical attributes can be tricky
596/30/2020 NIMMY RAJU,AP,VKCET,TVM
Binning Methods for Data Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
606/30/2020 NIMMY RAJU,AP,VKCET,TVM
61
Discretization Without Using Class Labels
(Binning vs. Clustering)
Data Equal interval width (binning)
Equal frequency (binning) K-means clustering leads to better results
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 61
62
Discretization by Classification & Correlation
Analysis
• Classification (e.g., decision tree analysis)
– Supervised: Given class labels, e.g., cancerous vs. benign
– Using entropy to determine split point (discretization point)
– Top-down, recursive split
– Details to be covered in Chapter 7
• Correlation analysis (e.g., Chi-merge: χ2-based discretization)
– Supervised: use class information
– Bottom-up merge: find the best neighboring intervals (those having
similar distributions of classes, i.e., low χ2 values) to merge
– Merge performed recursively, until a predefined stopping condition
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 62
Concept Hierarchy Generation
• Concept hierarchy organizes concepts (i.e., attribute values) hierarchically
and is usually associated with each dimension in a data warehouse
• Concept hierarchies facilitate drilling and rolling in data warehouses to view
data in multiple granularity
• Concept hierarchy formation: Recursively reduce the data by collecting and
replacing low level concepts (such as numeric values for age) by higher level
concepts (such as youth, adult, or senior)
• Concept hierarchies can be explicitly specified by domain experts and/or data
warehouse designers
• Concept hierarchy can be automatically formed for both numeric and
nominal data. For numeric data, use discretization methods shown.
636/30/2020 NIMMY RAJU,AP,VKCET,TVM
Concept Hierarchy Generation
for Nominal Data
• Specification of a partial/total ordering of attributes explicitly at
the schema level by users or experts
– street < city < state < country
• Specification of a hierarchy for a set of values by explicit data
grouping
– {Urbana, Champaign, Chicago} < Illinois
• Specification of only a partial set of attributes
– E.g., only street < city, not others
• Automatic generation of hierarchies (or attribute levels) by the
analysis of the number of distinct values
– E.g., for a set of attributes: {street, city, state, country}
646/30/2020 NIMMY RAJU,AP,VKCET,TVM
Automatic Concept Hierarchy Generation
• Some hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in
the data set
– The attribute with the most distinct values is placed at
the lowest level of the hierarchy
– Exceptions, e.g., weekday, month, quarter, year
65
country
province_or_ state
city
street
15 distinct values
365 distinct values
3567 distinct values
674,339 distinct values
6/30/2020 NIMMY RAJU,AP,VKCET,TVM

More Related Content

What's hot

Association rule mining
Association rule miningAssociation rule mining
Association rule miningUtkarsh Sharma
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and predictionDataminingTools Inc
 
5.3 mining sequential patterns
5.3 mining sequential patterns5.3 mining sequential patterns
5.3 mining sequential patternsKrish_ver2
 
Cs501 classification prediction
Cs501 classification predictionCs501 classification prediction
Cs501 classification predictionKamal Singh Lodhi
 
Fp growth algorithm
Fp growth algorithmFp growth algorithm
Fp growth algorithmPradip Kumar
 
1.11.association mining 3
1.11.association mining 31.11.association mining 3
1.11.association mining 3Krish_ver2
 
Data mining query languages
Data mining query languagesData mining query languages
Data mining query languagesMarcy Morales
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysisAcad
 
Association rule mining.pptx
Association rule mining.pptxAssociation rule mining.pptx
Association rule mining.pptxmaha797959
 
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...Subrata Kumer Paul
 
Lect6 Association rule & Apriori algorithm
Lect6 Association rule & Apriori algorithmLect6 Association rule & Apriori algorithm
Lect6 Association rule & Apriori algorithmhktripathy
 

What's hot (20)

Association rule mining
Association rule miningAssociation rule mining
Association rule mining
 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
5.3 mining sequential patterns
5.3 mining sequential patterns5.3 mining sequential patterns
5.3 mining sequential patterns
 
Cs501 classification prediction
Cs501 classification predictionCs501 classification prediction
Cs501 classification prediction
 
Fp growth algorithm
Fp growth algorithmFp growth algorithm
Fp growth algorithm
 
02 Data Mining
02 Data Mining02 Data Mining
02 Data Mining
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
1.11.association mining 3
1.11.association mining 31.11.association mining 3
1.11.association mining 3
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
data-mining-tutorial.ppt
data-mining-tutorial.pptdata-mining-tutorial.ppt
data-mining-tutorial.ppt
 
Apriori
AprioriApriori
Apriori
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
Data mining query languages
Data mining query languagesData mining query languages
Data mining query languages
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Association rule mining.pptx
Association rule mining.pptxAssociation rule mining.pptx
Association rule mining.pptx
 
5desc
5desc5desc
5desc
 
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
 
Lect6 Association rule & Apriori algorithm
Lect6 Association rule & Apriori algorithmLect6 Association rule & Apriori algorithm
Lect6 Association rule & Apriori algorithm
 

Similar to CS 402 DATAMINING AND WAREHOUSING -MODULE 2

Data Preparation and Preprocessing , Data Cleaning
Data Preparation and Preprocessing , Data CleaningData Preparation and Preprocessing , Data Cleaning
Data Preparation and Preprocessing , Data CleaningShivarkarSandip
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingkayathri02
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data MiningDHIVYADEVAKI
 
03Preprocessing01.pdf
03Preprocessing01.pdf03Preprocessing01.pdf
03Preprocessing01.pdfAlireza418370
 
Data preprocessing 2
Data preprocessing 2Data preprocessing 2
Data preprocessing 2extraganesh
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingTony Nguyen
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingJames Wong
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingHarry Potter
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingFraboni Ec
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingHoang Nguyen
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingYoung Alista
 
Preprocessing
PreprocessingPreprocessing
Preprocessingmmuthuraj
 

Similar to CS 402 DATAMINING AND WAREHOUSING -MODULE 2 (20)

Data Preparation and Preprocessing , Data Cleaning
Data Preparation and Preprocessing , Data CleaningData Preparation and Preprocessing , Data Cleaning
Data Preparation and Preprocessing , Data Cleaning
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
03Preprocessing01.pdf
03Preprocessing01.pdf03Preprocessing01.pdf
03Preprocessing01.pdf
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
Unit 3-2.ppt
Unit 3-2.pptUnit 3-2.ppt
Unit 3-2.ppt
 
Data processing
Data processingData processing
Data processing
 
Data processing
Data processingData processing
Data processing
 
Data1
Data1Data1
Data1
 
Data1
Data1Data1
Data1
 
Data preprocessing 2
Data preprocessing 2Data preprocessing 2
Data preprocessing 2
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 

Recently uploaded

Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 

Recently uploaded (20)

Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 

CS 402 DATAMINING AND WAREHOUSING -MODULE 2

  • 1. MODULE 2 • Data Preprocessing: Data Preprocessing Concepts, Data Cleaning, Data integration and transformation, Data Reduction, Discretization and concept hierarchy 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 1
  • 3. Data Quality: Why Preprocess the Data? • Measures for data quality: A multidimensional view – Accuracy: correct or wrong, accurate or not – Completeness: not recorded, unavailable, … – Consistency: some modified but some not, dangling, … – Timeliness: timely update? – Believability: how trustable the data are correct? – Interpretability: how easily the data can be understood? 36/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 4. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 4  . There are many possible reasons for inaccurate data (i.e., having incorrect attribute values). o The data collection instruments used may be faulty. o There may have been human or computer errors occurring at data entry. o Users may purposely submit incorrect data values for mandatory fields when they do not wish to submit personal information (e.g., by choosing the default value “January 1” displayed for birthday). This is known as disguised missing data. o Errors in data transmission can also occur. o There may be technology limitations such as limited buffer size for coordinating synchronized data transfer and consumption. o Incorrect data may also result from inconsistencies in naming conventions or data codes, or inconsistent formats for input fields (e.g., date). o Duplicate tuples also require data cleaning.
  • 5. Major Tasks in Data Preprocessing • Data cleaning – Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration – Integration of multiple databases, data cubes, or files • Data reduction – Dimensionality reduction – Numerosity reduction – Data compression • Data transformation and data discretization – Normalization – Concept hierarchy generation 56/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 7. Data Cleaning • Real-world data tend to be incomplete, noisy, and inconsistent. – incomplete: lacking attribute values…. • e.g., Occupation=“ ” (missing data) – noisy: containing noise, errors, or outliers • e.g., Salary=“−10” (an error) – inconsistent: containing discrepancies in codes or names, • e.g., Age=“42”, Birthday=“03/07/2010” • Was rating “1, 2, 3”, now rating “A, B, C” • duplicate records 76/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 8. • Data cleaning (or data cleansing) routines work to “clean” the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies 86/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 9. Incomplete (Missing) Data • Data is not always available – E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to – equipment malfunction – inconsistent with other recorded data and thus deleted – data not entered due to misunderstanding – certain data may not be considered important at the time of entry 96/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 10. How to Handle Missing Data? • Ignore the tuple: usually done when class label is missing (when doing classification) • Fill in the missing value manually: tedious + infeasible? • Fill in it automatically with – a global constant : e.g., “unknown”, a new class?! – the attribute mean – the attribute mean for all samples belonging to the same class: smarter – the most probable value: inference-based such as Bayesian formula or decision tree. 106/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 11. 11 Noisy Data • Noise: random error or variance in a measured variable • Incorrect attribute values may be due to – faulty data collection instruments – data entry problems – data transmission problems – technology limitation – inconsistency in naming convention • Other data problems which require data cleaning – duplicate records – incomplete data – inconsistent data 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 11
  • 12. How to Handle Noisy Data? • Binning – first sort data and partition into (equal-frequency) bins – then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. – In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. – In smoothing by bin medians, each bin value is replaced by the bin median. – In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value 126/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 14. How to Handle Noisy Data? • Regression – a technique that conforms data values to a function. eg: Linear regression , Multiple linear regression – Clustering – Similar values are organized into groups, or “clusters.” – Values that fall outside of the set of clusters may be considered outliers. • Combined computer and human inspection – detect suspicious values and check by human 146/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 15. Data Integration • Data integration: – Combines data from multiple sources into a coherent store • Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set. • This can help improve the accuracy and speed of the subsequent data mining process. • Issues in data Integration • 1.Entity identification problem: – Schema integration and object matching -How can equivalent real-world entities from multiple data sources be matched up? – For example, how can the data analyst or the computer be sure that customer id in one database and cust number in another refer to the same attribute? 15 15 6/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 16. Data Integration – 2. Detecting and resolving data value conflicts • For the same real world entity, attribute values from different sources are different • Possible reasons: different representations, different scales, e.g., metric vs. British units – 3. Redundancy • An attribute may be redundant if it can be “derived” from another attribute or set of attributes. • Inconsistencies in attribute or dimension naming can also cause redundancies in the resulting data set. • Some redundancies can be detected by correlation analysis 166/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 17. Handling Redundancy in Data Integration • Redundant attributes may be able to be detected by correlation analysis and covariance analysis • Given two attributes, such analysis can measure how strongly one attribute implies the other, based on the available data. • For nominal data, we use the Χ2 (chi-square) test. • For numeric attributes, use the correlation coefficient and covariance, both of which access how one attribute’s values vary from those of another. 17 17 6/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 18. Correlation Analysis (Nominal Data) • Χ2 (chi-square) test • The larger the Χ2 value, the more likely the variables are related • Suppose A has c distinct values, namely a1,a2, …………..ac • B has r distinct values, namely b1,b2, …………..br . • The data tuples described by A and B can be shown as a contingency table, with the c values of A making up the columns and the r values of B making up the rows.    Expected ExpectedObserved 2 2 )(  186/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 19. Correlation Analysis (Nominal Data) 19 where n is the number of data tuples, count.A D ai/ is the number of tuples having value ai for A, and count.B D bj/ is the number of tuples having value bj for B.6/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 20. Chi-Square Calculation: An Example 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 20
  • 21. Correlation Analysis (Numeric Data) • Correlation coefficient (also called Pearson’s product moment coefficient) where n is the number of tuples, and are the respective means of A and B, σA and σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross-product. • If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the stronger correlation. • rA,B = 0: independent; rAB < 0: negatively correlated BA n i ii BA n i ii BA n BAnba n BbAa r  )( )( )( ))(( 11 ,       A 21 B 6/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 22. Covariance (Numeric Data) • Covariance is similar to correlation where n is the number of tuples, and are the respective mean or expected values of A and B, σA and σB are the respective standard deviation of A and B. • Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their expected values. • Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to be smaller than its expected value. • Independence: CovA,B = 0 22 A B Correlation coefficient: 6/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 23. • It can be simplified in computation as 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 23
  • 24. Co-Variance: An Example 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 24
  • 25. Data Reduction • Complex data analysis and mining on huge amounts of data can take a long time, making such analysis impractical or infeasible. • Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results • Data reduction strategies include dimensionality reduction, numerosity reduction, and data compression 256/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 26. Overview of Data Reduction Strategies • Dimensionality reduction : process of reducing the number of random variables or attributes under consideration . • Wavelet transforms • Principal Components Analysis (PCA) • Attribute subset selection • Wavelet transforms and principal components analysis which transform or project the original data onto a smaller space. • Attribute subset selection is a method of dimensionality reduction in which irrelevant, weakly relevant, or redundant attributes or dimensions are detected and removed. 266/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 27. – Numerosity reduction:replace the original data volume by alternative, smaller forms of data representation. – These techniques may be parametric or nonparametric. – For parametric methods, a model is used to estimate the data, so that typically only the data parameters need to be stored, instead of the actual data. • Regression and log-linear models are examples. – Nonparametric methods for storing reduced representations of the data include histograms, clustering, sampling and data cube aggregation . 276/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 28. – Data compression: – In data compression, transformations are applied so as to obtain a reduced or “compressed” representation of the original data. – If the original data can be reconstructed from the compressed data without any information loss, the data reduction is called lossless. – If, instead, we can reconstruct only an approximation of the original data, then the data reduction is called lossy. 286/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 29. 29 1. Wavelet transforms • The discrete wavelet transform (DWT) is a linear signal processing technique that, when applied to a data vector X, transforms it to a numerically different vector, X/, of wavelet coefficients. • The two vectors are of the same length. • “How can this technique be useful for data reduction if the wavelet transformed data are of the same length as the original data?” • A compressed approximation of the data can be retained by storing only a small fraction of the strongest of the wavelet coefficients. • For example, all wavelet coefficients larger than some user- specified threshold can be retained. All other coefficients are set to 0.6/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 30. •The technique removes noise without smoothing out the main features of the data, making it effective for data cleaning as well. •Given a set of coefficients, an approximation of the original data can be constructed by applying the inverse of the DWT used. •The DWT is closely related to the discrete Fourier transform (DFT), a signal processing technique involving sines and cosines. • •In general, however, the DWT achieves better lossy compression. •That is, if the same number of coefficients is retained for a DWT and a DFT of a given data vector, the DWT version will provide a more accurate approximation of the original data. •DWT requires less space than the DFT. •Unlike the DFT, wavelets are quite localized in space. •There is only one DFT, yet there are several families of DWTs6/30/2020 NIMMY RAJU,AP,VKCET,TVM 30
  • 31. The general procedure for applying a discrete wavelet transform uses a hierarchical pyramid algorithm that halves the data at each iteration, resulting in fast computational speed. The method is as follows: 1. The length, L, of the input data vector must be an integer power of 2. This condition can be met by padding the data vector with zeros as necessary (L ≥ n). 2. Each transform involves applying two functions. The first applies some data smoothing, such as a sum or weighted average. The second performs a weighted difference, which acts to bring out the detailed features of the data. 3. The two functions are applied to pairs of data points in X, that is, to all pairs of measurements .This results in two data sets of length L/2.6/30/2020 NIMMY RAJU,AP,VKCET,TVM 31
  • 32. 4. The two functions are recursively applied to the data sets obtained in the previous loop, until the resulting data sets obtained are of length 2. 5. Selected values from the data sets obtained in the previous iterations are designated the wavelet coefficients of the transformed data. • Equivalently, a matrix multiplication can be applied to the input data in order to obtain the wavelet coefficients, where the matrix used depends on the given DWT 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 32
  • 33. 33 Wavelet Decomposition • S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to S^ = [23/4, -11/4, 1/2, 0, 0, 1, -1, 0] 6/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 34. 2.Principal Components Analysis (PCA) • Principal components analysis searches for k n- dimensional orthogonal vectors that can best be used to represent the data, where k ≤ n. • The original data are thus projected onto a much smaller space, resulting in dimensionality reduction. • PCA “combines” the essence of attributes by creating an alternative, smaller set of variables. The initial data can then be projected onto this smaller set. 346/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 35. Principal Component Analysis (Steps) • Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components) that can be best used to represent data – Normalize input data: Each attribute falls within the same range – Compute k orthonormal (unit) vectors, i.e., principal components – Each input data (vector) is a linear combination of the k principal component vectors – The principal components are sorted in order of decreasing “significance” or strength – Since the components are sorted, the size of the data can be reduced by eliminating the weak components, i.e., those with low variance (i.e., using the strongest principal components, it is possible to reconstruct a good approximation of the original data) • Works for numeric data only. 356/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 36. 3. Attribute subset selection • Data sets for analysis may contain hundreds of attributes, many of which may be irrelevant to the mining task or redundant. • Leaving out relevant attributes or keeping irrelevant attributes may be detrimental, causing confusion for the mining algorithm employed. • This can result in discovered patterns of poor quality. • In addition, the added volume of irrelevant or redundant attributes can slow down the mining process. 366/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 37. • Attribute subset selection reduces the data set size by removing such attributes or dimensions from it. • The goal of attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes. • Mining on a reduced set of attributes has an additional benefit: It reduces the number of attributes appearing in the discovered patterns, helping to make the patterns easier to understand. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 37
  • 38. • “How can we find a ‘good’ subset of the original attributes?” • For n attributes, there are 2n possible subsets. • Heuristic methods that explore a reduced search space are commonly used for attribute subset selection. • Their strategy is to make a locally optimal choice in the hope that this will lead to a globally optimal solution. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 38
  • 39. • Basic heuristic methods of attribute subset selection include the techniques that follow, some of which are illustrated . • 1. Stepwise forward selection: The procedure starts with an empty set of attributes as the reduced set. The best of the original attributes is determined and added to the reduced set. At each subsequent iteration or step, the best of the remaining original attributes is added to the set. • 2. Stepwise backward elimination: The procedure starts with the full set of attributes. At each step, it removes the worst attribute remaining in the set. • 3. Combination of forward selection and backward elimination: The stepwise forward selection and backward elimination methods can be combined so that, at each step, the procedure selects the best attribute and removes the worst from among the remaining attributes. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 39
  • 40. • 4. Decision tree induction: • Each internal (nonleaf) node denotes a test on an attribute, each branch corresponds to an outcome of the test, and each external (leaf) node denotes a class prediction. • At each node, the algorithm chooses the “best” attribute to partition the data into individual classes. • When decision tree induction is used for attribute subset selection, a tree is constructed from the given data. • All attributes that do not appear in the tree are assumed to be irrelevant. • The set of attributes appearing in the tree form the reduced subset of attributes. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 40
  • 42. 4. Regression and log-linear models • Regression and log-linear models can be used to approximate the given data. • In linear regression, the data are modeled to fit a straight line. • For example, a random variable, Y (called a response variable), can be modeled as a linear function of another random variable, x (called a predictor variable), with the equation • y = wx + b, where the variance of y is assumed to be constant. In the context of data mining, x and y are numeric database attributes. The coefficients, w and b (called regression coefficients), specify the slope of the line and the y-intercept, respectively. 426/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 43. • Multiple linear regression is an extension of (simple) linear regression, which allows a response variable, y, to be modeled as a linear function of two or more predictor variables. • Log-linear models approximate discrete multidimensional probability distributions. • Given a set of tuples in n dimensions (e.g., described by n attributes), consider each tuple as a point in an n-dimensional space. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 43
  • 44. 5.Histogram • Histograms use binning. • A histogram for an attribute, A, partitions the data distribution of A into disjoint subsets, referred to as buckets or bins. • If each bucket represents only a single attribute– value/frequency pair, the buckets are called singleton buckets 446/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 46. • Equal-width: In an equal-width histogram, the width of each bucket range is uniform (e.g., the width of $10 for the buckets) • Equal-frequency (or equal-depth): In an equal-frequency histogram, the buckets are created so that, roughly, the frequency of each bucket is constant (i.e., each bucket contains roughly the same number of contiguous data samples). 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 46
  • 48. • Clustering techniques consider data tuples as objects. • They partition the objects into groups, or clusters, so that objects within a cluster are “similar” to one another and “dissimilar” to objects in other clusters. • Similarity is defined in terms of how “close” the objects are in space, based on a distance function. • The “quality” of a cluster may be represented by its diameter, the maximum distance between any two objects in the cluster. • Centroid distance is an alternative measure of cluster quality and is defined as the average distance of each cluster object from the cluster centroid . 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 48
  • 49. 7.Sampling • Sampling can be used as a data reduction technique because it allows a large data set to be represented by a much smaller random data sample (or subset). Suppose. • Simple random sample without replacement (SRSWOR) of size s: This is created by drawing s of the N tuples from D (s < N), where the probability of drawing any tuple in D is 1/N, that is, all tuples are equally likely to be sampled. • Simple random sample with replacement (SRSWR) of size s: Here each time a tuple is drawn from D, it is recorded and then replaced. That is, after a tuple is drawn, it is placed back in D so that it may be drawn again.that a large data set, D, contains N tuples 496/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 50. • Cluster sample: If the tuples in D are grouped into M mutually disjoint “clusters,” then an SRS of s clusters can be obtained, where s < M. • Stratified sample: If D is divided into mutually disjoint parts called strata, a stratified sample of D is generated by obtaining an SRS at each stratum. This helps ensure a representative sample, especially when the data are skewed. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 50
  • 52. 52 8. Data aggregation Data can be aggregated so that the resulting data summarize the total sales per year instead of per quarter 6/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 53. • Datacubes store multidimensional aggregated information. • For example, Figure shows a data cube for multidimensional analysis of sales data with respect to annual sales per item type . 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 53
  • 54. • Each cell holds an aggregate data value, corresponding to the data point in multidimensional space. • Data cubes provide fast access to precomputed, summarized data, thereby benefiting online analytical processing as well as data mining. • The cube created at the lowest abstraction level is referred to as the base cuboid. • Lowest level should be usable, or useful for the analysis. • A cube at the highest level of abstraction is the apex cuboid. • The apex cuboid would give one total—the total sales for all three years, for all item types, and for all branches. • Each higher abstraction level further reduces the resulting data size. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 54
  • 55. 55 Data Transformation • A function that maps the entire set of values of a given attribute to a new set of replacement values s.t. each old value can be identified with one of the new values • Methods – Smoothing: Remove noise from data – Attribute/feature construction • New attributes constructed from the given ones – Aggregation: Summarization, data cube construction – Normalization: Scaled to fall within a smaller, specified range • min-max normalization • z-score normalization • normalization by decimal scaling – Discretization: Concept hierarchy climbing 6/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 56. Normalization • Min-max normalization: to [new_minA, new_maxA] – Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to • Z-score normalization (μ: mean, σ: standard deviation): – Ex. Let μ = 54,000, σ = 16,000. Then • Normalization by decimal scaling 716.00)00.1( 000,12000,98 000,12600,73    225.1 000,16 000,54600,73   56 AAA AA A minnewminnewmaxnew minmax minv v _)__('     A Av v   ' j v v 10 ' Where j is the smallest integer such that Max(|ν’|) < 1 6/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 57. Discretization • Three types of attributes – Nominal—values from an unordered set, e.g., color, profession – Ordinal—values from an ordered set, e.g., military or academic rank – Numeric—real numbers, e.g., integer or real numbers • Discretization: Divide the range of a continuous attribute into intervals – Interval labels can then be used to replace actual data values – Reduce data size by discretization – Supervised vs. unsupervised – Split (top-down) vs. merge (bottom-up) – Discretization can be performed recursively on an attribute – Prepare for further analysis, e.g., classification 576/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 58. 58 Data Discretization Methods • Typical methods: All the methods can be applied recursively – Binning • Top-down split, unsupervised – Histogram analysis • Top-down split, unsupervised – Clustering analysis (unsupervised, top-down split or bottom- up merge) – Decision-tree analysis (supervised, top-down split) – Correlation (e.g., 2) analysis (unsupervised, bottom-up merge) 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 58
  • 59. Simple Discretization: Binning • Equal-width (distance) partitioning – Divides the range into N intervals of equal size: uniform grid – if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N. – The most straightforward, but outliers may dominate presentation – Skewed data is not handled well • Equal-depth (frequency) partitioning – Divides the range into N intervals, each containing approximately same number of samples – Good data scaling – Managing categorical attributes can be tricky 596/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 60. Binning Methods for Data Smoothing  Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 606/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 61. 61 Discretization Without Using Class Labels (Binning vs. Clustering) Data Equal interval width (binning) Equal frequency (binning) K-means clustering leads to better results 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 61
  • 62. 62 Discretization by Classification & Correlation Analysis • Classification (e.g., decision tree analysis) – Supervised: Given class labels, e.g., cancerous vs. benign – Using entropy to determine split point (discretization point) – Top-down, recursive split – Details to be covered in Chapter 7 • Correlation analysis (e.g., Chi-merge: χ2-based discretization) – Supervised: use class information – Bottom-up merge: find the best neighboring intervals (those having similar distributions of classes, i.e., low χ2 values) to merge – Merge performed recursively, until a predefined stopping condition 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 62
  • 63. Concept Hierarchy Generation • Concept hierarchy organizes concepts (i.e., attribute values) hierarchically and is usually associated with each dimension in a data warehouse • Concept hierarchies facilitate drilling and rolling in data warehouses to view data in multiple granularity • Concept hierarchy formation: Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as youth, adult, or senior) • Concept hierarchies can be explicitly specified by domain experts and/or data warehouse designers • Concept hierarchy can be automatically formed for both numeric and nominal data. For numeric data, use discretization methods shown. 636/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 64. Concept Hierarchy Generation for Nominal Data • Specification of a partial/total ordering of attributes explicitly at the schema level by users or experts – street < city < state < country • Specification of a hierarchy for a set of values by explicit data grouping – {Urbana, Champaign, Chicago} < Illinois • Specification of only a partial set of attributes – E.g., only street < city, not others • Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values – E.g., for a set of attributes: {street, city, state, country} 646/30/2020 NIMMY RAJU,AP,VKCET,TVM
  • 65. Automatic Concept Hierarchy Generation • Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set – The attribute with the most distinct values is placed at the lowest level of the hierarchy – Exceptions, e.g., weekday, month, quarter, year 65 country province_or_ state city street 15 distinct values 365 distinct values 3567 distinct values 674,339 distinct values 6/30/2020 NIMMY RAJU,AP,VKCET,TVM