Data preprocessing

A Brief Presentation on Data Mining
Jason Rodrigues
Data Preprocessing

• Introduction
• Why data proprocessing?
• Data Cleaning
• Data Integration and
Transformation
• Data Reduction
• Discretization and concept
Heirarchy generation
• Takeaways
Agenda

Why Data Preprocessing?
Data in the real world is dirty

incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data

noisy: containing errors or outliers

inconsistent: containing discrepancies in codes or names
No quality data, no quality mining results!

Quality decisions must be based on quality data

Data warehouse needs consistent integration of quality data
A multi-dimensional measure of data quality

A well-accepted multi-dimensional view:

accuracy, completeness, consistency, timeliness, believability, value
added, interpretability, accessibility
Broad categories

intrinsic, contextual, representational, and accessibility

Data Preprocessing
Major Tasks of Data Preprocessing
Data cleaning

Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
Data integration

Integration of multiple databases, data cubes, files, or notes
Data trasformation

Normalization (scaling to a specific range)

Aggregation
Data reduction

Obtains reduced representation in volume but produces the same or
similar analytical results

Data discretization: with particular importance, especially for numerical
data

Data aggregation, dimensionality reduction, data compression,
generalization

Data Preprocessing
Major Tasks of Data Preprocessing
Data Cleaning
Data Integration
Databases
Data
Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation

Data Cleaning
Tasks of Data Cleaning

Fill in missing values

Identify outliers and smooth noisy data

Correct inconsistent data

Data Cleaning
Manage Missing Data

Ignore the tuple: usually done when class label is missing (assuming the
task is classification—not effective in certain cases)

Fill in the missing value manually: tedious + infeasible?

Use a global constant to fill in the missing value: e.g., “unknown”, a
new class?!

Use the attribute mean to fill in the missing value

Use the attribute mean for all samples of the same class to fill in
the missing value: smarter

Use the most probable value to fill in the missing value: inference-
based such as regression, Bayesian formula, decision tree

Data Cleaning
Manage Noisy Data
Binning Method:

first sort data and partition into (equi-depth) bins

then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc
Clustering:

detect and remove outliers
Semi Automated

Computer and Manual Intervention
Regression

Use regression functions

Data Cleaning
Cluster Analysis

Data Cleaning
Regression Analysis
x
y
y = x + 1
X1
Y1
Y1’
•Linear regression (best line to fit
two variables)
•Multiple linear regression (more
than two variables, fit to a
multidimensional surface

Data Cleaning
Inconsistant Data

Manual correction using external
references

Semi-automatic using various tools
− To detect violation of known functional
dependencies and data constraints
− To correct redundant data

Data integration and transformation
Tasks of Data Integration and transformation

Data integration:
− combines data from multiple sources into a coherent
store

Schema integration
− integrate metadata from different sources
− Entity identification problem: identify real world entities
from multiple data sources, e.g., A.cust-id ≡ B.cust-#

Detecting and resolving data value conflicts
− for the same real world entity, attribute values from
different sources are different
− possible reasons: different representations, different
scales, e.g., metric vs. British units, different currency

Manage Data Integration

Redundant data occur often when integrating multiple DBs
− The same attribute may have different names in different databases
− One attribute may be a “derived” attribute in another table, e.g., annual
revenue

Redundant data may be able to be detected by correlational analysis
• Careful integration can help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
BA
BA
n
BBAA
r
σσ)1(
))((
,
−
−−Σ
=

Manage Data Transformation
 Smoothing: remove noise from data (binning, clustering,
regression)
 Aggregation: summarization, data cube construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small, specified range
− min-max normalization
− z-score normalization
− normalization by decimal scaling
 Attribute/feature construction
− New attributes constructed from the given ones

Manage Data Reduction
Data reduction
Data reduction: reduced representation, while still retaining critical
information

Data cube aggregation

Dimensionality reduction

Data compression

Numerosity reduction

Discretization and concept hierarchy generation

Data Cube Aggregation
Data reduction

Multiple levels of aggregation in data cubes
− Further reduce the size of data to deal with

Reference appropriate levels Use the smallest representation capable
to solve the task

Data Compression
Data reduction

String compression
− There are extensive theories and well-tuned algorithms
− Typically lossless
− But only limited manipulation is possible without expansion

Audio/video, image compression
− Typically lossy compression, with progressive refinement
− Sometimes small fragments of signal can be reconstructed
without reconstructing the whole

Time sequence is not audio
− Typically short and vary slowly with time
``

Similarities and Dissimilarities
Proximity

Proximity is used to refer to Similarity or
Dissimilarity, since proximity between the
object is a function of proximity between
the corresponding attributes of two objects.

Similarity: Numeric measure of the degree
to which the two objects are alike.

Dissimilarity: Numeric measure of the
degree to which the two objects are
different.

Dissimilarities between Data Objects?

Similarity
− Numerical measure of how alike two data
objects are.
− Is higher when objects are more alike.
− Often falls in the range [0,1]

Dissimilarity
− Numerical measure of how different are two data
objects
− Lower when objects are more alike
− Minimum dissimilarity is often 0
− Upper limit varies

Proximity refers to a similarity or dissimilarity

Euclidean Distance

Euclidean Distance
Where n is the number of dimensions (attributes)
and pk and qk are, respectively, the kth
attributes
(components) or data objects p and q.

Standardization is necessary, if scales differ.
∑
=
−=
n
k
kk qpdist
1
2
)(

Euclidean Distance

Euclidean Distance
∑
=
−=
n
k
kk qpdist
1
2
)(
1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
0
5
4
3
Column 2

Minkowski Distance
 r = 1. City block (Manhattan, taxicab, L1
norm)
distance.
− A common example of this is the Hamming distance, which is just
the number of bits that are different between two binary vectors

r = 2. Euclidean distance
 r → ∞. “supremum” (Lmax
norm, L∞
norm) distance.
− This is the maximum difference between any component of the
vectors
− Example: L_infinity of (1, 0, 2) and (6, 0, 3) = ??
− Do not confuse r with n, i.e., all these distances are
defined for all numbers of dimensions.

Minkowski Distance
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L∞ p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0

Euclidean Distance Properties
• Distances, such as the Euclidean distance,
have some well known properties.
1. d(x, y) ≥ 0 for all x and y and d(x, y) = 0 only if
x = y. (Positive definiteness)
2. d(x, y) = d(y, x) for all x and q. (Symmetry)
3. d(x, y) ≤ d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)
where d(x, y) is the distance (dissimilarity) between
points (data objects), x and y.
• A distance that satisfies these properties is a
metric, and a space is called a metric space

Non Metric Dissimilarities – Set Differences

non-metric measures are often robust
(resistant to outliers, errors in objects, etc.)
− the symmetry and mainly the triangular inequality
are often violated

cannot be directly
used with MAMs
a
b
a > b + c
c
a
b
a ≠ b

Non Metric Dissimilarities – Time

various k-median distances
− measure distance between the two (k-th) most
similar portions in objects

COSIMIR
− back-propagation network with single output
neuron serving as a distance, allows training

Dynamic Time Warping distance
− sequence alignment technique
− minimizes the sum of distances between
sequence elements
 fractional Lp distances
− generalization of Minkowski distances (p<1)
− more robust to extreme differences in coordinates

Jaccard Coeffificient

Recall: Jaccard coefficient is a commonly used
measure of overlap of two sets A and B
jaccard(A,B) = |A ∩ B| / |A ∪ B|
jaccard(A,A) = 1
jaccard(A,B) = 0 if A ∩ B = 0

A and B don’t have to be the same size.

JC always assigns a number between 0 and 1.

Takeaways
Why Data Preprocessing?
Data Cleaning
Data Integration and Transformation
Data Reduction
Discretization and concept Heirarchy
generation

Data preprocessing

More Related Content

What's hot

Viewers also liked

Similar to Data preprocessing

More from Jason Rodrigues

Recently uploaded

Data preprocessing