Data preprocessing is a data mining technique
that involves transforming raw data into an
Data in the real world is dirty
Measures for data quality: A multidimensional
◦ Accuracy: correct or wrong, accurate or not
◦ Completeness: not recorded, unavailable, …
◦ Consistency: some modified but some not,
◦ Timeliness: timely update?
◦ Believability: how trustable the data are
◦ Interpretability: how easily the data can be
Major Tasks in Data
Data Transformation and Data
Data Cleaning: Incomplete
Data is not always available
Missing data may be due to
◦ equipment malfunction
◦ inconsistent with other recorded data and thus deleted
◦ data not entered due to misunderstanding
◦ certain data may not be considered important at the
time of entry
Increases the amount of storage space .
Data Cleaning as a Process
Missing values, noise, and inconsistencies contribute to
The first step in data cleaning as a process is
Discrepancies can be caused by several factors.
Poorly designed data entry forms
human error in data entry
The data should also be examined regarding:
o Unique rule:
Each attribute value must be different from all other attribute
o Consecutive rule
No missing values between lowest and highest values of the
o Null rule
Specifies the use of blanks, question marks, special
The merging of data from multiple data stores.
It can help reduce, avoid redundancies and
It improve the accuracy and speed of the subsequent
data mining process.
To obtain a reduced representation of the data set that is
much smaller in volume.
Strategies for data reduction include the following:
Data cube aggregation, where aggregation operations
are applied to the data in the construction of a data cube.
Attribute subset selection, where irrelevant, weakly
relevant, or redundant attributes or dimensions may be
detected and removed.
Dimensionality reduction, where encoding mechanisms are
used to reduce the data set size.
Numerosity reduction, where the data are replaced or
estimated by alternative, smaller data representations such as
Nonparametric methods such as clustering, sampling,
and the use of histograms.
In data transformation, the data are transformed or
consolidated into forms appropriate for mining.
Data transformation can involve the following:
Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small,
Discretization: Divide the range of a continuous attribute
◦ Interval labels can then be used to replace actual data
◦ Reduce data size by Discretization
◦ Split (top-down) vs. merge (bottom-up)
◦ Discretization can be performed recursively on an
◦ Prepare for further analysis, e.g., classification
Three types of attributes
◦ Nominal—values from an unordered set, e.g., color,
◦ Ordinal—values from an ordered set, e.g., military or
◦ Numeric—real numbers, e.g., integer or real numbers