Data preprocessing

Data Preprocessing
By
S.Dinesh Babu
II MCA

Definition
 Data preprocessing is a data mining technique
that involves transforming raw data into an
understandable format.
 Data in the real world is dirty

Measures for data quality:A multidimensional view
◦ Accuracy: correct or wrong, accurate or not
◦ Completeness: not recorded, unavailable, …
◦ Consistency: some modified but some not,
dangling, …
◦ Timeliness: timely update?
◦ Believability: how trustable the data are correct?
◦ Interpretability: how easily the data can be
understood?

Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data
Discretization

Data Cleaning: Incomplete
 Data is not always available
 Ex:Age:” ”;
 Missing data may be due to
◦ equipment malfunction
◦ inconsistent with other recorded data and thus deleted
◦ data not entered due to misunderstanding
◦ certain data may not be considered important at the
time of entry

Noisy Data
 Unstructured Data.
 Increases the amount of storage space .
Causes:
Hardware Failure
Programming Errors

Data Cleaning as a Process
 Missing values, noise, and inconsistencies contribute to
inaccurate data.
 The first step in data cleaning as a process is
discrepancy detection.
 Discrepancies can be caused by several factors.
 Poorly designed data entry forms
 human error in data entry

The data should also be examined regarding:
o Unique rule:
Each attribute value must be different from all other attribute
value.
o Consecutive rule
No missing values between lowest and highest values of the
attribute.
o Null rule
Specifies the use of blanks, question marks, special
characters.

Data Integration
 The merging of data from multiple data stores.
 It can help reduce, avoid redundancies and
inconsistencies.
 It improve the accuracy and speed of the subsequent
data mining process.

Data Reduction
 To obtain a reduced representation of the data set that is
much smaller in volume.
Strategies for data reduction include the following:
 Data cube aggregation, where aggregation operations
are applied to the data in the construction of a data cube.
 Attribute subset selection, where irrelevant, weakly
relevant, or redundant attributes or dimensions may be
detected and removed.

 Dimensionality reduction, where encoding mechanisms are
used to reduce the data set size.
 Numerosity reduction, where the data are replaced or
estimated by alternative, smaller data representations such as
 Parametric models
 Nonparametric methods such as clustering, sampling,
and the use of histograms.

Data Transformation
 In data transformation, the data are transformed or
consolidated into forms appropriate for mining.
Data transformation can involve the following:
 Smoothing: remove noise from data
 Aggregation: summarization, data cube construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small,
specified range
 min-max normalization

Data Discretization
 Discretization: Divide the range of a continuous attribute
into intervals
◦ Interval labels can then be used to replace actual data
values
◦ Reduce data size by Discretization
◦ Split (top-down) vs. merge (bottom-up)
◦ Discretization can be performed recursively on an
attribute
◦ Prepare for further analysis, e.g., classification

 Three types of attributes
◦ Nominal—values from an unordered set, e.g., color, profession
◦ Ordinal—values from an ordered set, e.g., military or academic
rank
◦ Numeric—real numbers, e.g., integer or real numbers

Data preprocessing

In this document

More Related Content

What's hot

Viewers also liked

Similar to Data preprocessing

Data preprocessing