Data Preprocessing
V. Saranya
AP/CSE
Sri Vidya College of Engineering & Technology, Virudhunagar
Preprocessing
• Databases are so noisy
– By missing values, inconsistent data, huge size.
– From multiple resources
– Low quality data
– Low quality mining results
Techniques
1. Data Cleaning
2. Data transformation
3. Data integration
4. Data reduction
Data Cleaning
Remove noise
Correct inconsistencies.
By filling missing values,
smoothing noisy data
Identify outliers
Data Integration
• Merge data from
multiple resources.
• (DW)
Issue:
• Different names in
different databases
may cause
inconsistencies.
Data Transformation
• Normalization
– Ex:
-2,32,100  -0.02,
.32, .00
Data Reduction
• Reduce data size by aggregating or
clustering.
Data reduction includes
1. Data aggregation:
– Building data
cube
2. Data Generalization
- concept hierarchy
3. Attribute subset selection:
 removing irrelevant attributes through
correlation analysis
4. Dimensionality Reduction:
 minimize the dimensionality
5. Numerosity reduction;
 replacing the data by alternative.
Need for preprocessing
• Incomplete, noisy and inconsistent data may
be in large volume of databases.
Reason for incomplete data
• Attributes may not be available
• Misunderstanding
• Data may be deleted.
• Missing data
• Fault data
• Errors in data transformation
Data Discretization
• Form of data reduction

Data preprocessing