This document discusses why data preprocessing is important before data mining or analysis. It involves cleaning data by filling in missing values and correcting inconsistencies. It also includes integrating and transforming data from multiple sources, as well as reducing data through techniques like aggregation, dimensionality reduction, and discretization of continuous attributes. The goal of preprocessing is to produce clean, consistent data to ensure quality mining and analytical results.
3. Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
4. Why Data Preprocessing?
Data in the real world is dirty
incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
noisy: containing errors or outliers
inconsistent: containing discrepancies in codes or
names
No quality data, no quality mining results
Quality decisions must be based on quality data
Data warehouse needs consistent integration of quality
data
5. O Data cleaning tasks
O Fill in missing values
O Identify outliers and smooth out noisy data
O Correct inconsistent data
duplicate records
incomplete data
inconsistent data
6. O Data integration:
O combines data from multiple sources into a coherent store
O Schema integration
O integrate metadata from different sources
O Entity identification problem: identify real world entities from
multiple data sources,
O Detecting and resolving data value conflicts
O for the same real world entity, attribute values from different
sources are different
O possible reasons: different representations, different scales,
e.g., metric vs. British units
7. O Smoothing: remove noise from data
O Aggregation: summarization, data cube construction
O Generalization: concept hierarchy climbing
O Normalization: scaled to fall within a small, specified
range
O min-max normalization
O z-score normalization
O normalization by decimal scaling
8. O Data reduction
O Obtains a reduced representation of the data set
that is much smaller in volume but yet produces
the same (or almost the same) analytical results
O Data reduction strategies
O Data cube aggregation
O Dimensionality reduction
O Numerosity reduction
O Discretization and concept hierarchy generation
9. O Discretization:
divide the range of a continuous attribute into intervals
O Some classification algorithms only accept categorical
attributes.
O Reduce data size by discretization
O Prepare for further analysis
O reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values.