Data Preprocessing

BY:
K.KOTTAISAMY
II MCA
Data Preprocessing

Data cleaning
Data integration
Data transformation
Data reduction

Data discretization
Major tasks in Data Preprocessing

Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies

Data integration
Integration of multiple databases, data cubes, or files

Data transformation
Normalization and aggregation
Data reduction
Reduced representation in volume but produces the same or
similar analytical results

Data discretization
Part of data reduction but with particular importance,
especially for numerical data
Data Preprocessing
Data Cleaning
Data in the Real World Is Dirty:
Lots of incorrect data, e.g., instrument faulty, human or
computer error, transmission error
incomplete:
lacking attribute values, containing only aggregate data
e.g., Occupation = “ ” (missing data)
noisy: containing noise, errors, or outliers
e.g., Salary = “−10” (an error)
Data Integration
Data integration:
combines data from multiple sources

Schema integration
integrate metadata from different sources

Detecting and resolving data value conflicts
for the same real world entity, attribute values from
different sources are different,
e.g., different scales, metric vs. British units

Removing duplicates and redundant data
Data Transformation
Smoothing:

remove noise from data

Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified
range
min-max normalization
z-score normalization
normalization by decimal scaling
Data Transformation: Normalization
Min-max normalization: to [new_minA, new_maxA]
v'

v minA
(new _ maxA new _ minA) new _ minA
maxA minA

Ex.

Let income range $12,000 to $98,000 normalized to
[0.0, 1.0]. Then $73,000 is mapped to

73,600
98,000

12 ,000
(1.0 0)
12 ,000

0

0.716
Data Transformation: Normalization

Z-score normalization (μ: mean, σ: standard
deviation):

v'

v

A
A

Ex. Let μ = 54,000, σ = 16,000. Then

73,600 54 ,000
16 ,000

1.225
Data Transformation: Normalization

Normalization by decimal scaling

v
v'
10 j
Where j is the smallest integer such that Max(|ν’|) < 1
Data reduction
Why data reduction?
A database/data warehouse may store terabytes of data
Complex data analysis/mining may take a very long time to
run on the complete data set
Data reduction
The data set that is much smaller in volume but yet produce
the same analytical results
Data reduction strategies
Data cube aggregation
Dimensionality reduction — e.g., remove unimportant
attributes
Data Compression
Discretization and concept hierarchy generation
Data cube aggregation

The lowest level of a data cube (base cuboid)
The aggregated data for an individual entity of
interest
E.g., a customer in a phone calling data warehouse
Multiple levels of aggregation in data cubes
Further reduce the size of data to deal with
Dimensionality Reduction

Feature selection (attribute subset selection):
Select a minimum set of attributes that is sufficient for
the data mining task.

Heuristic methods
step-wise forward selection
step-wise backward elimination
combining forward selection and backward elimination
Data Discretization

Three types of attributes:
Nominal — values from an unordered set
Ordinal — values from an ordered set
Continuous — real numbers

Discretization:
Some classification algorithms only accept
categorical attributes.
Reduce data size by discretization
Prepare for further analysis
Data preprocessing

Data preprocessing

  • 1.
  • 2.
    Data Preprocessing Data cleaning Dataintegration Data transformation Data reduction Data discretization
  • 3.
    Major tasks inData Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation
  • 4.
    Data reduction Reduced representationin volume but produces the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data
  • 5.
  • 6.
    Data Cleaning Data inthe Real World Is Dirty: Lots of incorrect data, e.g., instrument faulty, human or computer error, transmission error incomplete: lacking attribute values, containing only aggregate data e.g., Occupation = “ ” (missing data) noisy: containing noise, errors, or outliers e.g., Salary = “−10” (an error)
  • 7.
    Data Integration Data integration: combinesdata from multiple sources Schema integration integrate metadata from different sources Detecting and resolving data value conflicts for the same real world entity, attribute values from different sources are different, e.g., different scales, metric vs. British units Removing duplicates and redundant data
  • 8.
    Data Transformation Smoothing: remove noisefrom data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range min-max normalization z-score normalization normalization by decimal scaling
  • 9.
    Data Transformation: Normalization Min-maxnormalization: to [new_minA, new_maxA] v' v minA (new _ maxA new _ minA) new _ minA maxA minA Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to 73,600 98,000 12 ,000 (1.0 0) 12 ,000 0 0.716
  • 10.
    Data Transformation: Normalization Z-scorenormalization (μ: mean, σ: standard deviation): v' v A A Ex. Let μ = 54,000, σ = 16,000. Then 73,600 54 ,000 16 ,000 1.225
  • 11.
    Data Transformation: Normalization Normalizationby decimal scaling v v' 10 j Where j is the smallest integer such that Max(|ν’|) < 1
  • 12.
    Data reduction Why datareduction? A database/data warehouse may store terabytes of data Complex data analysis/mining may take a very long time to run on the complete data set Data reduction The data set that is much smaller in volume but yet produce the same analytical results
  • 13.
    Data reduction strategies Datacube aggregation Dimensionality reduction — e.g., remove unimportant attributes Data Compression Discretization and concept hierarchy generation
  • 14.
    Data cube aggregation Thelowest level of a data cube (base cuboid) The aggregated data for an individual entity of interest E.g., a customer in a phone calling data warehouse Multiple levels of aggregation in data cubes Further reduce the size of data to deal with
  • 15.
    Dimensionality Reduction Feature selection(attribute subset selection): Select a minimum set of attributes that is sufficient for the data mining task. Heuristic methods step-wise forward selection step-wise backward elimination combining forward selection and backward elimination
  • 16.
    Data Discretization Three typesof attributes: Nominal — values from an unordered set Ordinal — values from an ordered set Continuous — real numbers Discretization: Some classification algorithms only accept categorical attributes. Reduce data size by discretization Prepare for further analysis