1. Data Preprocessing
MS. T.K. ANUSUYA
ASSISTANT PROFESSOR
DEPARTMENT OF COMPUTER SCIENCE
BON SECOURS COLLEGE FOR WOMEN, THANJAVUR.
2. Why Data Pre-processing?
Data in real-world
Highly noisy, - errors or outliers
Missing/incomplete – lacking attribute values eg name=“”
Duplicate tuples
inconsistent data due to their typically huge size.
Low quality data
low quality mining results.
Different data sources
Data extraction, cleaning and transformation
2
Data Pre-processing
3. Multi Dimensional Measure of Data Quality
Accuracy
Completeness
Consistency
Timeliness
Believability
Interpretability
3
Data Pre-processing
5. Data Pre-processing Techniques
Data Cleaning
Missing values(noisy data), outliers , data’s are dirty
Data Integration
Integration of multiple databases, data cubes or files
Data Transformation
Normalization and aggregation
Data Reduction
Reduce data size,/compressed, aggregating, eliminating redundant
features
Dimensionality reduction -removing irrelevant attributes
Numerosity reduction – replaced by alternatives,
parametric models(regression /log linear models) or
non parametric models(eg. Histograms, clusters, sampling and data aggregation)
5
Data Pre-processing
6. Data Cleaning
To fill in missing values, smooth out noisy while identifying outliers and correct inconsistencies in the
data
• Missing Values
• Ignore the tuple – when class label is missing
• Fill in the missing value manually –tedious and infeasible
• Use a global constant to fill in the missing value – unknown a new class
• Use a measure of central tendency for the attribute
• Use the attribute mean or median for all samples belonging to the same class as the given tuple.
• Use the most probable value to fill on the missing value. –regression, Bayesian formula, decision
trees.
6
Data Pre-processing
7. Data Cleaning
• Noisy Data
• Noise is a random error or variance in a measured variable.
• Binning Method : sorting the data
• Smooth by bin median, median and boundaries.
• Clustering – detect and remove outliers
• Semi Automated – Computer and Manual intervention
• Regression – use regression functions
7
Data Pre-processing
8. Data Integration
Data Integration
Merging of data from multiple data stores.
Reduce and avoid redundancies and inconsistencies
Improves the accuracy and speed of the mining process.
Entity identification problem
Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton
Redundant attributes may be able to detected to correlation analysis and covariance analysis
8
Data Pre-processing
9. Correlation Analysis (Nominal Data)
Χ2 (chi-square) test
The larger the Χ2 value, the more likely the variables are related
The cells that contribute the most to the Χ2 value are those whose actual
count is very different from the expected count
Correlation does not imply causality
# of hospitals and # of car-theft in a city are correlated
Both are causally linked to the third variable: population
Expected
ExpectedObserved 2
2 )(
9
Data Pre-processing
10. 10
Data Pre-processing
Chi-square Calculation-example
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the
data distribution in the two categories)
It shows that like_science fiction and playchess are correlated in the group
93.507
840
)8401000(
360
)360200(
210
)21050(
90
)90250( 2222
2
11. 11
Data Pre-processing
Correlation Analysis (Numerical Data)
Correlation coefficient (also called Pearson’s product moment
coefficient)
where n is the number of tuples, and are the respective means of A and B,
σA and σB are the respective standard deviation of A and B, and Σ(AB) is the sum of
the AB cross-product.
If rA,B > 0, A and B are positively correlated (A’s values increase as B’s).
The higher, the stronger correlation.
rA,B = 0: independent; rA,B < 0: negatively correlated
BABA n
BAnAB
n
BBAA
r BA
)1(
)(
)1(
))((
,
12. Data Reduction
Reduced representation of the data set that is much smaller in volume, yet closely
maintains the integrity of base data.
Data cube aggregation
Dimensionality reduction - reducing the random variables or attributes under
consideration (Wavelet Transforms)
Numerosity reduction – Regression and log linear models, Histograms, Clustoring,
Sampling Data cube aggregation
Data compression
12
Data Pre-processing
13. Wavelet Transform
Data are transformed to preserve relative distance between objects at different
levels of resolutions
Used for image compression
13
Data Pre-processing
14. Numerosity Reduction
Reduce data volume by choosing alternative forms of data representation
Parametric Methods (Regressions)
Assume the data fits in models
Linear Regression -Straight line
Multiple Regression – multidimensional vector
Log linear model- discrete multidimensional distributions
Non-Parametric Methods
Don’t assume models (Histograms, clustering, sampling…)
14
Data Pre-processing
15. Histograms
Popular Data reduction techniques
Divide and equal the data into buckets and store average for each bucket
15
Data Pre-processing
16. Data Cube Aggregation
The lowest level of a data cube (Cubiod)
A cube is highest level of abstraction is apex cuboid
Multiple levels of aggregation in data cubes
Provide fast access to precomputed, summarized data.
Reduce the size of data
16
Data Pre-processing
17. Data Transformation
Pre-processing step
Data are transformed or consolidated the resulting mining process may be more efficient and
the patterns found.
Smoothing – remove noisy data (binning, regression and clustering)
Attribute construction – new attributes constructed
Aggregation –summarized, data cube
Normalization –(min-max, z-score)
Discretization –hierarchy climbing
Concept hierarchy generation for nominal data
17
Data Pre-processing
18. Normalization
Min – maz normalization 9new mina, new maxA)
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then
$73,000 is mapped to
Z-score normalization (μ: mean, σ: standard deviation):
Normalization by decimal scaling where j is the smallest integer such that
max v <1
AAA
AA
A
minnewminnewmaxnew
minmax
minv
v _)__('
A
Av
v
'
j
v
v'
18
Data Pre-processing
19. Data Discretization
Three types of attributes:
Nominal — values from an unordered set, e.g., color, profession
Ordinal — values from an ordered set, e.g., military or academic rank
Continuous — real numbers, e.g., integer or real numbers
Discretization:
Divide the range of a continuous attribute into intervals
Some classification algorithms only accept categorical attributes.
Reduce data size by discretization
Data Pre-processing
19
20. Data Discretization
Discretization
Reduce the number of values for a given continuous attribute by dividing the range of the attribute
into intervals
Interval labels can then be used to replace actual data values
Supervised vs. unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an attribute
Concept hierarchy formation
Recursively reduce the data by collecting and replacing low level concepts (such as numeric values
for age) by higher level concepts (such as young, middle-aged, or senior)
Data Pre-processing
20
21. Data Discretization Methods
Typical methods: All the methods can be applied recursively
Binning
Top-down split, unsupervised
Histogram analysis
Top-down split, unsupervised
Clustering analysis (unsupervised, top-down split or bottom-up merge)
Decision-tree analysis (supervised, top-down split)
Correlation (e.g., 2) analysis (unsupervised, bottom-up merge)
21
Data Pre-processing