3 module 2

Module - 2
By
Dr.Ramkumar.T
ramkumar.thirunavukarasu@vit.ac.in
• Data Preprocessing

2
Types of Data Sets
• Record
– Relational records
– Data matrix, e.g., numerical
matrix, crosstabs
– Document data: text documents:
term-frequency vector
– Transaction data
• Graph and network
– World Wide Web
– Social or information networks
– Molecular Structures
• Ordered
– Video data: sequence of images
– Temporal data: time-series
– Sequential Data: transaction
sequences
– Genetic sequence data
• Spatial, image and multimedia:
– Spatial data: maps
– Image data:
– Video data:
Document 1
season
timeout
lost
wi
n
game
score
ball
pla
y
coach
team
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
0
0
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

3
Attributes
• Attribute (or dimensions, features, variables): a data field,
representing a characteristic or feature of a data object.
– E.g., customer _ID, name, address
• Types:
– Nominal
– Binary
– Numeric: quantitative
• Interval-scaled
• Ratio-scaled

4
Attribute Types
• Nominal: categories, states, or ―names of things‖
– Hair_color = {auburn, black, blond, brown, grey, red, white}
– occupation
• Binary
– Nominal attribute with only 2 states (0 and 1)
– Symmetric binary: both outcomes equally important
• e.g., gender
– Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome
• Ordinal
– Values have a meaningful order (ranking)
– Size = {small, medium, large}, grades, army rankings

5
Numeric Attribute Types
• Quantity (integer or real-valued)
• Interval
• Measured on a scale of equal-sized units
• Values have order
– E.g., temperature in C˚or F˚, calendar dates
• Ratio
• We can speak of values as being an order of magnitude
larger than the unit of measurement (10 K˚ is twice as high
as 5 K˚).
– e.g., temperature in Kelvin, length, counts, monetary
quantities

Why Data Preprocessing?
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality data
• A multi-dimensional measure of data quality:
– A well-accepted multi-dimensional view:
• accuracy, completeness, consistency, timeliness, believability, value
added, interpretability, accessibility
– Broad categories:
• intrinsic, contextual, representational, and accessibility.

Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, files, or notes
• Data transformation
– Normalization (scaling to a specific range)
– Aggregation
• Data reduction
– Obtains reduced representation in volume but produces the same or similar
analytical results
– Data discretization: with particular importance, especially for numerical data
– Data aggregation, dimensionality reduction, data compression, generalization

Data Cleaning
• Data cleaning tasks
– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data

Missing Data
• Data is not always available
– E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of entry
– not register history or changes of the data
• Missing data may need to be inferred

How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (assuming the
task is classification—not effective in certain cases)
• Fill in the missing value manually: tedious + infeasible?
• Use a global constant to fill in the missing value: e.g., ―unknown‖,
a new class?!
• Use the attribute mean to fill in the missing value
• Use the attribute mean for all samples of the same class to fill
in the missing value: smarter
• Use the most probable value to fill in the missing value:
inference-based such as regression, Bayesian formula, decision tree

Noisy Data
• Question : What is noise?
• Answer : Random error in a measured variable.
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention

How to Handle Noisy Data?
• Binning method:
– first sort data and partition into (equi-depth) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Clustering
– detect and remove outliers
• Regression
– smooth by fitting the data into regression functions

Descriptive Statistics
Mean – The mean of a set of data is the sum of the data values
divided by the number of values.
Problem :
Marci‘s exam scores for her last math class were 79, 86, 82, and
94. What would the mean of these values would be?
Answer:
79+86+82+94/4=85.25. Typically we round means to one more
decimal place than the original data had. In this case, we would
round 85.25 to 85.3.

• Median - Middle value of the variable once it has been sorted
from low to high
• If the number of data values, N, is odd, then the median is
the middle data value. This value can be found by
rounding N/2 up to the next whole number.
• If the number of data values is even, there is no one
value. Hence we have to find the mean of the two middle
values N/2 and N/2 + 1
Data Given : 3,4,7,2,3,7,4,2,4,7,4
Sorted data : 2,2,3,3,4,4,4,4,7,7,7
Median : 4
Data Given : 3,4,7,2,3,7,4,2
Median ?

• Mode – Most commonly reported value for a particular
variable
3,4,5,6,7,7,7,8,8,9
Mode = 7
Find the mode of the data set
3,4,5,6,7,7,7,8,8,8,9
Mode = 7.5

19
• Range – Simple measure of the variation for a particular
variable. It is calculated as the difference between the highest
and lowest values.
2,3,4,6,7,7,8,9
Range = 9-2 = 7
• Variance – It is the measure of the deviation of a variable from
the mean.
For the variable that do not represent the entire population, the
sample variance formula is :

• Calculate the variance for the below values :
3,4,4,5,5,5,6,6,6,7,7,8,9

• Standard Deviation – The standard deviation is also referred as
Root Mean Square, is the square root of the variance.
• For a sample population, the formula is
• For the previous problem, the standard deviation is :

• Covariance – Measure to find out how much the dimensions
vary from the mean with respect to each other (Two Dimensions)
• The formula for calculating covariance among two variables X
and Y is

Exercises
1) Find Mean, Median, Variance, Standard
Deviation for the following Data Set.
12,23,34,44,59,70,98
2)

Binning
• Equal-width partitioning:
– It divides the range into N intervals of equal size: uniform
grid
– if A and B are the lowest and highest values of the attribute,
the width of intervals will be: W = (B-A)/N.
– The most straightforward
– Skewed data is not handled well.
• Equal-depth partitioning:
– It divides the range into N intervals, each containing
approximately same number of samples
– Good data scaling
– Managing categorical attributes can be tricky.

Binning Methods for Data Smoothing
(Equi-Depth)
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,
26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

Equi-Width Binning
for Data Smoothing
• Given Data Set ; 5, 10, 11, 13, 15, 35, 50 ,55, 72, 92, 204, 215
• The formula for binning into equal-widths is
= (max−min)/N

Data Smoothening by Three Point Summary
• For any Instance ‗I‘, the formula for three point moving average is
calculated as follows :
= XI-1 + XI + XI+1 / 3
• Consider the following data set : 3,4,7,2,14,0,21,9,1,5
• The data set would be smoothened as
3,4.67,4.33,7.67,5.33,11.67,10,10.33,5,5
X

Data Smoothening by Five Point Summary
• Quartiles, outliers and Boxplot
• Quartiles: Q1 (25th percentile), Q3 (75th percentile)
– Inter-quartile range: IQR = Q3 – Q1
– The lower inner fence (min) = Q1 – (1.5 * IQR)
– The Upper inner fence (max) = Q3 + (1.5 * IQR)
– Five number summary: min, Q1, M, Q3, max
– Boxplot: Ends of the box are the quartiles, median is
marked, whiskers, and plot outlier individually

Box and Whisker Diagrams
Looking at the box plot on its own:
Minimum value
LQ
Median UQ
Maximum value
whisker whisker

Quartiles
• Quartiles are values that divide the data in quarters.
• The first quartile (Q1) is the value so that 25% of the data values
are below it; the third quartile (Q3) is the value so that 75% of
the data values are below it. You may have guessed that the
second quartile is the same as the median, since the median is
the value so that 50% of the data values are below it.
• This divides the data into quarters; 25% of the data is between
the minimum and Q1, 25% is between Q1 and the median, 25%
is between the median and Q3, and 25% is between Q3 and the
maximum value.
September 1, 2021 31

Problem
Suppose we have measured 9 females, and
their heights (in inches) sorted from smallest to
largest are:
59 60 62 64 66 67 69 70 72
What are the first and third quartiles?

Solution
To find the first quartile we first compute the locator: 25% of
9 is L = 0.25(9) = 2.25. Since this value is not a whole
number, we round up to 3. The first quartile will be the third
data value: 62 inches.
To find the third quartile, we again compute the locator: 75%
of 9 is 0.75(9) = 6.75. Since this value is not a whole number,
we round up to 7. The third quartile will be the seventh data
value: 69 inches.

Try Again for the Data Set
Data:
11,13,13,15,15,16,19,20,20,20,21,21,22,23,24
,30,40,45,45,45,45,71,72,73,75
Solution in ‗R‘

Problem - 2
Suppose we had measured 8 females, and
their heights (in inches) sorted from smallest
to largest are:
59 60 62 64 66 67 69 70
What are the first and third quartiles? What is
the 5 number summary?

Solution
To find the first quartile we first compute the locator: 25% of 8
is L = 0.25(8) = 2. Since this value is a whole number, we will
find the mean of the 2nd and 3rd data values: (60+62)/2 = 61,
so the first quartile is 61 inches.
The third quartile is computed similarly, using 75% instead of
25%. L = 0.75(8) = 6. This is a whole number, so we will find
the mean of the 6th and 7th data values: (67+69)/2 = 68, so Q3
is 68.
Note that the median could be computed the same way, using
50%.

Regression Analysis
• Regression analysis: A collective name for
techniques for the modeling and analysis of
numerical data consisting of values of a
dependent variable (also called response
variable or measurement) and of one or more
independent variables (aka. explanatory
variables or predictors)
• The parameters are estimated so as to give a
"best fit" of the data
• Most commonly the best fit is evaluated by
using the least squares method, but other
criteria have also been used
• Used for prediction (including
forecasting of time-series data),
inference, hypothesis testing,
and modeling of causal
relationships
y
x
y = x + 1
X1
Y1
Y1‘

How to Handle Inconsistent Data?
• Manual correction using external references
• Semi-automatic using various tools
– To detect violation of known functional
dependencies and data constraints
– To correct redundant data

Data Integration
• Data integration:
– combines data from multiple sources into a coherent store
• Schema integration
– integrate metadata from different sources
– Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id  B.cust-#
• Detecting and resolving data value conflicts
– for the same real world entity, attribute values from different
sources are different
– possible reasons: different representations, different scales,
e.g., metric vs. British units, different currency

Handling Redundant Data in
Data Integration
• Redundant data occur often when integrating multiple DBs
– The same attribute may have different names in different databases
– One attribute may be a ―derived‖ attribute in another table, e.g.,
annual revenue
– Careful integration can help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
• Redundant data may be able to be detected by Correlation
analysis - Pearson‘s correlation coefficient for Numerical
Data

Correlation Analysis
• The term correlation refers the strength and direction of a linear
relationship between two or more variables
• If a change in one variable effects a change in the other variable,
the variables are said to be correlated.
• Is there a relationship between the number of hours a student
spends studying for a calculus test and the student's score on that
calculus test?
• There are basically three types of correlation, namely positive
correlation, negative correlation and zero correlation

Pearson‘s correlation coefficient
The formulas return a value between -1 and 1, where:
• If r is close to 1, we say that the variables are positively correlated. This means there
is likely a strong linear relationship between the two variables, with a positive slope.
• If r is close to -1, we say that the variables are negatively correlated. This means
there is likely a strong linear relationship between the two variables, with a negative
slope.
• If r is close to 0, we say that the variables are not correlated. This means that there is
likely no linear relationship between the two variables,
2


x Y
5 25
3 20
4 21
10 35
15 38

47
Correlation Analysis – Chi Square Test (Nominal
Data)
• A hypothesis is an idea that can be tested.
• Null Hypothesis : Thee is no association between the two variable. Both are
un-related with each other, Denoted H0
• Alternative hypothesis: Assumes that there is an association between the two
variables, Denoted H1
• Degree of Freedom & Level of Significance
• The significance level, also denoted as alpha or ‗α‘, is the probability of
rejecting the null hypothesis, usually 0.05
• Degrees of Freedom refers to the maximum number of logically independent
values, which are values that have the freedom to vary,
DF = (r-1) x (c-1)
2

2


Chi-Square Calculation: An Example
• Numbers in parenthesis are expected counts calculated based on
the data distribution in the two categories
Playing
Cricket
Not playing
Cricket
Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500

49
Chi Square Test (Nominal Data)
• The Observed values are those we gather ourselves. The
expected values are the frequencies expected, based on our null
hypothesis.
• If the observed chi-square test statistic is greater than the critical
value (found in the chi-square Table), the null hypothesis can be
rejected.



Expected
Expected
Observed 2
2 )
(

2

93
.
507
840
)
840
1000
(
360
)
360
200
(
210
)
210
50
(
90
)
90
250
( 2
2
2
2
2











Data Reduction
• Problem:
Data Warehouse may store terabytes of data:
Complex data analysis/mining may take a very
long time to run on the complete data set
• Solution?
– Data reduction…

Obtain a reduced representation of the data set
that is much smaller in volume but yet
produces the same (or almost the same)
analytical results
•Data reduction strategies
–Data cube aggregation
–Dimensionality reduction
–Data compression
–Numerosity reduction
–Discretization and concept hierarchy generation
Data Reduction

Principal Component Analysis – The need
• When the data has too many dimensions, then it becomes a problem for
Data Mining.
– High compute and execution time
– The risk of compromise in the quality of the model fit.
• When the dimension of data is too high, we need to find a way to reduce
it.
• But that reduction has to be done in such a way that we maintain the
original pattern of the data.
• PCA – A well known dimensionality reduction technique that maps the
data into the space of lower of dimensionality
• Transfer the set of correlated variables into the new set of un-correlated
variables
• Form of Un-supervised learning and also used as data pre-processing
technique

Principal Component Analysis
• Principal component analysis is a method of extracting important
variables known as principal components from a large set of variables
available in a data set.
• It represents the original data in terms of its principal components in a
new dimension space.
• Principal components (PC) are the directions where there is the most
variance, the directions where the data is most spread out.
• There are multiple principal components of a data – each representing the
different variance of the data. They are arranged in a chronological order
of variance.
• The first PC will capture the most variance i.e. the most information about
the data, followed by the second, third and so on.

Principal Component Analysis
• Mathematically, the principal
components are the eigenvectors of
the covariance matrix of the original
dataset.
• Eigenvector corresponds to a
direction. Each eigenvector has a
corresponding eigenvalue
• A eigenvalue is a number that
indicates how much variance there is
in the data along that eigenvector
• A larger eigenvalue means that that
principal component explains a large
amount of the variance in the data.

PCA – How it Works ?
• The main purpose of a principal component analysis is the
analysis of data to identify and find patterns to reduce the
dimensions of the dataset with a minimal loss of information.
• 'N' dimension of data
• Obtain the Covariance matrix
• Calculate 'N' Eigen values and Eigen vectors
• Order the Eigen vectors based on the Eigen values
• The Eigen vector with the highest Eigen value is the principal
components of the given data set
• Thus, the given ‗N‘ dimension of data has been reduced into
‗p‘ principal components finally.

Data Transformation
• Smoothing: remove noise from data (binning,
clustering, regression)
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small,
specified range
– min-max normalization
– z-score normalization
– normalization by decimal scaling
• Attribute/feature construction
– New attributes constructed from the given ones

Data Transformation: Normalization
• min-max normalization
• z-score normalization
• normalization by decimal scaling
A
A
A
A
A
A
min
new
min
new
max
new
min
max
min
v
v _
)
_
_
(
' 




A
A
dev
stand
mean
v
v
_
'


j
v
v
10
' Where j is the smallest integer such that Max(| |)<1
'
v
Particularly useful for classification (NNs, distance measurements,
nn classification, etc)

Normalization
 Min-max normalization: to [new_minA, new_maxA]
 Ex. Let income range $12,000 to $98,000 normalized to [0.0,
1.0]. Then $73,000 is mapped to
 Z-score normalization (μ: mean, σ: standard deviation):
 Ex. Let μ = 54,000, σ = 16,000. Then
 Normalization by decimal scaling
716
.
0
0
)
0
0
.
1
(
000
,
12
000
,
98
000
,
12
600
,
73





A
A
A
A
A
A
min
new
min
new
max
new
min
max
min
v
v _
)
_
_
(
' 




A
A
v
v




'
j
v
v
10
' Where j is the smallest integer such that Max(|ν‘|) < 1
225
.
1
000
,
16
000
,
54
600
,
73



3 module 2

More Related Content

What's hot

Similar to 3 module 2

Recently uploaded

3 module 2