Module - 2
By
Dr.Ramkumar.T
ramkumar.thirunavukarasu@vit.ac.in
• Data Preprocessing
2
Types of Data Sets
• Record
– Relational records
– Data matrix, e.g., numerical
matrix, crosstabs
– Document data: text documents:
term-frequency vector
– Transaction data
• Graph and network
– World Wide Web
– Social or information networks
– Molecular Structures
• Ordered
– Video data: sequence of images
– Temporal data: time-series
– Sequential Data: transaction
sequences
– Genetic sequence data
• Spatial, image and multimedia:
– Spatial data: maps
– Image data:
– Video data:
Document 1
season
timeout
lost
wi
n
game
score
ball
pla
y
coach
team
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
0
0
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
3
Attributes
• Attribute (or dimensions, features, variables): a data field,
representing a characteristic or feature of a data object.
– E.g., customer _ID, name, address
• Types:
– Nominal
– Binary
– Numeric: quantitative
• Interval-scaled
• Ratio-scaled
4
Attribute Types
• Nominal: categories, states, or ―names of things‖
– Hair_color = {auburn, black, blond, brown, grey, red, white}
– occupation
• Binary
– Nominal attribute with only 2 states (0 and 1)
– Symmetric binary: both outcomes equally important
• e.g., gender
– Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome
• Ordinal
– Values have a meaningful order (ranking)
– Size = {small, medium, large}, grades, army rankings
5
Numeric Attribute Types
• Quantity (integer or real-valued)
• Interval
• Measured on a scale of equal-sized units
• Values have order
– E.g., temperature in C˚or F˚, calendar dates
• Ratio
• We can speak of values as being an order of magnitude
larger than the unit of measurement (10 K˚ is twice as high
as 5 K˚).
– e.g., temperature in Kelvin, length, counts, monetary
quantities
Why Data Preprocessing?
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality data
• A multi-dimensional measure of data quality:
– A well-accepted multi-dimensional view:
• accuracy, completeness, consistency, timeliness, believability, value
added, interpretability, accessibility
– Broad categories:
• intrinsic, contextual, representational, and accessibility.
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, files, or notes
• Data transformation
– Normalization (scaling to a specific range)
– Aggregation
• Data reduction
– Obtains reduced representation in volume but produces the same or similar
analytical results
– Data discretization: with particular importance, especially for numerical data
– Data aggregation, dimensionality reduction, data compression, generalization
Forms of data preprocessing
Data Cleaning
• Data cleaning tasks
– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
Missing Data
• Data is not always available
– E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of entry
– not register history or changes of the data
• Missing data may need to be inferred
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (assuming the
task is classification—not effective in certain cases)
• Fill in the missing value manually: tedious + infeasible?
• Use a global constant to fill in the missing value: e.g., ―unknown‖,
a new class?!
• Use the attribute mean to fill in the missing value
• Use the attribute mean for all samples of the same class to fill
in the missing value: smarter
• Use the most probable value to fill in the missing value:
inference-based such as regression, Bayesian formula, decision tree
Example
Noisy Data
• Question : What is noise?
• Answer : Random error in a measured variable.
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
How to Handle Noisy Data?
• Binning method:
– first sort data and partition into (equi-depth) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Clustering
– detect and remove outliers
• Regression
– smooth by fitting the data into regression functions
Descriptive Statistics
Mean – The mean of a set of data is the sum of the data values
divided by the number of values.
Problem :
Marci‘s exam scores for her last math class were 79, 86, 82, and
94. What would the mean of these values would be?
Answer:
79+86+82+94/4=85.25. Typically we round means to one more
decimal place than the original data had. In this case, we would
round 85.25 to 85.3.
Descriptive Statistics
• Median - Middle value of the variable once it has been sorted
from low to high
• If the number of data values, N, is odd, then the median is
the middle data value. This value can be found by
rounding N/2 up to the next whole number.
• If the number of data values is even, there is no one
value. Hence we have to find the mean of the two middle
values N/2 and N/2 + 1
Data Given : 3,4,7,2,3,7,4,2,4,7,4
Sorted data : 2,2,3,3,4,4,4,4,7,7,7
Median : 4
Data Given : 3,4,7,2,3,7,4,2
Median ?
Descriptive Statistics
• Mode – Most commonly reported value for a particular
variable
3,4,5,6,7,7,7,8,8,9
Mode = 7
Find the mode of the data set
3,4,5,6,7,7,7,8,8,8,9
Mode = 7.5
19
Descriptive Statistics
• Range – Simple measure of the variation for a particular
variable. It is calculated as the difference between the highest
and lowest values.
2,3,4,6,7,7,8,9
Range = 9-2 = 7
• Variance – It is the measure of the deviation of a variable from
the mean.
For the variable that do not represent the entire population, the
sample variance formula is :
Descriptive Statistics
• Calculate the variance for the below values :
3,4,4,5,5,5,6,6,6,7,7,8,9
Descriptive Statistics
Descriptive Statistics
• Standard Deviation – The standard deviation is also referred as
Root Mean Square, is the square root of the variance.
• For a sample population, the formula is
• For the previous problem, the standard deviation is :
Descriptive Statistics
• Covariance – Measure to find out how much the dimensions
vary from the mean with respect to each other (Two Dimensions)
• The formula for calculating covariance among two variables X
and Y is
Exercises
1) Find Mean, Median, Variance, Standard
Deviation for the following Data Set.
12,23,34,44,59,70,98
2)
Binning
• Equal-width partitioning:
– It divides the range into N intervals of equal size: uniform
grid
– if A and B are the lowest and highest values of the attribute,
the width of intervals will be: W = (B-A)/N.
– The most straightforward
– Skewed data is not handled well.
• Equal-depth partitioning:
– It divides the range into N intervals, each containing
approximately same number of samples
– Good data scaling
– Managing categorical attributes can be tricky.
Binning Methods for Data Smoothing
(Equi-Depth)
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,
26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Equi-Width Binning
for Data Smoothing
• Given Data Set ; 5, 10, 11, 13, 15, 35, 50 ,55, 72, 92, 204, 215
• The formula for binning into equal-widths is
= (max−min)/N
Data Smoothening by Three Point Summary
• For any Instance ‗I‘, the formula for three point moving average is
calculated as follows :
= XI-1 + XI + XI+1 / 3
• Consider the following data set : 3,4,7,2,14,0,21,9,1,5
• The data set would be smoothened as
3,4.67,4.33,7.67,5.33,11.67,10,10.33,5,5
X
Data Smoothening by Five Point Summary
• Quartiles, outliers and Boxplot
• Quartiles: Q1 (25th percentile), Q3 (75th percentile)
– Inter-quartile range: IQR = Q3 – Q1
– The lower inner fence (min) = Q1 – (1.5 * IQR)
– The Upper inner fence (max) = Q3 + (1.5 * IQR)
– Five number summary: min, Q1, M, Q3, max
– Boxplot: Ends of the box are the quartiles, median is
marked, whiskers, and plot outlier individually
Box and Whisker Diagrams
Looking at the box plot on its own:
Minimum value
LQ
Median UQ
Maximum value
whisker whisker
Quartiles
• Quartiles are values that divide the data in quarters.
• The first quartile (Q1) is the value so that 25% of the data values
are below it; the third quartile (Q3) is the value so that 75% of
the data values are below it. You may have guessed that the
second quartile is the same as the median, since the median is
the value so that 50% of the data values are below it.
• This divides the data into quarters; 25% of the data is between
the minimum and Q1, 25% is between Q1 and the median, 25%
is between the median and Q3, and 25% is between Q3 and the
maximum value.
September 1, 2021 31
Problem
Suppose we have measured 9 females, and
their heights (in inches) sorted from smallest to
largest are:
59 60 62 64 66 67 69 70 72
What are the first and third quartiles?
September 1, 2021 32
Solution
To find the first quartile we first compute the locator: 25% of
9 is L = 0.25(9) = 2.25. Since this value is not a whole
number, we round up to 3. The first quartile will be the third
data value: 62 inches.
To find the third quartile, we again compute the locator: 75%
of 9 is 0.75(9) = 6.75. Since this value is not a whole number,
we round up to 7. The third quartile will be the seventh data
value: 69 inches.
September 1, 2021 33
Try Again for the Data Set
Data:
11,13,13,15,15,16,19,20,20,20,21,21,22,23,24
,30,40,45,45,45,45,71,72,73,75
Solution in ‗R‘
Problem - 2
Suppose we had measured 8 females, and
their heights (in inches) sorted from smallest
to largest are:
59 60 62 64 66 67 69 70
What are the first and third quartiles? What is
the 5 number summary?
Solution
To find the first quartile we first compute the locator: 25% of 8
is L = 0.25(8) = 2. Since this value is a whole number, we will
find the mean of the 2nd and 3rd data values: (60+62)/2 = 61,
so the first quartile is 61 inches.
The third quartile is computed similarly, using 75% instead of
25%. L = 0.75(8) = 6. This is a whole number, so we will find
the mean of the 6th and 7th data values: (67+69)/2 = 68, so Q3
is 68.
Note that the median could be computed the same way, using
50%.
Cluster Analysis
Regression Analysis
• Regression analysis: A collective name for
techniques for the modeling and analysis of
numerical data consisting of values of a
dependent variable (also called response
variable or measurement) and of one or more
independent variables (aka. explanatory
variables or predictors)
• The parameters are estimated so as to give a
"best fit" of the data
• Most commonly the best fit is evaluated by
using the least squares method, but other
criteria have also been used
• Used for prediction (including
forecasting of time-series data),
inference, hypothesis testing,
and modeling of causal
relationships
y
x
y = x + 1
X1
Y1
Y1‘
How to Handle Inconsistent Data?
• Manual correction using external references
• Semi-automatic using various tools
– To detect violation of known functional
dependencies and data constraints
– To correct redundant data
Data Integration
• Data integration:
– combines data from multiple sources into a coherent store
• Schema integration
– integrate metadata from different sources
– Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id  B.cust-#
• Detecting and resolving data value conflicts
– for the same real world entity, attribute values from different
sources are different
– possible reasons: different representations, different scales,
e.g., metric vs. British units, different currency
Handling Redundant Data in
Data Integration
• Redundant data occur often when integrating multiple DBs
– The same attribute may have different names in different databases
– One attribute may be a ―derived‖ attribute in another table, e.g.,
annual revenue
– Careful integration can help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
• Redundant data may be able to be detected by Correlation
analysis - Pearson‘s correlation coefficient for Numerical
Data
Correlation Analysis
• The term correlation refers the strength and direction of a linear
relationship between two or more variables
• If a change in one variable effects a change in the other variable,
the variables are said to be correlated.
• Is there a relationship between the number of hours a student
spends studying for a calculus test and the student's score on that
calculus test?
• There are basically three types of correlation, namely positive
correlation, negative correlation and zero correlation
Pearson‘s correlation coefficient
The formulas return a value between -1 and 1, where:
• If r is close to 1, we say that the variables are positively correlated. This means there
is likely a strong linear relationship between the two variables, with a positive slope.
• If r is close to -1, we say that the variables are negatively correlated. This means
there is likely a strong linear relationship between the two variables, with a negative
slope.
• If r is close to 0, we say that the variables are not correlated. This means that there is
likely no linear relationship between the two variables,
2

Pearson‘s correlation coefficient
x Y
5 25
3 20
4 21
10 35
15 38
Pearson‘s correlation coefficient
Exercise
47
Correlation Analysis – Chi Square Test (Nominal
Data)
• A hypothesis is an idea that can be tested.
• Null Hypothesis : Thee is no association between the two variable. Both are
un-related with each other, Denoted H0
• Alternative hypothesis: Assumes that there is an association between the two
variables, Denoted H1
• Degree of Freedom & Level of Significance
• The significance level, also denoted as alpha or ‗α‘, is the probability of
rejecting the null hypothesis, usually 0.05
• Degrees of Freedom refers to the maximum number of logically independent
values, which are values that have the freedom to vary,
DF = (r-1) x (c-1)
2

2

Chi-Square Calculation: An Example
• Numbers in parenthesis are expected counts calculated based on
the data distribution in the two categories
Playing
Cricket
Not playing
Cricket
Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
49
Chi Square Test (Nominal Data)
• The Observed values are those we gather ourselves. The
expected values are the frequencies expected, based on our null
hypothesis.
• If the observed chi-square test statistic is greater than the critical
value (found in the chi-square Table), the null hypothesis can be
rejected.



Expected
Expected
Observed 2
2 )
(

2

93
.
507
840
)
840
1000
(
360
)
360
200
(
210
)
210
50
(
90
)
90
250
( 2
2
2
2
2










Chi-Square Table
Data Reduction
• Problem:
Data Warehouse may store terabytes of data:
Complex data analysis/mining may take a very
long time to run on the complete data set
• Solution?
– Data reduction…
Obtain a reduced representation of the data set
that is much smaller in volume but yet
produces the same (or almost the same)
analytical results
•Data reduction strategies
–Data cube aggregation
–Dimensionality reduction
–Data compression
–Numerosity reduction
–Discretization and concept hierarchy generation
Data Reduction
Principal Component Analysis – The need
• When the data has too many dimensions, then it becomes a problem for
Data Mining.
– High compute and execution time
– The risk of compromise in the quality of the model fit.
• When the dimension of data is too high, we need to find a way to reduce
it.
• But that reduction has to be done in such a way that we maintain the
original pattern of the data.
• PCA – A well known dimensionality reduction technique that maps the
data into the space of lower of dimensionality
• Transfer the set of correlated variables into the new set of un-correlated
variables
• Form of Un-supervised learning and also used as data pre-processing
technique
Principal Component Analysis
• Principal component analysis is a method of extracting important
variables known as principal components from a large set of variables
available in a data set.
• It represents the original data in terms of its principal components in a
new dimension space.
• Principal components (PC) are the directions where there is the most
variance, the directions where the data is most spread out.
• There are multiple principal components of a data – each representing the
different variance of the data. They are arranged in a chronological order
of variance.
• The first PC will capture the most variance i.e. the most information about
the data, followed by the second, third and so on.
Principal Component Analysis
• Mathematically, the principal
components are the eigenvectors of
the covariance matrix of the original
dataset.
• Eigenvector corresponds to a
direction. Each eigenvector has a
corresponding eigenvalue
• A eigenvalue is a number that
indicates how much variance there is
in the data along that eigenvector
• A larger eigenvalue means that that
principal component explains a large
amount of the variance in the data.
PCA – How it Works ?
• The main purpose of a principal component analysis is the
analysis of data to identify and find patterns to reduce the
dimensions of the dataset with a minimal loss of information.
• 'N' dimension of data
• Obtain the Covariance matrix
• Calculate 'N' Eigen values and Eigen vectors
• Order the Eigen vectors based on the Eigen values
• The Eigen vector with the highest Eigen value is the principal
components of the given data set
• Thus, the given ‗N‘ dimension of data has been reduced into
‗p‘ principal components finally.
Data Transformation
• Smoothing: remove noise from data (binning,
clustering, regression)
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small,
specified range
– min-max normalization
– z-score normalization
– normalization by decimal scaling
• Attribute/feature construction
– New attributes constructed from the given ones
Data Transformation: Normalization
• min-max normalization
• z-score normalization
• normalization by decimal scaling
A
A
A
A
A
A
min
new
min
new
max
new
min
max
min
v
v _
)
_
_
(
' 




A
A
dev
stand
mean
v
v
_
'


j
v
v
10
' Where j is the smallest integer such that Max(| |)<1
'
v
Particularly useful for classification (NNs, distance measurements,
nn classification, etc)
Normalization
 Min-max normalization: to [new_minA, new_maxA]
 Ex. Let income range $12,000 to $98,000 normalized to [0.0,
1.0]. Then $73,000 is mapped to
 Z-score normalization (μ: mean, σ: standard deviation):
 Ex. Let μ = 54,000, σ = 16,000. Then
 Normalization by decimal scaling
716
.
0
0
)
0
0
.
1
(
000
,
12
000
,
98
000
,
12
600
,
73





A
A
A
A
A
A
min
new
min
new
max
new
min
max
min
v
v _
)
_
_
(
' 




A
A
v
v




'
j
v
v
10
' Where j is the smallest integer such that Max(|ν‘|) < 1
225
.
1
000
,
16
000
,
54
600
,
73



3 module 2

  • 1.
  • 2.
    2 Types of DataSets • Record – Relational records – Data matrix, e.g., numerical matrix, crosstabs – Document data: text documents: term-frequency vector – Transaction data • Graph and network – World Wide Web – Social or information networks – Molecular Structures • Ordered – Video data: sequence of images – Temporal data: time-series – Sequential Data: transaction sequences – Genetic sequence data • Spatial, image and multimedia: – Spatial data: maps – Image data: – Video data: Document 1 season timeout lost wi n game score ball pla y coach team Document 2 Document 3 3 0 5 0 2 6 0 2 0 2 0 0 7 0 2 1 0 0 3 0 0 1 0 0 1 2 2 0 3 0 TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk
  • 3.
    3 Attributes • Attribute (ordimensions, features, variables): a data field, representing a characteristic or feature of a data object. – E.g., customer _ID, name, address • Types: – Nominal – Binary – Numeric: quantitative • Interval-scaled • Ratio-scaled
  • 4.
    4 Attribute Types • Nominal:categories, states, or ―names of things‖ – Hair_color = {auburn, black, blond, brown, grey, red, white} – occupation • Binary – Nominal attribute with only 2 states (0 and 1) – Symmetric binary: both outcomes equally important • e.g., gender – Asymmetric binary: outcomes not equally important. • e.g., medical test (positive vs. negative) • Convention: assign 1 to most important outcome • Ordinal – Values have a meaningful order (ranking) – Size = {small, medium, large}, grades, army rankings
  • 5.
    5 Numeric Attribute Types •Quantity (integer or real-valued) • Interval • Measured on a scale of equal-sized units • Values have order – E.g., temperature in C˚or F˚, calendar dates • Ratio • We can speak of values as being an order of magnitude larger than the unit of measurement (10 K˚ is twice as high as 5 K˚). – e.g., temperature in Kelvin, length, counts, monetary quantities
  • 7.
    Why Data Preprocessing? •Data in the real world is dirty – incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data – noisy: containing errors or outliers – inconsistent: containing discrepancies in codes or names • No quality data, no quality mining results! – Quality decisions must be based on quality data – Data warehouse needs consistent integration of quality data • A multi-dimensional measure of data quality: – A well-accepted multi-dimensional view: • accuracy, completeness, consistency, timeliness, believability, value added, interpretability, accessibility – Broad categories: • intrinsic, contextual, representational, and accessibility.
  • 8.
    Major Tasks inData Preprocessing • Data cleaning – Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration – Integration of multiple databases, data cubes, files, or notes • Data transformation – Normalization (scaling to a specific range) – Aggregation • Data reduction – Obtains reduced representation in volume but produces the same or similar analytical results – Data discretization: with particular importance, especially for numerical data – Data aggregation, dimensionality reduction, data compression, generalization
  • 9.
    Forms of datapreprocessing
  • 10.
    Data Cleaning • Datacleaning tasks – Fill in missing values – Identify outliers and smooth out noisy data – Correct inconsistent data
  • 11.
    Missing Data • Datais not always available – E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to – equipment malfunction – inconsistent with other recorded data and thus deleted – data not entered due to misunderstanding – certain data may not be considered important at the time of entry – not register history or changes of the data • Missing data may need to be inferred
  • 12.
    How to HandleMissing Data? • Ignore the tuple: usually done when class label is missing (assuming the task is classification—not effective in certain cases) • Fill in the missing value manually: tedious + infeasible? • Use a global constant to fill in the missing value: e.g., ―unknown‖, a new class?! • Use the attribute mean to fill in the missing value • Use the attribute mean for all samples of the same class to fill in the missing value: smarter • Use the most probable value to fill in the missing value: inference-based such as regression, Bayesian formula, decision tree
  • 13.
  • 14.
    Noisy Data • Question: What is noise? • Answer : Random error in a measured variable. • Incorrect attribute values may be due to – faulty data collection instruments – data entry problems – data transmission problems – technology limitation – inconsistency in naming convention
  • 15.
    How to HandleNoisy Data? • Binning method: – first sort data and partition into (equi-depth) bins – then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. • Clustering – detect and remove outliers • Regression – smooth by fitting the data into regression functions
  • 16.
    Descriptive Statistics Mean –The mean of a set of data is the sum of the data values divided by the number of values. Problem : Marci‘s exam scores for her last math class were 79, 86, 82, and 94. What would the mean of these values would be? Answer: 79+86+82+94/4=85.25. Typically we round means to one more decimal place than the original data had. In this case, we would round 85.25 to 85.3.
  • 17.
    Descriptive Statistics • Median- Middle value of the variable once it has been sorted from low to high • If the number of data values, N, is odd, then the median is the middle data value. This value can be found by rounding N/2 up to the next whole number. • If the number of data values is even, there is no one value. Hence we have to find the mean of the two middle values N/2 and N/2 + 1 Data Given : 3,4,7,2,3,7,4,2,4,7,4 Sorted data : 2,2,3,3,4,4,4,4,7,7,7 Median : 4 Data Given : 3,4,7,2,3,7,4,2 Median ?
  • 18.
    Descriptive Statistics • Mode– Most commonly reported value for a particular variable 3,4,5,6,7,7,7,8,8,9 Mode = 7 Find the mode of the data set 3,4,5,6,7,7,7,8,8,8,9 Mode = 7.5
  • 19.
    19 Descriptive Statistics • Range– Simple measure of the variation for a particular variable. It is calculated as the difference between the highest and lowest values. 2,3,4,6,7,7,8,9 Range = 9-2 = 7 • Variance – It is the measure of the deviation of a variable from the mean. For the variable that do not represent the entire population, the sample variance formula is :
  • 20.
    Descriptive Statistics • Calculatethe variance for the below values : 3,4,4,5,5,5,6,6,6,7,7,8,9
  • 21.
  • 22.
    Descriptive Statistics • StandardDeviation – The standard deviation is also referred as Root Mean Square, is the square root of the variance. • For a sample population, the formula is • For the previous problem, the standard deviation is :
  • 23.
    Descriptive Statistics • Covariance– Measure to find out how much the dimensions vary from the mean with respect to each other (Two Dimensions) • The formula for calculating covariance among two variables X and Y is
  • 24.
    Exercises 1) Find Mean,Median, Variance, Standard Deviation for the following Data Set. 12,23,34,44,59,70,98 2)
  • 25.
    Binning • Equal-width partitioning: –It divides the range into N intervals of equal size: uniform grid – if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N. – The most straightforward – Skewed data is not handled well. • Equal-depth partitioning: – It divides the range into N intervals, each containing approximately same number of samples – Good data scaling – Managing categorical attributes can be tricky.
  • 26.
    Binning Methods forData Smoothing (Equi-Depth) * Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
  • 27.
    Equi-Width Binning for DataSmoothing • Given Data Set ; 5, 10, 11, 13, 15, 35, 50 ,55, 72, 92, 204, 215 • The formula for binning into equal-widths is = (max−min)/N
  • 28.
    Data Smoothening byThree Point Summary • For any Instance ‗I‘, the formula for three point moving average is calculated as follows : = XI-1 + XI + XI+1 / 3 • Consider the following data set : 3,4,7,2,14,0,21,9,1,5 • The data set would be smoothened as 3,4.67,4.33,7.67,5.33,11.67,10,10.33,5,5 X
  • 29.
    Data Smoothening byFive Point Summary • Quartiles, outliers and Boxplot • Quartiles: Q1 (25th percentile), Q3 (75th percentile) – Inter-quartile range: IQR = Q3 – Q1 – The lower inner fence (min) = Q1 – (1.5 * IQR) – The Upper inner fence (max) = Q3 + (1.5 * IQR) – Five number summary: min, Q1, M, Q3, max – Boxplot: Ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually
  • 30.
    Box and WhiskerDiagrams Looking at the box plot on its own: Minimum value LQ Median UQ Maximum value whisker whisker
  • 31.
    Quartiles • Quartiles arevalues that divide the data in quarters. • The first quartile (Q1) is the value so that 25% of the data values are below it; the third quartile (Q3) is the value so that 75% of the data values are below it. You may have guessed that the second quartile is the same as the median, since the median is the value so that 50% of the data values are below it. • This divides the data into quarters; 25% of the data is between the minimum and Q1, 25% is between Q1 and the median, 25% is between the median and Q3, and 25% is between Q3 and the maximum value. September 1, 2021 31
  • 32.
    Problem Suppose we havemeasured 9 females, and their heights (in inches) sorted from smallest to largest are: 59 60 62 64 66 67 69 70 72 What are the first and third quartiles? September 1, 2021 32
  • 33.
    Solution To find thefirst quartile we first compute the locator: 25% of 9 is L = 0.25(9) = 2.25. Since this value is not a whole number, we round up to 3. The first quartile will be the third data value: 62 inches. To find the third quartile, we again compute the locator: 75% of 9 is 0.75(9) = 6.75. Since this value is not a whole number, we round up to 7. The third quartile will be the seventh data value: 69 inches. September 1, 2021 33
  • 34.
    Try Again forthe Data Set Data: 11,13,13,15,15,16,19,20,20,20,21,21,22,23,24 ,30,40,45,45,45,45,71,72,73,75 Solution in ‗R‘
  • 35.
    Problem - 2 Supposewe had measured 8 females, and their heights (in inches) sorted from smallest to largest are: 59 60 62 64 66 67 69 70 What are the first and third quartiles? What is the 5 number summary?
  • 36.
    Solution To find thefirst quartile we first compute the locator: 25% of 8 is L = 0.25(8) = 2. Since this value is a whole number, we will find the mean of the 2nd and 3rd data values: (60+62)/2 = 61, so the first quartile is 61 inches. The third quartile is computed similarly, using 75% instead of 25%. L = 0.75(8) = 6. This is a whole number, so we will find the mean of the 6th and 7th data values: (67+69)/2 = 68, so Q3 is 68. Note that the median could be computed the same way, using 50%.
  • 37.
  • 38.
    Regression Analysis • Regressionanalysis: A collective name for techniques for the modeling and analysis of numerical data consisting of values of a dependent variable (also called response variable or measurement) and of one or more independent variables (aka. explanatory variables or predictors) • The parameters are estimated so as to give a "best fit" of the data • Most commonly the best fit is evaluated by using the least squares method, but other criteria have also been used • Used for prediction (including forecasting of time-series data), inference, hypothesis testing, and modeling of causal relationships y x y = x + 1 X1 Y1 Y1‘
  • 39.
    How to HandleInconsistent Data? • Manual correction using external references • Semi-automatic using various tools – To detect violation of known functional dependencies and data constraints – To correct redundant data
  • 40.
    Data Integration • Dataintegration: – combines data from multiple sources into a coherent store • Schema integration – integrate metadata from different sources – Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id  B.cust-# • Detecting and resolving data value conflicts – for the same real world entity, attribute values from different sources are different – possible reasons: different representations, different scales, e.g., metric vs. British units, different currency
  • 41.
    Handling Redundant Datain Data Integration • Redundant data occur often when integrating multiple DBs – The same attribute may have different names in different databases – One attribute may be a ―derived‖ attribute in another table, e.g., annual revenue – Careful integration can help reduce/avoid redundancies and inconsistencies and improve mining speed and quality • Redundant data may be able to be detected by Correlation analysis - Pearson‘s correlation coefficient for Numerical Data
  • 42.
    Correlation Analysis • Theterm correlation refers the strength and direction of a linear relationship between two or more variables • If a change in one variable effects a change in the other variable, the variables are said to be correlated. • Is there a relationship between the number of hours a student spends studying for a calculus test and the student's score on that calculus test? • There are basically three types of correlation, namely positive correlation, negative correlation and zero correlation
  • 43.
    Pearson‘s correlation coefficient Theformulas return a value between -1 and 1, where: • If r is close to 1, we say that the variables are positively correlated. This means there is likely a strong linear relationship between the two variables, with a positive slope. • If r is close to -1, we say that the variables are negatively correlated. This means there is likely a strong linear relationship between the two variables, with a negative slope. • If r is close to 0, we say that the variables are not correlated. This means that there is likely no linear relationship between the two variables, 2 
  • 44.
    Pearson‘s correlation coefficient xY 5 25 3 20 4 21 10 35 15 38
  • 45.
  • 46.
  • 47.
    47 Correlation Analysis –Chi Square Test (Nominal Data) • A hypothesis is an idea that can be tested. • Null Hypothesis : Thee is no association between the two variable. Both are un-related with each other, Denoted H0 • Alternative hypothesis: Assumes that there is an association between the two variables, Denoted H1 • Degree of Freedom & Level of Significance • The significance level, also denoted as alpha or ‗α‘, is the probability of rejecting the null hypothesis, usually 0.05 • Degrees of Freedom refers to the maximum number of logically independent values, which are values that have the freedom to vary, DF = (r-1) x (c-1) 2  2 
  • 48.
    Chi-Square Calculation: AnExample • Numbers in parenthesis are expected counts calculated based on the data distribution in the two categories Playing Cricket Not playing Cricket Sum (row) Like science fiction 250(90) 200(360) 450 Not like science fiction 50(210) 1000(840) 1050 Sum(col.) 300 1200 1500
  • 49.
    49 Chi Square Test(Nominal Data) • The Observed values are those we gather ourselves. The expected values are the frequencies expected, based on our null hypothesis. • If the observed chi-square test statistic is greater than the critical value (found in the chi-square Table), the null hypothesis can be rejected.    Expected Expected Observed 2 2 ) (  2  93 . 507 840 ) 840 1000 ( 360 ) 360 200 ( 210 ) 210 50 ( 90 ) 90 250 ( 2 2 2 2 2          
  • 50.
  • 51.
    Data Reduction • Problem: DataWarehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set • Solution? – Data reduction…
  • 52.
    Obtain a reducedrepresentation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results •Data reduction strategies –Data cube aggregation –Dimensionality reduction –Data compression –Numerosity reduction –Discretization and concept hierarchy generation Data Reduction
  • 53.
    Principal Component Analysis– The need • When the data has too many dimensions, then it becomes a problem for Data Mining. – High compute and execution time – The risk of compromise in the quality of the model fit. • When the dimension of data is too high, we need to find a way to reduce it. • But that reduction has to be done in such a way that we maintain the original pattern of the data. • PCA – A well known dimensionality reduction technique that maps the data into the space of lower of dimensionality • Transfer the set of correlated variables into the new set of un-correlated variables • Form of Un-supervised learning and also used as data pre-processing technique
  • 54.
    Principal Component Analysis •Principal component analysis is a method of extracting important variables known as principal components from a large set of variables available in a data set. • It represents the original data in terms of its principal components in a new dimension space. • Principal components (PC) are the directions where there is the most variance, the directions where the data is most spread out. • There are multiple principal components of a data – each representing the different variance of the data. They are arranged in a chronological order of variance. • The first PC will capture the most variance i.e. the most information about the data, followed by the second, third and so on.
  • 55.
    Principal Component Analysis •Mathematically, the principal components are the eigenvectors of the covariance matrix of the original dataset. • Eigenvector corresponds to a direction. Each eigenvector has a corresponding eigenvalue • A eigenvalue is a number that indicates how much variance there is in the data along that eigenvector • A larger eigenvalue means that that principal component explains a large amount of the variance in the data.
  • 56.
    PCA – Howit Works ? • The main purpose of a principal component analysis is the analysis of data to identify and find patterns to reduce the dimensions of the dataset with a minimal loss of information. • 'N' dimension of data • Obtain the Covariance matrix • Calculate 'N' Eigen values and Eigen vectors • Order the Eigen vectors based on the Eigen values • The Eigen vector with the highest Eigen value is the principal components of the given data set • Thus, the given ‗N‘ dimension of data has been reduced into ‗p‘ principal components finally.
  • 57.
    Data Transformation • Smoothing:remove noise from data (binning, clustering, regression) • Aggregation: summarization, data cube construction • Generalization: concept hierarchy climbing • Normalization: scaled to fall within a small, specified range – min-max normalization – z-score normalization – normalization by decimal scaling • Attribute/feature construction – New attributes constructed from the given ones
  • 58.
    Data Transformation: Normalization •min-max normalization • z-score normalization • normalization by decimal scaling A A A A A A min new min new max new min max min v v _ ) _ _ ( '      A A dev stand mean v v _ '   j v v 10 ' Where j is the smallest integer such that Max(| |)<1 ' v Particularly useful for classification (NNs, distance measurements, nn classification, etc)
  • 59.
    Normalization  Min-max normalization:to [new_minA, new_maxA]  Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to  Z-score normalization (μ: mean, σ: standard deviation):  Ex. Let μ = 54,000, σ = 16,000. Then  Normalization by decimal scaling 716 . 0 0 ) 0 0 . 1 ( 000 , 12 000 , 98 000 , 12 600 , 73      A A A A A A min new min new max new min max min v v _ ) _ _ ( '      A A v v     ' j v v 10 ' Where j is the smallest integer such that Max(|ν‘|) < 1 225 . 1 000 , 16 000 , 54 600 , 73  