Statistical Preliminaries

Statistical Preliminaries

R. Akerkar
TMRF, Kolhapur, India

Data Mining - R. Akerkar 1

 Data mining: tools, methodologies, and theories for
g , g ,
revealing patterns in data—a critical step in
knowledge discovery.

 Driving forces:
 Explosive growth of data in a great variety of fields
 Cheaper storage devices with higher capacity
 Faster communication
 Better d t b
B tt database manage systems
t
 Rapidly increasing computing power
 Make data to work for us


 Categorization
 Supervised learning vs. unsupervised
learning
 Is Y available in the training data?
 Regression vs Classification
vs.
 Is Y quantitative or qualitative?


Supervised learning

 Learning from examples, where a training set
examples
is given which acts as example for the
classes.
 The system finds a description for each class.
 Once description and hence the classification
rule has been formulated, it is used to predict
the class of previously unseen objects
objects.


Classification Rule

 The domestic flights in the country were operated by
Air Canada.
 Recently, many new airlines began their operations.
 Some of the customers of Air Canada started flying with
y g
these private airlines.
 So, as a result Air Canada loses its customers.

 Question: Why some customers remain loyal while
others leave.
 To predict: which customers it is most likely to lose
to its competitors.
 Build a model based on the historical data of loyal
customers versus customers who have left left.


Statistics

 A theory rich approach for data analysis.
theory-

 Measures of central tendency or Averages
y g
 A single expression representing the whole group is
selected.
 This i l
Thi single expression in statistics i k
i i t ti ti is known as th the
average.
 Averages are generally the central part of the distribution.
 And therefore they are also called the measures of central
tendency.


Types of measures of central tendency or
averages
 Arithmetic Mean (or simply mean)
 Median
 Mode
 Geometric Mean
 Harmonic Mean


 Arithmetic Mean: It is the ratio of the sum of all
observations to the total number of observation.

 Median: It is the middle most value of the variable in a
set of observations, when they are arranged either in
ascending or in descending order of their magnit de
magnitude.
Thus it divides the data into two equal parts.

 Mode: Mode is defined as that value in the series
which occurs most frequently. In a frequency
distribution mode is that variant which has maximum
frequency.


 Examples: Suppose we want to find the average height of a student in
a class
class.

 We can measure the height of all the students. Then add them and
divide it by number of students in the class. It will give mean height.

 We can ask the students to make a queue according to their height and
then the height of the middle most student will be the median. If there
are odd number of students, we will get a middle one but if they are
, g y
even in numbers then the average of the heights of the two middle
students will be the median.

 We can measure their heights And make a frequency distribution
heights.
table. We can make a table with the height of the students in one
column and the frequency in the other. With the limitations of our
measuring instruments many students must be having same height.
The modal height will be the one which maximum number of students
must be having. It means the height with the maximum frequency will
be the modal height.


 Variance
 is defined as the mean of the square of the
deviations( difference) from the mean.
 Procedure:
1.
1 Calculate the mean of the observations
observations.
2. Then calculate the difference of each observation
from the mean.
3. Then square the differences.
4. Add all the squares.
q
5. Divide the sum by the total number of
observations.


 Standard De iation
Deviation
 It is the square root of the variance.


Exercise 1

 Find the median of the data in the above
figure.
 Find the standard deviation in the data in
above figure.


Solutions
 There are 15 data points in the histogram.
Seven are smaller than 3 and seven are
greater than 3, so the median is 3.

 List the full set of observations in a
spreadsheet, repeating values as many times
p , p g y
as they occur: 0, 0, 0, 0, 1, 2, 2, 3, 4, 4, 4, 5,
5, 6, 7.
Apply the function STDEVP to the observations.
The result is 2.28


Exercise 2


Solutions


Exercise 3


Solution


Normal Distribution

 Normal distributions are a family of
distributions.
Normal distributions are symmetric with
y
scores more concentrated in the middle than
in the tails.
 They are defined by two parameters: the
mean (μ) and the standard deviation (σ).
( )


 For example, there are probably a nearly infinite
number of factors that determine a person's height
(thousands of genes, nutrition, diseases, etc.).
 Thus, height can be expected to be normally
g p y
distributed in the population.

 The normal distribution function is determined by

1/[(2 )1/2 ] e { 1/2 [(x
f(x) = 1/[(2*)1/2*] * e**{-1/2*[(x- µ)/]2 },
for -∞ < x < ∞
 where µ is the mean
  is the standard deviation
 e is the base of the natural logarithm, sometimes called Euler's e
(2.71...)
  is the constant Pi (3.14...)


Null hypothesis
 The statistical hypothesis that is set up for testing a hypothesis is
known as null hypothesis. It states that there is no difference
between the sample statistic and population parameter.

 The purpose of hypothesis testing is to test the viability of the null
p p yp g y
hypothesis in the light of experimental data.

 Consider a researcher interested in whether the time to respond
p
to a tone is affected by the consumption of alcohol. The null
hypothesis is that µ1 - µ2 = 0
 where µ1 is the mean time to respond after consuming alcohol and
µ2 i th mean ti
2 is the time t respond otherwise.
to d th i
 Thus, the null hypothesis concerns the parameter µ1 - µ2 and
the null hypothesis is that the parameter equals zero.


Null Hypothesis vs. Experimental data

 The null hypothesis is often the reverse of what the
experimenter actually believes;
 it is put forward to allow the data to contradict it.
 In the experiment on the effect of alcohol, the
experimenter probably expects alcohol to have a
harmful effect.
h f l ff t
 If the experimental data show a sufficiently large
effect of alcohol, then the null hypothesis that
alcohol
alcohol has no effect can be rejected.


Hypothesis testing
 Hypothesis testing is a method of inferential statistics.

 An experimenter starts with a hypothesis about a population
parameter called the null hypothesis.

 Data are then collected and the viability of the null
hypothesis is determined in light of the data.
 If the data are very different from what would be expected
under the assumption that the null hypothesis is true, then
the null hypothesis is rejected.
 If the data are not greatly at variance with what would be
f
expected under the assumption that the null hypothesis is
true, then the null hypothesis is not rejected.


 The test of hypothesis discloses the fact
whether the difference between sample
statistic and the corresponding hypothetical
p g yp
population parameter is significant or not
significant. Thus the test of hypothesis is also
g yp
known as the test of significance.


A Classical Model for
Hypothesis Testing
X1  X2
P
( v1 / n1  v2 / n2 )
where
P is the significance score and;
X 1 and X 2 are sample means for the independent samples;
v1 and v2 are variance scores for the respective means;
n1 and n2 are corresponding sample sizes
sizes.


Exercise


Solution


Exercise


Solution


Exercise

 If scores are normally distributed with a mean
of 30 and a standard deviation of 5, what
p
percent of the scores is: ( ) g
(a) greater than 30?
(b) greater than 37? (c) between 28 and 34?


Answers

 a.
a 50%
b. 8.08%
c. 44.35


Statistical Preliminaries

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Statistical Preliminaries

Similar to Statistical Preliminaries (20)

More from R A Akerkar

More from R A Akerkar (13)

Recently uploaded

Recently uploaded (20)

Statistical Preliminaries