2. Data mining: tools, methodologies, and theories for
g , g ,
revealing patterns in data—a critical step in
knowledge discovery.
Driving forces:
Explosive growth of data in a great variety of fields
Cheaper storage devices with higher capacity
Faster communication
Better d t b
B tt database manage systems
t
Rapidly increasing computing power
Make data to work for us
Data Mining - R. Akerkar 2
3. Categorization
Supervised learning vs. unsupervised
learning
Is Y available in the training data?
Regression vs Classification
vs.
Is Y quantitative or qualitative?
Data Mining - R. Akerkar 3
4. Supervised learning
Learning from examples, where a training set
examples
is given which acts as example for the
classes.
The system finds a description for each class.
Once description and hence the classification
rule has been formulated, it is used to predict
the class of previously unseen objects
objects.
Data Mining - R. Akerkar 4
5. Classification Rule
The domestic flights in the country were operated by
Air Canada.
Recently, many new airlines began their operations.
Some of the customers of Air Canada started flying with
y g
these private airlines.
So, as a result Air Canada loses its customers.
Question: Why some customers remain loyal while
others leave.
To predict: which customers it is most likely to lose
to its competitors.
Build a model based on the historical data of loyal
customers versus customers who have left left.
Data Mining - R. Akerkar 5
6. Statistics
A theory rich approach for data analysis.
theory-
Measures of central tendency or Averages
y g
A single expression representing the whole group is
selected.
This i l
Thi single expression in statistics i k
i i t ti ti is known as th the
average.
Averages are generally the central part of the distribution.
And therefore they are also called the measures of central
tendency.
Data Mining - R. Akerkar 6
7. Types of measures of central tendency or
averages
Arithmetic Mean (or simply mean)
Median
Mode
Geometric Mean
Harmonic Mean
Data Mining - R. Akerkar 7
8. Arithmetic Mean: It is the ratio of the sum of all
observations to the total number of observation.
Median: It is the middle most value of the variable in a
set of observations, when they are arranged either in
ascending or in descending order of their magnit de
magnitude.
Thus it divides the data into two equal parts.
Mode: Mode is defined as that value in the series
which occurs most frequently. In a frequency
distribution mode is that variant which has maximum
frequency.
Data Mining - R. Akerkar 8
9. Examples: Suppose we want to find the average height of a student in
a class
class.
We can measure the height of all the students. Then add them and
divide it by number of students in the class. It will give mean height.
We can ask the students to make a queue according to their height and
then the height of the middle most student will be the median. If there
are odd number of students, we will get a middle one but if they are
, g y
even in numbers then the average of the heights of the two middle
students will be the median.
We can measure their heights And make a frequency distribution
heights.
table. We can make a table with the height of the students in one
column and the frequency in the other. With the limitations of our
measuring instruments many students must be having same height.
The modal height will be the one which maximum number of students
must be having. It means the height with the maximum frequency will
be the modal height.
Data Mining - R. Akerkar 9
10. Variance
is defined as the mean of the square of the
deviations( difference) from the mean.
Procedure:
1.
1 Calculate the mean of the observations
observations.
2. Then calculate the difference of each observation
from the mean.
3. Then square the differences.
4. Add all the squares.
q
5. Divide the sum by the total number of
observations.
Data Mining - R. Akerkar 10
11. Standard De iation
Deviation
It is the square root of the variance.
Data Mining - R. Akerkar 11
12. Exercise 1
Find the median of the data in the above
figure.
Find the standard deviation in the data in
above figure.
Data Mining - R. Akerkar 12
13. Solutions
There are 15 data points in the histogram.
Seven are smaller than 3 and seven are
greater than 3, so the median is 3.
List the full set of observations in a
spreadsheet, repeating values as many times
p , p g y
as they occur: 0, 0, 0, 0, 1, 2, 2, 3, 4, 4, 4, 5,
5, 6, 7.
Apply the function STDEVP to the observations.
The result is 2.28
Data Mining - R. Akerkar 13
18. Normal Distribution
Normal distributions are a family of
distributions.
Normal distributions are symmetric with
y
scores more concentrated in the middle than
in the tails.
They are defined by two parameters: the
mean (μ) and the standard deviation (σ).
( )
Data Mining - R. Akerkar 18
19. For example, there are probably a nearly infinite
number of factors that determine a person's height
(thousands of genes, nutrition, diseases, etc.).
Thus, height can be expected to be normally
g p y
distributed in the population.
The normal distribution function is determined by
1/[(2 )1/2 ] e { 1/2 [(x
f(x) = 1/[(2*)1/2*] * e**{-1/2*[(x- µ)/]2 },
for -∞ < x < ∞
where µ is the mean
is the standard deviation
e is the base of the natural logarithm, sometimes called Euler's e
(2.71...)
is the constant Pi (3.14...)
Data Mining - R. Akerkar 19
20. Null hypothesis
The statistical hypothesis that is set up for testing a hypothesis is
known as null hypothesis. It states that there is no difference
between the sample statistic and population parameter.
The purpose of hypothesis testing is to test the viability of the null
p p yp g y
hypothesis in the light of experimental data.
Consider a researcher interested in whether the time to respond
p
to a tone is affected by the consumption of alcohol. The null
hypothesis is that µ1 - µ2 = 0
where µ1 is the mean time to respond after consuming alcohol and
µ2 i th mean ti
2 is the time t respond otherwise.
to d th i
Thus, the null hypothesis concerns the parameter µ1 - µ2 and
the null hypothesis is that the parameter equals zero.
Data Mining - R. Akerkar 20
21. Null Hypothesis vs. Experimental data
The null hypothesis is often the reverse of what the
experimenter actually believes;
it is put forward to allow the data to contradict it.
In the experiment on the effect of alcohol, the
experimenter probably expects alcohol to have a
harmful effect.
h f l ff t
If the experimental data show a sufficiently large
effect of alcohol, then the null hypothesis that
alcohol
alcohol has no effect can be rejected.
Data Mining - R. Akerkar 21
22. Hypothesis testing
Hypothesis testing is a method of inferential statistics.
An experimenter starts with a hypothesis about a population
parameter called the null hypothesis.
Data are then collected and the viability of the null
hypothesis is determined in light of the data.
If the data are very different from what would be expected
under the assumption that the null hypothesis is true, then
the null hypothesis is rejected.
If the data are not greatly at variance with what would be
f
expected under the assumption that the null hypothesis is
true, then the null hypothesis is not rejected.
Data Mining - R. Akerkar 22
23. The test of hypothesis discloses the fact
whether the difference between sample
statistic and the corresponding hypothetical
p g yp
population parameter is significant or not
significant. Thus the test of hypothesis is also
g yp
known as the test of significance.
Data Mining - R. Akerkar 23
24. A Classical Model for
Hypothesis Testing
X1 X2
P
( v1 / n1 v2 / n2 )
where
P is the significance score and;
X 1 and X 2 are sample means for the independent samples;
v1 and v2 are variance scores for the respective means;
n1 and n2 are corresponding sample sizes
sizes.
Data Mining - R. Akerkar 24
29. Exercise
If scores are normally distributed with a mean
of 30 and a standard deviation of 5, what
p
percent of the scores is: ( ) g
(a) greater than 30?
(b) greater than 37? (c) between 28 and 34?
Data Mining - R. Akerkar 29
30. Answers
a.
a 50%
b. 8.08%
c. 44.35
Data Mining - R. Akerkar 30