Statistical Preliminaries    R. Akerkar    TMRF, Kolhapur, India                     Data Mining - R. Akerkar   1
   Data mining: tools, methodologies, and theories for               g        ,         g ,    revealing patterns in data...
   Categorization    Supervised learning vs. unsupervised    learning       Is Y available in the training data?   Reg...
Supervised learning   Learning from examples, where a training set                   examples    is given which acts as e...
Classification Rule   The domestic flights in the country were operated by    Air Canada.       Recently, many new airli...
Statistics   A theory rich approach for data analysis.      theory-   Measures of central tendency or Averages          ...
Types of measures of central tendency oraverages   Arithmetic Mean (or simply mean)   Median   Mode   Geometric Mean ...
   Arithmetic Mean: It is the ratio of the sum of all    observations to the total number of observation.   Median: It i...
   Examples: Suppose we want to find the average height of a student in    a class      class.   We can measure the heig...
   Variance       is defined as the mean of the square of the        deviations( difference) from the mean.   Procedure...
   Standard De iation                Deviation   It is the square root of the variance.                      Data Mining...
Exercise 1    Find the median of the data in the above     figure.    Find the standard deviation in the data in     abo...
Solutions   There are 15 data points in the histogram.    Seven are smaller than 3 and seven are    greater than 3, so th...
Exercise 2             Data Mining - R. Akerkar   14
Solutions            Data Mining - R. Akerkar   15
Exercise 3             Data Mining - R. Akerkar   16
Solution           Data Mining - R. Akerkar   17
Normal Distribution   Normal distributions are a family of    distributions.    Normal distributions are symmetric with  ...
   For example, there are probably a nearly infinite    number of factors that determine a persons height    (thousands o...
Null hypothesis   The statistical hypothesis that is set up for testing a hypothesis is    known as null hypothesis. It s...
Null Hypothesis vs. Experimental data   The null hypothesis is often the reverse of what the    experimenter actually bel...
Hypothesis testing   Hypothesis testing is a method of inferential statistics.   An experimenter starts with a hypothesi...
   The test of hypothesis discloses the fact    whether the difference between sample    statistic and the corresponding ...
A Classical Model forHypothesis Testing                            X1    X2            P                     ( v1 / n1 ...
Exercise           Data Mining - R. Akerkar   25
Solution           Data Mining - R. Akerkar   26
Exercise           Data Mining - R. Akerkar   27
Solution           Data Mining - R. Akerkar   28
Exercise   If scores are normally distributed with a mean    of 30 and a standard deviation of 5, what    p    percent of...
Answers   a.    a 50%    b. 8.08%    c. 44.35               Data Mining - R. Akerkar   30
Upcoming SlideShare
Loading in …5
×

Statistical Preliminaries

1,279 views
1,097 views

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,279
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
22
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Statistical Preliminaries

  1. 1. Statistical Preliminaries R. Akerkar TMRF, Kolhapur, India Data Mining - R. Akerkar 1
  2. 2.  Data mining: tools, methodologies, and theories for g , g , revealing patterns in data—a critical step in knowledge discovery. Driving forces: Explosive growth of data in a great variety of fields  Cheaper storage devices with higher capacity  Faster communication  Better d t b B tt database manage systems t Rapidly increasing computing power Make data to work for us Data Mining - R. Akerkar 2
  3. 3.  Categorization Supervised learning vs. unsupervised learning  Is Y available in the training data? Regression vs Classification vs.  Is Y quantitative or qualitative? Data Mining - R. Akerkar 3
  4. 4. Supervised learning Learning from examples, where a training set examples is given which acts as example for the classes. The system finds a description for each class. Once description and hence the classification rule has been formulated, it is used to predict the class of previously unseen objects objects. Data Mining - R. Akerkar 4
  5. 5. Classification Rule The domestic flights in the country were operated by Air Canada.  Recently, many new airlines began their operations.  Some of the customers of Air Canada started flying with y g these private airlines.  So, as a result Air Canada loses its customers. Question: Why some customers remain loyal while others leave. To predict: which customers it is most likely to lose to its competitors. Build a model based on the historical data of loyal customers versus customers who have left left. Data Mining - R. Akerkar 5
  6. 6. Statistics A theory rich approach for data analysis. theory- Measures of central tendency or Averages y g  A single expression representing the whole group is selected.  This i l Thi single expression in statistics i k i i t ti ti is known as th the average.  Averages are generally the central part of the distribution.  And therefore they are also called the measures of central tendency. Data Mining - R. Akerkar 6
  7. 7. Types of measures of central tendency oraverages Arithmetic Mean (or simply mean) Median Mode Geometric Mean Harmonic Mean Data Mining - R. Akerkar 7
  8. 8.  Arithmetic Mean: It is the ratio of the sum of all observations to the total number of observation. Median: It is the middle most value of the variable in a set of observations, when they are arranged either in ascending or in descending order of their magnit de magnitude. Thus it divides the data into two equal parts. Mode: Mode is defined as that value in the series which occurs most frequently. In a frequency distribution mode is that variant which has maximum frequency. Data Mining - R. Akerkar 8
  9. 9.  Examples: Suppose we want to find the average height of a student in a class class. We can measure the height of all the students. Then add them and divide it by number of students in the class. It will give mean height. We can ask the students to make a queue according to their height and then the height of the middle most student will be the median. If there are odd number of students, we will get a middle one but if they are , g y even in numbers then the average of the heights of the two middle students will be the median. We can measure their heights And make a frequency distribution heights. table. We can make a table with the height of the students in one column and the frequency in the other. With the limitations of our measuring instruments many students must be having same height. The modal height will be the one which maximum number of students must be having. It means the height with the maximum frequency will be the modal height. Data Mining - R. Akerkar 9
  10. 10.  Variance  is defined as the mean of the square of the deviations( difference) from the mean. Procedure: 1. 1 Calculate the mean of the observations observations. 2. Then calculate the difference of each observation from the mean. 3. Then square the differences. 4. Add all the squares. q 5. Divide the sum by the total number of observations. Data Mining - R. Akerkar 10
  11. 11.  Standard De iation Deviation It is the square root of the variance. Data Mining - R. Akerkar 11
  12. 12. Exercise 1  Find the median of the data in the above figure.  Find the standard deviation in the data in above figure. Data Mining - R. Akerkar 12
  13. 13. Solutions There are 15 data points in the histogram. Seven are smaller than 3 and seven are greater than 3, so the median is 3. List the full set of observations in a spreadsheet, repeating values as many times p , p g y as they occur: 0, 0, 0, 0, 1, 2, 2, 3, 4, 4, 4, 5, 5, 6, 7. Apply the function STDEVP to the observations. The result is 2.28 Data Mining - R. Akerkar 13
  14. 14. Exercise 2 Data Mining - R. Akerkar 14
  15. 15. Solutions Data Mining - R. Akerkar 15
  16. 16. Exercise 3 Data Mining - R. Akerkar 16
  17. 17. Solution Data Mining - R. Akerkar 17
  18. 18. Normal Distribution Normal distributions are a family of distributions. Normal distributions are symmetric with y scores more concentrated in the middle than in the tails. They are defined by two parameters: the mean (μ) and the standard deviation (σ). ( ) Data Mining - R. Akerkar 18
  19. 19.  For example, there are probably a nearly infinite number of factors that determine a persons height (thousands of genes, nutrition, diseases, etc.). Thus, height can be expected to be normally g p y distributed in the population. The normal distribution function is determined by 1/[(2 )1/2 ] e { 1/2 [(x f(x) = 1/[(2*)1/2*] * e**{-1/2*[(x- µ)/]2 }, for -∞ < x < ∞  where µ is the mean   is the standard deviation  e is the base of the natural logarithm, sometimes called Eulers e (2.71...)   is the constant Pi (3.14...) Data Mining - R. Akerkar 19
  20. 20. Null hypothesis The statistical hypothesis that is set up for testing a hypothesis is known as null hypothesis. It states that there is no difference between the sample statistic and population parameter. The purpose of hypothesis testing is to test the viability of the null p p yp g y hypothesis in the light of experimental data. Consider a researcher interested in whether the time to respond p to a tone is affected by the consumption of alcohol. The null hypothesis is that µ1 - µ2 = 0  where µ1 is the mean time to respond after consuming alcohol and µ2 i th mean ti 2 is the time t respond otherwise. to d th i Thus, the null hypothesis concerns the parameter µ1 - µ2 and the null hypothesis is that the parameter equals zero. Data Mining - R. Akerkar 20
  21. 21. Null Hypothesis vs. Experimental data The null hypothesis is often the reverse of what the experimenter actually believes; it is put forward to allow the data to contradict it. In the experiment on the effect of alcohol, the experimenter probably expects alcohol to have a harmful effect. h f l ff t If the experimental data show a sufficiently large effect of alcohol, then the null hypothesis that alcohol alcohol has no effect can be rejected. Data Mining - R. Akerkar 21
  22. 22. Hypothesis testing Hypothesis testing is a method of inferential statistics. An experimenter starts with a hypothesis about a population parameter called the null hypothesis. Data are then collected and the viability of the null hypothesis is determined in light of the data.  If the data are very different from what would be expected under the assumption that the null hypothesis is true, then the null hypothesis is rejected.  If the data are not greatly at variance with what would be f expected under the assumption that the null hypothesis is true, then the null hypothesis is not rejected. Data Mining - R. Akerkar 22
  23. 23.  The test of hypothesis discloses the fact whether the difference between sample statistic and the corresponding hypothetical p g yp population parameter is significant or not significant. Thus the test of hypothesis is also g yp known as the test of significance. Data Mining - R. Akerkar 23
  24. 24. A Classical Model forHypothesis Testing X1  X2 P ( v1 / n1  v2 / n2 ) where P is the significance score and; X 1 and X 2 are sample means for the independent samples; v1 and v2 are variance scores for the respective means; n1 and n2 are corresponding sample sizes sizes. Data Mining - R. Akerkar 24
  25. 25. Exercise Data Mining - R. Akerkar 25
  26. 26. Solution Data Mining - R. Akerkar 26
  27. 27. Exercise Data Mining - R. Akerkar 27
  28. 28. Solution Data Mining - R. Akerkar 28
  29. 29. Exercise If scores are normally distributed with a mean of 30 and a standard deviation of 5, what p percent of the scores is: ( ) g (a) greater than 30? (b) greater than 37? (c) between 28 and 34? Data Mining - R. Akerkar 29
  30. 30. Answers a. a 50% b. 8.08% c. 44.35 Data Mining - R. Akerkar 30

×