Data analysis


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data analysis

  1. 1. DATA ANALYSISGroup 5
  2. 2. The mean, median and mode Presenter: Huu Loc The mean, median and mode are all valid measures of central tendency but, under different conditions, some measures of central tendency become more appropriate to use than others.
  3. 3. Mean  The mean (or average) is the most popular and well known measure of central tendency. It can be used with both discrete and continuous data, although its use is most often with continuous data.  The mean is equal to the sum of all the values in the data set divided by the number of values in the data set.
  4. 4. So, if we have n values in a data set andthey have values x1, x2, ..., xn, then thesample mean, usually denoted by(pronounced x bar), is:
  5. 5.  The mean is essentially a model of your data set. It is the value that is most common. An important property of the mean is that it includes every value in your data set as part of the calculation. In addition, the mean is the only measure of central tendency where the sum of the deviations of each value from the mean is always zero.
  6. 6. MedianThe median is the middle score for a set ofdata that has been arranged in order ofmagnitude. The median is less affected byoutliers and skewed data. In order tocalculate the median, suppose we havethe data below:We first need to rearrange that data intoorder of magnitude (smallest first):
  7. 7. Our median mark is the middle mark -in this case 56 (highlighted in bold). Itis the middle mark because there are 5scores before it and 5 scores after it.
  8. 8. ModeThe mode is the most frequent score inour data set. On a histogram itrepresents the highest bar in a barchart or histogram. You can, therefore,sometimes consider the mode asbeing the most popular option. Anexample of a mode is presentedbelow:
  9. 9. Normally, the mode is used forcategorical data where we wish to knowwhich is the most common category asillustrated below:
  10. 10. One of the problems with the mode is that it isnot unique, so it leaves us with problems whenwe have two or more values that share thehighest frequency, such as below:
  11. 11. Summary of when to use the mean, median and modeUsing the following summary table to knowwhat the best measure of central tendency iswith respect to the different types of variable.
  12. 12. MEASURES OFDISPERSIONPresenter: Nguyen Ngoc Cam
  13. 13. Measures of DispersionMeasure of central tendency give us good information about the scores inour distribution.However, we can have very different shapes to our distribution, yet havethe same central tendency.Measures of dispersion or variability will give us information about thespread of the scores in our distribution.Are the scores clustered close together over a small portion of the scale, orare the scores spread out over a large segment of the scale?
  14. 14. Main points:1. Range2. Standard Deviation3. Variance
  15. 15. 1. RangeThe difference between the biggest and thesmallest number in the data of the group.The range tells you how spread out the datais.
  16. 16. 1. Range
  17. 17. 1. RangeProblem:1. It changes drastically with the magnitude of the extreme scores2. It’s an unstable measure  rarely used for statistical analyses
  18. 18. 2. Standard DeviationStandard Deviation is the most frequently usedmeasure of variability.It looks at the average variability of all the scorearound the mean, all the scores are taken intoaccount.
  19. 19. 2. Standard DeviationThe larger the Standard Deviation, the morevariability from the central point in thedistribution.The smaller the Standard Deviation, the closerthe distribution is to the central point.
  20. 20. 2. Standard Deviation
  21. 21. 2. Standard Deviation
  22. 22. 2. Standard DeviationThe SD tells us the standard of how far out fromthe point of central tendency the individualscores are distributed.It tells us information that the mean doesn’t as important or even more important than themean
  23. 23. 3. Variance
  24. 24. PAIRED T-TESTPresenter: Tran Thi Ngan Giang
  25. 25. Introduction• A paired t-test is used to compare two population means where you have two samples in which observations in one sample can be paired with observations in the other sample.• For example:• A diagnostic test was made before studying a particular module and then again after completing the module. We want to find out if, in general, our teaching leads to improvements in students’ knowledge/skills.
  26. 26. First, we see the descriptive statistics for both variables.The post-test mean scores are higher.
  27. 27. Next, we see the correlation between the two variables.There is a strong positive correlation. People whodid well on the pre-test also did well on the post-test.
  28. 28. Finally, we see the T, degrees of freedom, and significance.• Our significance is .053• If the significance value is less than .05, there is a significant difference. If the significance value is greater than. 05, there is no significant difference.• Here, we see that the significance value is approaching significance, but it is not a significant difference. There is no difference between pre- and post-test scores. Our test preparation course did not help!
  29. 29. INDEPENDENT SAMPLES T-TESTSPresenter: Dinh Quoc Minh Dang
  30. 30. Outline 1. Introduction 2. Hypothesis for the independent t-test 3. What do you need to run an independent t-test? 4. Formula 5. Example (Calculating + Reporting)
  31. 31. IntroductionThe independent t-test, also called the two sample t-test or students t-test isan inferential statistical test that determines whether there is a statisticallysignificant difference between the means in two unrelated groups.
  32. 32. Hypothesis for the independent t-testThe null hypothesis for the independent t-test is that the population means from the twounrelated groups are equal:H0: u1 = u2In most cases, we are looking to see if we can show that we can reject the null hypothesisand accept the alternative hypothesis, which is that the population means are not equal:HA: u1 ≠ u2To do this we need to set a significance level (alpha) that allows us to either reject or acceptthe alternative hypothesis. Most commonly, this value is set at 0.05.
  33. 33. What do you need to run an independent t-test?In order to run an independent t-test you need the following: 1. One independent, categorical variable that has two levels. 2. One dependent variable
  34. 34. Formula M: mean (the average score of the group) SD: Standard Deviation N: number of scores in each group Exp: Experimental Group Con: Control Group
  35. 35. Formula
  36. 36. Example
  37. 37. Example
  38. 38. Effect Size
  39. 39. Reporting the Result of an Independent T-TestWhen reporting the result of an independent t-test, you need to include the t-statistic value, the degrees of freedom (df) and the significance value of thetest (P-value). The format of the test result is: t(df) = t-statistic, P =significance value.
  40. 40. Example result (APA Style)An independent samples T-test is presented the same as the one-sample t-test: t(75) = 2.11, p = .02 (one –tailed), d = .48 Degrees of freedom Value of Effect statistic size if Significance Include if test available of statistic is one-tailedExample: Survey respondents who were employed by the federal, state, or localgovernment had significantly higher socioeconomic indices (M = 55.42, SD =19.25) than survey respondents who were employed by a private employer (M =47.54, SD = 18.94) , t(255) = 2.363, p = .01 (one-tailed).
  41. 41. Analysis of Variance (ANOVA) Presenter : Minh Sang
  42. 42. IntroductionWe already learned about the chi square testfor independence, which is useful for datathat is measured at the nominal or ordinallevel of analysis.If we have data measured at the intervallevel, we can compare two or morepopulation groups in terms of theirpopulation means using a technique calledanalysis of variance, or ANOVA.
  43. 43. Completely randomized designPopulation 1 Population 2….. Population kMean = 1 Mean = 2 …. Mean = kVariance= 12 Variance= 22 … Variance = k2 We want to know something about how the populations compare. Do they have the same mean? We can collect random samples from each population, which gives us the following data.
  44. 44. Completely randomized designMean = M1 Mean = M2 ..… Mean = MkVariance=s12 Variance=s22 …. Variance = sk2N1 cases N2 cases …. Nk casesSuppose we want to compare 3 college majors in a business school by the average annual income people make 2 years after graduation. We collect the following data (in $1000s) based on random surveys.
  45. 45. Completely randomized designAccounting Marketing Finance27 23 4822 36 3533 27 4625 44 3638 39 2829 32 29
  46. 46. Completely randomized designCan the dean conclude that there are differences among the major’s incomes?H o: 1 = 2 = 3HA: 1 2 3In this problem we must take into account:1) The variance between samples, or the actual differences by major. This is called the sum of squares for treatment (SST).
  47. 47. Completely randomized design2) The variance within samples, or the variance of incomes within a single major. This is called the sum of squares for error (SSE).Recall that when we sample, there will always be a chance of getting something different than the population. We account for this through #2, or the SSE.
  48. 48. F-StatisticFor this test, we will calculate a F statistic, which is used to compare variances.F = SST/(k-1) SSE/(n-k)SST=sum of squares for treatmentSSE=sum of squares for errork = the number of populationsN = total sample size
  49. 49. F-statisticIntuitively, the F statistic is:F = explained variance unexplained varianceExplained variance is the difference between majorsUnexplained variance is the difference based on random sampling for each group (see Figure 10-1, page 327)
  50. 50. Calculating SSTSST = ni(Mi - )2 = grand mean or = Mi/k or the sum of all values for all groups divided by total sample sizeMi = mean for each samplek= the number of populations
  51. 51. Calculating SSTBy majorAccounting M1=29, n1=6Marketing M2=33.5, n2=6Finance M3=37, n3=6 = (29+33.5+37)/3 = 33.17SST = (6)(29-33.17)2 + (6)(33.5-33.17)2 + (6)(37-33.17)2 = 193
  52. 52. Calculating SSTNote that when M1 = M2 = M3, then SST=0 which would support the null hypothesis.In this example, the samples are of equal size, but we can also run this analysis with samples of varying size also.
  53. 53. Calculating SSESSE = (Xit – Mi)2In other words, it is just the variance for each sample added together.SSE = (X1t – M1)2 + (X2t – M2)2 + (X3t – M3)2SSE = [(27-29)2 + (22-29)2 +…+ (29-29)2] + [(23-33.5)2 + (36-33.5)2 +…] + [(48-37)2 + (35-37)2 +…+ (29-37)2]SSE = 819.5
  54. 54. Statistical OutputWhen you estimate this information in a computer program, it will typically be presented in a table as follows:Source of df Sum of Mean F-ratioVariation squares squaresTreatment k-1 SST MST=SST/(k-1) F=MSTError n-k SSE MSE=SSE/(n-k) MSETotal n-1 SS=SST+SSE
  55. 55. Calculating F for our exampleF = 193/2 819.5/15F = 1.77Our calculated F is compared to the critical value using the F-distribution with F , k-1, n-k degrees of freedomk-1 (numerator df)n-k (denominator df)
  56. 56. The ResultsFor 95% confidence ( =.05), our critical F is 3.68 (averaging across the values at 14 and 16In this case, 1.77 < 3.68 so we must accept the null hypothesis.The dean is puzzled by these results because just by eyeballing the data, it looks like finance majors make more money.
  57. 57. The ResultsMany other factors may determine the salary level, such as GPA. The dean decides to collect new data selecting one student randomly from each major with the following average grades.
  58. 58. New dataAverage Accounting Marketing Finance M(b)A+ 41 45 51 M(b1)=45.67A 36 38 45 M(b2)=39.67B+ 27 33 31 M(b3)=30.83B 32 29 35 M(b4)=32C+ 26 31 32 M(b5)=29.67C 23 25 27 M(b6)=25 M(t)1=30.83 M(t)2=33.5 M(t)3=36.83 = 33.72
  59. 59. Randomized Block DesignNow the data in the 3 samples are not independent, they are matched by GPA levels. Just like before, matched samples are superior to unmatched samples because they provide more information. In this case, we have added a factor that may account for some of the SSE.
  60. 60. Two way ANOVANow SS(total) = SST + SSB + SSEWhere SSB = the variability among blocks, where a block is a matched group of observations from each of the populationsWe can calculate a two-way ANOVA to test our null hypothesis. We will talk about this next week.