Upcoming SlideShare
×

# Bgy5901

573 views

Published on

Published in: Education, Technology
1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
573
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
15
0
Likes
1
Embeds 0
No embeds

No notes for slide

### Bgy5901

1. 1. DATA ANALYSIS
2. 2. <ul><li>Outline </li></ul><ul><li>types of data/ variables </li></ul><ul><li>measures of central tendency </li></ul><ul><li>measures of dispersion, SE </li></ul><ul><li>assumptions of parametric tests </li></ul><ul><li>tests of normality </li></ul><ul><li>homogeneity of variance </li></ul><ul><li>hypothesis testing </li></ul><ul><li>T-test </li></ul><ul><li>ANOVA </li></ul>
3. 3. Definition of Biostatistics STATISTICS : F ield of study relating to the collection, classification, summarization, analysis and interpretation of numerical information. Definition of Statistics BIOSTATISTICS : Application of statistics to the analysis of biological and medical data.
4. 4. <ul><li>Experimental unit - object, person, anything upon which a treatment is applied. </li></ul><ul><li>A factor of an experiment is a controlled independent variable, experimenter determines levels of factor. </li></ul><ul><li>Level - different values of a factor. Level implies amount or magnitude. </li></ul><ul><li>Response variable is the dependent variable which is dependent on the factor. </li></ul>Definitions
5. 5. <ul><li>a complete set of units of interest </li></ul><ul><li>can be general or specific </li></ul><ul><li>usually determined by research question </li></ul><ul><li>parameter - any measure that tells us something about the entire population; uses lower case of Greek letters e.g. μ </li></ul>Population Sample <ul><li>a fraction of the population </li></ul><ul><li>cannot collect data from everyone in population- why? </li></ul><ul><li>statistic - any measure that tells us something about the sample; uses upper case of Latin letters e.g. X </li></ul><ul><li>use sample to estimate parameters of population </li></ul><ul><li>must be representative and random </li></ul>
6. 6. sample should reflect the composition of the population of interest every person (or unit) in the population from which the sample is drawn has an equal probability of being chosen Sample as good estimators of population representative random
7. 7. Descriptive vs. Inferential statistics descriptive inferential <ul><ul><li>the use of graphical or numerical methods to summarize and identify patterns in a data set </li></ul></ul><ul><ul><li>only provides information on data being analyzed </li></ul></ul><ul><ul><li>the use of sample data to make generalizations about a larger set of data </li></ul></ul><ul><ul><li>provides estimation about population of interest based on selected parts of the population </li></ul></ul>
8. 8. A variable is any measured characteristic or attribute that differs for different subjects. What is a Variable? For example, if the length of 45 leaves were measured, then length would be a variable.
9. 9. Levels of Precision in Measurement names assigned to categories but no relation between the categories can be inferred Nominal Ordinal Interval Ratio values are ranked (put in order) distance between any two adjacent values is the same but the zero point is arbitrary similar to interval level but contains an absolute zero point Types of data/ variables
10. 10. Example of ordinal scale Position Marks Matric. No. 5 60 123460 4 71 123459 3 72 123458 2 95 123457 1 98 123456
11. 11. 1 2 3 4 5 6 7 8 9 10 interval same length Example of interval scale
12. 12. Measures of central tendency Central tendency is the point at which the distribution of scores is centred . Three measures of central tendency: 1. Mode 2. Median 3. Mean
13. 13. Measures of central tendency <ul><li>Mode </li></ul><ul><ul><li>the most frequent value </li></ul></ul><ul><ul><li>for nominal data the mode is the only measure of central tendency </li></ul></ul><ul><ul><li>easy to calculate and understand </li></ul></ul><ul><ul><li>possible to have several modes in a data set </li></ul></ul><ul><ul><li>may not always represent the data well and can change if a new value is added </li></ul></ul>
14. 14. Measures of central tendency <ul><li>Median </li></ul><ul><ul><li>the middle value of a distribution when the values are arranged in numerical order; if even number of values, take the average of the two middle values </li></ul></ul><ul><ul><li>stable: relatively unaffected by extreme values & skewed distributions </li></ul></ul><ul><ul><li>can be used with ordinal, interval or ratio data </li></ul></ul><ul><ul><li>sampling fluctuations: likely to differ in samples from same population </li></ul></ul><ul><ul><li>can be misleading when comparing samples therefore less useful than the mean </li></ul></ul>
15. 15. Measures of central tendency <ul><li>Mean (average) </li></ul><ul><ul><li>sum of all values of a variable divided by the number of values </li></ul></ul><ul><ul><li>uses every value (no loss of information) </li></ul></ul><ul><ul><li>most accurate summary of the data </li></ul></ul><ul><ul><li>resistant to sampling variation (if several samples taken from the same population, means likely the same) </li></ul></ul><ul><ul><li>can be influenced by extreme values (outliers) </li></ul></ul><ul><ul><li>can only be used with continuous data </li></ul></ul>
16. 16. Measures of dispersion Dispersion refers to the variability of values in a data set i.e. the extent to which a set of values differ
17. 17. Measures of dispersion <ul><li>Range </li></ul><ul><ul><li>difference between the highest and the lowest value </li></ul></ul><ul><ul><li>easy to compute </li></ul></ul><ul><ul><li>outliers: easily influences by extreme values </li></ul></ul><ul><ul><li>based on only two of the observations and gives no idea of how the other observations are arranged between these two numbers </li></ul></ul><ul><ul><li>tends to increase as the size of the sample increases </li></ul></ul>
18. 18. Measures of dispersion <ul><li>Interquartile Range </li></ul><ul><ul><li>range of the middle 50% of values </li></ul></ul><ul><ul><li>less susceptible to outliers </li></ul></ul><ul><ul><li>uses only half of the data </li></ul></ul>
19. 19. <ul><li>Standard deviation </li></ul><ul><ul><li>average difference between each value and the mean </li></ul></ul><ul><ul><li>measures the variability within the data set </li></ul></ul><ul><ul><li>how well the mean represents the data </li></ul></ul><ul><ul><li>uses every value </li></ul></ul><ul><ul><li>can be influenced by extreme values </li></ul></ul>Measures of dispersion
20. 20. Sampling Distribution μ = 10 9 8 7 4 3 2 1 5 6
21. 21. Distribution of the sample means 11 9 9 8 12 7 11 6 10 5 10 4 9 3 10 2 8 1 Mean Sample
22. 22. Frequency Sampling distribution of the sample means 3 2 1 8 9 10 11 12 Sample mean
23. 23. How well does the sample represent the population? <ul><li>If we want to know how well the mean represents the data we calculate the standard deviation of the mean. </li></ul><ul><li>Similarly, to estimate how accurate the sample represents the population we will calculate the standard deviation of the distribution of the sample means i.e. the Standard Error (SE). </li></ul><ul><li>SE = standard deviation of the population/ √ n </li></ul><ul><li>Since the SD of population is not known, SD of sample will be used instead </li></ul>
24. 24. Standard Deviation vs. Standard Error of the Mean (SEM) When to use which?
25. 25. <ul><li>Independent values </li></ul><ul><li>value from one subject does not influence the value of another </li></ul><ul><li>Interval data </li></ul><ul><li>data should be measured at least at the interval level </li></ul><ul><li>Normally distributed </li></ul><ul><li>bell-shaped; tests of normality should be conducted </li></ul><ul><li>Homogeneity of variance </li></ul><ul><li>variances should be the same throughout the data </li></ul>Assumptions of Parametric Tests
26. 26. Tests of normality <ul><li>Skewness and kurtosis </li></ul><ul><li>Histogram and stem and leaf plot </li></ul><ul><li>Kolmogorov-Smirnov & Shapiro-Wilk tests </li></ul><ul><li>Normal probability plot (Q-Q plot) </li></ul><ul><li>Box-plot </li></ul>
27. 27. Skewness and Kurtosis <ul><li>values of skewness and kurtosis should be zero in a normal distribution </li></ul><ul><li>values of skewness & kurtosis should be divided by their respective standard errors </li></ul><ul><li>look for values greater than 1.96; if > 1.96 then data is NOT normally distributed </li></ul>
28. 28. Skewness and Kurtosis
29. 30. Kolmogorov Smirnov & Shapiro-Wilk <ul><li>compare the scores in the sample to a normally distributed set of scores with the same mean and standard deviation. </li></ul><ul><li>if test is non-significant (p > 0.05) then distribution is not significantly different from a normal distribution therefore it is normally distributed </li></ul><ul><li>if p < 0.05 then distribution is significantly different from a normal distribution therefore it is NOT normally distributed. </li></ul>
30. 31. Kolmogorov Smirnov & Shapiro-Wilk
31. 33. Box-Plot <ul><li>Median should be in the middle of the box. </li></ul>
32. 34. Outliers <ul><li>values that are widely separated from the rest </li></ul><ul><li>Possible reasons for outliers: </li></ul><ul><li>measurement invalid (device not functioning, misrecorded value) </li></ul><ul><li>misclassified measurement- belongs to a population different from which the rest of sample was drawn </li></ul><ul><li>represents a rare or chance event </li></ul>
33. 35. Homogeneity of Variance <ul><li>Levene’s Test </li></ul><ul><li>Tests if variances in different groups are the same. </li></ul><ul><li>If significant (p< 0.05) variances are NOT equal. </li></ul><ul><li>If non-significant (p > 0.05) variances are equal. </li></ul><ul><li>Variance Ratio (VR) </li></ul><ul><li>Compare two or more groups. </li></ul><ul><li>Variance ratio = largest variance/ smallest variance </li></ul><ul><li>If VR < 2, homogeneity can be assumed. </li></ul>
34. 36. <ul><li>Components of a hypothesis test </li></ul><ul><ul><li>null hypothesis (H o ) </li></ul></ul><ul><ul><li>alternative hypothesis (H a ) </li></ul></ul><ul><ul><li>test statistic </li></ul></ul><ul><ul><li>reject or accept? p-value vs. significance level </li></ul></ul>Hypothesis Testing
35. 37. A tentative explanation for an observation, phenomenon, or scientific problem that can be tested by further investigation. Hypothesis Testing You have some claim about the parameter and you want to see whether the data supports the claim or not Hypothesis
36. 38. <ul><li>Null hypothesis (H o ) </li></ul><ul><ul><li>statement being tested in a statistical test </li></ul></ul><ul><ul><li>usually the null hypothesis is a statement of no effect or no difference </li></ul></ul><ul><li>Alternative hypothesis (H a ) </li></ul><ul><ul><li>experimental hypothesis- a hypothesis to be considered as an alternative to the null hypothesis </li></ul></ul>Null and alternative hypothesis
37. 39. <ul><li>Definition </li></ul><ul><li>Value used to decide whether or not the null hypothesis should be rejected in hypothesis testing </li></ul><ul><li>Sources of variation </li></ul><ul><li>In any experiment there are two basic sources of variation </li></ul><ul><ul><li>systematic - variation due to experimental manipulation </li></ul></ul><ul><ul><li>unsystematic - due to random factors </li></ul></ul>Test Statistics
38. 40. <ul><li>Need to calculate test statistic to find differences between samples </li></ul><ul><li>test statistic = systematic variation </li></ul><ul><li> unsystematic variation </li></ul><ul><li>Then need to calculate the probability of obtaining a value that large </li></ul><ul><li>Compare the amount of variance created by an experimental effect against amount of variance due to random factors- WHY? </li></ul>Test Statistics
39. 41. <ul><li>if experiment has had an effect we’d expect it to create more variance than random factors alone </li></ul><ul><li>the bigger the test statistic , the more unlikely it is to occur by chance ; probability of them occurring by chance becomes smaller </li></ul><ul><li>when probability falls below a certain pre-determined value , accept that test statistic as large as it is because of experimental manipulation and not due to random factors </li></ul>Test Statistics
40. 42. <ul><li>significance level, α </li></ul><ul><ul><li>probability that the test rejects the null hypothesis on the assumption that the null hypothesis is true </li></ul></ul><ul><ul><li>pre-determined value </li></ul></ul><ul><li>p-value </li></ul><ul><ul><li>probability that the test statistic be as large or larger than that actually observed by chance alone if the H o is true </li></ul></ul><ul><ul><li>the smaller the p-value, the stronger is the evidence against H o i.e. the observed result is unlikely to occur just by chance </li></ul></ul>p-value and significance level
41. 43. <ul><ul><li>In statistics, a result is called significant if it is unlikely to have occurred by chance . </li></ul></ul><ul><ul><li>&quot;A statistically significant difference&quot; simply means there is statistical evidence that there is a difference. </li></ul></ul><ul><ul><li>However it does not mean the difference is necessarily large, important or meaningful. </li></ul></ul><ul><ul><li>Means that observed effects are unlikely due to chance and results are reliable and likely to be repeatable </li></ul></ul>Statistical Significance
42. 44. <ul><li>Two kinds of errors can be made in significance testing </li></ul><ul><ul><li>a true null hypothesis can be incorrectly rejected ( Type I ) </li></ul></ul><ul><ul><ul><li>conclusion drawn that the null hypothesis is false when in fact it is true </li></ul></ul></ul><ul><ul><ul><li>probability of Type I error ( α ) is the significance level </li></ul></ul></ul><ul><ul><li>a false null hypothesis can be failed to be rejected ( Type II error ) </li></ul></ul><ul><ul><ul><li>considered an error because fail to reject the null hypothesis correctly e.g. assuming no effect of treatment when there was </li></ul></ul></ul><ul><ul><ul><li>probability of Type II error is β </li></ul></ul></ul>Type I and II errors
43. 45. 1 – β = power of a statistical test Power of a statistical test is the ability of a study to find a significant difference if indeed one exists. It is the probability that you will reject the null hypothesis when it is false Power of a statistical test
44. 46. T-test <ul><li>To look for differences in mean between group of subjects from two different experimental conditions </li></ul><ul><li>experimental condition - the procedure that is varied in order to estimate a variable's effect </li></ul><ul><li>If the mean difference between groups is large, it could mean two things: </li></ul><ul><ul><li>The two groups were taken from the same population but differ simply due to chance. </li></ul></ul><ul><ul><li>The two groups come from different populations. (If for example we have manipulated one of the groups then this is evidence that the experimental manipulation has caused the large difference between the groups). </li></ul></ul>
45. 47. Independent & dependent t-test <ul><li>Independent t-test </li></ul><ul><li>two experimental conditions </li></ul><ul><li>different subjects were assigned to each condition i.e. each subject is tested under only one condition. </li></ul><ul><li>Dependent t-test (paired t-test) </li></ul><ul><li>two experimental conditions </li></ul><ul><li>same subjects took part in both conditions of the experiment </li></ul>
46. 48. Independent t-test (SPSS output)
47. 49. <ul><li>compares mean from three or more groups </li></ul><ul><li>need to first test the H o that all group means are equal; H a is that the group means differ </li></ul><ul><li>if the null hypothesis is rejected this means that the means of these group are not equal </li></ul><ul><li>need to conduct post-hoc test to determine which means significantly differ </li></ul>ANOVA
48. 50. <ul><li>One-way independent ANOVA </li></ul><ul><li>only one independent variable- age group (independent) and height (dependent) and different participants will be used in each condition </li></ul><ul><li>Two-way independent ANOVA </li></ul><ul><li>two independent variables- age group and gender (independent) and height (dependent) and different participants will be used in each condition </li></ul><ul><li>One-way repeated measures ANOVA </li></ul><ul><li>only one independent variable- exercise type (independent) and blood pressure level (dependent) and same participants will be used in all conditions </li></ul>Types of ANOVA
49. 51. Identify the experimental unit, factor(s), response variable, level(s) and most appropriate statistical test . <ul><li>240 chickens of four different breeds were randomly assigned to three different farms. After five weeks the weight of the chickens were measured. </li></ul><ul><li>A researcher wanted to investigate whether different types of fertilizer mixtures affect the growth of plants differently. 36 seeds were randomly assigned to two different types of fertilizer treatment. The height of each plant was measured after 3 weeks. </li></ul>