1. Summary of tools used for
analysis
Compiled by:
Subodh Khanal
Asst. Professor
Paklihawa Campus
Institute of Agriculture and Animal Science
2. Steps
ā¢ Check your cases (individuals) and variables(characters) .
ā¢ You may have entered cases in numeric (numbers) or string
(characters).
ā¢ Make sure to mark as numeric if you have entered some number.
ā¢ Check whether your variable is categorical or continuous.
ā¢ Categorical means they might be in nominal or ordinal.
ā¢ Continious means they might be in interval or ratio scales.
3. Some examples of nominal (refer level of
measurement slide for detail)
ā¢ Treatments: 1(control), 2 (normal diet) 3 (improved diet)
ā¢ Gender.
ā¢ Yes/no
ā¢ Ethnicity
ā¢ Name of country
ā¢ Infection: complicated
ā¢ Color
4. Some examples of ordinal scales
ā¢ Fertility: high, medium, low
ā¢ Education level
ā¢ Strongly agree, disagree
ā¢ Ratings of movie (5 star, 1 star)
ā¢ Feelings
ā¢ Satisfaction
5. Some examples of interval scale
ā¢ Temperature (degree Celsius)
ā¢ Marks (percentage)
ā¢ Time
7. When to use frequencies?
ā¢ Nominal or ordinal
ā¢ Categorical options
ā¢ See valid percent in output
8. When to use descriptive?
ā¢ Open ended Continous variable
ā¢ See mean, standard deviation, standard error, skewness and kurtosis
ā¢ Calculate z score (skewness/its s.e. and kurtosis/se)
ā¢ Should be within -1.96to 1.96
ā¢ Calculate standardized value (z) in Z also
ā¢ The value above 2.5 is outlier and above 3.2 is extreme outlier.
ā¢ Also see histogram and normality of curve.
ā¢ Remember to use explore data and see box plot also and Q-Q curve.
9. When to use chi square?
ā¢ Non parametric
ā¢ Both dependent and independent variables are categorical.
ā¢ Put independent in columns.
ā¢ Remember the variable whose effect you are seeing is independent
variable and on whom the effect is being imposed is dependent.
ā¢ Select kappa
ā¢ Select phi and cramers V
ā¢ Phi for 2x2 matrix, cramers v for all others.
ā¢ See values of kappa, phi and cramers v in next slide.
ā¢ In output see pearsonās chi square, Fischer and likelihood ratio
10. Data requirements for chi square test
ā¢ Two categorical variables.
ā¢ Two or more categories (groups) for each variable.
ā¢ Independence of observations.
ā¢ There is no relationship between the subjects in each group.
ā¢ The categorical variables are not "paired" in any way (e.g. pre-test/post-test
observations).
ā¢ Relatively large sample size.
ā¢ Expected frequencies for each cell are at least 1.
ā¢ Expected frequencies should be at least 5 for the majority (80%) of the cells.
11. When we try to compare proportions of a categorical outcome
according to different independent groups, we can consider several
statistical tests such as chi-squared test, Fisher's exact test, or z-test.
ā¢ Fisher's exact test is practically applied only in analysis of small samples but
actually it is valid for all sample sizes.
ā¢ While the chi-squared test relies on an approximation, Fisher's exact test is
one of exact tests.
ā¢ Especially when more than 20% of cells have expected frequencies < 5, we
need to use Fisher's exact test because applying approximation method is
inadequate.
ā¢ In SPSS unless you have the SPSS Exact Test Module, you can only perform
a Fisherās exact test on a 2Ć2 table, and these results are presented by
default.
ā¢ https://www.socscistatistics.com/tests/chisquare2/default2.aspx performs
fischer exact statistics for 5X5
ā¢ If expected count is more than 5 see pearson chi square.
12. So here is how you do it
ā¢ Go to analyse
ā¢ Descriptive
ā¢ Click cross tab
ā¢ Select variables on row
(dependent) and column
(independent)
ā¢ Click exact
ā¢ Click exact (time per limit test)
ā¢ Click continue
ā¢ Click ok
13. ā¢ As 100% of cells have expected count less than 5 see Fischer exact test
ā¢ To see chi square test at least 80% of cells must have expected count
more than 5(20% cells have expected count less than 5).
ā¢ Likelihood ratio (g-test) is also an option in this case. But Fischer exact
test is more common.
14. One sample t test
ā¢ The One Sample t Test determines whether the sample mean is
statistically different from a known or hypothesized population mean.
The One Sample t Test is a parametric test.
ā¢ Also known as single sample t test.
ā¢ Variable used is called as test variable.
ā¢ In a One Sample t Test, the test variable is compared against a "test
value", which is a known or hypothesized value of the mean in the
population.
15. It is commonly used when:
ā¢ Statistical difference between a sample mean and a known or hypothesized
value of the mean in the population.
ā¢ Statistical difference between the sample mean and the sample midpoint
of the test variable.
ā¢ Statistical difference between the sample mean of the test variable and
chance.
ā¢ This approach involves first calculating the chance level on the test variable. The
chance level is then used as the test value against which the sample mean of the test
variable is compared.
ā¢ Statistical difference between a change score and zero.
ā¢ This approach involves creating a change score from two variables, and then
comparing the mean change score to zero, which will indicate whether any change
occurred between the two time points for the original measures. If the mean change
score is not significantly different from zero, no significant change occurred
16. Requirement for one sample t test
ā¢ Test variable that is continuous (i.e., interval or ratio level)
ā¢ Scores on the test variable are independent (i.e., independence of observations)
ā¢ There is no relationship between scores on the test variable
ā¢ Violation of this assumption will yield an inaccurate p value
ā¢ Random sample of data from the population
ā¢ Normal distribution (approximately) of the sample and population on the test
variable
ā¢ Non-normal population distributions, especially those that are thick-tailed or heavily skewed,
considerably reduce the power of the test
ā¢ Among moderate or large samples, a violation of normality may still yield accurate p values
ā¢ Homogeneity of variances (i.e., variances approximately equal in both the sample
and population)
ā¢ No outliers
17. Paired sample t test
ā¢ The Paired Samples t Test compares two means that are from the
same individual, object, or related units. The two means can
represent things like:
ļ¼A measurement taken at two different times (e.g., pre-test and post-
test with an intervention administered between the two time points)
ļ¼A measurement taken under two different conditions (e.g.,
completing a test under a "control" condition and an "experimental"
condition)
ļ¼Measurements taken from two halves or sides of a subject or
experimental unit (e.g., measuring hearing loss in a subject's left and
right ears).
18. Also known as
ā¢ Dependent t Test
ā¢ Paired t Test
ā¢ Repeated Measures t Test
The variable used in this test is known as:
ā¢ Dependent variable, or test variable (continuous), measured at two
different times or for two related conditions or units
19. Used for observing
ā¢ Statistical difference between two time points
ā¢ Statistical difference between two conditions
ā¢ Statistical difference between two measurements
ā¢ Statistical difference between a matched pair
Note: The Paired Samples t Test can only compare the means for two (and only
two) related (paired) units on a continuous outcome that is normally distributed.
The Paired Samples t Test is not appropriate for analyses involving the following: 1)
unpaired data; 2) comparisons between more than two units/groups; 3) a
continuous outcome that is not normally distributed; and 4) an ordinal/ranked
outcome.
20. Moreover,
ā¢ To compare unpaired means between two groups on a continuous
outcome that is normally distributed, choose the Independent
Samples t Test.
ā¢ To compare unpaired means between more than two groups on a
continuous outcome that is normally distributed, choose ANOVA.
ā¢ To compare paired means for continuous data that are not normally
distributed, choose the nonparametric Wilcoxon Signed-Ranks Test.
ā¢ To compare paired means for ranked data, choose the nonparametric
Wilcoxon Signed-Ranks Test.
21. Requirements for paired sample t test
ā¢ Dependent variable that is continuous (i.e., interval or ratio level)
ā¢ Note: The paired measurements must be recorded in two separate variables.
ā¢ Related samples/groups (i.e., dependent observations)
ā¢ The subjects in each sample, or group, are the same. This means that the subjects in the first
group are also in the second group.
ā¢ Random sample of data from the population
ā¢ Normal distribution (approximately) of the difference between the paired values
ā¢ No outliers in the difference between the two related groups
ā¢ Note: When testing assumptions related to normality and outliers, you must use a
variable that represents the difference between the paired values - not the original
variables themselves.
ā¢ Note: When one or more of the assumptions for the Paired Samples t Test are not met,
you may want to run the nonparametric Wilcoxon Signed-Ranks Test instead.
22. Independent sample t test
ā¢ The Independent Samples t Test compares the means of two independent groups in
order to determine whether there is statistical evidence that the associated population
means are significantly different. The Independent Samples t Test is a parametric test.
ā¢ This test is also known as:
ļ±Independent t Test
ļ±Independent Measures t Test
ļ±Independent Two-sample t Test
ļ±Student t Test
ļ±Two-Sample t Test
ļ±Uncorrelated Scores t Test
ļ±Unpaired t Test
ļ±Unrelated t Test
ā¢ The variables used in this test are known as:
ļ±Dependent variable, or test variable
ļ±Independent variable, or grouping variable
23. Used for testing the following
ā¢ Statistical differences between the means of two groups
ā¢ Statistical differences between the means of two interventions
ā¢ Statistical differences between the means of two change scores
Note: The Independent Samples t Test can only compare the means for two
(and only two) groups.
It cannot make comparisons among more than two groups.
If you wish to compare the means across more than two groups, you will
likely want to run an ANOVA.
24. Requirement
ā¢ Dependent variable that is continuous (i.e., interval or ratio level)
ā¢ Independent variable that is categorical (i.e., two groups)
ā¢ Cases that have values on both the dependent and independent variables
ā¢ Independent samples/groups (i.e., independence of observations)
ā¢ There is no relationship between the subjects in each sample. This means that:
ā¢ Subjects in the first group cannot also be in the second group
ā¢ No subject in either group can influence subjects in the other group
ā¢ No group can influence the other group
ā¢ Violation of this assumption will yield an inaccurate p value
ā¢ Random sample of data from the population
ā¢ Normal distribution (approximately) of the dependent variable for each group
ā¢ Non-normal population distributions, especially those that are thick-tailed or heavily skewed, considerably reduce the power of the test
ā¢ Among moderate or large samples, a violation of normality may still yield accurate p values
ā¢ Homogeneity of variances (i.e., variances approximately equal across groups)
ā¢ When this assumption is violated and the sample sizes for each group differ, the p value is not trustworthy. However, the Independent
Samples t Test output also includes an approximate t statistic that is not based on assuming equal population variances. This alternative
statistic, called the Welch t Test statistic1, may be used when equal variances among populations cannot be assumed. The Welch t Test
is also known an Unequal Variance t Test or Separate Variances t Test.
ā¢ No outliers
ā¢ Note: When one or more of the assumptions for the Independent Samples t Test are not met, you may want to run the nonparametric Mann-
Whitney U Test instead.
25. One way ANOVA
ā¢ One-Way ANOVA ("analysis of variance") compares the means of two or
more independent groups in order to determine whether there is statistical
evidence that the associated population means are significantly different.
One-Way ANOVA is a parametric test.
ā¢ This test is also known as:
ļ±One-Factor ANOVA
ļ±One-Way Analysis of Variance
ļ±Between Subjects ANOVA
ā¢ The variables used in this test are known as:
ā¢ Dependent variable
ā¢ Independent variable (also known as the grouping variable, or factor)
ā¢ This variable divides cases into two or more mutually exclusive levels, or groups
26. Used for
ā¢ Field studies
ā¢ Experiments
ā¢ Quasi-experiments
ā¢ The One-Way ANOVA is commonly used to test the following:
ļ±Statistical differences among the means of two or more groups
ļ±Statistical differences among the means of two or more interventions
ļ±Statistical differences among the means of two or more change scores
ā¢ Note: Both the One-Way ANOVA and the Independent Samples t Test can compare the
means for two groups. However, only the One-Way ANOVA can compare the means
across three or more groups.
ā¢ Note: If the grouping variable has only two groups, then the results of a one-way ANOVA
and the independent samples t test will be equivalent. In fact, if you run both an
independent samples t test and a one-way ANOVA in this situation, you should be able to
confirm that t2=F.
27. Requirement
ā¢ Dependent variable that is continuous (i.e., interval or ratio level)
ā¢ Independent variable that is categorical (i.e., two or more groups)
ā¢ Cases that have values on both the dependent and independent variables
ā¢ Independent samples/groups (i.e., independence of observations)
ā¢ There is no relationship between the subjects in each sample. This means that:
ā¢ subjects in the first group cannot also be in the second group
ā¢ no subject in either group can influence subjects in the other group
ā¢ no group can influence the other group
ā¢ Random sample of data from the population
ā¢ Normal distribution (approximately) of the dependent variable for each group (i.e., for
each level of the factor)
ā¢ Non-normal population distributions, especially those that are thick-tailed or heavily skewed,
considerably reduce the power of the test
ā¢ Among moderate or large samples, a violation of normality may yield fairly accurate p values
28. Continued ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦..
ā¢ Homogeneity of variances (i.e., variances approximately equal across
groups)
ā¢ When this assumption is violated and the sample sizes differ among groups,
the p value for the overall F test is not trustworthy. These conditions warrant
using alternative statistics that do not assume equal variances among
populations, such as the Browne-Forsythe or Welch statistics (available
via Options in the One-Way ANOVA dialog box).
ā¢ When this assumption is violated, regardless of whether the group sample
sizes are fairly equal, the results may not be trustworthy for post hoc tests.
When variances are unequal, post hoc tests that do not assume equal
variances should be used.
ā¢ No outliers
29. Correlation
ā¢ Pearson Correlation produces a sample correlation coefficient, r,
which measures the strength and direction of linear relationships
between pairs of continuous variables. By extension, the Pearson
Correlation evaluates whether there is statistical evidence for a linear
relationship among the same pairs of variables in the population,
represented by a population correlation coefficient, Ļ (ārhoā). The
Pearson Correlation is a parametric measure.
ā¢ This measure is also known as:
ļ±Pearsonās correlation
ļ±Pearson product-moment correlation (PPMC)
30. Used for
ā¢ Correlations among pairs of variables
ā¢ Correlations within and between sets of variables
ā¢ The bivariate Pearson correlation indicates the following:
ļ±Whether a statistically significant linear relationship exists between two continuous
variables
ļ±The strength of a linear relationship (i.e., how close the relationship is to being a
perfectly straight line)
ļ±The direction of a linear relationship (increasing or decreasing)
ā¢ Note: The bivariate Pearson Correlation cannot address non-linear relationships or
relationships among categorical variables. If you wish to understand relationships that
involve categorical variables and/or non-linear relationships, you will need to
choose another measure of association.
ā¢ Note: The bivariate Pearson Correlation only reveals associations among continuous
variables. The bivariate Pearson Correlation does not provide any inferences about
causation, no matter how large the correlation coefficient is.
31. Requirement
ā¢ Two or more continuous variables (i.e., interval or ratio level)
ā¢ Cases that have values on both variables
ā¢ Linear relationship between the variables
ā¢ Independent cases (i.e., independence of observations)
ā¢ There is no relationship between the values of variables between cases. This means that:
ā¢ the values for all variables across cases are unrelated
ā¢ for any case, the value for any variable cannot influence the value of any variable for other cases
ā¢ no case can influence another case on any variable
ā¢ The biviariate Pearson correlation coefficient and corresponding significance test are not robust when
independence is violated.
ā¢ Bivariate normality
ā¢ Each pair of variables is bivariately normally distributed
ā¢ Each pair of variables is bivariately normally distributed at all levels of the other variable(s)
ā¢ This assumption ensures that the variables are linearly related; violations of this assumption may indicate that
non-linear relationships among variables exist. Linearity can be assessed visually using a scatterplot of the
data.
ā¢ Random sample of data from the population
ā¢ No outliers
32. linear Regression analysis
ā¢ Linear regression is the next step up after correlation.
ā¢ It is used when we want to predict the value of a variable based on the
value of another variable.
ā¢ The variable we want to predict is called the dependent variable (or
sometimes, the outcome variable).
ā¢ The variable we are using to predict the other variable's value is called the
independent variable (or sometimes, the predictor variable).
ā¢ For example, you could use linear regression to understand whether yield
performance can be predicted based on dose and practices of manure
application ; whether cigarette consumption can be predicted based on
smoking duration; and so forth.
ā¢ If you have two or more independent variables, rather than just one, you
need to use multiple regression.
33.
34. Used for
ā¢ Model multiple independent variables
ā¢ Include continuous and categorical variables
ā¢ Use polynomial terms to model curvature
ā¢ Assess interaction terms to determine whether the effect of one
independent variable depends on the value of another variable
35. Requirements
ā¢ Your two variables should be measured at the continuous level ( interval or ratio scales)
ā¢ There needs to be a linear relationship between the two variables. (see through scatter plots)
ļ±If the relationship displayed in your scatterplot is not linear, you will have to either run a non-
linear regression analysis, perform a polynomial regression or "transform" your data, which you
can do using SPSS Statistics.
ā¢ There should be no significant outliers.
ā¢ You should have independence of observations, which you can easily check using the Durbin-
Watson statistic, which is a simple test to run using SPSS Statistics.
ā¢ Your data needs to show homoscedasticity, which is where the variances along the line of best fit
remain similar as you move along the line
ā¢ Finally, you need to check that the residuals (errors) of the regression line are approximately
normally distributed (we explain these terms in our enhanced linear regression guide). Two
common methods to check this assumption include using either a histogram (with a
superimposed normal curve) or a Normal P-P Plot.
36.
37. Binary logistic regression
ā¢ Binary logistic regression models the relationship between a set of predictors and
a binary response variable. A binary response has only two possible values, such
as win and lose.
ā¢ Use a binary regression model to understand how changes in the predictor values
are associated with changes in the probability of an event occurring.
ā¢ Where the dependent variable is dichotomous or binary in nature, we cannot use
simple linear regression. Logistic regression is the statistical technique used to
predict the relationship between predictors (our independent variables) and a
predicted variable (the dependent variable) where the dependent variable is
binary (e.g., sex [male vs. female], response [yes vs. no], score [high vs. low],
etcā¦).
ā¢ There must be two or more independent variables, or predictors, for a logistic
regression. The IVs, or predictors, can be continuous (interval/ratio) or
categorical (ordinal/nominal).
ā¢ All predictor variables are tested in one block to assess their predictive ability
while controlling for the effects of other predictors in the model.
38. Uses
ā¢ Logistic regression is a powerful statistical way of modeling a binomial
outcome (takes the value 0 or 1 like having or not having a disease)
with one or more explanatory variables.
ā¢ logistic regression provides a quantified value for the strength of the
association adjusting for other variables (removes confounding
effects).
ā¢ The exponential of coefficients correspond to odd ratios for the given
factor.
39.
40. Requirements
ā¢ Dependent variable to be categorical and dichotomous.
ā¢ The error term need to be independent.
ā¢ Linearity of predictors, independent variables and log of odds
41. If odds ratio is
ā¢ >1 subtract that value with -1. e.g if odds ratio is 4.5 then the value
has 4.5-1 times higher than the odds for other option.
ā¢ If <1 then substract with 1 e.g. if odds ratio is 0.07, it will have 1-
0.07=0.93 i.e. 93% increase in the odds.