Data Analysis
Quantitative Analysis
Analyzing Quantitative Data (for
brief review only)
 Parametric Statistics:
 -appropriate for interval/ratio data
 -generalizable to a population
 -assumes normal distributions
 Non-Parametric Statistics:
 -used with nominal/ordinal data
 -not generalizable to a population
 -does not assume normal distributions
Tables and Graphs
 Frequency tables with percentages give a
numerical description of the cases on a
variable.
 Graphs like bar or pie graphs are used to
display nominal or ordinal data
 Histograms and line graphs (frequency
polygons) can display interval/ratio level data.
 Bivariate relationships can be displayed
using contingency tables (nominal or ordinal)
 Relationships at the interval/ratio level are
displayed using a scatterplot.
Basic Descriptive Statistics
 Use summary measures such as mean
(interval), median (ordinal), or mode (nominal)
to describe central tendency of a distribution
 For dispersion (variability) use standard
deviation, variance, and range to tell you how
spread out the data are about the mean.
 Can use z-scores to compare scores across
two distributions
Contingency (Cross-Tabs) Analysis
and Related Statistics
 - for non-parametric (non-normal
distributions) statistics
 Assumptions
 Nominal or ordinal (categorical) data
 Any type of distribution
 The hypothesis test: The null hypothesis:
the two (or more) samples come from the
same distribution
Contingency (cont.)
Conducting the Analysis:
 a. calculate percentages within the categories of
the IV and compare across the categories of the DV.
Are there differences in the outcomes?
 b. for nominal
 Chi-square statistic: is the relationship (the above
differences) real?
 Phi, Cramer's V, etc.: how strong is the relationship?
 c. for ordinal
 t-test for gamma: is the relationship (the above differences)
real?
 Gamma: how strong and what direction?
T-Tests (parametric) for Means and
Proportions
 The t-test is used to determine whether
sample(s) have different means. Essentially, the
t-test is the ratio between the sample mean
difference and the standard error of that
difference. The t-test makes some important
assumptions:
 Interval/Ratio level data
 one or two levels of one or two variables
 normal distributions
 equal variances (relatively).
T-tests (cont.)
 a. The one sample t-test:
 tests a sample mean against a known population
mean
 b. The independent samples t-test:
 tests whether the mean of one sample is different
from the mean of another sample.
 c. The paired group t-test (dependent or
related samples)
 tests if two groups within the overall sample are
different on the same dependent variable.
ANOVA (parametric)
 Analysis of Variance, or ANOVA, is testing the
difference in the means among 3 or more
different samples.
 One-way ANOVA Assumptions:
 One independent variable -- categorical with
two+ levels
 Dependent variable -- interval or ratio
 ANOVA is testing the ratio (F) of the mean
squares between groups and within groups.
Depending on the degrees of freedom, the F
score will show if there is a difference in the
means among all of the groups.
ANOVA (cont.)
 One-way ANOVA will provide you with an F-ratio
and its corresponding p-value.
 If there is a large enough difference between the
between groups mean squares and the within
groups mean squares, then the null hypothesis will
be rejected, indicating that there is a difference in
the mean scores among the groups.
 However, the F-ratio does not tell you where those
differences are, only that one group mean is
significantly different from the others.
Correlation (parametric)
 Used to test the presence, strength and direction
of a linear relationship among variables.
 Correlation is a numerical expression that signifies
the relationship between two variables. Correlation
allows you to explore this relationship by
'measuring the association' between the variables.
 Correlation is a 'measure of association' because
the correlation coefficient provides the degree of
the relationship between the variables. Correlation
does not infer causality! Typically, you need at
least interval and ratio data. However, you can run
correlation with ordinal level data with 5 or more
categories.
Correlation (cont.)
 The Correlation Coefficient : Pearson's r, the
correlation coefficient, is the numeric value of the
relationship between variables. The correlation
coefficient is a percentage and can vary between
-1 and +1. If no relationship exists, then the
correlation coefficient would equal 0. Pearson's r
provides an (1) estimate of the strength of the
relationship and (2) an estimate of the direction of
the relationship.
 If the correlation coefficient lies between -1 and 0,
it is a negative (inverse) relationship, 0 and +1, it is
a positive relationship and is 0, there is no
relationship The closer the coefficient lies to -1 or
+1, the stronger the relationship.
Correlation (cont.)
 Coefficient of determination: provides the
percentage of the variance accounted for both
variables (x & y). To calculate the determination
coefficient, you square the r value. In other
words, if you had an r of 90, your coefficient of
determination would account for just 81 percent
of the variance between the variables.
Regression
 Regression is used to model, calculate, and predict the
pattern of a linear relationship among two or more
variables.
 There are two types of regression -- simple & multiple
 a. Assumptions
 Note: Variables should be approximately normally
distributed. If not, recode and use non-parametric
measures.
 Dependent Variable: at least interval (can use ordinal if using
summated scale)
 Independent Variable: should be interval. Independent
variables should be independent of each other, not related in
any way. You can use nominal if it is binary or 'dummy'
variable (0,1)
Regression (cont.)
 b. Tests
 Overall: The null tests that the regression (estimated)
line no better predicting dependent variable than the
mean line
 Coefficients (slope "b", etc.): That the estimated
coefficient equals 0
 c. Statistics
 Overall: R-squared, F-test
 Coefficient: t tests
 d. Limitations
 Only addresses linear patterns
 Variables should be normally distributed
Using Computer Software to
Analyze Quantitative Data
 Special statistical software is available to
analyze large quantities of data and to do
more complex analyses
 The most common computer software used
in sociology are SPSS and SAS
 SPSS is available at both the King’s and
Brescia computer labs and as well in various
computer labs on main campus.

Quantitative Data analysis

  • 1.
  • 2.
    Analyzing Quantitative Data(for brief review only)  Parametric Statistics:  -appropriate for interval/ratio data  -generalizable to a population  -assumes normal distributions  Non-Parametric Statistics:  -used with nominal/ordinal data  -not generalizable to a population  -does not assume normal distributions
  • 3.
    Tables and Graphs Frequency tables with percentages give a numerical description of the cases on a variable.  Graphs like bar or pie graphs are used to display nominal or ordinal data  Histograms and line graphs (frequency polygons) can display interval/ratio level data.  Bivariate relationships can be displayed using contingency tables (nominal or ordinal)  Relationships at the interval/ratio level are displayed using a scatterplot.
  • 4.
    Basic Descriptive Statistics Use summary measures such as mean (interval), median (ordinal), or mode (nominal) to describe central tendency of a distribution  For dispersion (variability) use standard deviation, variance, and range to tell you how spread out the data are about the mean.  Can use z-scores to compare scores across two distributions
  • 5.
    Contingency (Cross-Tabs) Analysis andRelated Statistics  - for non-parametric (non-normal distributions) statistics  Assumptions  Nominal or ordinal (categorical) data  Any type of distribution  The hypothesis test: The null hypothesis: the two (or more) samples come from the same distribution
  • 6.
    Contingency (cont.) Conducting theAnalysis:  a. calculate percentages within the categories of the IV and compare across the categories of the DV. Are there differences in the outcomes?  b. for nominal  Chi-square statistic: is the relationship (the above differences) real?  Phi, Cramer's V, etc.: how strong is the relationship?  c. for ordinal  t-test for gamma: is the relationship (the above differences) real?  Gamma: how strong and what direction?
  • 7.
    T-Tests (parametric) forMeans and Proportions  The t-test is used to determine whether sample(s) have different means. Essentially, the t-test is the ratio between the sample mean difference and the standard error of that difference. The t-test makes some important assumptions:  Interval/Ratio level data  one or two levels of one or two variables  normal distributions  equal variances (relatively).
  • 8.
    T-tests (cont.)  a.The one sample t-test:  tests a sample mean against a known population mean  b. The independent samples t-test:  tests whether the mean of one sample is different from the mean of another sample.  c. The paired group t-test (dependent or related samples)  tests if two groups within the overall sample are different on the same dependent variable.
  • 9.
    ANOVA (parametric)  Analysisof Variance, or ANOVA, is testing the difference in the means among 3 or more different samples.  One-way ANOVA Assumptions:  One independent variable -- categorical with two+ levels  Dependent variable -- interval or ratio  ANOVA is testing the ratio (F) of the mean squares between groups and within groups. Depending on the degrees of freedom, the F score will show if there is a difference in the means among all of the groups.
  • 10.
    ANOVA (cont.)  One-wayANOVA will provide you with an F-ratio and its corresponding p-value.  If there is a large enough difference between the between groups mean squares and the within groups mean squares, then the null hypothesis will be rejected, indicating that there is a difference in the mean scores among the groups.  However, the F-ratio does not tell you where those differences are, only that one group mean is significantly different from the others.
  • 11.
    Correlation (parametric)  Usedto test the presence, strength and direction of a linear relationship among variables.  Correlation is a numerical expression that signifies the relationship between two variables. Correlation allows you to explore this relationship by 'measuring the association' between the variables.  Correlation is a 'measure of association' because the correlation coefficient provides the degree of the relationship between the variables. Correlation does not infer causality! Typically, you need at least interval and ratio data. However, you can run correlation with ordinal level data with 5 or more categories.
  • 12.
    Correlation (cont.)  TheCorrelation Coefficient : Pearson's r, the correlation coefficient, is the numeric value of the relationship between variables. The correlation coefficient is a percentage and can vary between -1 and +1. If no relationship exists, then the correlation coefficient would equal 0. Pearson's r provides an (1) estimate of the strength of the relationship and (2) an estimate of the direction of the relationship.  If the correlation coefficient lies between -1 and 0, it is a negative (inverse) relationship, 0 and +1, it is a positive relationship and is 0, there is no relationship The closer the coefficient lies to -1 or +1, the stronger the relationship.
  • 13.
    Correlation (cont.)  Coefficientof determination: provides the percentage of the variance accounted for both variables (x & y). To calculate the determination coefficient, you square the r value. In other words, if you had an r of 90, your coefficient of determination would account for just 81 percent of the variance between the variables.
  • 14.
    Regression  Regression isused to model, calculate, and predict the pattern of a linear relationship among two or more variables.  There are two types of regression -- simple & multiple  a. Assumptions  Note: Variables should be approximately normally distributed. If not, recode and use non-parametric measures.  Dependent Variable: at least interval (can use ordinal if using summated scale)  Independent Variable: should be interval. Independent variables should be independent of each other, not related in any way. You can use nominal if it is binary or 'dummy' variable (0,1)
  • 15.
    Regression (cont.)  b.Tests  Overall: The null tests that the regression (estimated) line no better predicting dependent variable than the mean line  Coefficients (slope "b", etc.): That the estimated coefficient equals 0  c. Statistics  Overall: R-squared, F-test  Coefficient: t tests  d. Limitations  Only addresses linear patterns  Variables should be normally distributed
  • 16.
    Using Computer Softwareto Analyze Quantitative Data  Special statistical software is available to analyze large quantities of data and to do more complex analyses  The most common computer software used in sociology are SPSS and SAS  SPSS is available at both the King’s and Brescia computer labs and as well in various computer labs on main campus.