2. •Empirical research usually uses some type of
statistical analysis
•Mathematics – Language for accomplishing
logical operations inherent in good data analysis
•Statistics – Branch of math appropriate to
research
•Descriptive statistics – Method for describing data
in manageable forms
•Inferential statistics – Assist in forming conclusions
from our observations
•About a population, based on studying the sample
2
3. •Univariate Analysis: Only one variable at a time
•Bivariate Analysis: Two variables
•Multivariate Analysis: Three or more variables
•Distributions – Reporting all individual cases
•Marginals – Frequency distributions of grouped data
(age of students)
•Frequency Distribution: (2, 7, 11, 14, 16)
Class Frequency
1-9 2
10-19 3
3
4. •“Summary Averages”
•Mode: Most frequent attribute
•Mean: Sum of all values divided by # of total
values
•Median: Middle attribute of ranked data
4
5. •Range: Distance separating the highest value
from the lowest value
•Standard Deviation: The average amount of
variation about the mean
•Variance: Sum of squared standard deviations
from mean divided by total number of cases
•Percentile: What percentage of cases fall at or
below some value; can be grouped into quartiles
•Rates: Used to standardize some measure for
comparative purposes
5
6. •We are interested how variables are related
(explanation)
•Contingency table: Used to compare subgroups;
“percentage down” column, read across row
•Values of the dependent variable are contingent
on values of the independent variable
6
Own a Gun? Men Women
Yes 49% 24%
No 51 76
100% = (1,270) (633)
7. •Instead of explaining the dependent variable
on the basis of a single independent variable -
seek an explanation through the use of more
than one independent variable
7
Assault Rates, Poverty, & Mobility in 60 Boston
Neighborhoods
_______________Mobility_______________
_
Poverty Low High Total
Low 12.2 (22) 19.5 (21) 15.8 (43)
High 43.8 (4) 25.0 (13) 29.4 (17)
8. •Indicates strength of relationship (01)
•Based on Proportionate Reduction of Error
(PRE):
•How much variation in y can be predicted by x;
how much you can reduce your error in
predicting y by knowing x
•The greater the relationship between two
variables, the greater the reduction of error
8
9. •Nominal Variables: Gender, marital status, or
race
•Lambda (λ) : Based on your ability to guess values
on one of the variables
•Ordinal Variables: Occupational status,
education
•Gamma (γ) : Same as lambda, except based on the
ordinal arrangement of values
•Interval or Ratio Variables: Age, income
•Pearson’s product-moment correlation (r )
9
10. • Variables are linearly related:
• The mean of Y increases linearly with X
• Check scatterplot for general linear trend
• Watch out for non-linear relationships
(e.g., U-shaped)
• Y is normally distributed for every outcome of
X in the population; “Conditional normality”
Ex: Income = X, Happiness = Y
• Is a histogram of Income approximately
normal? For those with X = $25K? $50K?
$100K?
• If all are roughly normal, the assumption is met
10
11. • Association between two variables: Y = f (x)
• Regression Line: All four points lie on a
straight line, we can superimpose that line
over the points; Y' = a + b(x)
• “Unexplained Variation”: The sum of squared
differences between actual and estimated
values of Y
• Represents errors that exist even when estimates
are based on known values of X
• “Explained Variation”: The difference between
the total variation and the unexplained
variation
11
12. • When we generalize from samples to larger
populations, we use inferential statistics to
test the significance of an observed
relationship
• Data analysis & sampling
• Most research projects involve samples
• Ultimate purpose is to make inferences about that
larger (target) population
• Both univariate and multivariate findings can be
interpreted as a basis for inference
12
13. • Univariate Measures: Percentages & Means
• “Standard Error”: p x q
s=√ n
• Any statement of sampling error must contain
two essential components:
• Confidence Level
• Confidence Interval
• Inferential statistics apply to sampling error
only; they do not take account of nonsampling
errors
13
14. • So, two variables are related? Is the
relationship a significant one?
• Parametric tests of significance can tell us
• We report probability that a parameter falls
within a certain range (confidence interval) &
that degree of uncertainty is due to normal
sampling error
14
15. •Statistical significance is expressed with
probabilities
•What does P < .05 or .01 or .001 mean?
•Significance at .05 level means that probability
of achieving result by chance alone is 5 out of
100 (or 1 at the .01 level)
•If it’s not by chance, it represents a real finding
between the variables!
15
16. • Based on the Null Hypothesis: the assumption
that there is no relationship between two
variables in a population
• Compares what you get (empirical) with what
you expect given a null hypothesis of no
relationship
•Computing: For each cell in the tables, we
•Subtract the expected frequency for that cell
from the observed frequency
•Square this quantity, and
•Divide the squared difference by the expected
frequency
16
17. • Significance tests are guideline, not ultimate
standard
•Dangers due to sampling error, sample size,
etc
•Check and compare to other tests
•"Empirical research is, first and foremost, a
logical rather than a mathematical operation."
17
18. • What is a statistically discernable difference?
• Results from tests on a non-random sample
would be considered statistically significant if
found in a random sample
• Findings should be viewed as important but not
statistically significant
18
Editor's Notes
Mean = 29.4
Median = 24
Mode = 18
Range = 60
10…….-29.4 = -19.4….squared equals 376.36
18…….= -11.4….squared equals 129.96
18…….= -11.4….squared equals 129.96
24…….= - 5.4….squared equals 29.16
30……. = .6….squared equals .36
36…….= 6.6….squared equals 43.56
70…….= 40.6….squared equals 1648.36
Sum of squares = 2357.72
Sum of squares divided by 7 cases = 336.82 (variance)
Standard deviation = square root of variance = 18.35
S
Generally, both variables have to be continuous. “Normally distributed” = randomly distributed, with no pattern to the distribution
Y is a function of X
Y is a function of X
Again, we rely on probability sampling theory to ask about the likelihood of a relationship occurring by chance
We have different levels of significance; chosen level depends on a # of factors
As simple size increases, we are more likely to achieve statistical significance since greater n means greater likelihood of detecting an effect (power)