Data Analysis and Surveying 101:Data Analysis and Surveying 101:
Basic research methods and
biostatistics
as they apply to...
What we will cover today
 Research Methods
• Sampling Frame and Sampling
• Generalizability
• Bias
• Reliability and Vali...
Research MethodsResearch Methods
 “To do successful research, you don't need
to know everything, you just need to know
of one thing that isn't known.”
• A...
What exactly is research?
 “Scientific research is systematic,
controlled, empirical, and critical
investigation of natur...
Important Components of
Empirical Research
 Problem statement, research questions,
purposes, benefits
 Theory, assumptio...
Sampling
 What is your population of interest?
• To whom do you want to generalize your
results?
 All students (18 and o...
Sampling
 A sample is “a smaller (but hopefully
representative) collection of units from a
population used to determine t...
Types of Samples
 Probability (Random) Samples 
• Simple random sample
• Systematic random sample
• Stratified random sa...
Sample Size
Size of Campus Final Desired N
<600 All students
600-2,999 600
3,000-9,999 700
10,000-19,999 800
20,000-29,000...
Bias and Error
Bias and Error
 Systematic Error or Bias: unknown or
unacknowledged error created during
the design, measurement, samplin...
Reliability and Validity
 Reliability
• The extent to which a test is repeatable and
yields consistent scores
• Affected ...
Reliability vs. Validity
 In order to be valid, a test must be reliable;
but reliability does not guarantee validity.
Levels of Measurement
Levels of Measurement
 Nominal
• Gender
 Male, Female
• Vaccinations
 Yes, No, Unsure
 Ordinal
• Personal health statu...
BiostatisticsBiostatistics
 “It is commonly believed that anyone who
tabulates numbers is a statistician. This is
like believing that anyone who own...
Types of Statistics
 Descriptive statistics
• Describe the basic features of data in a
study
• Provide summaries about th...
Descriptive Statistics
 Mode
 Median
 Mean
 Central Tendency
 Variation
 Range
 Variance
 Standard Deviation
 Fre...
Descriptive Statistics
Examples
 Categorical Variables (Nominal/Ordinal)
Q1 Gen health
9145 16.9 17.0 17.0
23767 43.9 44....
Descriptive Statistics
Examples
 Categorical Variables (Nominal/Ordinal)
Q49 Year in school * Q46 Sex Crosstabulation
736...
Descriptive Statistics
Examples
Descriptive Statistics
51935 534 52 586 153.16 35.791 1281.031
52017 56.00 48.00 104.00 67...
Hypotheses
 Null hypotheses
• Presumed true until statistical evidence
in the form of a hypothesis test indicates
otherwi...
Alpha, Beta, Power,
Effect Size Alpha – probability of
making a Type I error
• Reject null when null is
true
• Level of s...
Let’s test some
hypotheses!!!
Test of the mean of one
continuous variable
One-Sample Statistics
53374 4.42 4.401 .019How many drinks
N Mean Std. Deviati...
Test of a single proportion
of
one categorical variable 20% of college students report their health is
excellent
• Hypoth...
Correlations
1 .238**
.000
53374 52576
.238** 1
.000
52576 52896
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation...
Test of the difference
between two means
Group Statistics
32687 1.34 2.017 .011
18474 1.82 3.627 .027
Sex
female
male
Part...
Descriptives
Blood Alcohol Content
21285 .0741 .08215 .00056 .0730 .0752 .00 1.27
781 .1127 .09278 .00332 .1062 .1193 .00 ...
Test of the difference
between two or more
meansMultiple Comparisons
Dependent Variable: Blood Alcohol Content
Games-Howel...
Test for a relationship
between two categorical
variables Is there an association between being a member
of a fraternity/...
Test for relationship
between two categorical
variables
Ever - Depression * Frat or sorority? Crosstabulation
681 7692 837...
Important Points to
Remember
 An significant association does not
indicate causation
 Statistical significance is not al...
Questions???
Data Analysis and Surveying 101: Basic research methods and ...
Upcoming SlideShare
Loading in …5
×

Data Analysis and Surveying 101: Basic research methods and ...

1,493 views
1,324 views

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,493
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
100
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • PROBLEM STATEMENT, PURPOSES, BENEFITS
    What exactly do I want to find out?
    What is a researchable problem?
    What are the obstacles in terms of knowledge, data availability, time, or resources?
    Do the benefits outweigh the costs?
     
    THEORY, ASSUMPTIONS, BACKGROUND LITERATURE
    What does the relevant literature in the field indicate about this problem?
    Which theory or conceptual framework does the work fit within?
    What are the criticisms of this approach, or how does it constrain the research process?
    What do I know for certain about this area?
    What is the background to the problem that needs to be made available in reporting the work?
     
    VARIABLES AND HYPOTHESES
    What will I take as given in the environment ie what is the starting point?
    Which are the independent and which are the dependent variables?
    Are there control variables?
    Is the hypothesis specific enough to be researchable yet still meaningful?
    How certain am I of the relationship(s) between variables?
     
    OPERATIONAL DEFINITIONS AND MEASUREMENT
    Does the problem need scoping/simplifying to make it achievable?
    What and how will the variables be measured?
    What degree of error in the findings is tolerable?
    Is the approach defendable?
     
    RESEARCH DESIGN AND METHODOLOGY
    What is my overall strategy for doing this research?
    Will this design permit me to answer the research question?
    What constraints will the approach place on the work?
     
    INSTRUMENTATION/SAMPLING
    How will I get the data I need to test my hypothesis?
    What tools or devices will I use to make or record observations?
    Are valid and reliable instruments available, or must I construct my own?
    How will I choose the sample?
    Am I interested in representativeness?
    If so, of whom or what, and with what degree of accuracy or level of confidence?
     
    DATA ANALYSIS
    What combinations of analytical and statistical process will be applied to the data?
    Which of these will allow me to accept or reject my hypotheses?
    Do the findings show numerical differences, and are those differences important?
     
    CONCLUSIONS, INTERPRETATIONS, RECOMMENDATIONS
    Was my initial hypothesis supported?
    What if my findings are negative?
    What are the implications of my findings for the theory base, for the background assumptions, or relevant literature?
    What recommendations result from the work?
    What suggestions can I make for further research on this topic?
  • How do we determine our population of interest?
    Administrators can tell us
    We notice anecdotally or through qualitative research that a particular subgroup of students is experiencing higher risk
    We decide to do everyone and go from there
    3 factors that influence sample representativeness
    Sampling procedure
    Sample size
    Participation (response)
    When might you sample the entire population?
    When your population is very small
    When you have extensive resources
    When you don’t expect a very high response
  • Sampling frame errors: university versus personal email addresses; changing class rosters; are all students in your population of interest represented?
  • Picture of sampling breakdown
  • Two general approaches to sampling are used in social science research. With probability sampling, all elements (e.g., persons, households) in the population have some opportunity of being included in the sample, and the mathematical probability that any one of them will be selected can be calculated. With nonprobability sampling, in contrast, population elements are selected on the basis of their availability (e.g., because they volunteered) or because of the researcher&amp;apos;s personal judgment that they are representative. The consequence is that an unknown portion of the population is excluded (e.g., those who did not volunteer). One of the most common types of nonprobability sample is called a convenience sample – not because such samples are necessarily easy to recruit, but because the researcher uses whatever individuals are available rather than selecting from the entire population.
    Because some members of the population have no chance of being sampled, the extent to which a convenience sample – regardless of its size – actually represents the entire population cannot be known
  • ***Get types of research bias from class notes***
  • Descriptive Statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample and the measures. Together with simple graphics analysis, they form the basis of virtually every quantitative analysis of data. With descriptive statistics you are simply describing what is, what the data shows.
    Inferential Statistics investigate questions, models and hypotheses. In many cases, the conclusions from inferential statistics extend beyond the immediate data alone. For instance, we use inferential statistics to try to infer from the sample data what the population thinks. Or, we use inferential statistics to make judgments of the probability that an observed difference between groups is a dependable one or one that might have happened by chance in this study. Thus, we use inferential statistics to make inferences from our data to more general conditions; we use descriptive statistics simply to describe what&amp;apos;s going on in our data.
  • reach conclusions that extend beyond the immediate data alone
  • reach conclusions that extend beyond the immediate data alone
  • Data Analysis and Surveying 101: Basic research methods and ...

    1. 1. Data Analysis and Surveying 101:Data Analysis and Surveying 101: Basic research methods and biostatistics as they apply to the Theresa Jackson Hughes, MPH American College Health Association December 2006
    2. 2. What we will cover today  Research Methods • Sampling Frame and Sampling • Generalizability • Bias • Reliability and Validity • Levels of measurement  Biostatistics • Statistical significance • Other key terms • Appropriate statistical tests • Fun examples from the Spring 2005 dataset! Get excited! It’s data time!!!
    3. 3. Research MethodsResearch Methods
    4. 4.  “To do successful research, you don't need to know everything, you just need to know of one thing that isn't known.” • Arthur Schawlow  “That's the nature of research - you don't know what in hell you're doing.” • Harold "Doc" Edgerton  “If we knew what it was we were doing, it would not be called research, would it?” • Albert Einstein
    5. 5. What exactly is research?  “Scientific research is systematic, controlled, empirical, and critical investigation of natural phenomena guided by theory and hypotheses about the presumed relations among such phenomena.” • Kerlinger, 1986  Research is an organized and systematic way of finding answers to questions
    6. 6. Important Components of Empirical Research  Problem statement, research questions, purposes, benefits  Theory, assumptions, background literature  Variables and hypotheses  Operational definitions and measurement  Research design and methodology  Instrumentation, sampling  Data analysis  Conclusions, interpretations, recommendations
    7. 7. Sampling  What is your population of interest? • To whom do you want to generalize your results?  All students (18 and over)  Undergraduates only  Greeks  Athletes  Other  Can you sample the entire population?
    8. 8. Sampling  A sample is “a smaller (but hopefully representative) collection of units from a population used to determine truths about that population” (Field, 2005)  Why sample? • Resources (time, money) and workload • Gives results with known accuracy that can be calculated mathematically  The sampling frame is the list from which the potential respondents are drawn • Registrar’s office • Class rosters • Must assess sampling frame errors
    9. 9. Types of Samples  Probability (Random) Samples  • Simple random sample • Systematic random sample • Stratified random sample  Proportionate  Disproportionate • Cluster sample  Non-Probability Samples • Convenience sample • Purposive sample • Quota
    10. 10. Sample Size Size of Campus Final Desired N <600 All students 600-2,999 600 3,000-9,999 700 10,000-19,999 800 20,000-29,000 900 ≥30,000 1,000  Depends on expected response rate • Average 85% for paper  FINAL SAMPLE DESIRED / .85 = SAMPLE • Average 25% for web  FINAL SAMPLE DESIRED / .25 = SAMPLE
    11. 11. Bias and Error
    12. 12. Bias and Error  Systematic Error or Bias: unknown or unacknowledged error created during the design, measurement, sampling, procedure, or choice of problem studied • Error tends to go in one direction  Examples: Selection, Recall, Social desirability  Random • Unrelated to true measures  Example: Momentary fatigue
    13. 13. Reliability and Validity  Reliability • The extent to which a test is repeatable and yields consistent scores • Affected by random error/bias  Validity • The extent to which a test measures what it is supposed to measure • A subjective judgment made on the basis of experience and empirical indicators • Asks "Is the test measuring what you think it’s measuring?“ • Affected by systematic error/bias
    14. 14. Reliability vs. Validity  In order to be valid, a test must be reliable; but reliability does not guarantee validity.
    15. 15. Levels of Measurement
    16. 16. Levels of Measurement  Nominal • Gender  Male, Female • Vaccinations  Yes, No, Unsure  Ordinal • Personal health status  Excellent, Very good, Good, Fair, Poor • Last 30 days  Never used, Not in last 30 days, 1-2 days, 3-5 days, 6-9 days, 10-19 days, 20-29 days, All 30 days  Interval • Body Mass Index (BMI)  Ratio • Number of drinks • Number of sexual partners • Perception percentages • Blood alcohol concentration (BAC)
    17. 17. BiostatisticsBiostatistics
    18. 18.  “It is commonly believed that anyone who tabulates numbers is a statistician. This is like believing that anyone who owns a scalpel is a surgeon.” • R. Hooke  “Torture numbers, and they'll confess to anything.” • Gregg Easterbrook  “98% of all statistics are made up.” • Author Unknown
    19. 19. Types of Statistics  Descriptive statistics • Describe the basic features of data in a study • Provide summaries about the sample and measures  Inferential statistics • Investigate questions, models, and hypotheses • Infer population characteristics based on sample • Make judgments about what we observe
    20. 20. Descriptive Statistics  Mode  Median  Mean  Central Tendency  Variation  Range  Variance  Standard Deviation  Frequency
    21. 21. Descriptive Statistics Examples  Categorical Variables (Nominal/Ordinal) Q1 Gen health 9145 16.9 17.0 17.0 23767 43.9 44.2 61.2 16442 30.4 30.6 91.8 3737 6.9 6.9 98.7 565 1.0 1.1 99.8 132 .2 .2 100.0 53788 99.4 100.0 323 .6 54111 100.0 1 excellent 2 very good 3 good 4 fair 5 poor 6 don't know Total Valid SystemMissing Total Frequency Percent Valid Percent Cumulative Percent
    22. 22. Descriptive Statistics Examples  Categorical Variables (Nominal/Ordinal) Q49 Year in school * Q46 Sex Crosstabulation 7366 4154 11520 14.5% 8.2% 22.7% 6755 3678 10433 13.3% 7.2% 20.6% 6195 3333 9528 12.2% 6.6% 18.8% 5192 2676 7868 10.2% 5.3% 15.5% 1380 985 2365 2.7% 1.9% 4.7% 5088 3246 8334 10.0% 6.4% 16.4% 203 105 308 .4% .2% .6% 266 145 411 .5% .3% .8% 32445 18322 50767 63.9% 36.1% 100.0% Count % of Total Count % of Total Count % of Total Count % of Total Count % of Total Count % of Total Count % of Total Count % of Total Count % of Total 1 1st year undergrad 2 2nd year under 3 3rd year under 4 4th year under 5 5th year or more under 6 graduate 7 adult special 8 other Q49 Year in school Total 1 female 2 male Q46 Sex Total
    23. 23. Descriptive Statistics Examples Descriptive Statistics 51935 534 52 586 153.16 35.791 1281.031 52017 56.00 48.00 104.00 67.2035 4.01241 16.099 53374 88 0 88 4.42 4.401 19.370 53326 65 0 65 2.99 2.726 7.430 50604 2.47 .00 2.47 .0731 .08357 .007 50218 Q48 Weight in pounds HT_INCH Height in Inches Q13 How many drinks Q12 Hours alcohol BAC Blood Alcohol Content Valid N (listwise) N Range Minimum Maximum Mean Std. Deviation Variance  Continuous Variables (Interval/Ratio)
    24. 24. Hypotheses  Null hypotheses • Presumed true until statistical evidence in the form of a hypothesis test indicates otherwise  There is no effect/relationship  There is no difference in means  Alternative hypotheses • Tested using inferential statistics  There is an effect/relationship  There is a difference in means
    25. 25. Alpha, Beta, Power, Effect Size Alpha – probability of making a Type I error • Reject null when null is true • Level of significance, p value  Beta – probability of making a Type II error • Fail to reject null when null is false  Power – probability of correctly rejecting null • 1 – Beta  Effect Size • Measure of the strength of the relationship between two variables Null is true Null is false Reject null Alpha Type I error 1 – Beta Power CORRECT REJECTION Fail to Reject null 1 – Alpha CORRECT NON- REJECTION Beta Type II error
    26. 26. Let’s test some hypotheses!!!
    27. 27. Test of the mean of one continuous variable One-Sample Statistics 53374 4.42 4.401 .019How many drinks N Mean Std. Deviation Std. Error Mean One-Sample Test -30.352 53373 .000 -.578 -.62 -.54How many drinks t df Sig. (2-tailed) Mean Difference Lower Upper 95% Confidence Interval of the Difference Test Value = 5  College students report drinking an average of 5 drinks the last time they “partied”/socialized • Hypotheses  Ho: µ = 5  HA: µ ≠ 5 • Test: Two-tailed t-test • Result: Reject null
    28. 28. Test of a single proportion of one categorical variable 20% of college students report their health is excellent • Hypotheses  Ho: p = 20  HA: p ≠ 20 (one-tailed) • Test: Z-test for a single proportion • Result: Reject null Binomial Test <= 1 9145 .170 .2 .000a,b > 1 44643 .830 53788 1.000 Group 1 Group 2 Total Gen health Category N Observed Prop. Test Prop. Asymp. Sig. (1-tailed) Alternative hypothesis states that the proportion of cases in the first group < .2.a. Based on Z Approximation.b.
    29. 29. Correlations 1 .238** .000 53374 52576 .238** 1 .000 52576 52896 Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N How many drinks Partners you had How many drinks Partners you had Correlation is significant at the 0.01 level (2-tailed).**. Test of a relationship between two continuous variables  There is a relationship between the number of drinks students report drinking the last time they drank and the number of sex partners they have had within the last school year • Hypotheses  Ho: ρ = 0  HA: ρ ≠ 0 • Test: Pearson Product Moment Correlation • Result: Reject null
    30. 30. Test of the difference between two means Group Statistics 32687 1.34 2.017 .011 18474 1.82 3.627 .027 Sex female male Partners you had N Mean Std. Deviation Std. Error Mean Independent Samples Test 867.978 .000 -19.360 51159 .000 -.483 .025 -.532 -.434 -16.704 25065.988 .000 -.483 .029 -.540 -.426 Equal variances assumed Equal variances not assumed Partners you had F Sig. Levene's Test for Equality of Variances t df Sig. (2-tailed) Mean Difference Std. Error Difference Lower Upper 95% Confidence Interval of the Difference t-test for Equality of Means  Men and women report significantly different numbers of sexual partners over the past 12 months • Hypotheses  µ1 = µ2  µ1 ≠ µ2 • Test: Independent Samples t-test OR One-way ANOVA • Result: Reject null
    31. 31. Descriptives Blood Alcohol Content 21285 .0741 .08215 .00056 .0730 .0752 .00 1.27 781 .1127 .09278 .00332 .1062 .1193 .00 .75 3620 .0622 .07357 .00122 .0598 .0646 .00 1.41 18151 .0773 .08539 .00063 .0760 .0785 .00 2.47 4279 .0606 .08490 .00130 .0581 .0631 .00 1.17 2266 .0579 .08296 .00174 .0545 .0613 .00 1.26 50382 .0731 .08357 .00037 .0724 .0738 .00 2.47 residence hall frat/sorority house other university housing off campus with parents other Total N Mean Std. Deviation Std. Error Lower Bound Upper Bound 95% Confidence Interval for Mean Minimum Maximum Test of the difference between two or more means Mean BAC reported differs across student residences • Hypotheses  µ1 = µ2 = µ3 =µ4 = µ5 = µ6  µi ≠ µj for at least one pair i, j • Test: One-way ANOVA • Result: Reject null ANOVA Blood Alcohol Content 3.188 5 .638 92.123 .000 348.695 50376 .007 351.884 50381 Between Groups Within Groups Total Sum of Squares df Mean Square F Sig.
    32. 32. Test of the difference between two or more meansMultiple Comparisons Dependent Variable: Blood Alcohol Content Games-Howell -.03865* .00337 .000 -.0483 -.0290 .01190* .00135 .000 .0081 .0157 -.00316* .00085 .003 -.0056 -.0007 .01350* .00141 .000 .0095 .0175 .01623* .00183 .000 .0110 .0215 .03865* .00337 .000 .0290 .0483 .05055* .00354 .000 .0404 .0606 .03548* .00338 .000 .0258 .0451 .05215* .00356 .000 .0420 .0623 .05488* .00375 .000 .0442 .0656 -.01190* .00135 .000 -.0157 -.0081 -.05055* .00354 .000 -.0606 -.0404 -.01506* .00138 .000 -.0190 -.0111 .00160 .00178 .947 -.0035 .0067 .00433 .00213 .323 -.0017 .0104 .00316* .00085 .003 .0007 .0056 -.03548* .00338 .000 -.0451 -.0258 .01506* .00138 .000 .0111 .0190 .01667* .00144 .000 .0125 .0208 .01940* .00185 .000 .0141 .0247 -.01350* .00141 .000 -.0175 -.0095 -.05215* .00356 .000 -.0623 -.0420 -.00160 .00178 .947 -.0067 .0035 -.01667* .00144 .000 -.0208 -.0125 .00273 .00217 .809 -.0035 .0089 -.01623* .00183 .000 -.0215 -.0110 -.05488* .00375 .000 -.0656 -.0442 -.00433 .00213 .323 -.0104 .0017 -.01940* .00185 .000 -.0247 -.0141 -.00273 .00217 .809 -.0089 .0035 (J) Currently live frat/sorority house other university housing off campus with parents other residence hall other university housing off campus with parents other residence hall frat/sorority house off campus with parents other residence hall frat/sorority house other university housing with parents other residence hall frat/sorority house other university housing off campus other residence hall frat/sorority house other university housing off campus with parents (I) Currently live residence hall frat/sorority house other university housing off campus with parents other Mean Difference (I-J) Std. Error Sig. Lower Bound Upper Bound 95% Confidence Interval The mean difference is significant at the .05 level.*.
    33. 33. Test for a relationship between two categorical variables Is there an association between being a member of a fraternity/sorority and ever being diagnosed with depression? • Hypotheses  Ho: There is no association between being a member of a fraternity/sorority and ever being diagnosed with depression.  HA: There is an association between being a member of a fraternity/sorority and ever being diagnosed with depression. • Test: Chi-square test for independence • Result: Fail to reject null
    34. 34. Test for relationship between two categorical variables Ever - Depression * Frat or sorority? Crosstabulation 681 7692 8373 715.6 7657.4 8373.0 3744 39657 43401 3709.4 39691.6 43401.0 4425 47349 51774 4425.0 47349.0 51774.0 Count Expected Count Count Expected Count Count Expected Count yes no Ever - Depression Total yes no Frat or sorority? Total Chi-Square Tests 2.185b 1 .139 2.122 1 .145 2.211 1 .137 .141 .073 2.185 1 .139 51774 Pearson Chi-Square Continuity Correctiona Likelihood Ratio Fisher's Exact Test Linear-by-Linear Association N of Valid Cases Value df Asymp. Sig. (2-sided) Exact Sig. (2-sided) Exact Sig. (1-sided) Computed only for a 2x2 tablea. 0 cells (.0%) have expected count less than 5. The minimum expected count is 715. 62. b.
    35. 35. Important Points to Remember  An significant association does not indicate causation  Statistical significance is not always the same as practical significance  Multiple factors contribute to whether your results are significant  It gets easier and easier as you practice! 
    36. 36. Questions???

    ×