The talk discusses and demonstrate techniques for analyzing survey data. Survey data is useful data source to answer a wide range of questions, however, it often requires special analytical techniques to interpret. We'll discuss how to weight data to match known population parameters (such as StatsCan census data) using post-stratification and using the MICE algorithm to deal with missing data. These techniques are commonly used in political polling and social science research. I'll provide example code in R and explain all the steps using data from a survey of Canadians' values.
2. Background
Founder and Principal of RA2 in Calgary
Apply data analysis, research and digital tools for
NGOs, political groups and brands
- Political polling
- Data strategy
- Stakeholder/member relations
Survey research, NLP/machine learning, social
network analysis
3. Survey Data
Goal:
- Estimate opinions, attitudes, beliefs,
values, behaviour, etc. for a population
- Accurately assess the variable of interest
for the whole population, not just the
survey respondents
Source: Stats and R Blog
Population vs. Sample
4. Survey error comes from many sources
Most commonly reported error is the Margin of
Random Sampling Error
● This only accounts for part of the error
Other sources of error come down to how the data
was collected, how questions were asked, etc.
● These are more difficult to estimate
Total Survey Error
Source: Biemer (2010), Total Survey Error: Design,
Implementation, and Evaluation
Survey Error
5. Probability vs. Non-probability sample
● Probability samples are the gold standard
● In probability samples, each member of
the sample has a chance of being included
in the sample
● New methods are somewhere in between.
E.g. probability panels
Stratification and quotas
● Reduce bias in the sample
● Not a silver-bullet solution
Practical Considerations
Cost considerations
● Convenience samples are typically much
cheaper than probability samples
Convenience considerations
● Some methods take longer to field
● May not be easy to reach some groups
6. Nonsense/Fraudulent Responses:
● Satisficing—respondents take mental shortcuts
● Respondents may not paying attention
● May just want the survey incentive (if applicable)
● Could be malicious to distort survey results
Quality Control Checks
● Straightlining: Respondent chooses all the same questions in a
grid
● Speeding: Respondent complete the survey in superhuman time
● Trap Questions: Respondents select implausible answers or don’t
follow instructions
Fatigue Leads to Satisficing
● Shorter is better
● Very short (less than 5 questions) is ideal
● Data quality drop significantly after ~20 minutes (YMMV)
Survey Data Quality
For more on trap questions:
Liu and Wronski 2018: Trap questions in
online surveys: Results from three web
survey experiments
Kung et al 2018: Are Attention Check
Questions a Threat to Scale Validity
7. Missing Completely at Random (MCAR):
● The best case scenario
● Data is missing at random and not changing the distribution of responses
Missing Not at Random (MNAR)
● There is a pattern to the missingness
● Could indicate a larger issue with data collection
● May indicate response bias (social desirability bias, etc.) or bias in the
sample
Missing at Random (MAR)
● There is only a relationship between missingness and the value you’re
measuring
Missing Data
8. There are many methods to deal with missing data (complete cases, nearest
neighbour, mean, median, etc.)
MICE is a top-performer
● Stands for Multivariate Imputation by Chained Equations
● Uses other variables in the dataset to estimate missing values
● Generates “plausible synthetic values”
From the documentation: By default, the method uses pmm, predictive mean
matching (numeric data) logreg, logistic regression imputation (binary data, factor
with 2 levels) polyreg, polytomous regression imputation for unordered categorical
data (factor > 2 levels) polr, proportional odds model for (ordered, > 2 levels).
General rule is to only input up to 5% missingness
Great documentation at
https://www.rdocumentation.org/packages/mice/versions/3.13.0/topics/mice
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/
Imputation with MICE
Image designed by Jaden M. Walters
9. Using population data from a Fall 2021 political
research poll
Will make some simplifying assumptions for
demonstration purposes
● Ignoring stratification
● Assuming MCAR
(Missing Completely at Random)
MICE Imputation Example in R
10. Propensity weighting
● Adjust survey sample to known population parameters
● Weight by the inverse probability of selection to remove bias
● With probability samples, selection probabilities are known
● With non-probability samples, probabilities are estimated
Algorithm iteratively adjusts weights to match survey distributions to known population distributions
● Implemented in R using the rake() function from the survey package, as well as the
anesrake() function from the anesrake package
This is one of, if not the, most common weighting method used by researchers and pollsters
● Only requires knowledge of the marginal populations
Weighting with Post-Stratification
11. Using population data from:
https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/
details/download-telecharger/comp/page_dl-tc.cfm?Lang=E
Post-Stratification Raking Example in R