delivered at World Bank, part of Development Data Group Learning Series
Washington DC, 2016-03-07
Response rates do not always provide an accurate depiction of data quality. Research based on a large multi-country survey indicate that when interviewers play a substantial role in sample selection, interviewer manipulation may artificially generate high response rates. For example, when using the random walk selection technique, interviewers should select every kth household, but they have substantial leeway in deciding which household is the kth one, and may preferentially select those where someone is home. Or, when rostering a household to select a random respondent, interviewers may leave off household members who are seldom at home. If many interviewers engage is such behaviors, a high response rate may in fact be the result of biased sample selection and therefore indicate low data quality.
There are two lessons from these findings. First, response rates should not be used as the sole or primary proxy for data quality. Second, whenever possible, interviewers’ role in sample selection should be minimized. The talk concludes with a review of alternative sampling methods that take advantage of geospatial data such as satellite photos, drone imagery and handheld GPS devices. The ideal sampling techniques are ones that minimize interviewer discretion and allow for verification of interviewer performance.
Are the Hard to Cover Also Less Likely to Respond?Stephanie Eckman
Growing evidence suggests that the cases identified and added via efforts to improve coverage are disproportionately nonresponders to the survey request. For example, the AAPOR Cell Phone Task Force found that mobileonly households, which are undercovered in landline frames, have lower response rates than households that have a landline. The LISS web panel offered internet access to noninternet households, in an effort to improve coverage of the population, but observed lower recruitment rates among these households. Because response rates are published and seen as quality indicators, whereas coverage rates usually go unreported, there may be incentives for those involved in survey production to increase response rates at the expense of coverage rates. This chapter will systematically review the existing evidence for such a nonresponse coverage tradeoff and use a theoretical lens to search for the mechanisms underlying the connection between nonresponse and undercoverage. We will also call attention to situations in which the distinction between nonresponse and undercoverage is not entirely clear. We consider alternative formulations of the response rate that collapse across response and coverage; such measures will be particularly important as the field moves towards data collection beyond surveys as we know them, where nonresponse and undercoverage cannot be easily distinguished.
Data-Ed Online: Engineering Solutions to Data Quality ChallengesData Blueprint
This webinar originally aired on Tuesday, October 9th, 2012. It is part of Data Blueprint's ongoing webinar series on data management with Dr. Aiken.
Sign up for future sessions at http://www.datablueprint.com/webinar-schedule.
Abstract:
This presentation provides guidance to organizations considering or preparing for data quality initiatives. We will illustrate how organizations with chronic business challenges often can trace the root of the problem to poor data quality. Showing how data quality can be engineered provides a useful framework in which to develop an organizational approach. This in turn will allow organizations to more quickly identify data problems caused by structural issues versus practice-oriented defects. Participants will also learn the importance of practicing data quality engineering quantification.
Actividad # 6. (Lunes, 6 marzo (ABC)
Realizar una presentación en Power Point emitiendo tu propio juicio sobre el concepto de contabilidad, colocar 5 objetivos de la contabilidad teniendo en cuenta su empresa, decir por qué es importante la contabilidad y mencionar a su consideración tipos de contabilidad y por qué, por último subir la presentación en www.slideshare.net y enviar enlace al correo electrónico josetorres.sagradocorazon@gmail.com
Are the Hard to Cover Also Less Likely to Respond?Stephanie Eckman
Growing evidence suggests that the cases identified and added via efforts to improve coverage are disproportionately nonresponders to the survey request. For example, the AAPOR Cell Phone Task Force found that mobileonly households, which are undercovered in landline frames, have lower response rates than households that have a landline. The LISS web panel offered internet access to noninternet households, in an effort to improve coverage of the population, but observed lower recruitment rates among these households. Because response rates are published and seen as quality indicators, whereas coverage rates usually go unreported, there may be incentives for those involved in survey production to increase response rates at the expense of coverage rates. This chapter will systematically review the existing evidence for such a nonresponse coverage tradeoff and use a theoretical lens to search for the mechanisms underlying the connection between nonresponse and undercoverage. We will also call attention to situations in which the distinction between nonresponse and undercoverage is not entirely clear. We consider alternative formulations of the response rate that collapse across response and coverage; such measures will be particularly important as the field moves towards data collection beyond surveys as we know them, where nonresponse and undercoverage cannot be easily distinguished.
Data-Ed Online: Engineering Solutions to Data Quality ChallengesData Blueprint
This webinar originally aired on Tuesday, October 9th, 2012. It is part of Data Blueprint's ongoing webinar series on data management with Dr. Aiken.
Sign up for future sessions at http://www.datablueprint.com/webinar-schedule.
Abstract:
This presentation provides guidance to organizations considering or preparing for data quality initiatives. We will illustrate how organizations with chronic business challenges often can trace the root of the problem to poor data quality. Showing how data quality can be engineered provides a useful framework in which to develop an organizational approach. This in turn will allow organizations to more quickly identify data problems caused by structural issues versus practice-oriented defects. Participants will also learn the importance of practicing data quality engineering quantification.
Actividad # 6. (Lunes, 6 marzo (ABC)
Realizar una presentación en Power Point emitiendo tu propio juicio sobre el concepto de contabilidad, colocar 5 objetivos de la contabilidad teniendo en cuenta su empresa, decir por qué es importante la contabilidad y mencionar a su consideración tipos de contabilidad y por qué, por último subir la presentación en www.slideshare.net y enviar enlace al correo electrónico josetorres.sagradocorazon@gmail.com
Driving Marketing & Sales Activities in your BusinessRussell Cummings
This webinar looks at 3 key elements for building marketing and sales activities in your business: 3 essential components of marketing and sales; How to build a Sales Pipeline for your business; How do “I make it happen?
Do you lose precious time due to data quality problems?
Do you need to integrate data from multiples sources and provide an integrated view of your customer or product attributes to other systems?
SQL Server 2016 Data Quality and Master Data Services can help you.
Undercoverage plaques many frames - housing units are missed by listers or do not appear on the postal service list; persons with tenuous connections to households are not captured in rosters; persons hide their eligibility during screener interviews. The literature on undercoverage suggests several methods for improving the coverage of such frames, via a missed housing unit procedure, or detailed probes about household members, or disguising the target population in survey questions. However, each of these solutions introduces additional costs into the survey process. In this way, survey designers face a coverage-cost trade-off. In addition, there is increasing evidence that the cases found via these coverage-improvement measures are disproportionately nonresponders to the survey request. Thus there appears to be a coverage-nonresponse trade-off as well. Together these points raise the question of how much effort we should put into increasing coverage, when such efforts increase costs and nonresponse? This presentation will review empirical evidence for these trade-offs and search for clues to the mechanisms underlying the connection between nonresponse and undercoverage.
The use of GIS tools in analyzing and conducting large-scale surveys has increased in the last several years and will likely continue to do so as the technologies become less expensive and easier to use. Starting with the Total Survey Error framework, this talk will discuss how GIS tools can help us measure and reduce different error sources, such coverage, nonresponse and measurement error. In addition, the tools can increase interviewer efficiency and reduce data collection costs. As we embrace these tools, survey researchers should maintain a healthy skepticism about their role. The talk will review the errors that GPS devices and GIS software can introduce; privacy and confidentiality concerns are also important.
Three Studies on Supplementing Survey Data with Active DataStephanie Eckman
As survey costs increase and response rates decrease, researchers are looking to alternative methods to collect data from study subjects. Passive data are data collected from subjects without posing questions and recording responses. Examples are passive data are: location data collected from smartphones; applications installed on smartphones; activity data from fitness devices such as fitbits. Because they are collected without subject involvement, passive data may offer a way to reduce the burden born by our research subjects while also allowing us to collect high quality data needed for social science research. However, preliminary research into how to collect and analyze passive data is needed. In this talk, I present three research studies which use passive data to improve the quality and/or reduce the burden of survey data. The talk will focus on what we have learned and what research remains to be done.
Does evidence actually influence policy? What can be done to improve the record?
Presentation by Priya Deshingkar, Research Director of the Migrating out of Poverty RPC
Interviewer Involvement in Selection Shapes the Relationship between Response...Stephanie Eckman
A high survey response rate may be a sign that interviewers are not following directions and that your data are full of undercoverage and nonresponse error. Presentation at #ITSEW workshop June 2018
Several studies have shown that, contrary to most researchers' expectations, high response rates are not correlated with low bias in survey data. In this paper we show that the relationship between response rates and bias is moderated by the type of sampling method used. When interviewers are involved in selecting the sample of households for the survey, high response rates can in fact be a sign of high bias. We suggest that this relationship is due to interviewers' incentives to select households with high response propensities.
Driving Marketing & Sales Activities in your BusinessRussell Cummings
This webinar looks at 3 key elements for building marketing and sales activities in your business: 3 essential components of marketing and sales; How to build a Sales Pipeline for your business; How do “I make it happen?
Do you lose precious time due to data quality problems?
Do you need to integrate data from multiples sources and provide an integrated view of your customer or product attributes to other systems?
SQL Server 2016 Data Quality and Master Data Services can help you.
Undercoverage plaques many frames - housing units are missed by listers or do not appear on the postal service list; persons with tenuous connections to households are not captured in rosters; persons hide their eligibility during screener interviews. The literature on undercoverage suggests several methods for improving the coverage of such frames, via a missed housing unit procedure, or detailed probes about household members, or disguising the target population in survey questions. However, each of these solutions introduces additional costs into the survey process. In this way, survey designers face a coverage-cost trade-off. In addition, there is increasing evidence that the cases found via these coverage-improvement measures are disproportionately nonresponders to the survey request. Thus there appears to be a coverage-nonresponse trade-off as well. Together these points raise the question of how much effort we should put into increasing coverage, when such efforts increase costs and nonresponse? This presentation will review empirical evidence for these trade-offs and search for clues to the mechanisms underlying the connection between nonresponse and undercoverage.
The use of GIS tools in analyzing and conducting large-scale surveys has increased in the last several years and will likely continue to do so as the technologies become less expensive and easier to use. Starting with the Total Survey Error framework, this talk will discuss how GIS tools can help us measure and reduce different error sources, such coverage, nonresponse and measurement error. In addition, the tools can increase interviewer efficiency and reduce data collection costs. As we embrace these tools, survey researchers should maintain a healthy skepticism about their role. The talk will review the errors that GPS devices and GIS software can introduce; privacy and confidentiality concerns are also important.
Three Studies on Supplementing Survey Data with Active DataStephanie Eckman
As survey costs increase and response rates decrease, researchers are looking to alternative methods to collect data from study subjects. Passive data are data collected from subjects without posing questions and recording responses. Examples are passive data are: location data collected from smartphones; applications installed on smartphones; activity data from fitness devices such as fitbits. Because they are collected without subject involvement, passive data may offer a way to reduce the burden born by our research subjects while also allowing us to collect high quality data needed for social science research. However, preliminary research into how to collect and analyze passive data is needed. In this talk, I present three research studies which use passive data to improve the quality and/or reduce the burden of survey data. The talk will focus on what we have learned and what research remains to be done.
Does evidence actually influence policy? What can be done to improve the record?
Presentation by Priya Deshingkar, Research Director of the Migrating out of Poverty RPC
Interviewer Involvement in Selection Shapes the Relationship between Response...Stephanie Eckman
A high survey response rate may be a sign that interviewers are not following directions and that your data are full of undercoverage and nonresponse error. Presentation at #ITSEW workshop June 2018
Several studies have shown that, contrary to most researchers' expectations, high response rates are not correlated with low bias in survey data. In this paper we show that the relationship between response rates and bias is moderated by the type of sampling method used. When interviewers are involved in selecting the sample of households for the survey, high response rates can in fact be a sign of high bias. We suggest that this relationship is due to interviewers' incentives to select households with high response propensities.
Workshop session 4 - Optimal sample designs for general community telephone s...The Social Research Centre
Social Research Centre workshop - Telephone Surveying in the Post-Modern Era, held Thursday 10 October 2019. Presentation by Dina Neiger - Chief Statistician (Social Research Centre)
Ce projet, réalisé pour le compte de la Société d’habitation du Québec (SHQ), a permis d’établir les liens entre les diverses sources de données afin de produire un profil statistique des résidents des logements sociaux de la Ville de Québec.
Data Integrity in Decentralized Clinical Trials (DCTs)InsideScientific
Experts expand on the need for a comprehensive understanding of all sources of data in DCTs, and the need to evaluate those data centrally in real time to mitigate the risks associated with their capture (including data capture at the edge of the network (wearables)).
Every disruptive innovation must be complemented by adapted procedures, and this also applies to decentralized clinical trials (DCTs). Traditionally, sites entered clinical trial data in an Electronic Data Capture (EDC) system and these source data were verified at the site to confirm accuracy. Risk based monitoring focused on site level metrics such as screen failure rates, query rates, Serious Adverse Events (SAEs) reported, missed/late visits, etc. With DCTs, as source data are collected directly from participants this is no longer an option and a different approach is required to ensure the quality and integrity of the data. As a rule, a comprehensive understanding of all sources for data capture in a clinical trial and the process for centralization is essential. Also, it is important to evaluate the data collected in real time to allow early interventions that will ensure data integrity for regulatory submission.
In this webinar, Chitra Lele describes how centralized monitoring strategies can help aggregate and analyze data in real time and provide insights to a variety of functional teams across the trial continuum. Daniel Gutierrez describes how the Clinerion platform can boost data integrity in DCTs. The technology transforms global data sources to one query-able data model for structured medical data, while ensuring that the data keep its full resolution and integrity during aggregated queries.
Pierre Etienne talks about the expanding role of mobile Health Care Professionals (HCPs) and their crucial role in protecting data integrity. Clifton Chow finishes with a comparison of several artificial intelligence (AI) based binary classifiers for detecting the integrity of data obtained from Internet of Things (IoT) enabled wearable sensors.
Data Quality Concerns when Crowdsourcing Scientific TasksStephanie Eckman
From classifying images or texts to responding to surveys, tapping into the knowledge of crowds to complete complex tasks has become a common strategy in social and information sciences. Although the timeliness and cost-effectiveness of crowdsourcing may provide desirable advantages to researchers, the data it generates may be of lower quality for some scientific purposes. The quality control mechanisms, if any, offered by common crowdsourcing platforms may not provide robust measures of data quality. This study explores whether research task participants may engage in motivated misreporting whereby participants tend to cut corners to reduce their workload while performing various scientific tasks online. We conducted an experiment with three common crowdsourcing tasks: answering surveys, coding images, and classifying online social media content. The experiment recruited workers from three sources: a crowdsourcing platform for crowd workers, a commercial survey panel provider for online panelists, and a research volunteering website for citizen scientists. The analysis seeks to address the following two questions: (1) whether online panelists, crowd workers or volunteers may engage in motivated misreporting differently and (2) whether the patterns of misreporting vary by different task types. We further seek to examine potential correlation between the patterns of motivated misreporting and the data quality of complex scientific research tasks. The study closes with suggestions of quality assurance practices of incorporating collective intelligence to improve the system for massive online information analysis in social science research.
Combining Survey and Wearable Data on Exercise and Sleep Stephanie Eckman
High quality data on physical activity is difficult to collect via surveys. Respondents tend to overreport physical activity due to social desirability bias. Consumer health wearables, such as Fitbits, may offer a method of collecting higher quality data. To explore whether wearable devices provide a reasonable alternative to survey data collection, we collected survey data on exercise and sleep patterns from 500 respondents. These questions were modelled after the Behavioral Risk Factor Surveillance Survey in the US. The respondents also provided their previous month’s Fitbit data. These data capture activity, steps, heartrate and sleep every minute. In addition to these survey and passive data, we also have the nationwide BRFSS data and the nationwide data on Fitibit users. Combining these data sources, we will comment on the relative quality of survey and wearable reports of physical activity. Our results will be informative for all researchers thinking of integrating consumer wearables into their studies, as well as anyone who collects or analyzes data on physical activity.
Data Quality Concerns when Crowdsourcing Scientific TasksStephanie Eckman
Crowdsourcing has become a popular means to solicit assistance for scientific research. From classifying images or texts to responding to surveys, tapping into the knowledge of crowds to complete complex tasks has become a common strategy in social and information sciences. Although the timeliness and cost-effectiveness of crowdsourcing may provide desirable advantages to researchers, the data it generates may be of lower quality for some scientific purposes. The quality control mechanisms, if any, offered by common crowdsourcing platforms may not provide robust measures of data quality. This study explores whether research task participants may engage in motivated misreporting whereby participants tend to cut corners to reduce their workload while performing various scientific tasks online. We conducted an experiment with three common crowdsourcing tasks: answering surveys, coding images, and classifying online social media content. The experiment recruited workers from three sources: a crowdsourcing platform for crowd workers, a commercial survey panel provider for online panelists, and a research volunteering website for citizen scientists. The analysis seeks to address the following two questions: (1) whether online panelists, crowd workers or volunteers may engage in motivated misreporting differently and (2) whether the patterns of misreporting vary by different task types. We further seek to examine potential correlation between the patterns of motivated misreporting and the data quality of complex scientific research tasks. The study closes with suggestions of quality assurance practices of incorporating collective intelligence to improve the system for massive online information analysis in social science research.
Sampling Nomads: A New Technique for Remote, Hard-to-Reach, and Mobile Popula...Stephanie Eckman
Livestock are an important component of rural livelihoods in developing countries, but data about this source of income and wealth are difficult to collect due to the nomadic and
seminomadic nature of many pastoralist populations. Most household surveys exclude those without permanent dwellings, leading to undercoverage. In this study, we explore the use of a random geographic cluster sample (RGCS) as an alternative to the household-based sample. In this design, points are randomly selected and all eligible respondents found inside circles drawn around the selected points are interviewed. This approach should eliminate undercoverage of mobile populations. We present results of an RGCS survey with a total sample size of 784 households to measure livestock ownership in the Afar region of Ethiopia in 2012. We explore the RGCS data quality relative to a recent household survey, and discuss the implementation challenges.
Presentation at the European Central Bank, Nov 6, 2013
Panel surveys are used to measure change over time, but previous research has shown that simply asking the same questions of the same respondents in repeated interviews leads to overreporting of change. With proactive dependent interviewing, responses from the previous interview are preloaded into the questionnaire, and respondents are reminded of this information before being asked about their current situation. Existing research has shown that dependent interviewing techniques can reduce spurious change in wave-to-wave reports and thus improve the quality of estimates from longitudinal data. However, the literature provides little guidance on how such questions should be worded. After reminding a respondent of her report in the last wave (“Last time we interviewed you, you said that you were not employed”), we might ask: “Is that still the case?”; “Has that changed?”; “Is that still the case or has that changed?”; or we might ask the original question again: “What is your current labour market activity?”. In this study we present experimental evidence from a longitudinal telephone survey in Germany (n=1500) in which we experimentally manipulated the wording of the dependent questions and contrasted them with independent questions. We report differences in the responses collected by the different question types. Due to the concern that respondents may falsely confirm previous information as still applying, leading to underreporting of change in dependent interviewing, we also test hypotheses about how respondents answer such questions. In these tests, we focus on the roles played by personality, deliberate misreporting to shorten the interview, least effort strategies and cognitive ability in the response process to dependent questions. The paper provides evidence-based guidance on questionnaire design for panel surveys.
joint work in Annette Jaeckle, University of Essex
Previous research has demonstrated that the way in which filter questions are asked can affect the responses given: respondents tend to give fewer answers which trigger additional questions when the filters are interleafed with the follow up questions than when the filters are asked all in a group. We extend this research to looped questions in which respondents are asked the same battery of questions about every full-time job they have held, or every degree they have received. Such looping questions are common in surveys which collect biographical histories, but little prior work has explored the best way to ask such questions. Like filter questions, looping questions can be asked in two formats: one which asks first how many full time jobs a person has held, and another which first asks about one job and then asks if the respondent has held another job. We call these two formats “how many” and “go again.” In this paper, we investigate whether the format effect that we find in filter questions also applies to these looping questions. Based on the filter question research, we expected to find reduced reporting in the “go again” format. To investigate the phenomenon, we use data from a recent web survey in German (n=1,068, AAPOR RR1=10.3%). We do find the expected effect. Exploiting a link between survey responses and administrative data which is available for more than half the sample, we also show that respondents in the “how many” condition give more accurate responses on the number of events, and those in the “go again” condition tend to underreport. However, there may be other reasons to prefer the “go again” format, as it allows respondents to discuss one event at a time. Our results provide guidance to questionnaire designers, survey practitioners and analysts of survey data.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Response Rates Impact Data Quality, But not How you Might Think
1. www.rti.orgRTI International is a registered trademark and a trade name of Research Triangle Institute.
Response Rates Impact Data Quality,
but Not How you Might Think
Based on 2 papers:
Eckman, S and Koch, A. “The Relationship between Response Rates,
Sampling Method and Data Quality: Evidence from the European Social
Survey” Under Review
Eckman, S, Himelein, K and Dever, J. “Innovative Sample Designs Using
GIS Technology" forthcoming in Advances in Comparative Survey
Methods: Multicultural, Multinational and Multiregional Context.
Stephanie Eckman, RTI Fellow
2. Motivation
Relationship between RR & Data Quality
High response rates signal data are good quality
Response rates uncorrelated with data quality
– High RR survey no more accurate than low (Keeter et al, 2000)
– Merkle & Edelman (2002)
– Groves & Peytcheva (2008)
2
4. RRs do not Correlate with Nonresponse Bias
4 Groves & Peytcheva 2008
5. Motivation
Relationship between RR & Data Quality
High response rates signal data are good quality
Response rates uncorrelated with data quality
– High RR survey not more accurate than low (Keeter et al, 2000)
– Merkle & Edelman (2002)
– Groves & Peytcheva (2008)
But maybe high response rates are a sign that data are crap?
5
6. Data Quality
Total Survey Error Framework
– Undercoverage
– Nonresponse
– Measurement error
– Editing error
– Processing error
– etc.
Misrepresentation error
– Undercoverage + Nonresponse
Tradeoff between undercoverage & NR
– Eckman & Kreuter 2017
6
Image: http://makeagif.com/dkjuuc
7. European Social Survey
7 waves
30+ countries
Central Committee sets standards
– Core questionnaire
– Minimum effective sample size
– Paradata collection
– Documentation
– Face to face attempts
– RR standard 70%
Our data: 136 country-rounds in first 6 waves
7
8. Sampling Methods in Analysis
8
Sampling
Method Includes
Field Staff Involvement in
Selecting
nHousehold Person
Individual
Register
None None
70
Household
Register
Household Register
Address Register
None
Interviewer
None
Interviewer
41
Household
Walk
Listing
Random Walk
Lister
Lister
Interviewer
Interviewer
25
10. 2 Measures of Data Quality
External measure:
– How different is ESS from Labor Force Survey?
– On 6 categorical variables: age, gender, HH size, marital status, etc.
– Index of dissimilarity measures how different 2 surveys are
– Average over 6 variables
– Assumes LFS is higher quality
Internal measure:
– 50% of all respondents from gender heterogeneous couples should be
women
– ܫ, > 1.96 indicates significant deviation from 50%
10
ܫ, =
% female, − 50
50 ∗ 50/݊
ܦ,, = 0.5 ∗ |ܻ,,
ாௌௌ
− ܻ,,
ிௌ
|
11. 2 DVs, 2 IVs
Dependent variables: misrepresentation error
– External measure
– Internal measure
Independent variables
– RR
– Sampling method
Joint effect of RR and sampling method on data quality
11
14. Implications
High RRs might signal that you have problems with your data
– When interviewers select samples
– Interviewers seem to manipulate selection process to keep RRs high
Note that ESS does better random walk than other surveys
– Listing should be done by someone other than interviewer
Other problems with random walk
– Walker effects
– No probabilities of selection
14
15. Possible Solutions
What are some alternatives to random walk?
– Satellite Photos
– Reverse Geocoding
– Qibla Method
– Geosampling
– Listing with Drones
15
16. GIS Resources
Turn by turn directions on phone
Satellite images
– Daytime images
– Small-sat revolution
– Nighttime lights
Other remote sensing data
How can we exploit these resources for sampling?
– And avoid random walks problems
16
22. Qibla Method
Qibla is Arabic for “in the direction of Mecca”
Given random starting coordinate
– Interviewer walks in the direction of Mecca
– Selects first HH encountered
22
25. Geosampling
Select first stage units
– Administrative units
– Or 1km squares
Select second stage units
– Smaller squares
Visit and interview all households in smaller unit
25
27. Eliminates separate listing
step
Still vulnerable to interviewer
manipulation
Possible QC by interviewer
GPS tracks? (Himelein et al,
2014)
Geosampling: Second & Third Stage
27
28. Use of UAVs for Listing
RTI has tested listing from drone images
– Galapagos & Guatemala
28
Amer et al 2016
32. Conclusions
Ideal method:
– Removes influence of interviewer
– Results in equal probability sample of HUs
– With known probabilities
No alternative is perfect
– High involvement of interviewers
– High data requirements
Drones may prove useful
32
Going to explore connection between RR and data quality (UC + NR)
Data collection independent in each country
5 sample types used in ESS
Ordered here by interviewer involvement in selection (low to high)
R selection method via roster + kish table or via birthday method
Recoded into 3
Very few surveys reach 70%
19% in 1
12% in 2
51% in 3
Higher means worse quality
No easy way to put a std err on this
Purposefully using strong language (cause, effect)
Gonna do some prelim analyses and then get into models
Naïve linear regression lines
1 -- nearly all of the country-rounds using the individual register sampling method have low external measures: these samples have relatively low misrepresentation error.
2 -- country-rounds using the household register sampling methods have slightly higher measures on average (meaning the samples are less representative)
3 -- values are also high.
4 -- most fall inside the [0; 1:96] region, meaning that the observed deviations from 50% female may be explained entirely by sampling error.
5 -- 46% of the country-rounds show gender ratios that are signicantly different from 50%.
6 -- 76% of the country-rounds show signicant deviations from 50% female.
Other thing we’re doing – testing slopes of reg lines
-- sig and + in 3,6
-- others not sig
Random effects by country
Also tried:
RR in tertiles, quintiles, deciles – results unchanged
binary indicator of significant internal bias – results unchanged
Fixed effects models
External
-- strong effect of sample type (but no diff 2 vs 3)
-- no effect of RR
Internal
-- strong effect of sample type (but no diff 2 vs 3)
-- small + effect of RR in model 7: high RR -> high misrep
Sample type matters for data quality
RR does not
When there is no register
Assuming clusters already selected
Many of these solutions make use to GIS data
Planet has 149 satellites, images entire earth everyday
Other data: LIDAR, PhoDAR
To go back to the photo we looked at earlier…
Could # structures and select from image
Give Is image showing selected units
Software figures out what closest structure is
Software or interviewer figures out what closest structure is
Probability of selection??
Any direction would work
Similar to reverse geocoding
Gallup interested in piloting this
Selection region for structure 1 shown in blue
Good in theory, but blue area depends on position of all other buildings – how do we know this?
This is similar to reverse geocoding
Many points lead to selections outside the area – what to do?
New problem we didn’t have in reverse geocoding
Challenges in Implementation
Satellite images incomplete, outdated, or unavailable
Satellite image resolution low and captures only rooftops
Difficult to determine if structure is a business, group quarters, vacant, controlled access, etc.
Environmental changes (landslides, etc.) and new buildings not captured
GPS accuracy varies across countries
Detailed rural road network not available in majority of cases with accessibility issues due to elevation and natural blocks (e.g. ravines)
Improve analysis within Galapagos
Use local staff to extract information from drone imagery
Compare consistency between drones and Geo-listing
Estimate percent error across methodologies
Combine methodologies to improve/update imagery
Extend to Guatemala to explore some urban and rural settings
Assess use in conflict affected and fragile locations
Recommendations and guide to use for
Census updates
Sampling
Field work support
Not yet at the point where drones can replace humans in data collection!
How building looks on street view
How building looks from drone
Improve analysis within Galapagos
Use local staff to extract information from drone imagery
Compare consistency between drones and Geo-listing
Estimate percent error across methodologies
Combine methodologies to improve/update imagery
Assess use in conflict affected and fragile locations
Recommendations and guide to use for
Census updates
Sampling
Field work support
Not yet at the point where drones can replace humans in data collection!