- 1. Respondent Driven Sampling & Network Sampling with Memory (time permitting…) M. Giovanna Merli Sanford School of Public Policy & Duke Population Research Institute (DUPRI) Duke University
- 2. Funding Acknowledgements • RDS Data Collection in China (2009-2010) – “Place-RDS Comparison Study” • USAID under the terms of cooperative agreements GPO-A-00-03-00003-00 and GPO-A-00-09-00003-0 (Weir, PI) • China National Center for STD Control (Chen, PI) • Duke CFAR AI064518 (Merli, PI) – “Partnership for Social Science Research on HIV/AIDS in China” • NICHD R24 HD056670 (Henderson, PI) • RDS Data Analyses and Simulations (2011-2015) – “Using Multiple Data Sources to Improve RDS Estimation” • NICHD R01HD068523 (Merli, PI) • NSM Data Collection in Tanzania – PFirst Award/DGHI (Merli, PI) 2
- 3. Problems with the study of hidden populations Female sex workers, men who have sex with men, injecting drug users, homeless, undocumented migrants are hidden populations For these populations we typically want to: • Obtain accurate and precise estimates of disease prevalence • Discern impact on larger population health dynamics • Identify gaps in HIV/STD prevention Collecting data from hidden populations to infer population representation is difficult because of the absence of a sampling frame – their members are hard to identify – Stigma – Non response – Lack of trust – Rarity 3
- 4. Problems with the study of hidden populations • Convenience samples, clinic-based inquiries, and sampling frames with limited coverage (e.g. venue based sampling) lack basis for inferring representation 4
- 5. Respondent Driven Sampling (RDS) Heckathorn 1997, 2002; Salganik and Heckathorn 2004; Volz and Heckathorn 2008 • Most popular solution to problems of sampling hidden populations – 450+ studies – 624+ papers, 10k+ citations – Over $185 million from NIH • Compare to “ego centric” – 167 studies funded – $42 million since 1990 5
- 6. How RDS works • RDS primarily used to estimate population proportions of binary nodal covariates (e.g. gender, infection status, tier of sex work, etc.) • Leverages social network of respondents to recruit other respondents • Chain referral / peer recruitment / link tracing sampling strategy – “Seed” participants (selected by convenience) receive coupons (2) – Recruit 2-3 new participants each – Each new respondent given 2-3 coupons to recruit others – Recruitment incentives for participating and for successful recruitment – No one participates more than once – Process continues until desired sample size is obtained 6
- 10. 10
- 14. How RDS works 14
- 15. Problems with estimation in link tracing sampling designs of hidden populations • Sampling frame unavailable • Sample inclusion probabilities are not known (hence sampling weights unknown) • Researchers have limited control of the sampling process • Seed respondents not chosen at random
- 16. RDS solution • Sampling probabilities computed under an approximation of the true sampling process – RDS assumes non-seed participants are Sampled with Probability Proportional to self-reported degree – (SPPD) – Provable in a random walk on most graphs of interest – Sampling probabilities approximated by degree, hence sampling weight = 1/degree • Weighting/estimation can yield asymptotically unbiased estimates of the population mean • SPPD assumption underpins much of RDS estimation claims 16
- 17. RDS estimators Estimator Proportion Equation Notes Naïve 𝑝 = 𝑖𝜖𝜒 𝑥𝑖 𝑛 −1 𝑥𝑖 is the value of the focal variable for respondent 𝑖; 𝑛 is the sample size RDS1-SH 𝑝 = 𝑆0,1 𝑑0 𝑆0,1 𝑑0 + 𝑆1,0 𝑑1 −1 𝑆 𝑎,𝑏 is the estimated proportion of recruitments from group 𝑎 to 𝑏; 𝑑 𝑎is the estimated average degree in each group (Salganik and Heckathorn 2004) RDS1-LEN 𝑝 = 𝑆0,1 𝑒𝑔𝑜 𝑑0 𝑆0,1 𝑒𝑔𝑜 𝑑0 + 𝑆1,0 𝑒𝑔𝑜 𝑑1 −1 𝑆 𝑎,𝑏 𝑒𝑔𝑜 is the estimated proportion of network ties from group 𝑎 to 𝑏 based on ego network reports (Lu 2013) RDS2-VH 𝒑 = 𝒊∈𝝌 𝒙𝒊 𝒅𝒊 −𝟏 𝒊∈𝝌 𝒅𝒊 −𝟏 −𝟏 𝒅𝒊 −𝟏 is the inverse of self- reported degree for person 𝒊 (Volz and Heckathorn 2008) 17
- 18. In RDS, all approximations are subject to critical assumptions that are often not met in the field • About the unobserved sample recruitment process (most crucial) – Respondent gives a coupon to a friend – Respondents recruit new participants non-preferentially from amongst their social contacts (each friend has an equal chance of being picked) – The initial set of respondents (“seeds”) are drawn with random probabilities – Respondents report their number of ties accurately (how many people you know that are members of the population of interest?) • About the social network structure – Rapid mixing: The chain referral process converges very quickly to the stationary distribution of a random walk (i.e. node selection probabilities are independent of sample starting point) – Connectedness: The target population must be connected by a network that consists of a single component – Network size: Network must be sufficiently large (sampling fraction small) that sampling without replacement can be treated as if it is equivalent to sampling with replacement 18
- 19. Prior evaluations of RDS • Comparison of RDS estimates to known parameters of non- hidden populations – (Wejnert 2009; Wejnert & Heckathorn 2008; McCreesh et al. 2012) • Test effects of violating RDS assumptions about social network structure on synthetic populations – (Gile & Handcock 2010; Thomas & Gile 2011; Lu et al. 2011) • Examine effects of network structure in multiple empirical settings with theoretical/ideal RDS samples – (Goel & Salganik 2010; Mouw & Verdery 2012; Verdery , Mouw et al. 2015) • Use full information on participants’ recruitment behavior to evaluate non-preferential recruitment assumption – (Yamanis, Merli, Neely et al. Sociological Methods and Research 2013) 19
- 20. RDS evaluation in the context of Female Sex Workers in Liuzhou, China • Evaluate SPPD assumption and population coverage (Merli, Moody, Smith et al., 2015 Social Science and Medicine) • Evaluate performance of RDS estimators (Verdery, Merli, Moody et al., 2015 Epidemiology) • Propose RDS data collection innovation to improve estimator performance (Verdery, Merli, Moody, In Progress) • Evaluations with a simulation approach grounded in empirical data from a hidden population of FSWs in China (Liuzhou, Guangxi Province) (Weir, Merli, Li et al. 2012, Sexually Transmitted Infections) 20
- 21. Data • Two sources – RDS: 583 FSWs (Oct. 2009 – Feb. 2010) (about 8% of total FSW population in Liuzhou) – PLACE (venue based sampling approach): 161 FSWs (Nov. 2009 – Mar. 2010) • Same target population and inclusion definition – Women who reside in Liuzhou who exchanged sex for money in last 4 weeks • Same geographic area and similar time period • Same measurement of key variables – Test for biomarker of lifetime exposure to syphilis and core questionnaire • Same face-to-face interview and common applicant pool for interviewers • Rare to have two concurrent surveys in same population! 21
- 22. Description of the Liuzhou RDS sample Tier of sex work Venues where clients are solicited RDS (N = 576) High Karaoke bars, star hotels, discos, night clubs 250 Middle Hair salons, saunas, massage parlors, foot cleaning/massage, bathhouses 268 Low Streets, parks, other public spaces 27 Non- venue based Telephone, text, internet, private referrals 31 22 Fisher and Merli 2014, Network Science.
- 23. Approach, part 1 • Construct “population social network” from data collected in RDS and PLACE – Used new methodologies for estimating social network parameters and simulating population network • Use Case Control Logistic Regression to estimate homophily parameters from the RDS data (Smith, SM 2012) • Use Exponential Random Graph Modeling to generate full network from local structural features (ERGM; Handcock et al., JOSS 2008) – Tested various sensitivities about the means by which this population social network is constructed • (which data source, venue size estimates, and assumptions about geographic distribution of social network ties) 23
- 24. “Population social network” Generate “population characteristics” based on PLACE survey estimates Add “population social network” based on RDS survey estimates 24
- 25. Approach, part 2 • Simulate RDS chains over “population social network” (1000 per recruitment scenario) – Scenarios vary according to different sample recruitment assumptions • Seeding of the chain • Recruitment patterns – How much does the ideal case (random seeding and random recruitment) diverge from actual RDS seeding and recruitment matched to the Liuzhou FSW data? 25
- 26. Results: Violation of SPPD assumption • Compared individual degree to the proportion of times an individual was sampled across the simulated chains – Very high correlation when seeds and referrals are random – SSPD assumption increasingly violated when seeds & referrals are matched to the actual data – Over-recruitment of middle tier sex workers drives the result • For more: – Merli, Moody, Smith et al., Social Science & Medicine, 2015 26 r=0.82 r=0.96 r=0.97 Merli, Moody, Smith et al., SSM, 2015
- 27. Distribution of RDS2-VH proportion estimates (low/middle tier) across seeding and recruitment scenarios 27 Verdery, Merli, Moody et al. 2015, Epidemiology
- 28. Variability of estimates: Design effects (ratio of variance in RDS estimates to variance in estimates from same size SRS) • DE very large, but not out of line with findings of prior work (Goel and Salganik 2010) • Large Design Effects imply that much larger sample sizes would be required to reach level of precision currently assumed from RDS samples typically in the hundreds • CDC recommends RDS sample sizes in the hundreds for public health surveillance – IMPLICATIONS: Not sufficient power to identify changes in behaviors or disease prevalence 28 DemDem DemRan RanRan Middle Tier 6.18 19.60 28.20
- 29. Discussion • Seeding and recruitment scenarios – Matching on seeds not critical – Matching on recruitment patterns has a larger effect, exacerbates biases but reduces design effects • Problematic because seems harder to control than seed matching 29
- 30. Estimator performance • Estimator development – Only one (RDS1-LEN) works markedly better than others • Robust to preferential recruitment by taking into account respondents’ ego- network composition – BUT unusable for most (unobservable) characteristics we care about – Still problems with variance estimation 30 Verdery, Merli, Moody et al. 2015, Epidemiology Distributions of estimates of proportions in low tiers of sex work by estimator (recruitment and seeds matched to the Liuzhou FSW data)
- 31. Recent innovation: IP-RDS (Verdery, Merli, Moody, In Progress) • What can be done to improve the performance of RDS estimates while retaining the method’s desirable peer- driven sample recruitment properties? • Modify RDS data collection process • Apply antithetic variate mean estimator to data • Results from simulations: Improved estimation performance 31
- 32. New data collection protocol IP-RDS • Incentivize respondents to invert their preferences when choosing new respondents, i.e. respondents are asked to invert their recruitment preferences on the recruitment biasing variable (e.g. tier of sex work) 32
- 36. Antithetic variate mean estimator • 𝜇 𝐴𝑉 = 𝑖∈𝑚1 𝑦 𝑖 2 + 𝑖∈𝑚2 𝑦 𝑖 2 , where yi is the value of the focal variable for the i respondent m1 is the count of recruitments by members of one group of the recruitment biasing variable (e.g. tier of sex work), and m2 is the count of recruitments by members of the other group 36
- 37. Distributions of estimates of proportions in low/mid tiers of sex work by estimator (naïve mean, RDS2-VH, AV-IP_RDS) and level of biased recruitment behavior (absolute difference in recruitment probabilities conditional on attribute of targeted peer) 37
- 38. Discussion of IP-RDS • Simple change to RDS protocol – May or may not require financial incentives for targeted recruitment (empirical question) • Outperforms conventional estimators – Gains in bias reduction comparable to RDS1-LEN estimator • Tested on more networks (similar results) • BUT …Not yet field tested 38
- 39. Network Sampling with Memory • Mouw and Verdery 2012, Sociological Methodology • Collects network data • Introduces researcher’s control over the sampling process • Directs the recruitment process to more efficiently explore the network (avoiding bottlenecks)
- 40. How does NSM work? • Recruitment starts with a few seed respondents • Network roster data collected from respondents about minimally identifying information of their network members (last name and last four digits of cell phone number) to connect nodes in the network (up to 10 network members per respondent) • NSM sampling algorithm selects up to 3 nominated network members per respondent and asks respondents for full contact information on these • Process proceeds iteratively to recruit new waves of respondents
- 42. How does NSM work? • NSM sampling algorithm uses two sampling modes, List and Search • List mode – keeps a list, L, of all nominated network members – samples with replacement from L – even sampling of new nodes -- new nodes sampled at the same cumulative sampling rate as earlier nodes – as list of sampled nodes approaches the full population network, NSM sample converges to simple random sampling
- 43. How does NSM work? • Search mode—look for “bridge” nodes to unexplored parts of the network. Start in search mode, then switch to list mode.
- 44. Simulation results • Test NSM vs. RDS using 162 university and School networks from Facebook and Add Health • Size of networks ranges from 300 to 16,500 nodes • Estimate % white (Add Health) and % first year students (Facebook) • Start from a randomly selected student, repeat 500 times for each network • Calculate bias, design effects and mean absolute bias • Test (162 networks) DE is 1.16 for NSM vs 77.38 for RDS
- 45. Is it feasible? • Is it feasible to collect network data on hidden populations? • 2010 NSIT (Network Survey of Immigration and Transnationalism) (Mouw, PI) • CAHS (Chinese in Africa Health Survey) (Merli, PI) • Cost effectiveness of gains in precision
- 46. NSM field applications Network Survey of Immigration and Transnationalism (NSIT) Mouw et al. 2014. Social Problems; Verdery et al. 2016. Social Networks Chinese in Africa Health Survey (CAHS) Merli, Verdery, Mouw, Li 2016. Migration Studies 46 Red: RDU Blue: Mexico Green: Houston Small: Nominated Large: Sampled Network of Chinese migrants in Dar es Salaam sampled by NSM, size = probability of selecting next node
- 47. Key challenge: Getting referrals from respondents • NSIT required recontacting respondents to get contact information on alters • CAHS -- “forward” sampling variant (FNSM)— more practical – Asked for contact information on a small number of alters at each interview (selected by NSM algorithm)
- 48. NSM -- Future directions • NIH R21 grant to test NSM among Chinese immigrants in RDU (Merli, Mouw, Verdery, Moody, Keister, Sanders) – Pilot various approaches to get referrals from respondents – Evaluate NSM against ACS – Test multiple modes of data collection (in-person, telephone, web) 48