More Related Content Similar to ResearchReportwithproYves.docx (2) Similar to ResearchReportwithproYves.docx (2) (20) ResearchReportwithproYves.docx (2)1. Winter 2016 & Spring 2016
Research Report
Advisor: Professor Yves Atchadé
Student: Miao Wang
Effect of sampling protocol modification on Response Driven Sampling Method
Abstract : The main goal of this research project is to understand how slight change of sampling protocol would affect
the estimate in the context of Response Driven Sampling. Two slight change of protocol are explored: (1) asking each
participant to provide all their contacts and researchers randomly pick one contact to follow (2) given we know a
covariate, which is related to the interested population, asking each participant to provide the covariate, and we tilt the
sampling preference according to it. Both of the scenario were compared with the usual RDS process in terms of the
meansquareerror of the estimate and the sensitivity to seed participants. The project finds out that the estimate from
both scenario 1 and scenario 2 have better performance in terms of smaller meansquareerror but both have same
level of sensitivity to seed participants compared to normal RDS procedure. This study may provide a new perspective
of sampling in that how the change of protocol of sampling can help reduce the variance and yield better estimates.
Keywords: sampling protocol, Responsedrivensampling, reduce variance of estimat
1. Introduction
The Response Driven Sampling (RDS) is widely used where it is too difficult to directly sample
from the desired population due to unavailability. For example, if we are interested to know the HIV
infection rate among sex workers in the United States, there is no way to have a complete list of all sex
workers. In this case, it is mostly useful to find one sex worker and ask her to recruit another worker she
knows who might or might not be infected with HIV. The sampling method of one respondent recruiting
another is Response Driven Sampling (RDS). The RDS sample constitutes a Markov Chain Monte Carlo
(MCMC) (Goel & Saliinik, 2009) and thus can produce valid statistical inference. Given the Markov Chain
has an invariant distribution, after some mixing time, the sample will follow the invariant distribution and
therefore the estimate constructed by the sample (excluding the samples before mixing time) still
approximate the true population mean of the estimate. However, as Goel & Saliinik (2009) mentioned that
the RDS Method has serious problem of large variance and thus reducing variance for RDS estimate is
essential and crucial.
The protocol of RDS is very standard: researcher interviews participants and then give them some
coupons, which they can give to their friends. Some of their friends will agree to participate and thus come
back to the study with their coupons. The proposal of this research is to explore whether slight
modification of this above sampling process, ie Modification 1 and Modification 2 below, could produce
better estimate in terms of meansquareerror and sensitivity to seed.
2.
Modification 1: randomly pick one contact
Instead of let participants to recruit their own friends, researchers ask the subjects to list all their
contacts, and researchers will randomly pick one of the contact to interview. The intuition of this
modification is as follows:
Most researchers think of connection among individuals as either “yes” or “no”, that either you
know someone or you don’t. However, this blackorwhite model neglect the importance of the magnitude
of closeness. Knowing 20 sex workers doesn’t necessarily imply that you will give the coupons to any of
them equally likely. In reality, you will probably ask someone you feel most comfortable to talk to. The
author believe that because people will mostly likely to recommend someone they are close to, the
sampling is slowly pushed towards the entire population. In order to push the spread more quickly, the
sampling should favor towards the least close person in a way.
…... A B C D …..↔ ↔ ↔ ↔ ↔
For example, if A is most close to B, somewhat close to C and least close to D. If the researchers just
simply ask A to recommend a person, A will mostly likely to recommend B and least likely D. However, if
the researchers ask A to list all the people he/she knows and randomly picks one, then D will be more likely
to chosen in this scenario than the previous one. This paper’s proposal is that in this way, the Markov Chain
can get through the bottleneck between subgroups more quickly (in less sample size) and thus reducing the
variance of the estimate.
Modification 2: have a sampling preference according to some covariate
Suppose a study has a main research question, for example infection rate among sex workers, and
we know that some covariate has a strong correlation with the research question, for example we know that
the lower education sex worker has, the more likely she is infected with HIV. Modification 2 is that
researchers not only give the coupons to participants, but also gives participants extra rewards if their
friends have low education. In other words, researchers implement the RDS in a way that someone with
lower education will be more likely to be chosen than the others.
The intuition of weighting the sampling according to some covariates is as follows: According to the
theory about Markov Chain, the sample from RDS even dependent, all follow the same distribution π
(proportion of time in a state) and thus the estimate (infection rate p̂) is measured by the
̂ 1/Wxi (Xi) /Wxip = * ∑
n−1
i = 0
f
WA = ∑
x in A
∑
y in population
(x, )W y
{ 1, if person x is infected with HIV(x) f =
{ 0, if person x is not infected
W /W π = x population
4.
2. Method
The methodology in testing performance of modification 1( mod.1) is by simulation: design a fake
population, randomly infect a portion of the population and perform both the ordinary RDS (control) and
mod.1 to compare the meansquareerror and sensitivity of p̂. Methodology for modification 2 ( mod.2)
follows almost the same procedure. The only differences lie in the design of social network structure and
the mathematical representation of two modifications.
Population Design for mod.1
In order to echo to the reasoning of mod.1 in the introduction, the population design for mod.1
should contains two properties: (1) the connection between people should have different level of closeness
(2) there exists subgroups in the population, where the connectivity inside the subgroups are stronger than
connectivity across subgroups.
In practise, a population size of 5000 was created, with a 1250 population of subgroup A and a
3750 population of subgroup B. For simplicity, every individual inside a subgroup was assigned to have 2
close friends, 2 friends, and 2 acquaintances.
Personi 3 ⎯ Personi 2 ⎯ Personi 1 ⎯ Personi ⎯Personi+1 ⎯ Personi+2⎯Personi+3
For every person i, person i1 and person i+1 are his close friends, person i2 and i+2 are his friends, and
person i3 and i+3 are his acquaintances. The “close friend”, “friend”, and “acquaintance” relationship are
represented by different numerical number in the matrix of network relationship. We also assume that that
person i do not have any connection with the rest in population. Then, 200 people from A and 200 people
from B were randomly chosen and made a onetoonerelationship represented by the lowest level of
connection.
Therefore, the relation matrix W looks like below:
If x, y are in the same group:
(x, ) 9, if |x | 1W y = −y =
(x, ) 4, if |x | 2W y = −y =
(x, ) 1, if |x | 3W y = −y =
W(x, ) 0, if |x | 3 y = −y >
If x, y are in different group:
(x, ) 1, if x, y are chosen to be connectedW y =
(x, ) 0, otherwiseW y =
Where 9, 4, 1, 0 corresponds to “close friend”, “friend”, “acquaintance”, “stranger”.
If we assume that participants will recruit their friends based on closeness, then the sample of
ordinary RDS (control) will have a transmission matrix of K, where . However, if(x, ) K y = W(x, y)
(x,y)∑
5000
y = 1
W
5. researchers ask participants to write down all their contacts and then randomly pick one, then every
connection regardless of closeness will have equal chance to be selected. Therefore, for mod.1 the transition
matrix will become:
(x, ) 1, if W(x, ) 0W˜ y = y >
(x, ) 0, if W(x, ) 0W˜ y = y =
.(x, ) K y = W (x, y)˜
(x,y)∑
5000
y = 1
W˜
Population Design for mod.2
Mod.2 requires a knowing covariate that is correlated with Infection function , where (x) h (x)f x
represent any individual in the population. In this paper, two subgroups are deliberately design with
subgroup A having a distinguishably higher infection rate than subgroup B. In this context, mod.2 means
that researchers will increase the likelihood of recruit someone from group A.
A population size of 5000 was then created, with a portion of subgroup A and the rest being
subgroup B. Every individual in the network is connected to each other with some probability and for
simplicity the connection will be represented as 1 and no connection will be represented as 0.
(x, ) 1, if x, y are connectedW y =
(x, ) 0, if x, y are not connectedW y =
It is important to realize that infected people are usually 4 5 times more than others to be
connected , and people in the same subgroup are more likely to be connected as well. In order to reflect
different connectivity among four types of members in population (Infected A, Non Infected A, Infected B,
NonInfected B), we need to control 10 parameter of the probability of connection between these four
types.
(W(x, ) 1) , if x nfect A, y ealthy A or x ealthy A, y nfect A p y = = pAa ⊆ I ⊆ H ⊆ H ⊆ I
Same for the rest 9 cases.
The 10 parameter should be tuned so that the heatmap of the network will have strongest heat in 5,
10 and some heat in 1, 8, 4, and not so much heat in the rest map in the left graph below. The right graph is
the actual results from the chosen parameter in this experiment. After having the proper chosen parameter, I
only modify the population connectivity by multiplying all 10 parameters by a same quantity, which means
relative ratio of each parameter stays the same.
8.
● Sensitivity to seed
The sensitivity of to seed is measured by the standard deviation of a massive quantity (1000) of pˆ pˆ
generated from rather small sample size (200). It turns out that the standard deviation of mod.1 is smaller
than the control, indicating less sensitivity.
However, it turns out that the difference is not significant in practical senses. As mentioned by
Heckathorn(1997), as the recruiting goes on, the demographic distribution of the sample will converge
towards the distribution in the population and thus stabilize. Therefore, the practical benefit of less
sensitivity to seed, meaning less mixing time of the Markov Chain process, is to reduce cost by having less
sample size yet achieving the same estimation (see Graph 3).
12. 4. Conclusion
To summarize, mod1’s essential intuition is to encourage the participants to recruit someone they
are less familiar with to speed up the sampling spread out, while mod.2’s essential procedure is to guide the
sampling towards the subgroups with higher values or more observations of the research question.
The simulation above indicated that both mod.1 and mod.2 have potentiality in terms of reducing
estimate’s variance. However, it is crucial to distinguish the circumstance where either of the modification
of RDS provides better edge than the normal RDS because the simulation is heavily rely on the design of
the population network and the control of the parameters.
First, mod.1 should be considered in case where the connectivity inside subgroups are apparently
better than the connectivity across different subgroups. In other word, there are some isolated social groups.
Also, as mentioned by Heckathorn, mod.1 might not be applicable in case of the hidden population
suffering from social judgement (sex workers, drug injectors), because participants will be threatened or
unwilling to provide information about others.
Second, mod.2 should be considered in case where the research question has remarkably diverse
answers across each subgroups. It also required that the study has a clear research question and a
beforehand knowledge of the diverse characteristics of the population. And last but not least, the
connectivity is also assumed to be correlated with people share similar value in the research question. For
example, in the above experiment, the research question is to find out the infection rate of HIV. We not
only assumed that connectivity within subgroups are stronger, we also assumed that the an infected person
is 45 times more likely to be connected with an infected person. This is a key assumption in order to come
to the conclusion that mod.2 has less meansquareerror.
14. Appendix
1. R code for simulation, Modification 1:
1.1 constructing Kernel Matrix
Pop_size = 5000
inf_pop = 0.5
######## within group parameter ######
Scale = c(9,4,1,0)
####### Infect Parameter ######
inf_a = 0.5
P_a = 1/4
##### Between group Parameter ######
scale_connect = 1
n_connect = 100
######################
A_size = Pop_size * P_a
B_size = Pop_size A_size
A_Infect = inf_a * A_size
B_Infect = Pop_size*inf_pop A_Infect
A = Square_matirx(A_size,Scale)
B = Square_matirx(B_size,Scale)
M1 = Connect(A,B,n_connect,scale_connect)
W1 = Weight_v(M1)
A2 = Square_matirx(A_size,c(1,1,1,0))
B2 = Square_matirx(B_size,c(1,1,1,0))
M2 = Connect(A2,B2,n_connect,1)
W2 = Weight_v(M2)
############################################
#### Function
############################################
### The Square Matrix ###
# n: dimension;
# Scale: vector of scale close to 0
15. Square_matirx = function(n,Scale){
k = length(Scale); v = Scale
M = matrix(NA,nrow = n, ncol = n)
M[1,] = c(0,v,numeric(nk*21),rev(v))
for ( i in 2:k){
M[i,] = c(v[i1],M[i1,][n])
}
for (i in (k+1):(nk)){
M[i,] = c(0,M[i1,][n])
}
for (i in (nk+1):n){
M[i,] = c(rev(v)[in+k],M[i1,][n])
}
return (M)
}
### Combine two Matrix ###
# A, B two matrix
# n: the dimension of output matrix
Combine_matrix = function(A,B){
a = nrow(A); b = nrow(B)
A_0 = matrix(0,nrow=a,ncol=b)
A = cbind(A,A_0)
B_0 = matrix(0,nrow=b,ncol=a)
B = cbind(B_0,B)
return(rbind(A,B))
}
### Connect between A, B ###
# A,B two matrix
# n: number of one to one connect between A,B
# k: scale of connection between A,B
Connect = function(A,B,n,k){
M = Combine_matrix(A,B)
a = nrow(A); m = nrow(M)
candid = sample(x = 1:a,size = n,replace=F)
for ( i in candid){
j = sample(x=(a+1):m,size = 1)
M[i,j] = k; M[j,i] = k
}
return(M)
}
### Weight Matrix ###
Weight_v = function(Pop){
16. W = rowSums(Pop)
return (W)
}
1.2 deciding infect population
Infect = numeric(Pop_size)
Infect[sample(x = 1:A_size,size = A_Infect,replace = F)] = 1
Infect[sample(x = (A_size + 1) :Pop_size,size = B_Infect,replace = F)] = 1
1.3 Test two Method
MCMC_one = function(M,seed,N){
n = nrow(M); Res = numeric(N)
Res[1] = seed
for ( i in 2:N){
m = M[Res[i1],]
k = m/sum(m)
p = k[which( m > 0)]
v = which( m > 0)
Res[i] = sample(v,1,F,p)}
return(Res)
}
Test_MCMC_one = function(M,seed,N,W){
dt = MCMC_one(M,seed,N)
infect = Infect[dt]
Sum_Wxi = sum(infect*1/W[dt])
Norm = sum(1/W[dt])
return(1/Norm * Sum_Wxi)
}
##################
##################
N = 1000
rep = 200
pi_1 = W1/sum(W1)
pi_2 = W2/sum(W2)
seed1 = sample(1:Pop_size,rep,T,prob = pi_1)
seed2 = sample(1:Pop_size,rep,T,prob = pi_2)
p_hat1 = numeric(rep)
p_hat2 = numeric(rep)
for ( i in 1:rep){
p_hat1[i] = Test_MCMC_one(M1,seed1[i],N,W1)
p_hat2[i] = Test_MCMC_one(M2,seed2[i],N,W2)
}
inf_pop
mean(p_hat1)
mean(p_hat2)
sqrt(mean((p_hat1inf_pop)^2)) #MSE for method 1
sqrt(mean((p_hat2inf_pop)^2)) #MSE for method 2
# sensitivity
n = 200; seed_n = 1000
seed = sample(1:Pop_size,seed_n)
A = c(); B =c()
for ( i in 1:seed_n){
17. A = c(A,Test_MCMC_one_mixingtime(K1,seed[i],n,W1))
B = c(B,Test_MCMC_one_mixingtime(K2,seed[i],n,W2))
}
sqrt(mean((A – Infect_size/Pop_size)^2)) #MSE for method 1
sqrt(mean((B Infect_size/Pop_size)^2)) #MSE for method 2
## Compare the converge speed of the demographic distribution
N = 500
rep = 500
seed = sample(1:Pop_size,rep)
Sample1 = matrix(0,nrow = rep, ncol = N)
Sample2 = matrix(0,nrow = rep, ncol = N)
for ( i in 1:rep){
Sample1[i,] = MCMC_one(M1,seed[i],N)
Sample2[i,] = MCMC_one(M2,seed[i],N)
}
inf_pro = function(vector,Infect){
x = Infect[vector]
pro = as.numeric(length(x))
for ( i in 1:length(x)){
pro[i] = sum(x[1:i])/i
}
return(pro)
}
pro1 = matrix(0,nrow = rep, ncol = N)
pro2 = matrix(0,nrow = rep, ncol = N)
for ( i in 1:rep){
pro1[i,] = inf_pro(Sample1[i,],Infect)
pro2[i,] = inf_pro(Sample2[i,],Infect)
}
plot(pro1[3,], xlab= "sample size", ylab = "infection rate", type = 'l',main = "Graph 3: converging speed of the sample ")
lines(pro2[3,],col = "red")
legend("topright",col=c("black","red"),legend = c("control","mod.1"), lty = c(1,1))
converg_time = function(vector,err){
n = 1
dist = max(vector) min(vector)
while(dist > err*2){
vector = vector[1]
dist = max(vector) min(vector)
n = n + 1
}
return(n)
}
err = 0.001
time1 = numeric(rep)
time2 = numeric(rep)
for ( i in 1:rep){
time1[i] = converg_time(pro1[i,],err)
time2[i] = converg_time(pro2[i,],err)
}
sum(time1 == 500)
sum(time2 == 500)
mean(time1); mean(time2)
sd(time1); sd(time2)
2. R code for simulation in Modification 2
2.1 Network Construction
18. #### Population Parameter ####
Total_pop = 500
pro_a = 1/4
#### Infection Parameter ####
inf_a = 0.8
inf_b = 0.4
#### Network Parametr ####
# Define p of a connection between four type:
# A(non infect A), a(infect A), B(non infect B), b(infect B)
# In total 10 type of connection
P_AA = 0.01 #1
P_Aa = 0.0001 #2
P_AB = 0.0005 #3
P_Ab = 0.00025 #4
P_aa = 0.04 #5
P_aB = 0.00025 #6
P_ab = 0.02 #7
P_BB = 0.01 #8
P_Bb = 0.0025 #9
P_bb = 0.04 #10
### Generate Population ###
n_AT = floor(Total_pop*pro_a)
n_BT = Total_pop n_AT
n_A = floor(n_AT*(1inf_a))
n_a = n_AT n_A
n_B = floor(n_BT*(1inf_b))
n_b = n_BT n_B
Network = matrix(0,nrow=Total_pop,ncol = Total_pop)
#1
for (i in 1:n_A){
for (j in 1:n_A){
Network[i,j] = as.numeric(runif(1) < P_AA)
}
}
#2
for (i in 1:n_A){
for (j in n_A:(n_A + n_a)){
Network[i,j] = as.numeric(runif(1) < P_Aa)
}
}
#3
for (i in 1:n_A){
for (j in (n_A + n_a):(n_A + n_a + n_B)){
Network[i,j] = as.numeric(runif(1) < P_AB)
}
}
#4
for (i in 1:n_A){
for (j in (n_A + n_a + n_B):(n_A + n_a + n_B + n_b)){
Network[i,j] = as.numeric(runif(1) < P_Ab)
}
}
#5
for (i in n_A:(n_A + n_a)){
for (j in n_A:(n_A + n_a)){
Network[i,j] = as.numeric(runif(1) < P_aa)
}
}
#6
for (i in n_A:(n_A + n_a)){
for (j in (n_A + n_a):(n_A + n_a + n_B)){
19. Network[i,j] = as.numeric(runif(1) < P_aB)
}
}
#7
for (i in n_A:(n_A + n_a)){
for (j in (n_A + n_a + n_B):(n_A + n_a + n_B + n_b)){
Network[i,j] = as.numeric(runif(1) < P_ab)
}
}
#8
for (i in (n_A + n_a):(n_A + n_a + n_B)){
for (j in (n_A + n_a):(n_A + n_a + n_B)){
Network[i,j] = as.numeric(runif(1) < P_BB)
}
}
#9
for (i in (n_A + n_a):(n_A + n_a + n_B)){
for (j in (n_A + n_a + n_B):(n_A + n_a + n_B + n_b)){
Network[i,j] = as.numeric(runif(1) < P_Bb)
}
}
#10
for (i in (n_A + n_a + n_B):(n_A + n_a + n_B + n_b)){
for (j in (n_A + n_a + n_B):(n_A + n_a + n_B + n_b)){
Network[i,j] = as.numeric(runif(1) < P_bb)
}
}
Network[lower.tri(Network,diag = T)] = 0
Network = (Network + t(Network))
isSymmetric(Network)
any(rowSums(Network) == 0)
sum(rowSums(Network) == 0)
which(rowSums(Network) == 0)
Network[which(rowSums(Network) == 0),sample(1:Total_pop,sample(1:2,1))] = 1
M1 = Network
W1 = rowSums(M1)
M2 = M1
for ( i in 1:Total_pop){
M2[i,] = c(M2[i,1:(n_A+n_a)] * (inf_a/inf_b),
M2[i, (n_A+n_a+1):Total_pop])
}
W2 = rowSums(M2)
Infect = c(numeric(n_A),rep(1,n_a),numeric(n_B),rep(1,n_b))
cor(W1,Infect)
cor(W2,Infect)
summary(rowSums(M1))
summary(rowSums(M2))
2.2 Heatmap
### Heat map, testing for small population
source("http://www.phaget4.org/R/myImagePlot.R")
ID_names = c(rep("A",n_A),rep("a",n_a),rep("B",n_B),rep("b",n_b))
colnames(Network) = ID_names
rownames(Network) = ID_names
myImagePlot(Network)
2.3 Test
inf = inf_a*pro_a + inf_b *(1pro_a)
#### Testing function
Recru = function(M,seed,N){
20. n = nrow(M); Res = numeric(N)
Res[1] = seed
for ( i in 2:N){
m = M2[Res[i1],]
p = m[which(m > 0)]
p = p/sum(p)
v = which(m > 0)
if (length(v) > 1){
Res[i] = sample(v,1,T,p)
}
else{Res[i] = v}
}
return(Res)
}
Test = function(M,seed,N,W){
dt = Recru(M,seed,N)
infect = Infect[dt]
Sum_Wxi = sum(infect*1/W[dt])
Norm = sum(1/W[dt])
return(1/Norm * Sum_Wxi)
}
#### Test 1
# Recru a sample of 2500 people from M1, M2 with same seed
N = 500
seed = sample(1:Total_pop,1)
Sample1 = Recru(M1,seed, N)
Sample2 = Recru(M2, seed, N)
mat_Sample1 = matrix(0, nrow = N, ncol = 4)
colnames(mat_Sample1) = c("#A at t","#a at t","#B at t","#b at t")
rownames(mat_Sample1) = 1:N
mat_Sample2 = matrix(0, nrow = N, ncol = 4)
colnames(mat_Sample2) = c("#A at t","#a at t","#B at t","#b at t")
rownames(mat_Sample2) = 1:N
for ( i in 1:N){
mat_Sample1[i,2] = sum((Sample1[1:i] > n_A) & (Sample1[1:i] <= (n_A + n_a)))
mat_Sample1[i,4] = sum((Sample1[1:i] > (n_B + n_A + n_a)) & (Sample1[1:i] <= (n_B + n_A + n_a + n_b)))
mat_Sample1[i,1] = sum(Sample1[1:i] < (n_A) + (n_a)) mat_Sample1[i,2]
mat_Sample1[i,3] = sum(Sample1[1:i] > (n_A) + (n_a)) mat_Sample1[i,4]
mat_Sample2[i,2] = sum((Sample2[1:i] > n_A) & (Sample2[1:i] <= (n_A + n_a)))
mat_Sample2[i,4] = sum((Sample2[1:i] > (n_B + n_A + n_a)) & (Sample2[1:i] <= (n_B + n_A + n_a + n_b)))
mat_Sample2[i,1] = sum(Sample2[1:i] < (n_A) + (n_a)) mat_Sample2[i,2]
mat_Sample2[i,3] = sum(Sample2[1:i] > (n_A) + (n_a)) mat_Sample2[i,4]
}
for ( i in 1:N){
mat_Sample1[i,] = mat_Sample1[i,]/i
mat_Sample2[i,] = mat_Sample2[i,]/i
}
par(mfrow = c(2,1))
plot(mat_Sample1[,1], col = "blue", type = "l",
ylim = c(0,max(max(mat_Sample2),max(mat_Sample1))), xlab = "n th sample wave",
ylab = "% type by nth wave", main = "M1, n= 500, seed is B")
lines(mat_Sample1[,2], col = "red")
lines(mat_Sample1[,3], col = "green")
lines(mat_Sample1[,4], col = "pink")
plot(mat_Sample2[,1], col = "blue", type = "l",
ylim = c(0,max(max(mat_Sample2),max(mat_Sample1))), xlab = "n th sample wave",
ylab = "% type by nth wave", main = "M2, n = 500, seed is B")
lines(mat_Sample2[,2], col = "red")
21. lines(mat_Sample2[,3], col = "green")
lines(mat_Sample2[,4], col = "pink")
#### Test 2
## measure the theoretical varience of p_hat
N = 1000
rep = 100
pi_1 = W1/sum(W1)
pi_2 = W2/sum(W2)
seed1 = sample(1:Total_pop,rep,T,prob = pi_1)
seed2 = sample(1:Total_pop,rep,T,prob = pi_2)
p_hat1 = numeric(rep)
p_hat2 = numeric(rep)
for ( i in 1:rep){
p_hat1[i] = Test(M1,seed1[i],N,W1)
p_hat2[i] = Test(M2,seed2[i],N,W2)
}
inf
mean(p_hat1)
mean(p_hat2)
sqrt(mean((p_hat1inf)^2)) #MSE for method 1
sqrt(mean((p_hat2inf)^2)) #MSE for method 2
sd(p_hat1)
sd(p_hat2)
par(mfrow = c(2,1))
plot(p_hat1,xlab = "n th p_hat", ylab = "p_hat value",
main = "M1,N = 1000,rep = 100",
type = "l",ylim = c(0.35,0.85))
abline(h = inf, col = "blue", lty = 2)
abline(h = mean(p_hat1) + sd(p_hat1), col = "orange", lty = 6)
abline(h = mean(p_hat1) sd(p_hat1), col = "orange", lty = 6)
plot(p_hat2,xlab = "n th p_hat", ylab = "p_hat value",
main = "M2,N = 1000,rep = 100",
type = "l", ylim = c(0.35,0.85))
abline(h = inf, col = "blue", lty = 2)
abline(h = mean(p_hat2) + sd(p_hat2), col = "orange", lty = 6)
abline(h = mean(p_hat2) sd(p_hat2), col = "orange", lty = 6)
### Test 3
#### Sensitivity to the seed
## compare sd(1), sd(2)
N = 100
rep = 300
seed = sample(1:Total_pop, rep)
p_hat1 = numeric(rep)
p_hat2 = numeric(rep)
for ( i in 1:rep){
p_hat1[i] = Test(M1,seed[i],N,W1)
p_hat2[i] = Test(M2,seed[i],N,W2)
}
inf
mean(p_hat1)
mean(p_hat2)
sqrt(mean((p_hat1inf)^2)) #MSE for method 1
sqrt(mean((p_hat2inf)^2)) #MSE for method 2
sd(p_hat1)
sd(p_hat2)
## Compare the converge speed of the demographic distribution
22. N = 500
rep = 500
seed = sample(1:Total_pop,rep)
Sample1 = matrix(0,nrow = rep, ncol = N)
Sample2 = matrix(0,nrow = rep, ncol = N)
for ( i in 1:rep){
Sample1[i,] = Recru(M1,seed[i],N)
Sample2[i,] = Recru(M2,seed[i],N)
}
inf_pro = function(vector,Infect){
x = Infect[vector]
pro = as.numeric(length(x))
for ( i in 1:length(x)){
pro[i] = sum(x[1:i])/i
}
return(pro)
}
pro1 = matrix(0,nrow = rep, ncol = N)
pro2 = matrix(0,nrow = rep, ncol = N)
for ( i in 1:rep){
pro1[i,] = inf_pro(Sample1[i,],Infect)
pro2[i,] = inf_pro(Sample2[i,],Infect)
}
plot(pro1[1,])
converg_time = function(vector,err){
n = 1
dist = max(vector) min(vector)
while(dist > err*2){
vector = vector[1]
dist = max(vector) min(vector)
n = n + 1
}
return(n)
}
err = 0.001
time1 = numeric(rep)
time2 = numeric(rep)
for ( i in 1:rep){
time1[i] = converg_time(pro1[i,],err)
time2[i] = converg_time(pro2[i,],err)
}
sum(time1 == 500)
sum(time2 == 500)
mean(time1); mean(time2)
sd(time1); sd(time2)
23. Reference
Goel, S., & Salganik, M. J. (2009). Respondentdriven sampling as Markov chain Monte Carlo.
Statist. Med. Statistics in Medicine, 28(17), 22022229. doi:10.1002/sim.3613
Heckathorn, D. D. (1997). RespondentDriven Sampling: A New Approach to the Study of
Hidden Populations. Social Problems, 44(2), 174199. doi:10.1525/sp.1997.44.2.03x0221m
Klovdahl, Alden. S. (1989). Urban social network: some methodological problems and
possibilities. In the small world, M. Kochen(ed.), 176210. Norwood, N.J.:Ablex.