ResearchReportwithproYves.docx (2)

Winter 2016 & Spring 2016
Research Report
Advisor: Professor Yves Atchadé
Student: Miao Wang

Effect of sampling protocol modification on Response Driven Sampling Method

Abstract : The main goal of this research project is to understand how slight change of sampling protocol would affect
the estimate in the context of Response Driven Sampling. Two slight change of protocol are explored: (1) asking each
participant to provide all their contacts and researchers randomly pick one contact to follow (2) given we know a
covariate, which is related to the interested population, asking each participant to provide the covariate, and we tilt the
sampling preference according to it. Both of the scenario were compared with the usual RDS process in terms of the
meansquareerror of the estimate and the sensitivity to seed participants. The project finds out that the estimate from
both scenario 1 and scenario 2 have better performance in terms of smaller meansquareerror but both have  same
level of sensitivity to seed participants compared to normal RDS procedure. This study may provide a new perspective
of sampling in that how the change of protocol of sampling can help reduce the variance and yield better estimates.
Keywords: sampling protocol, Responsedrivensampling, reduce variance of estimat

1. Introduction
The Response Driven Sampling (RDS) is widely used where it is too difficult to directly sample
from the desired population due to unavailability. For example, if we are interested to know the HIV
infection rate among sex workers in the United States, there is no way to have a complete list of all sex
workers. In this case, it is mostly useful to find one sex worker and ask her to recruit another worker she
knows who might or might not be infected with HIV. The sampling method of one respondent recruiting
another is Response Driven Sampling (RDS). The RDS sample constitutes a Markov Chain Monte Carlo
(MCMC) (Goel & Saliinik, 2009) and thus can produce valid statistical inference. Given the Markov Chain
has an invariant distribution, after some mixing time, the sample will follow the invariant distribution and
therefore the estimate constructed by the sample (excluding the samples before mixing time) still
approximate the true population mean of the estimate. However, as Goel & Saliinik (2009) mentioned that
the RDS Method has serious problem of large variance and thus reducing variance for RDS estimate is
essential and crucial.
The protocol of RDS is very standard: researcher interviews participants and then give them some
coupons, which they can give to their friends. Some of their friends will agree to participate and thus come
back to the study with their coupons. The proposal of this research is to explore whether slight
modification of this above sampling process, ie Modification 1 and  Modification 2 below, could produce
better estimate in terms of meansquareerror and sensitivity to seed.

Modification 1: randomly pick one contact
Instead of let participants to recruit their own friends, researchers ask the subjects to list all their
contacts, and researchers will randomly pick one of the contact to interview. The intuition of this
modification is as follows:
Most researchers think of connection among individuals as either “yes” or “no”, that either you
know someone or you don’t. However, this blackorwhite model neglect the importance of the magnitude
of closeness. Knowing 20 sex workers doesn’t necessarily imply that you will give the coupons to any of
them equally likely. In reality, you will probably ask someone you feel most comfortable to talk to. The
author believe that because people will mostly likely to recommend someone they are close to, the
sampling is slowly pushed towards the entire population. In order to push the spread more quickly, the
sampling should favor towards the least close person in a way.
…... A B C D …..↔ ↔ ↔ ↔ ↔
For example, if A is most close to B, somewhat close to C and least close to D. If the researchers just
simply ask A to recommend a person, A will mostly likely to recommend B and least likely D. However, if
the researchers ask A to list all the people he/she knows and randomly picks one, then D will be more likely
to chosen in this scenario than the previous one. This paper’s proposal is that in this way, the Markov Chain
can get through the bottleneck between subgroups more quickly (in less sample size) and thus reducing the
variance of the estimate.
Modification 2: have a sampling preference according to some covariate
Suppose a study has a main research question, for example infection rate among sex workers, and
we know that some covariate has a strong correlation with  the research question, for example we know that
the lower education sex worker has, the more likely she is infected with HIV. Modification 2 is that
researchers not only give the coupons to participants, but also gives participants extra rewards if their
friends have low education. In other words, researchers implement the RDS in a way that someone with
lower education will be more likely to be chosen than the others.
The intuition of weighting the sampling according to some covariates is as follows: According to the
theory about Markov Chain, the sample from RDS even dependent, all follow the same distribution π
(proportion of time in a state)  and thus the estimate (infection rate p̂) is measured by the
̂ 1/Wxi (Xi) /Wxip =    * ∑
n−1
i = 0
f
WA = ∑

x in A
∑

y in population
(x, )W y
{ 1, if person x is infected with HIV(x) f =
         { 0, if person x is not infected
   W /W π = x population

Where is the ith sample from RDS, represent the edge from person x to y. In order to decreaseXi (x, ) W y
the variance of p̂, it is essential to find a which has high correlation with . Therefore, by increasingπ (x)f
the likelihood to recruit someone that is low in education, we modified the W(x,y) matrix, and thus increase
the correlation between and .π (x)f

2. Method
The methodology in testing performance of modification 1( mod.1) is by simulation: design a fake
population, randomly infect a portion of the population and perform both the ordinary RDS (control) and
mod.1 to compare the meansquareerror and sensitivity of p̂. Methodology for modification 2 ( mod.2)
follows almost the same procedure. The only differences lie in the design of social network structure and
the mathematical representation of two modifications.

Population Design for mod.1
In order to echo to the reasoning of mod.1 in the introduction, the population design for mod.1
should contains two properties: (1) the connection between people should have different level of closeness
(2) there exists subgroups in the population, where the connectivity inside the subgroups are stronger than
connectivity across subgroups.
In practise, a population size of 5000 was created, with a 1250 population of subgroup A and a
3750 population of subgroup B. For simplicity, every individual inside a subgroup was assigned to have  2
close friends, 2 friends, and 2 acquaintances.
Personi 3 ⎯ Personi 2 ⎯ Personi 1 ⎯ Personi ⎯Personi+1 ⎯ Personi+2⎯Personi+3
For every person i, person i1 and person i+1 are his close friends, person i2 and i+2 are his friends, and
person i3 and i+3 are his acquaintances. The “close friend”, “friend”, and “acquaintance” relationship are
represented by different numerical number in the matrix of network relationship. We also assume that that
person i do not have any connection with the rest in population.  Then, 200 people from A and 200 people
from B were randomly chosen and made a onetoonerelationship represented by the lowest level of
connection.
Therefore, the relation matrix W looks like below:
If x, y are in the same group:
(x, ) 9, if |x | 1W y = −y =
(x, ) 4, if |x | 2W y = −y =
(x, ) 1, if |x | 3W y = −y =
W(x, ) 0, if |x |   3 y = −y >
If x, y are in different group:
(x, ) 1, if x, y are chosen to be connectedW y =
(x, ) 0, otherwiseW y =
Where 9, 4, 1, 0 corresponds to “close friend”, “friend”, “acquaintance”, “stranger”.
If we assume that participants will recruit their friends based on closeness, then the sample of
ordinary RDS (control) will have a transmission matrix of K, where .   However, if(x, ) K y = W(x, y)
(x,y)∑
5000
y = 1
W

researchers ask participants to write down all their contacts and then randomly pick one, then every
connection regardless of closeness will have equal chance to be selected. Therefore, for mod.1 the transition
matrix will become:
  (x, ) 1, if W(x, ) 0W˜ y = y >
  (x, ) 0, if W(x, ) 0W˜ y = y =
.(x, ) K y = W (x, y)˜
(x,y)∑
5000
y = 1
W˜

Population Design for mod.2
Mod.2 requires a knowing covariate that is correlated with Infection function , where (x) h (x)f x
represent any individual in the population. In this paper, two subgroups are deliberately design with
subgroup A having a distinguishably higher infection rate than subgroup B. In this context, mod.2 means
that researchers will increase the likelihood of recruit someone from group A.
A population size of 5000 was then created, with a portion of subgroup A and the rest being
subgroup B. Every individual in the network is connected to each other with some probability and for
simplicity the connection will be represented as 1 and no connection will be represented as 0.
(x, ) 1, if x, y are connectedW y =
(x, ) 0, if x, y are not connectedW y =
It is important to realize that infected people are usually 4 5 times more than others to be
connected , and people in the same subgroup are more likely to be connected as well. In order to reflect
different connectivity among four types of members in population (Infected A, Non Infected A, Infected B,
NonInfected B), we need to control 10 parameter of the probability of connection between these four
types.
(W(x, ) 1) , if x nfect A, y ealthy A or x ealthy A, y nfect A   p y = = pAa ⊆ I ⊆ H ⊆ H ⊆ I
Same for the rest 9 cases.
The 10 parameter should be tuned so that the heatmap of the network will have strongest heat in 5,
10 and some heat in 1, 8, 4, and not so much heat in the rest map in the left graph below. The right graph is
the actual results from the chosen parameter in this experiment. After having the proper chosen parameter, I
only modify the population connectivity by multiplying all 10 parameters by a same quantity, which means
relative ratio of each parameter stays the same.

The graph below is the heatmap of 0.1 times of all previous parameter, resulting 0.1 times of the
mean degree of the population than before. As we can see, as long as the relative ratio stays the same, the
heatmap of the network will still have the similar configuration.

Since Mod.2 will increase the likelihood to recruit someone in A, the new network matrix W was ˜
designed as follows:
(x, ) W(x, ) W˜ y = y * infection rate of B
infection rate of A
if y , ⊆ A

3. Result
Both of the modification is compared to the normal RDS (control) in two aspects: comparison
between meansquareerror of the estimate and its sensitivity to seed. Meansquareerror is apˆ
measurement to understand the fluctuation from to the true p, and thus the smaller the meansquareerrorpˆ
is, the better .Seed refers to the first participant in the RDS sample. In most practise, the seed is notpˆ
chosen randomly, and therefore the less sensitive is to the seed, the more stable the result will be.pˆ
The results shows that both mod.1 and mod.2 have smaller meansquareerror than control
but neither have less sensitivity to seed.

Results for modification 1
● Meansquareerror
Graph 1 and Graph 2 display that mod.1 meansquareerror is smaller than that of the control, throughout
different population infection rate or different connectivity across group A and group B.

Graph 1:  by applying mod.1, the meansquareerror could be improved by roughly 16% to 28%.


● Sensitivity to seed
The sensitivity of to seed is measured by the standard deviation of a massive quantity (1000) of  pˆ pˆ
generated from rather small sample size (200). It turns out that the standard deviation of mod.1 is smaller
than the control, indicating less sensitivity.
However, it turns out that the difference is not significant in practical senses. As mentioned by
Heckathorn(1997), as the recruiting goes on, the demographic distribution of the sample will converge
towards the distribution in the population and thus stabilize. Therefore, the practical benefit of less
sensitivity to seed, meaning less mixing time of the Markov Chain process, is to reduce cost by having less
sample size yet achieving the same estimation (see Graph 3).

Therefore, we hold a threshold for fluctuation and find the time when the infection rate in sample
first met a converge criteria, and compare the time to understand the speed of convergence. As shown in the
table below, the convergence speed does not rely on the connectivity across subgroups. More importantly,
as was shown more clearly in terms of the convergence speed, we do not benefit from using mod.1, because
the mixing time is reduced by 2 sample size or less. This means that in reality, less sensitivity of mod.1 is
so insignificant that we won’t feel it.

Table 1. Comparison of the first time reaching convergence of the infection rate in the sample
Estimate
from
mod.1 or
control?
Numbers
of
connection
between
Group A
and B
Minimum 1st
Quantile
Median Mean 3rd
Quantile
Maximum Standard
deviation
control 800 484 496 498 496.8 499 499 2.56
mod.1 800 486 495 496.6 498 498 499 2.28
control 100 480 496 497.5 496.6 498 499 2.64
mod.1 100 486 496 497.5 496.8 499 499 2.05

Results for modification 2
● Meansquareerror of pˆ
Mod.2’s better performance than control in terms of significant smaller meansquareerror. In addition, the
mod.2 ‘s performance is very stable throughout different mean degree (Graph 4). (The mean degree reflects
the aggregated connectivity of the population.)


Graph 6: The correlation between degree of Network matrix and Infection function accords with the claim in the
introduction that when has high correlation with , the estimate p̂ will have lower varianceπ (x)f

● Sensitivity to seed
Using the same method shown in results for mod.1, it shows that neither does mod.2 has smaller variance
of than the control, nor does it has significant faster convergence speed. Therefore, we conclude thatpˆ
mod.2 do not bring better performance in terms of the sensitivity to seed.

4. Conclusion
To summarize, mod1’s essential intuition is to encourage the participants to recruit someone they
are less familiar with to speed up the sampling spread out, while mod.2’s essential procedure is to guide the
sampling towards the subgroups with higher values or more observations of the research question.
The simulation above indicated that both mod.1 and mod.2 have potentiality in terms of reducing
estimate’s variance.  However, it is crucial to distinguish the circumstance where either of the modification
of RDS provides better edge than the normal RDS because the simulation is heavily rely on the design of
the population network and the control of the parameters.
First, mod.1 should be considered in case where the connectivity inside subgroups are apparently
better than the connectivity across different subgroups. In other word, there are some isolated social groups.
Also, as mentioned by Heckathorn, mod.1 might not be applicable in case of the hidden population
suffering from social judgement (sex workers, drug injectors), because participants will be threatened or
unwilling to provide information about others.
Second, mod.2 should be considered in case where the research question has remarkably diverse
answers across each subgroups. It also required that the study has a clear research question and a
beforehand knowledge of the diverse characteristics of the population. And last but not least, the
connectivity is also assumed to be correlated with people share similar value in the research question. For
example, in the above experiment, the research question is to find out the infection rate of HIV. We not
only assumed that connectivity within subgroups are stronger, we also assumed that the an infected person
is 45 times more likely to be connected with an infected person. This is a key assumption in order to come
to the conclusion that mod.2 has less meansquareerror.

5. Discussion
This paper ask an interesting question about the effect of slight change of sampling protocol.
Design new sampling method for a specific problem might be one solution. However, what if in the future
researchers can not only find out a sampling method suitable for their research, but also tailor the sampling
protocol to their own specific need and thus resulting better estimate? As one can see, the change in
sampling protocol can be achieved by simply adding additional questions to the survey or provide intention
to recruit a specific subgroups, while resulting a estimate 20% to 30% better than before. The impact of
slight change of sampling protocol can be rewarding.
It is also noteworthy that the first modification proposed in this paper was first mentioned by
Klovdahl  in 1989 as “random walk sampling”. Therefore, in fact this paper implemented a comparison
between Random Walk Sampling and Response Driven Sampling and also provides new perspective to
understand Response Driven Sampling.
The second modification can be applied to a more general scenario, where researchers in advance
know a set of covariates that is high correlated with the research question. Then, instead of weight towards
a specific subgroup, the researchers could guide the sampling according to the value of the covariates in the
hidden population.
The limitation of this study is the lack of application on real data sets and a theoretical proof of
this paper's proposal. The difficulty of mathematical proof mainly lies on the how to abstract a recruiting
transition matrix. The difficult of application on real data is due to both time and budget limitation. I
sincerely hope that there will be follow up research.

Appendix
1. R code for simulation, Modification 1:
1.1 constructing Kernel Matrix
Pop_size = 5000
inf_pop = 0.5

######## within group parameter ######
Scale = c(9,4,1,0)

####### Infect Parameter ######
inf_a = 0.5
P_a = 1/4

##### Between group Parameter ######
scale_connect = 1
n_connect = 100

######################

A_size = Pop_size * P_a
B_size = Pop_size A_size
A_Infect = inf_a * A_size
B_Infect = Pop_size*inf_pop A_Infect

A = Square_matirx(A_size,Scale)
B = Square_matirx(B_size,Scale)
M1 = Connect(A,B,n_connect,scale_connect)
W1 = Weight_v(M1)

A2 = Square_matirx(A_size,c(1,1,1,0))
B2 = Square_matirx(B_size,c(1,1,1,0))
M2 = Connect(A2,B2,n_connect,1)
W2 = Weight_v(M2)

############################################
#### Function
############################################

### The Square Matrix ###
# n: dimension;
# Scale: vector of scale close to 0

Square_matirx = function(n,Scale){
  k = length(Scale); v = Scale
  M = matrix(NA,nrow = n, ncol = n)
  M[1,] = c(0,v,numeric(nk*21),rev(v))
  for ( i in 2:k){
    M[i,] = c(v[i1],M[i1,][n])
  }
  for (i in (k+1):(nk)){
    M[i,] = c(0,M[i1,][n])
  }
  for (i in (nk+1):n){
    M[i,] = c(rev(v)[in+k],M[i1,][n])
  }
  return (M)
}

### Combine two Matrix ###
# A, B two matrix
# n: the dimension of output matrix
Combine_matrix = function(A,B){
  a = nrow(A); b = nrow(B)
  A_0 = matrix(0,nrow=a,ncol=b)
  A = cbind(A,A_0)
  B_0 = matrix(0,nrow=b,ncol=a)
  B = cbind(B_0,B)
  return(rbind(A,B))
}

### Connect between A, B ###
# A,B two matrix
# n: number of one to one connect between A,B
# k: scale of connection between A,B
Connect = function(A,B,n,k){
  M = Combine_matrix(A,B)
  a = nrow(A); m = nrow(M)
  candid = sample(x = 1:a,size = n,replace=F)
  for ( i in candid){
    j = sample(x=(a+1):m,size = 1)
    M[i,j] = k; M[j,i] = k
  }
  return(M)
}

### Weight Matrix ###
Weight_v = function(Pop){

W = rowSums(Pop)
  return (W)
}

1.2 deciding infect population
Infect = numeric(Pop_size)
Infect[sample(x = 1:A_size,size = A_Infect,replace = F)] = 1
Infect[sample(x = (A_size + 1) :Pop_size,size = B_Infect,replace = F)] = 1

1.3 Test two Method
MCMC_one = function(M,seed,N){
  n = nrow(M); Res = numeric(N)
  Res[1] = seed
  for ( i in 2:N){
    m = M[Res[i1],]
    k = m/sum(m)
    p = k[which( m > 0)]
    v = which( m > 0)
    Res[i] = sample(v,1,F,p)}
  return(Res)
}

Test_MCMC_one = function(M,seed,N,W){
  dt = MCMC_one(M,seed,N)
  infect = Infect[dt]
  Sum_Wxi = sum(infect*1/W[dt])
  Norm = sum(1/W[dt])
  return(1/Norm * Sum_Wxi)
}

##################
##################
N = 1000
rep = 200
pi_1 = W1/sum(W1)
pi_2 = W2/sum(W2)
seed1 = sample(1:Pop_size,rep,T,prob = pi_1)
seed2 = sample(1:Pop_size,rep,T,prob = pi_2)

p_hat1 = numeric(rep)

for ( i in 1:rep){
  p_hat1[i] = Test_MCMC_one(M1,seed1[i],N,W1)
  p_hat2[i] = Test_MCMC_one(M2,seed2[i],N,W2)
}

inf_pop
mean(p_hat1)
mean(p_hat2)
sqrt(mean((p_hat1inf_pop)^2)) #MSE for method 1
sqrt(mean((p_hat2inf_pop)^2)) #MSE for method 2

# sensitivity
n = 200; seed_n = 1000
seed = sample(1:Pop_size,seed_n)
A = c(); B =c()
for ( i in 1:seed_n){

A = c(A,Test_MCMC_one_mixingtime(K1,seed[i],n,W1))
  B = c(B,Test_MCMC_one_mixingtime(K2,seed[i],n,W2))
}
sqrt(mean((A – Infect_size/Pop_size)^2)) #MSE for method 1
sqrt(mean((B Infect_size/Pop_size)^2)) #MSE for method 2

## Compare the converge speed of the demographic distribution
N = 500
rep = 500

seed = sample(1:Pop_size,rep)
Sample1 = matrix(0,nrow = rep, ncol = N)

for ( i in 1:rep){
  Sample1[i,] = MCMC_one(M1,seed[i],N)
  Sample2[i,] = MCMC_one(M2,seed[i],N)
}

inf_pro = function(vector,Infect){
  x = Infect[vector]
  pro = as.numeric(length(x))
  for ( i in 1:length(x)){
    pro[i] = sum(x[1:i])/i
  }
  return(pro)
}

pro1 = matrix(0,nrow = rep, ncol = N)
for ( i in 1:rep){
  pro1[i,] = inf_pro(Sample1[i,],Infect)
}
plot(pro1[3,], xlab= "sample size", ylab = "infection rate", type = 'l',main = "Graph 3: converging speed of the sample ")
lines(pro2[3,],col = "red")
legend("topright",col=c("black","red"),legend = c("control","mod.1"), lty = c(1,1))
converg_time = function(vector,err){
  n = 1
  dist = max(vector) min(vector)
  while(dist > err*2){
    vector = vector[1]
    n = n + 1
  }
  return(n)
}

err = 0.001
time1 = numeric(rep)
for ( i in 1:rep){
  time1[i] = converg_time(pro1[i,],err)
}

sum(time1 == 500)
sum(time2 == 500)
mean(time1); mean(time2)
sd(time1); sd(time2)

2. R code for simulation in Modification 2
2.1 Network Construction

#### Population Parameter ####
Total_pop = 500
pro_a = 1/4

#### Infection Parameter ####
inf_a = 0.8
inf_b = 0.4

#### Network Parametr ####
# Define p of a connection between four type:
# A(non infect A), a(infect A), B(non infect B), b(infect B)
# In total 10 type of connection
P_AA = 0.01 #1
P_Aa = 0.0001 #2
P_AB = 0.0005 #3
P_Ab = 0.00025 #4
P_aa = 0.04 #5
P_aB = 0.00025 #6
P_ab = 0.02 #7
P_BB = 0.01 #8
P_Bb = 0.0025 #9
P_bb = 0.04 #10

### Generate Population ###
n_AT = floor(Total_pop*pro_a)
n_BT = Total_pop n_AT
n_A = floor(n_AT*(1inf_a))
n_a = n_AT n_A
n_B = floor(n_BT*(1inf_b))
n_b = n_BT n_B

Network = matrix(0,nrow=Total_pop,ncol = Total_pop)
#1
for (i in 1:n_A){
  for (j in 1:n_A){
    Network[i,j] = as.numeric(runif(1) < P_AA)
  }
}
#2
for (i in 1:n_A){
  for (j in n_A:(n_A + n_a)){
    Network[i,j] = as.numeric(runif(1) < P_Aa)
  }
}
#3
for (i in 1:n_A){
  for (j in (n_A + n_a):(n_A + n_a + n_B)){
    Network[i,j] = as.numeric(runif(1) < P_AB)
  }
}
#4
for (i in 1:n_A){
  for (j in (n_A + n_a + n_B):(n_A + n_a + n_B + n_b)){
    Network[i,j] = as.numeric(runif(1) < P_Ab)
  }
}
#5
for (i in n_A:(n_A + n_a)){
  for (j in n_A:(n_A + n_a)){
    Network[i,j] = as.numeric(runif(1) < P_aa)
  }
}
#6

Network[i,j] = as.numeric(runif(1) < P_aB)
  }
}
#7
    Network[i,j] = as.numeric(runif(1) < P_ab)
  }
}
#8
for (i in (n_A + n_a):(n_A + n_a + n_B)){
    Network[i,j] = as.numeric(runif(1) < P_BB)
  }
}
#9
for (i in (n_A + n_a):(n_A + n_a + n_B)){
    Network[i,j] = as.numeric(runif(1) < P_Bb)
  }
}
#10
for (i in (n_A + n_a + n_B):(n_A + n_a + n_B + n_b)){
    Network[i,j] = as.numeric(runif(1) < P_bb)
  }
}

Network[lower.tri(Network,diag = T)] = 0
Network = (Network + t(Network))
isSymmetric(Network)
any(rowSums(Network) == 0)
sum(rowSums(Network) == 0)
which(rowSums(Network) == 0)
Network[which(rowSums(Network) == 0),sample(1:Total_pop,sample(1:2,1))] = 1

M1 = Network
W1 = rowSums(M1)
M2 = M1
for ( i in 1:Total_pop){
  M2[i,] = c(M2[i,1:(n_A+n_a)] * (inf_a/inf_b),
              M2[i, (n_A+n_a+1):Total_pop])
}
W2 = rowSums(M2)
Infect = c(numeric(n_A),rep(1,n_a),numeric(n_B),rep(1,n_b))
cor(W1,Infect)
cor(W2,Infect)
summary(rowSums(M1))
summary(rowSums(M2))

2.2 Heatmap
### Heat map, testing for small population
source("http://www.phaget4.org/R/myImagePlot.R")
ID_names = c(rep("A",n_A),rep("a",n_a),rep("B",n_B),rep("b",n_b))
colnames(Network) = ID_names
rownames(Network) = ID_names
myImagePlot(Network)

2.3 Test
inf = inf_a*pro_a + inf_b *(1pro_a)
#### Testing function
Recru = function(M,seed,N){

n = nrow(M); Res = numeric(N)
  Res[1] = seed
  for ( i in 2:N){
    m = M2[Res[i1],]
    p = m[which(m > 0)]
    p = p/sum(p)
    v = which(m > 0)
    if (length(v) > 1){
      Res[i] = sample(v,1,T,p)
    }
    else{Res[i] = v}
    }
  return(Res)
}

Test = function(M,seed,N,W){
  dt = Recru(M,seed,N)
  infect = Infect[dt]
  Sum_Wxi = sum(infect*1/W[dt])
  Norm = sum(1/W[dt])
  return(1/Norm * Sum_Wxi)
}

#### Test 1
# Recru a sample of 2500 people from M1, M2 with same seed
N = 500

seed = sample(1:Total_pop,1)
Sample1 = Recru(M1,seed, N)
Sample2 = Recru(M2, seed, N)

mat_Sample1 = matrix(0, nrow = N, ncol = 4)
colnames(mat_Sample1) = c("#A at t","#a at t","#B at t","#b at t")
rownames(mat_Sample1) = 1:N
mat_Sample2 = matrix(0, nrow = N, ncol = 4)
colnames(mat_Sample2) = c("#A at t","#a at t","#B at t","#b at t")
rownames(mat_Sample2) = 1:N

for ( i in 1:N){
  mat_Sample1[i,2] = sum((Sample1[1:i] > n_A) & (Sample1[1:i] <= (n_A + n_a)))
  mat_Sample1[i,4] = sum((Sample1[1:i] > (n_B + n_A + n_a)) & (Sample1[1:i] <= (n_B + n_A + n_a + n_b)))
  mat_Sample1[i,1] = sum(Sample1[1:i] < (n_A) + (n_a)) mat_Sample1[i,2]
  mat_Sample1[i,3] = sum(Sample1[1:i] > (n_A) + (n_a)) mat_Sample1[i,4]

  mat_Sample2[i,2] = sum((Sample2[1:i] > n_A) & (Sample2[1:i] <= (n_A + n_a)))
  mat_Sample2[i,4] = sum((Sample2[1:i] > (n_B + n_A + n_a)) & (Sample2[1:i] <= (n_B + n_A + n_a + n_b)))
  mat_Sample2[i,1] = sum(Sample2[1:i] < (n_A) + (n_a)) mat_Sample2[i,2]
  mat_Sample2[i,3] = sum(Sample2[1:i] > (n_A) + (n_a)) mat_Sample2[i,4]
}

for ( i in 1:N){
  mat_Sample1[i,] = mat_Sample1[i,]/i
  mat_Sample2[i,] = mat_Sample2[i,]/i
}
par(mfrow = c(2,1))
plot(mat_Sample1[,1], col = "blue", type = "l",
     ylim = c(0,max(max(mat_Sample2),max(mat_Sample1))), xlab = "n th sample wave",
     ylab = "% type by nth wave", main = "M1, n= 500, seed is B")
lines(mat_Sample1[,2], col = "red")
lines(mat_Sample1[,3], col = "green")
lines(mat_Sample1[,4], col = "pink")
plot(mat_Sample2[,1], col = "blue", type = "l",
     ylim = c(0,max(max(mat_Sample2),max(mat_Sample1))), xlab = "n th sample wave",
     ylab = "% type by nth wave", main = "M2, n = 500, seed is B")
lines(mat_Sample2[,2], col = "red")

lines(mat_Sample2[,3], col = "green")
lines(mat_Sample2[,4], col = "pink")

#### Test 2
## measure the theoretical varience of p_hat
N = 1000
rep = 100

pi_1 = W1/sum(W1)
pi_2 = W2/sum(W2)
seed1 = sample(1:Total_pop,rep,T,prob = pi_1)
seed2 = sample(1:Total_pop,rep,T,prob = pi_2)


for ( i in 1:rep){
  p_hat1[i] = Test(M1,seed1[i],N,W1)
  p_hat2[i] = Test(M2,seed2[i],N,W2)
}

inf
mean(p_hat1)
mean(p_hat2)
sqrt(mean((p_hat1inf)^2)) #MSE for method 1
sd(p_hat1)
sd(p_hat2)

par(mfrow = c(2,1))
plot(p_hat1,xlab = "n th p_hat", ylab = "p_hat value",
     main = "M1,N = 1000,rep = 100",
     type = "l",ylim = c(0.35,0.85))
abline(h = inf, col = "blue", lty = 2)
abline(h = mean(p_hat1) + sd(p_hat1), col = "orange", lty = 6)
abline(h = mean(p_hat1) sd(p_hat1), col = "orange", lty = 6)
plot(p_hat2,xlab = "n th p_hat", ylab = "p_hat value",
     main = "M2,N = 1000,rep = 100",
     type = "l", ylim = c(0.35,0.85))
abline(h = inf, col = "blue", lty = 2)
abline(h = mean(p_hat2) + sd(p_hat2), col = "orange", lty = 6)
abline(h = mean(p_hat2) sd(p_hat2), col = "orange", lty = 6)

### Test 3
#### Sensitivity to the seed
## compare sd(1), sd(2)
N = 100
rep = 300
seed = sample(1:Total_pop, rep)

for ( i in 1:rep){
  p_hat1[i] = Test(M1,seed[i],N,W1)
  p_hat2[i] = Test(M2,seed[i],N,W2)
}

inf
mean(p_hat1)
mean(p_hat2)
sd(p_hat1)
sd(p_hat2)

## Compare the converge speed of the demographic distribution

N = 500
rep = 500

seed = sample(1:Total_pop,rep)

for ( i in 1:rep){
  Sample1[i,] = Recru(M1,seed[i],N)
  Sample2[i,] = Recru(M2,seed[i],N)
}

inf_pro = function(vector,Infect){
  x = Infect[vector]
  pro = as.numeric(length(x))
  for ( i in 1:length(x)){
    pro[i] = sum(x[1:i])/i
  }
  return(pro)
}

for ( i in 1:rep){
}
plot(pro1[1,])

converg_time = function(vector,err){
  n = 1
  while(dist > err*2){
    vector = vector[1]
    n = n + 1
  }
  return(n)
}

err = 0.001
for ( i in 1:rep){
}

sum(time1 == 500)
sum(time2 == 500)
mean(time1); mean(time2)
sd(time1); sd(time2)

Reference
Goel, S., & Salganik, M. J. (2009). Respondentdriven sampling as Markov chain Monte Carlo.
Statist. Med. Statistics in Medicine, 28(17), 22022229. doi:10.1002/sim.3613

Heckathorn, D. D. (1997). RespondentDriven Sampling: A New Approach to the Study of
Hidden Populations. Social Problems, 44(2), 174199. doi:10.1525/sp.1997.44.2.03x0221m

Klovdahl, Alden. S. (1989). Urban social network: some methodological problems and
possibilities. In the small world, M. Kochen(ed.), 176210. Norwood, N.J.:Ablex.

ResearchReportwithproYves.docx (2)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ResearchReportwithproYves.docx (2)

Similar to ResearchReportwithproYves.docx (2) (20)

ResearchReportwithproYves.docx (2)