Successfully reported this slideshow.                       Upcoming SlideShare
×

# Using STATA in Survey Data Analysis - Niveen El Zayat

1,657 views

Published on

Training Course on The Analysis of the Egyptian Economic Census Survey 2012/2013

Cairo, 17-19 October, 2015

Published in: Government & Nonprofit
• Full Name
Comment goes here.

Are you sure you want to Yes No • D0WNL0AD FULL ▶ ▶ ▶ ▶ http://1lite.top/7QNSa ◀ ◀ ◀ ◀

Are you sure you want to  Yes  No

Are you sure you want to  Yes  No

Are you sure you want to  Yes  No

Are you sure you want to  Yes  No

### Using STATA in Survey Data Analysis - Niveen El Zayat

1. 1. Using STATA in Survey Data Analysis NIVEEN EL ZAYAT ASSISTANT LECTURER DEPARTMENT OF STATISTICS FACULTY OF ECONOMICS AND POLITICAL SCIENCE
2. 2. Outlines  Aspects of survey data  Survey Designs  Setting survey design in STATA: svyset  Describing survey design: svydes  Estimation commands using prefix svy: (In STATA do-file)  Estimation for subpopulations (In STATA do-file)  Extracting survey-weighted tables and graphs (In STATA do-file)
3. 3. Aspects of survey data  What is Survey?  A research technique to draw information about well-defined population through selecting a sample that systematically questioned, hence, results are analyzed and generalized to the population.  Surveys necessary for providing decision-makers with information that serves the planning, monitoring and evaluation purposes  Survey data is collected through either “complete enumeration” or “sampling”.  Complete enumeration indicates to collecting data from all units exist in the population.  Sampling indicates to collecting data from a subset of the population and designed to be hopefully representative to that population and used to determine truths about it.  Censuses are examples of complete enumeration while Demographic Health Survey (DHS) and Labor Force Sample Survey are examples of sample surveys.
4. 4. Aspects of survey data  Populations in research  Target population: 1. Refers to the ENTIRE group of individuals that researchers are interested in. 2. It is known as the theoretical population.  Accessible population: 1. Refers to the one that researchers can get access to their individuals. 2. It is a subset of the target population and also known as the study population. 3. Researchers actually draw their samples form it thus it is define by the sample frame.
5. 5. Aspects of survey data  Why Sample Survey Data?  Assessing entire population may be impossible:  Impractical; in case of infinite population and population with destructive observations.  Data are wasteful if they are not collected within time limit.  Expensive; a sample survey will be less costly than complete enumeration.  Limited recourses (money and human) and extremely large workload.  Cause a lot of errors to control and monitor.  Cause destructiveness of the observations.  Lists (frame) are rarely up to date.  Due to the above reasons, samples are the wonderful option.
6. 6. Aspects of survey data  Probability (random) Samples  Every unit of the population at any stage is drawn with known probability. Inference about population parameter can be easily drawn based on “probability theory”.  There are two sampling procedures: 1. Sampling with replacement (with finite population) 2. Sampling without replacement (with infinite population)
7. 7. Survey Designs ► Simple Survey Design: SRS  Data from a single-round survey analyzed with limited reference to other information (aka “flat” or “rectangle”)  Every unit in the population has an equal chance of being part of the sample.  Data collection is very simple, it just needs sample frame to the study population.  Often lead to inaccurate point estimates and/or inaccurate standard errors if population is not homogeneous.  Most standard statistical methods assume simple random sampling Complex or multi-stage designSimple survey design Stratified Sample Simple Random Sample (SRS) Cluster Sample Stratified-cluster sample
8. 8. ► Complex (multi-stages/ hierarchical) designs  Drawing SRS is sometimes impossible (e.g., there is indicative frame to the population elements).  Different elements may have different probability of selection into the sample due to population nature.  Used with hierarchical data (HH surveys).  Many statistical procedures assume i.i.d. , moreover, many statistical packages treat data as SRS.  Elements are not samples independently in most surveys as one may need to select groups of individual. Survey Designs
9. 9. Survey Designs ► Stratification  The entire studied population is divided into well-defined groups called “strata” based on a relevant characteristic often based on geographically (region) or demographic variables (gender, level of education or SES).  Sampling units are independently and randomly sampled from within each stratum with different probability of selection.  It is usually results in smaller variance and standard errors than that of SRS.  An example is to stratify the population by locality (urban/rural).
10. 10. Survey Designs ► Clustering  Set of individuals (regions, districts, city blocks or households) are sampled as a group (cluster) then population elements are drawn from the selected cluster.  Further subsamples within cluster may be drawn (often called multi-stage design).  The highest level of cluster is referred to as Primary Sampling Unit (PSU)  The lower level of cluster is referred to as Secondary Sampling Unit (SSU)  An example when geographical regions, such as local government areas, are selected in the first stage. In the second stage schools were selected. In the third stage, the unit of analysis - perhaps teachers or students are sampled. Regions represent PSUs in this example.  Different sample techniques may be applied at different stages which increase sample-to- sample variability and lead to higher variance and standard errors.
11. 11. Survey Designs ► Exercise
12. 12. Survey Designs ► Exercise (Name each of the following design)
13. 13. Survey Designs ► Exercise (Name each of the following design)
14. 14. Survey Designs ► Features of survey designs  Probability (sample) weights  The most common is the sampling weight (aka probability weight) which is used to weight the sample back to the population from which the sample was drawn.  By definition, this weight is the inverse of the sample fraction (N/n).  In a two-stage design, the probability weight is calculated as f1f2, which means that the inverse of the sampling fraction for the first stage is multiplied by the inverse of the sampling fraction for the second stage.  In actual survey data sets, the "final weight" usually starts with the inverse of the sampling fraction, then several adjustments may be applied to account for sampling design problems such as unit non-response, errors in the sampling frame (aka non-coverage) or post-stratification.
15. 15. Survey Designs ► Features of survey designs  Finite population correction (FPC)  It is an adjustment applied to the variance due to sampling without replacement from finite population. Based on central limit theorem, FPC is calculated as: FPC = [ (N-n)/(N-1) ]1/2.  PC is usually applied when sample fraction (n/N) is large otherwise when n is small relative to the population size N, the FPC is almost close to 1, it will have a little impact and can be safely ignored.  For multi-stage survey design, one may apply FPC at one or more stages.
16. 16. Survey Designs ► Features of survey designs  Design effect (DEFF)  Standard errors under different sample designs are compared using design effect statistics. For complex samples, this is typically carried out by drawing comparisons to a hypothetical simple random sample (SRS) of the same size.  It is computed as the ratio of the variance of an estimate θ (based on complex design) to the variance of an estimate θ from a simple random sample (SRS) of the same size; DEEF=Var(θDesign)/Var(θSRS).  Design factor (DEFT) The square root of the design effect; DEFT=(D.EEF)½ which sets things back to the scale of standard errors.  DEFT=1 (No effect of sample design on standard errors).  DEFT>1 (Sample design increase/ inflate standard errors).  DEFT<1 (Sample design reduces standard errors).
17. 17. Setting survey design in STATA: svyset  STATA Syntax The command svyset (declare data as survey data) is used to identify the sample design features of your data to STATA. It allows us to identify a wide range of complex sampling designs. Single-stage design: svyset [psu_varname] [weight_var] [, strata(varname) fpc(varname) options] Multiple-stage design svyset psu_var [weight_var] [, design_options options] [|| ssu_var , design_options] Once the data saved with the survey design it will be a part of dataset until they are cleared or changed or a new dataset is loaded into memory.
18. 18. Setting survey design in STATA: svyset  One-stage design  Example 1: The National Maternal and Infant Health Survey (NMIHS) 1988: Data file name: NMIHS.dta (represents a sample of 9,953 live births) population was divided into 6 strata according to the subdomains of two birth demographic variables; race (black & non black) and birth weight (<1,500 g, 1,500-2,499 g & 2,500+ g), a systematic samples of live birth that were restricted to women 15+ years of age and that were registered in 48 States. Svyset no survey characteristics are set svyset [pw=finwgt], strata(stratan) pweight: finwgt VCE: linearized Single unit: missing Strata 1: stratan SU 1: <observations> FPC 1: <zero> This is an example of stratified systematic design. The weight variable was adjusted to get “finwgt” (see Table N, look at p23 in the report pdf file).
19. 19. Setting survey design in STATA: svyset  Two-stage design  Example 2: Oman World Health Survey (OWHS) 2008: Data file name: OWHS_chronic.dta [40% of the original sample], the design of the survey was set as a part of the data set svyset pweight: adjINDweight VCE: linearized Single unit: missing Strata 1: v022 SU 1: v021 FPC 1: <zero> This is an example of stratified cluster design. Strata is represented by “v022”, clusters by “v021” and the weight variable by “adjINDweight”.
20. 20. Setting survey design in STATA: svyset  Exercises 1: We want to know the income per household in a certain city, and we don’t have a list of households. Instead of trying to create a list of households, it would be more practical to sample blocks. Each block would be considered a sampling unit. Assuming a FPC variable exist in the data set, write the STATA command declaring such survey design?  Exercises 2: In above example, instead of having sampling clusters from the city, we first divided the city into regions and then, within each region, we sampled blocks (eventually with different criteria among regions). Assuming a FPC variable exist in the data set, write the STATA command declaring such survey design? ► More survey design (Hypothetical Exercises) svyset block [pw = pwvar], fpc(fpcvar) svyset block [pw = pwvar], strata(region) fpc(fpcvar)
21. 21. Setting survey design in STATA: svyset  Exercises 3: We want to perform a survey on the eating habits of children attending elementary schools. A possible design would be: perform samples independently on each state. For each state, perform a random sample of counties. Within each county, perform a random sample of schools, and interview each student for the selected schools. Assuming a FPC variables exist at each stage, write the STATA command declaring such design?  Exercises 4: In above example, if within each school we stratify per grade and sample students independently on each grade, then we need to add another level. Assuming a FPC variables exist at each stage, write the STATA command declaring such design? ► More survey design (Hypothetical Exercises) svyset county [pw = pwvar], strata(state) fpc(fpcvar) || school, fpc(fpcvar2) svyset county [pw = pwvar], strata(state) fpc(fpcvar) || school, fpc(fpcvar2) || student, fpc(fpcvar3) strata(grade)
22. 22. Describing survey design: svydes  STATA Syntax The command svydes describe the survey design that was previously declared to the data set by svyset. svyset [varlist], [stage(#) finalstage single option] For multistage design it describe the design for each stage by determining the number of the stage by option [stage(#)]. Option [single] used to only list strata with single PSUs (singleton). Generally, It adds “*” at the strata id variable to show that it has a single PSU.
23. 23. Describing survey design: svydes  Two-stage design use data set: OWHS_chronic.dta , STATA command Svydes  Stratum is the stratum id number given by the strata variable;  #units is the number of PSUs in the strata and  #Obs the number of PSUs in a given stratum.  The other columns give some summary statistics on the number of observations.  The important thing to note here: if strata have singleton PSUs then #units will =1. This means they only include one PSU which also indicated by a "*"