Sample, advanced epidemiology factor analysis

Kellie Watkins 1
Final Project – Advanced Design Analysis
Introduction: In 2010, Texas had the fourth highest occurrence of new diagnoses of HIV/AIDS
and TB. The Texas Department of State Health Services (DSHS) identified HIV infected
persons as a high risk population for TB in Harris County. In order to explore this relationship
further, co-morbidity will be were examined through regression analyses.
Problem, ResearchQuestion, Hypothesis: A population based ecological study was conducted
to identify areas with a high number of TB and HIV new diagnoses in Harris County, Texas from
2009 through 2010. TB and HIV new diagnoses rates were linked to socio-economic variables at
the census tract level. The independent variables include the following: housing size, American
Indian/Alaska Native, Asian, Associates degree, Bachelor degree, Black, divorced, English
(primary language), female, foreign borne, graduate or professional degree, high school degree,
Hispanic, less than 9th grade education, male, married or separated, Native Hawaiian/Pacific
Islander, Native born, never married, no educational diploma, other language (non-English
speakers), non-Hispanic, Separated, some college level education, White, other ethnicity, two
ethnicities or more, widowed, poverty, and unemployed. All variables are measured as a
percentage of the total population per census tract according to the 2010 census data; therefore,
each independent variable is continuous. Literature demonstrates that these variables might be
important risk factors for HIV and/or TB among the Harris County population, but initial
analysis indicates that there is multicollinearity. Furthermore, particular variables might be
highly correlated (e.g., educational level). Factor analysis might reduce redundancy among these
highly correlated variables. Factor analysis was the primary focus of this project although a final
logistic regression model will be performed to analyze the results.
The average rate of HIV new diagnoses and the average rate of TB new diagnoses were
6.81 and 1.87 per 10000 persons, respectively, in Harris County over the two year period. A co-
morbidity variable was created that identified census tracts with above average rates of HIV and
TB, and this variable will represent the outcome of interest. Logistic regression analysis will
assess the relationship between the predictor variables and co-morbidity. The main research
question of interest is to determine whether known risk factors for HIV and known risk factors
for TB are significantly associated with the co-morbidity of HIV and TB at the population level.
Hypothesis: H0: m =… m+t = 0; Ha: m =… m+t ≠ 0
Description of Data: TB data is from the Department of State Health Services and originates
from the local TB control program. HIV data is from the Houston Department of Health and
Human Services. TB and HIV are reportable diseases and required by law. Although data is de-
identified and analyzed at the population level, HIV/TB data may not be shared between
unauthorized parties. The project received approval from relevant shareholders and the
University of Texas Internal Review Board.
Geocoding: Subject level data was the source data. The addresses of HIV new diagnoses
and the addresses of TB new diagnoses were aggregated at the census tract level through
geocoding. HIV new diagnoses were geocoded by staff at HDHHS through a Centers for
Disease Control and Prevention grant project. TB data was transmitted to HDHHS and included
all new diagnoses beginning in 2000 for the state of Texas. For the purposes of this project,

Kellie Watkins 2
point level addresses were geocoded using the online address locator ArcGIS 9.3 North America
Geocode Service. A total of 735 TB case records from DSHS were located within Harris County
after selecting cases by attribute. There were 395 cases of TB in 2009 and 340 cases of TB in
2010. The average match score was 93.75 with a maximum of 100 and a minimum of 73.45.
Census Level: HIV and TB individual level residential addresses needed to be linked to a
Harris County census tract shapefile at the polygon level. The summation of all the points in an
individual census tract created an output of the total counts/census tract. These points,
representing the location of new diagnoses of HIV and TB respectively, were divided by the total
census tract population for each tract to produce the rates for 2009 and 2010.
Census Variables: 2010 Census Data was collected by the U.S. Census Bureau through the
decennial census questionnaire that is mandated for every household in the United States. Detailed
population and housing data were released on 21 December 2010. Census tracts represent statistical
subdivisions within each county. According to the U.S. Census Bureau, most tracts contain a range of
2,500 to 8,000 individuals although the size is largely varied due to its dependence on the density of
resident. Census tracts do not cross county boundaries; therefore each census tract within Harris
County is limited to within these borders. There are a total of 786 census tracts located within the
borders of Harris County. Census tracts are designed to be homogenous across demographic
variables such as socioeconomic status and other population characteristics making them ideal for
public health research focused on community based intervention and outreach programs such as this
thesis.
Literature suggested that particular variables were potential risk factors for HIV or TB.
These variables include age, housing size, race, sex, educational level, employment status, poverty,
and marital status. These variables exist in various public databases available online through the
U.S. Census Bureau. Each potential variable of interest was extracted individually and linked to the
TB and HIV outcomes through ArcGIS using the census tract as the common factor. All variables
were continuous because of their aggregate nature. For instance, instead of an individual “race”
variable with subsequent divisions such as Black, Asian, White, etc., each individual race represents
a separate variable such as the percentage of Black residents in a census tract, the percentage of
White residents in a census tract, and so forth.
Data Analysis and Results: Initially, before the creation of the logistic outcome, ordinary least
squares regression was attempted using ArcGIS 9.3. However, an error message was produced
that stated that multicollinearity was preventing the program from running correctly. Despite
eliminating variables consistently, multicollinearity remained. As a result, it was determined that
SAS 9.3 would be used to run additional analyses in hopes of resolving the multicollinearity
issue.
An initial correlation matrix was produced that demonstrated the high degree of
correlation between each variable (see Appendix I for a partial representation of the correlation
coefficients). A remaining issue was that some variables were clearly independent of each other
while others, such as “race,” essentially measured the same entity but were not equal and,
consequently, could not be combined. Furthermore, continuing the example with race, Black
and Asian were each risk factors for HIV and TB, respectively. Maintaining each as independent
variables was deemed essential and logistic regression was still the preferred method of analysis.

Kellie Watkins 3
After consulting with a biostatistician, factor analysis was recommended for the variables that
were highly correlated and logically related. The goal was to create two subscales, education
and marital status, as predictors in the final regression model to reduce at least some of the
redundancy. The mean of each item in the factor, per education and marital status respectively,
were used as a continuous variable in the logistic regression model.
Prior to analyzing the items that were related among education and marital status, an
initial factor method was run (principal components) with rotation method varimax in the SAS
system using all variables of interest. There were a total of 32 Eigen values with an average of 1.
For the initial factor method (principal components), 7 factors would be retained by the mineigen
criterion of 1.0 (see Appendix II), where values greater than 0.35 within each factor were flagged
with an asterisk. The variance explained by each factor was 10.73 for Factor 1, 4.80 for Factor
2, 2.79 for Factor 3, 1.94 for Factor 4, 1.35 for Factor 5, 1.21 for Factor 6, and 1.06 for Factor 7.
Using the rotation method varimax, an orthogonal transformation matrix was produced (see
Appendix II). The variance explained by each factor was 8.81 for Factor 1, 5.12 for Factor 2,
2.55 for Factor 3, 2.36 for Factor 4, 2.07 for Factor 5, 1.69 for Factor 6, and 1.27 for Factor 7.
However, when viewing the seven factors and the variables with values greater than 0.35 per
factor, it was difficult to interpret the results. The related components, or pattern, were not
involving variables that could be easily or logically explained. While it was understood why
particular relationships or correlations between these variables might exist, it might not be
suitable for producing a meaningful interpretation.
Consequently, the same analysis was run but with fewer variables that were meaningfully
related. Due to the dramatically decreased number of variables analyzed in the factor analysis,
this type of analysis might not be the most desirable approach, but it was pursued in hopes of
diminishing the number of variables by finding a meaningful measure of similar components.
For marital status, the percent divorced, married but separated, never married, separated, and
widowed was analyzed. To maintain consistency, the factor procedure rotation method varimax
was used in the SAS system. Three factors were produced, and within each factor the values
greater than 0.35 were flagged with an asterisk (see Appendix III). The variance explained by
each factor was 0.85 for Factor 1, 0.56 for Factor 2, and 0.33 for Factor 3. A new variable was
created based on the factors that was named “single.” For education, the percent with an
associate’s degree, bachelor’s degree, graduate or professional level education, high school
degree, less than 9th grade, no educational diploma, and some college education was analyzed.
To maintain consistency, the factor procedure rotation method varimax was used in the SAS
system. Three factors were produced, and within each factor the values greater than 0.35 were
flagged with an asterisk (see Appendix IV). The variance explained by each factor was 2.01 for
Factor 1, 1.78 for Factor 2, and 1.19 for Factor 3. A new variable was created based on the
factors that was named “education.” The goal of the factor analysis was to reduce the number of
variables in the final model through logical and statistical reasoning.
Model Building: A univariate analysis was run for each of the remaining variables of interest
(i.e., potential risk factors) and the binary co-morbidity outcome. Housing size, single,
education, poverty, unemployment, Asian, Black, non-English speakers, foreign borne, and male
were significantly associated with HIV/TB at a 5% significance at the univariate level. Age (p-
value of 0.0510) and Hispanic (p-value of 0.1060) were marginally significant and still

Kellie Watkins 4
considered important risk factors for consideration in the model building process. Multiple
logistic regression models were fit using literature to determine which variables were best suited
as predictors of co-morbidity in addition to verifying their correlation to the outcome variable in
the individual analysis. There was some residual correlation between predictors, such as
between Black and poverty, but each variable was retained because of their established
importance as individual risk factors. Some interaction terms were considered but rejected based
on their lack of meaningful interpretation.
Assessment and Discussion: After careful consideration and significant debate, it was
concluded that a separate analysis for HIV and TB as individual, binary variables should be run
to determine which independent variables were significantly associated with above average rates
of HIV at the census tract level and above average rates of TB at the census tract level to
compare with the co-morbidity model. For the HIV model, significant variables (p-value<0.05)
were housing size, Asian, Black, foreign borne, and age. Poverty (p-value of 0.07) was
marginally significant. For the TB model, significant variables (p-value<0.05) were poverty,
Asian, and Black. The direction of the coefficient for Asian reversed, with increased percentage
of Asian populations associated with a higher rate of TB. Education (p-value of 0.08) was
marginally significant. The final co-morbidity model included the following independent
variables: housing size, single, education, poverty, unemployment, Asian, Black, Hispanic,
foreign borne, male, and age. Age and male were considered, although they were not significant
in the single models, because literature demonstrated that younger age and male were considered
risk factors for HIV and/or TB among Harris County. For the final model, housing size,
unemployment, Black, and foreign borne were significant (p-value<0.05). Furthermore, the AIC
value was smaller for this model (620.264) when compared to the other models run previously
for the co-morbidity outcome. The significance of unemployment was unexpected because
oftentimes poverty is considered a substitute for this variable, whereas poverty was not
significant in this model. For the significant predictors, the relationship indicated that increased
percentage in smaller households was associated with lower co-morbidity while an increased
percentage of unemployment, Black, and foreign borne were associated with higher co-
morbidity. Asian was another variable that needs additional consideration because it is
considered a protective factor against higher rates of HIV at the census tract level but a risk
factor for higher rates of TB at the census tract level. Results indicate that higher percentage of
Asian populations might be associated with lower co-morbidity.
Limitations and Recommendations: In attempting to address the multicollinearity issues
involved in this procedure, factor analysis was applied. However, there might have been other,
preferred methods to reduce the redundancy of variables, including simple review of the
literature. In addition, creating subscales based on the factor loadings might not have been
effective or done correctly based on the nature of factor analysis. It was determined that it was
the best method to apply within the limited timeframe and tools available, but additional
possibilities should be pursued in the future. Furthermore, the data cleaning and geocoding
process should be reviewed for accuracy due to the complicated and extensive labor that was
undertaken to create the final database. Additionally, the initial analysis was run before any
understanding of multilevel analysis was available. The next steps have already begun to pursue
this analysis using the subject level TB data that was initially transmitted via DSHS. In this case,
the subject level and census tract level can be considered instead of aggregating the data.

Kellie Watkins 7
APPENDIX II: Initial Factor Method: Principal Components

Kellie Watkins 10
Appendix III: Marital Status

Kellie Watkins 11
APPENDIX IV: Education

Kellie Watkins 12
APPENDIX V: HIV, TB, and HIV/TB Co-Morbidity Models
Outcome: HIV (binary)

Kellie Watkins 13
Outcome: TB (binary)
Outcome: HIV/TB (co-morbidity, binary)

Sample, advanced epidemiology factor analysis

Recommended

Recommended

More Related Content

Similar to Sample, advanced epidemiology factor analysis

Similar to Sample, advanced epidemiology factor analysis (20)

More from KellieWatkins1

More from KellieWatkins1 (11)

Recently uploaded

Recently uploaded (20)

Sample, advanced epidemiology factor analysis